You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
sitc/nlp/4_6_Combining_Features.ipynb

526 lines
19 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Combining Features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Objectives](#Objectives)\n",
"* [Dataset](#Dataset)\n",
"* [Loading the dataset](#Loading-the-dataset)\n",
"* [Transformers](#Transformers)\n",
"* [Lexical features](#Lexical-features)\n",
"* [Syntactic features](#Syntactic-features)\n",
"* [Feature Extraction Pipelines](#Feature-Extraction-Pipelines)\n",
"* [Feature Union Pipeline](#Feature-Union-Pipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous section we have seen how to analyse lexical, syntactic and semantic features. All these features can help in machine learning techniques.\n",
"\n",
"In this notebook we are going to learn how to combine them. \n",
"\n",
"There are several approaches for combining features, at character, lexical, syntactical, semantic or behavioural levels. \n",
"\n",
"Some authors obtain the different featuras as lists and then join these lists, a good example is shown [here](http://www.aicbt.com/authorship-attribution/) for authorship attribution. Other authors use *FeatureUnion* to join the different sparse matrices, as shown [here](http://es.slideshare.net/PyData/authorship-attribution-forensic-linguistics-with-python-scikit-learn-pandas-kostas-perifanos) and [here](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html). Finally, other authors use FeatureUnions with weights, as shown in [scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html).\n",
"\n",
"A *FeatureUnion* is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object.\n",
"\n",
"In this chapter we are going to follow the combination of Pipelines and FeatureUnions, as described in scikit-learn, [Zac Stewart](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html), his [Kaggle submission](https://github.com/zacstewart/kaggle_seeclickfix/blob/master/estimator.py), and [Michelle Fullwood](https://michelleful.github.io/code-blog/2015/06/20/pipelines/), since it provides a simple and structured approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to use one [dataset from Kaggle](https://www.kaggle.com/c/asap-aes/) for automatic essay scoring, a very interesting area for teachers.\n",
"\n",
"The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine's capabilities.\n",
"\n",
"Each of these files contains 28 columns:\n",
"\n",
"* **essay_id**: A unique identifier for each individual student essay\n",
"* **essay_set**: 1-8, an id for each set of essays\n",
"* **essay**: The ascii text of a student's response\n",
"* **rater1_domain1**: Rater 1's domain 1 score; all essays have this\n",
"* **rater2_domain1**: Rater 2's domain 1 score; all essays have this\n",
"* **rater3_domain1**: Rater 3's domain 1 score; only some essays in set 8 have this.\n",
"* **domain1_score**: Resolved score between the raters; all essays have this\n",
"* **rater1_domain2**: Rater 1's domain 2 score; only essays in set 2 have this\n",
"* **rater2_domain2**: Rater 2's domain 2 score; only essays in set 2 have this\n",
"* **domain2_score**: Resolved score between the raters; only essays in set 2 have this\n",
"* **rater1_trait1 score - rater3_trait6 score**: trait scores for sets 7-8\n",
"\n",
"The dataset is provided in the folder *data-kaggle/training_set_rel3.tsv*.\n",
"\n",
"There are cases in the training set that contain ???, \"illegible\", or \"not legible\" on some words. You may choose to discard them if you wish, and essays with illegible words will not be present in the validation or test sets.\n",
"\n",
"The dataset has been anonymized to remove personally identifying information from the essays using the Named Entity Recognizer (NER) from the Stanford Natural Language Processing group and a variety of other approaches. The relevant entities are identified in the text and then replaced with a string such as \"@PERSON1.\"\n",
"\n",
"The entities identified by NER are: \"PERSON\", \"ORGANIZATION\", \"LOCATION\", \"DATE\", \"TIME\", \"MONEY\", \"PERCENT\"\n",
"\n",
"Other replacements made: \"MONTH\" (any month name not tagged as a date by the NER), \"EMAIL\" (anything that looks like an e-mail address), \"NUM\" (word containing digits or non-alphanumeric symbols), and \"CAPS\" (any capitalized word that doesn't begin a sentence, except in essays where more than 20% of the characters are capitalized letters), \"DR\" (any word following \"Dr.\" with or without the period, with any capitalization, that doesn't fall into any of the above), \"CITY\" and \"STATE\" (various cities and states)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use Pandas to load the dataset. We will not go deeper in analysing the dataset, using the techniques already seen previously."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# The files are coded in ISO-8859-1\n",
"\n",
"df_orig = pd.read_csv(\"data-essays/training_set_rel3.tsv\", encoding='ISO-8859-1', delimiter=\"\\t\", header=0)\n",
"df_orig[0:4]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_orig.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# We filter the data of the essay_set number 1, and we keep only two columns for this \n",
"# example\n",
"\n",
"df = df_orig[df_orig['essay_set'] == 1][['essay_id', 'essay', 'domain1_score']].copy()\n",
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df[0:5]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Define X and Y\n",
"X = df['essay'].values\n",
"y = df['domain1_score'].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Transformers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Every feature extractor should be implemented as a custom Transformer. A transformer can be seen as an object that receives data, applies some changes, and returns the data, usually with the same same that the input. The methods we should implement are:\n",
"* *fit* method, in case we need to learn and train for extracting the feature\n",
"* *transform method*, that applies the defined transformation to unseen data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we show the general approach to develop transformers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Generic Transformer \n",
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"\n",
"class GenericTransformer(BaseEstimator, TransformerMixin):\n",
"\n",
" def transform(self, X, y=None):\n",
" return do_something_to(X, self.vars) # where the actual feature extraction happens\n",
"\n",
" def fit(self, X, y=None):\n",
" return self # used if the feature requires training, for example, clustering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn provides a class [FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) that makes easy to create new transformers. We have to provide a function that is executed in the method transform()."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lexical features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we include some examples of lexical features. We have omitted character features (for example, number of exclamation marks)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Sample of statistics using nltk\n",
"# Another option is defining a function and pass it as a parameter to FunctionTransformer\n",
"\n",
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
"\n",
"class LexicalStats (BaseEstimator, TransformerMixin):\n",
" \"\"\"Extract lexical features from each document\"\"\"\n",
" \n",
" def number_sentences(self, doc):\n",
" sentences = sent_tokenize(doc, language='english')\n",
" return len(sentences)\n",
"\n",
" def fit(self, x, y=None):\n",
" return self\n",
"\n",
" def transform(self, docs):\n",
" return [{'length': len(doc),\n",
" 'num_sentences': self.number_sentences(doc)}\n",
" for doc in docs]\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from nltk.stem import PorterStemmer\n",
"from nltk import word_tokenize\n",
"from nltk.corpus import stopwords\n",
"import string\n",
"\n",
"def custom_tokenizer(words):\n",
" \"\"\"Preprocessing tokens as seen in the lexical notebook\"\"\"\n",
" tokens = word_tokenize(words.lower())\n",
" porter = PorterStemmer()\n",
" lemmas = [porter.stem(t) for t in tokens]\n",
" stoplist = stopwords.words('english')\n",
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
" punctuation = set(string.punctuation)\n",
" lemmas_punct = [w for w in lemmas_clean if w not in punctuation]\n",
" return lemmas_punct"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Syntactic features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we include and example of syntactic feature extraction."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from nltk import pos_tag\n",
"from collections import Counter \n",
"\n",
"class PosStats(BaseEstimator, TransformerMixin):\n",
" \"\"\"Obtain number of tokens with POS categories\"\"\"\n",
"\n",
" def stats(self, doc):\n",
" tokens = custom_tokenizer(doc)\n",
" tagged = pos_tag(tokens, tagset='universal')\n",
" counts = Counter(tag for word,tag in tagged)\n",
" total = sum(counts.values())\n",
" #copy tags so that we return always the same number of features\n",
" pos_features = {'NOUN': 0, 'ADJ': 0, 'VERB': 0, 'ADV': 0, 'CONJ': 0, \n",
" 'ADP': 0, 'PRON':0, 'NUM': 0}\n",
" \n",
" pos_dic = dict((tag, float(count)/total) for tag,count in counts.items())\n",
" for k in pos_dic:\n",
" if k in pos_features:\n",
" pos_features[k] = pos_dic[k]\n",
" return pos_features\n",
" \n",
" def transform(self, docs, y=None):\n",
" return [self.stats(doc) for doc in docs]\n",
" \n",
" def fit(self, docs, y=None):\n",
" \"\"\"Returns `self` unless something different happens in train and test\"\"\"\n",
" return self"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Extraction Pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We define Pipelines to extract the desired features.\n",
"\n",
"In case we want to apply different processing techniques to different part of the corpus (e.g. title or body or, ...), look [here](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html) for an example of how to extract and process the different parts into a Pipeline."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline, FeatureUnion\n",
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer\n",
"\n",
"\n",
"ngrams_featurizer = Pipeline([\n",
" ('count_vectorizer', CountVectorizer(ngram_range = (1, 3), encoding = 'ISO-8859-1', \n",
" tokenizer=custom_tokenizer)),\n",
" ('tfidf_transformer', TfidfTransformer())\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Union Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can ensemble the different pipelines to define which features we want to extract, how to combine them, and apply later machine learning techniques to the resulting feature set.\n",
"\n",
"In Feature Union we can pass either a pipeline or a transformer.\n",
"\n",
"The basic idea is:\n",
"* **Pipelines** consist of sequential steps: one step works on the results of the previous step\n",
"* **FeatureUnions** consist of parallel tasks whose result is grouped when all have finished."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.model_selection import cross_val_score, KFold\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.feature_extraction import DictVectorizer\n",
"from sklearn.preprocessing import FunctionTransformer\n",
"from sklearn.decomposition import NMF, LatentDirichletAllocation\n",
"\n",
"\n",
"\n",
"## All the steps of the Pipeline should end with a sparse vector as the input data\n",
"\n",
"pipeline = Pipeline([\n",
" ('features', FeatureUnion([\n",
" ('lexical_stats', Pipeline([\n",
" ('stats', LexicalStats()),\n",
" ('vectors', DictVectorizer())\n",
" ])),\n",
" ('words', TfidfVectorizer(tokenizer=custom_tokenizer)),\n",
" ('ngrams', ngrams_featurizer),\n",
" ('pos_stats', Pipeline([\n",
" ('pos_stats', PosStats()),\n",
" ('vectors', DictVectorizer())\n",
" ])),\n",
" ('lda', Pipeline([ \n",
" ('count', CountVectorizer(tokenizer=custom_tokenizer)),\n",
" ('lda', LatentDirichletAllocation(n_components=4, max_iter=5,\n",
" learning_method='online', \n",
" learning_offset=50.,\n",
" random_state=0))\n",
" ])),\n",
" ])),\n",
" \n",
" ('clf', MultinomialNB(alpha=.01)) # classifier\n",
" ])\n",
"\n",
"# Using KFold validation\n",
"\n",
"cv = KFold(2, shuffle=True, random_state=33)\n",
"scores = cross_val_score(pipeline, X, y, cv=cv)\n",
"print(\"Scores in every iteration\", scores)\n",
"print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is not very good :(."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.1"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}