"* [Feature Union Pipeline](#Feature-Union-Pipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous section we have seen how to analyse lexical, syntactic and semantic features. All these features can help in machine learning techniques.\n",
"\n",
"In this notebook we are going to learn how to combine them. \n",
"\n",
"There are several approaches for combining features, at character, lexical, syntactical, semantic or behavioural levels. \n",
"\n",
"Some authors obtain the different featuras as lists and then join these lists, a good example is shown [here](http://www.aicbt.com/authorship-attribution/) for authorship attribution. Other authors use *FeatureUnion* to join the different sparse matrices, as shown [here](http://es.slideshare.net/PyData/authorship-attribution-forensic-linguistics-with-python-scikit-learn-pandas-kostas-perifanos) and [here](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html). Finally, other authors use FeatureUnions with weights, as shown in [scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html).\n",
"\n",
"A *FeatureUnion* is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object.\n",
"\n",
"In this chapter we are going to follow the combination of Pipelines and FeatureUnions, as described in scikit-learn, [Zac Stewart](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html), his [Kaggle submission](https://github.com/zacstewart/kaggle_seeclickfix/blob/master/estimator.py), and [Michelle Fullwood](https://michelleful.github.io/code-blog/2015/06/20/pipelines/), since it provides a simple and structured approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to use one [dataset from Kaggle](https://www.kaggle.com/c/asap-aes/) for automatic essay scoring, a very interesting area for teachers.\n",
"\n",
"The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine's capabilities.\n",
"The dataset is provided in the folder *data-kaggle/training_set_rel3.tsv*.\n",
"\n",
"There are cases in the training set that contain ???, \"illegible\", or \"not legible\" on some words. You may choose to discard them if you wish, and essays with illegible words will not be present in the validation or test sets.\n",
"\n",
"The dataset has been anonymized to remove personally identifying information from the essays using the Named Entity Recognizer (NER) from the Stanford Natural Language Processing group and a variety of other approaches. The relevant entities are identified in the text and then replaced with a string such as \"@PERSON1.\"\n",
"Other replacements made: \"MONTH\" (any month name not tagged as a date by the NER), \"EMAIL\" (anything that looks like an e-mail address), \"NUM\" (word containing digits or non-alphanumeric symbols), and \"CAPS\" (any capitalized word that doesn't begin a sentence, except in essays where more than 20% of the characters are capitalized letters), \"DR\" (any word following \"Dr.\" with or without the period, with any capitalization, that doesn't fall into any of the above), \"CITY\" and \"STATE\" (various cities and states)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use Pandas to load the dataset. We will not go deeper in analysing the dataset, using the techniques already seen previously."
"Every feature extractor should be implemented as a custom Transformer. A transformer can be seen as an object that receives data, applies some changes, and returns the data, usually with the same same that the input. The methods we should implement are:\n",
"* *fit* method, in case we need to learn and train for extracting the feature\n",
"* *transform method*, that applies the defined transformation to unseen data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we show the general approach to develop transformers"
" return do_something_to(X, self.vars) # where the actual feature extraction happens\n",
"\n",
" def fit(self, X, y=None):\n",
" return self # used if the feature requires training, for example, clustering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn provides a class [FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) that makes easy to create new transformers. We have to provide a function that is executed in the method transform()."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lexical features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we include some examples of lexical features. We have omitted character features (for example, number of exclamation marks)."
" pos_dic = dict((tag, float(count)/total) for tag,count in counts.items())\n",
" for k in pos_dic:\n",
" if k in pos_features:\n",
" pos_features[k] = pos_dic[k]\n",
" return pos_features\n",
" \n",
" def transform(self, docs, y=None):\n",
" return [self.stats(doc) for doc in docs]\n",
" \n",
" def fit(self, docs, y=None):\n",
" \"\"\"Returns `self` unless something different happens in train and test\"\"\"\n",
" return self"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Extraction Pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We define Pipelines to extract the desired features.\n",
"\n",
"In case we want to apply different processing techniques to different part of the corpus (e.g. title or body or, ...), look [here](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html) for an example of how to extract and process the different parts into a Pipeline."
"Now we can ensemble the different pipelines to define which features we want to extract, how to combine them, and apply later machine learning techniques to the resulting feature set.\n",
"\n",
"In Feature Union we can pass either a pipeline or a transformer.\n",
"\n",
"The basic idea is:\n",
"* **Pipelines** consist of sequential steps: one step works on the results of the previous step\n",
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",