{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Combining Features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Objectives](#Objectives)\n", "* [Dataset](#Dataset)\n", "* [Loading the dataset](#Loading-the-dataset)\n", "* [Transformers](#Transformers)\n", "* [Lexical features](#Lexical-features)\n", "* [Syntactic features](#Syntactic-features)\n", "* [Feature Extraction Pipelines](#Feature-Extraction-Pipelines)\n", "* [Feature Union Pipeline](#Feature-Union-Pipeline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous section we have seen how to analyse lexical, syntactic and semantic features. All these features can help in machine learning techniques.\n", "\n", "In this notebook we are going to learn how to combine them. \n", "\n", "There are several approaches for combining features, at character, lexical, syntactical, semantic or behavioural levels. \n", "\n", "Some authors obtain the different featuras as lists and then join these lists, a good example is shown [here](http://www.aicbt.com/authorship-attribution/) for authorship attribution. Other authors use *FeatureUnion* to join the different sparse matrices, as shown [here](http://es.slideshare.net/PyData/authorship-attribution-forensic-linguistics-with-python-scikit-learn-pandas-kostas-perifanos) and [here](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html). Finally, other authors use FeatureUnions with weights, as shown in [scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html).\n", "\n", "A *FeatureUnion* is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object.\n", "\n", "In this chapter we are going to follow the combination of Pipelines and FeatureUnions, as described in scikit-learn, [Zac Stewart](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html), his [Kaggle submission](https://github.com/zacstewart/kaggle_seeclickfix/blob/master/estimator.py), and [Michelle Fullwood](https://michelleful.github.io/code-blog/2015/06/20/pipelines/), since it provides a simple and structured approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to use one [dataset from Kaggle](https://www.kaggle.com/c/asap-aes/) for automatic essay scoring, a very interesting area for teachers.\n", "\n", "The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine's capabilities.\n", "\n", "Each of these files contains 28 columns:\n", "\n", "* essay_id: A unique identifier for each individual student essay\n", "* essay_set: 1-8, an id for each set of essays\n", "* essay: The ascii text of a student's response\n", "* rater1_domain1: Rater 1's domain 1 score; all essays have this\n", "* rater2_domain1: Rater 2's domain 1 score; all essays have this\n", "* rater3_domain1: Rater 3's domain 1 score; only some essays in set 8 have this.\n", "* domain1_score: Resolved score between the raters; all essays have this\n", "* rater1_domain2: Rater 1's domain 2 score; only essays in set 2 have this\n", "* rater2_domain2: Rater 2's domain 2 score; only essays in set 2 have this\n", "* domain2_score: Resolved score between the raters; only essays in set 2 have this\n", "* rater1_trait1 score - rater3_trait6 score: trait scores for sets 7-8\n", "\n", "The dataset is provided in the folder *data-kaggle/training_set_rel3.tsv*.\n", "\n", "There are cases in the training set that contain ???, \"illegible\", or \"not legible\" on some words. You may choose to discard them if you wish, and essays with illegible words will not be present in the validation or test sets.\n", "\n", "The dataset has been anonymized to remove personally identifying information from the essays using the Named Entity Recognizer (NER) from the Stanford Natural Language Processing group and a variety of other approaches. The relevant entities are identified in the text and then replaced with a string such as \"@PERSON1.\"\n", "\n", "The entitities identified by NER are: \"PERSON\", \"ORGANIZATION\", \"LOCATION\", \"DATE\", \"TIME\", \"MONEY\", \"PERCENT\"\n", "\n", "Other replacements made: \"MONTH\" (any month name not tagged as a date by the NER), \"EMAIL\" (anything that looks like an e-mail address), \"NUM\" (word containing digits or non-alphanumeric symbols), and \"CAPS\" (any capitalized word that doesn't begin a sentence, except in essays where more than 20% of the characters are capitalized letters), \"DR\" (any word following \"Dr.\" with or without the period, with any capitalization, that doesn't fall into any of the above), \"CITY\" and \"STATE\" (various cities and states)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading the dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use Pandas to load the dataset. We will not go deeper in analysing the dataset, using the techniques already seen previously." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | essay_id | \n", "essay_set | \n", "essay | \n", "rater1_domain1 | \n", "rater2_domain1 | \n", "rater3_domain1 | \n", "domain1_score | \n", "rater1_domain2 | \n", "rater2_domain2 | \n", "domain2_score | \n", "... | \n", "rater2_trait3 | \n", "rater2_trait4 | \n", "rater2_trait5 | \n", "rater2_trait6 | \n", "rater3_trait1 | \n", "rater3_trait2 | \n", "rater3_trait3 | \n", "rater3_trait4 | \n", "rater3_trait5 | \n", "rater3_trait6 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "1 | \n", "Dear local newspaper, I think effects computer... | \n", "4 | \n", "4 | \n", "NaN | \n", "8 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "2 | \n", "1 | \n", "Dear @CAPS1 @CAPS2, I believe that using compu... | \n", "5 | \n", "4 | \n", "NaN | \n", "9 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "3 | \n", "1 | \n", "Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl... | \n", "4 | \n", "3 | \n", "NaN | \n", "7 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "4 | \n", "1 | \n", "Dear Local Newspaper, @CAPS1 I have found that... | \n", "5 | \n", "5 | \n", "NaN | \n", "10 | \n", "NaN | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 rows × 28 columns
\n", "\n", " | essay_id | \n", "essay | \n", "domain1_score | \n", "
---|---|---|---|
0 | \n", "1 | \n", "Dear local newspaper, I think effects computer... | \n", "8 | \n", "
1 | \n", "2 | \n", "Dear @CAPS1 @CAPS2, I believe that using compu... | \n", "9 | \n", "
2 | \n", "3 | \n", "Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl... | \n", "7 | \n", "
3 | \n", "4 | \n", "Dear Local Newspaper, @CAPS1 I have found that... | \n", "10 | \n", "
4 | \n", "5 | \n", "Dear @LOCATION1, I know having computers has a... | \n", "8 | \n", "