Added NLP notebooks

pull/1/head
cif2cif 8 years ago
parent 31153f8e5d
commit 3054442421

@ -0,0 +1,92 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to Natural Language Processing\n",
" \n",
"In this lab session, we are going to learn how to analyze texts and apply machine learning techniques on textual data sources.\n",
"\n",
"# Objectives\n",
"\n",
"The main objectives of this session are:\n",
"* Learn how to obtain lexical, syntactic and semantic features from texts\n",
"* Learn to use some libraries, such as NLTK, Scikit-learn and gensim for NLP"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. [Home](4_0_Intro_NLP.ipynb)\n",
"1. [Lexical processing](4_1_Lexical_Processing.ipynb)\n",
"1. [Syntactic processing](4_2_Syntactic_Processing.ipynb)\n",
"2. [Vector representation](4_3_Vector_Representation.ipynb)\n",
"3. [Classification](4_4_Classification.ipynb)\n",
"1. [Semantic models](4_5_Semantic_Models.ipynb)\n",
"1. [Combining features](4_6_Combining_Features.ipynb)\n",
"5. [Exercises](4_7_Exercises.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

@ -0,0 +1,644 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lexical Processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Objectives](#Objectives)\n",
"* [Tools](#Tools)\n",
"* [Cleansing](#Cleansing)\n",
"* [Tokenization](#Tokenization)\n",
"* [Sentence Splitter](#Sentence-Splitter)\n",
"* [Word Splitter](#Word-Splitter)\n",
"* [Stemming and Lemmatization](#Stemming-and-Lemmatization)\n",
"* [Stop word removal](#Stop-word-removal)\n",
"* [Punctuation removal](#Punctuation-removal)\n",
"* [Rare words and spelling](#Rare-words-and-spelling)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we are going to learn how to preprocess texts, also known as *text wrangling*. This task involves data munging, text cleansing, specific preprocessing, tokenization, stemming or lemmatization and stop word removal.\n",
"\n",
"The main objectives of this session are:\n",
"* Learn how to preprocess text sources\n",
"* Learn to use some of the most popular NLP libraries\n",
"\n",
"We are going to use as an example part of a computer review included in [Liu's Product Review of IJCA 2015](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets) and a tweet from the [Semeval 2013 Task 2 dataset](https://www.cs.york.ac.uk/semeval-2013/task2/data/uploads/datasets/readme.txt), slightly modified for learning purposes."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"review = \"\"\"I purchased this monitor because of budgetary concerns. This item was the most inexpensive 17 inch monitor \n",
"available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the \n",
"screen wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different \n",
"monitor models since I 'm a college student and this particular monitor had as poor of picture quality as \n",
"any I 've seen.\"\"\"\n",
"\n",
"tweet = \"\"\"@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to \n",
" her music!!!! WOW!!! #ladygaga #britney\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we are going to use several libraries, which provide complementary features:\n",
"* [NLTK](nltk.org/book_1ed/) - provides functionalities for sentence splitting, tokenization, lemmatization, NER, collocations ... and access to many lexical resources (WordNet, Corpora, ...)\n",
"* [Gensim](https://radimrehurek.com/gensim/) - provides functionalities for corpora management, LDA and LSI, among others.\n",
"* [TextBlob](http://textblob.readthedocs.io/) - provides a simple way to access to many of the NLP functions. It is simpler than NLTK and integrates additional functionatities, such as language detection, spelling or even sentiment analysis.\n",
"* [CLiPS](http://www.clips.ua.ac.be/pages/pattern-en#parser) -- contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment and mood analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface. Unfortunately, it does not support Python 3 yet.\n",
"\n",
"\n",
"In order to use nltk, we should download first the lexical resources we are going to use. We can updated them later. For this, you need:\n",
"* import nltk\n",
"* Run *nltk.download()* (the first time we use it). A window will appear. You should select just the corpus 'book' and press download.\n",
"If you inspect the window, you can get an overview of available lexical resources (corpora, lexicons and grammars). For example, you can find some relevant sentiment lexicons in corpora (SentiWordNet, Sentence Polarity Dataset, Vader, Opinion Lexicon or VADER Sentiment Lexicon)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import nltk\n",
"ntlk.download()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cleansing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case we will use raw text. In case you need to clean the documents (eliminate HTML markup, etc.), you can use libraries such as [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Tokenization is the process of transforming a text into tokens. Depending on the input, we can want to split the text into sentences or words. Moreover, some input such as Twitter can require taking into account processing special tokens, such as hashtags.\n",
"\n",
"NLTK provides good support for [tokenization](http://www.nltk.org/api/nltk.tokenize.html).\n",
"\n",
"Next we are going to practice several of these features."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sentence Splitter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use a standard sentence splitter (*sent_tokenize* which uses *PunkTonenizer*), or train a sentence splitter, using the class [*PunktSentenceTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
"\n",
"If the text is multilingual, we can install [*textblob*](http://textblob.readthedocs.io/) or [*langdetect*](https://pypi.python.org/pypi/langdetect?) to detect the text language and select the most suitable sentence splitter. NLTK comes with 17 trained languages for sentence splitting."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['I purchased this monitor because of budgetary concerns.', 'This item was the most inexpensive 17 inch monitor \\navailable to me at the time I made the purchase.', 'My overall experience with this monitor was very poor.', \"When the \\nscreen wasn't contracting or glitching the overall picture quality was poor to fair.\", \"I've viewed numerous different \\nmonitor models since I 'm a college student and this particular monitor had as poor of picture quality as \\nany I 've seen.\"]\n"
]
}
],
"source": [
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
"\n",
"sentences = sent_tokenize(review, language='english')\n",
"print(sentences)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Word Splitter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next stem is dividing every sentence (or the full step) into words."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['I', 'purchased', 'this', 'monitor', 'because', 'of', 'budgetary', 'concerns', '.'], ['This', 'item', 'was', 'the', 'most', 'inexpensive', '17', 'inch', 'monitor', 'available', 'to', 'me', 'at', 'the', 'time', 'I', 'made', 'the', 'purchase', '.'], ['My', 'overall', 'experience', 'with', 'this', 'monitor', 'was', 'very', 'poor', '.'], ['When', 'the', 'screen', 'was', \"n't\", 'contracting', 'or', 'glitching', 'the', 'overall', 'picture', 'quality', 'was', 'poor', 'to', 'fair', '.'], ['I', \"'ve\", 'viewed', 'numerous', 'different', 'monitor', 'models', 'since', 'I', \"'m\", 'a', 'college', 'student', 'and', 'this', 'particular', 'monitor', 'had', 'as', 'poor', 'of', 'picture', 'quality', 'as', 'any', 'I', \"'ve\", 'seen', '.']]\n"
]
}
],
"source": [
"words = [word_tokenize(t) for t in sent_tokenize(review)]\n",
"print(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our case, we are not interested in processing every sentence, we have split into sentence just for learning purposes. So, we are going to get the word tokens."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['I', 'purchased', 'this', 'monitor', 'because', 'of', 'budgetary', 'concerns', '.', 'This', 'item', 'was', 'the', 'most', 'inexpensive', '17', 'inch', 'monitor', 'available', 'to', 'me', 'at', 'the', 'time', 'I', 'made', 'the', 'purchase', '.', 'My', 'overall', 'experience', 'with', 'this', 'monitor', 'was', 'very', 'poor', '.', 'When', 'the', 'screen', 'was', \"n't\", 'contracting', 'or', 'glitching', 'the', 'overall', 'picture', 'quality', 'was', 'poor', 'to', 'fair', '.', 'I', \"'ve\", 'viewed', 'numerous', 'different', 'monitor', 'models', 'since', 'I', \"'m\", 'a', 'college', 'student', 'and', 'this', 'particular', 'monitor', 'had', 'as', 'poor', 'of', 'picture', 'quality', 'as', 'any', 'I', \"'ve\", 'seen', '.']\n"
]
}
],
"source": [
"words = word_tokenize(review)\n",
"print(words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can define our own word tokenizer using regular expressions (and using the class [*RegexpTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"With TweetTokenizer Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight ! ! ! She still listens to her music ! ! ! WOW ! ! ! #ladygaga #britney\n",
"With word_tokenizer @ concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight ! ! ! She still listens to her music ! ! ! ! WOW ! ! ! # ladygaga # britney\n"
]
}
],
"source": [
"from nltk.tokenize import TweetTokenizer\n",
"tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
"tweet_tokens = tknzr.tokenize(tweet)\n",
"print (\"With TweetTokenizer \" + \" \".join(tweet_tokens))\n",
"print(\"With word_tokenizer \" + \" \".join(word_tokenize(tweet)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Stemming and Lemmatization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NLTK provides support for stemming in the package [*stem*](http://www.nltk.org/api/nltk.stem.html). There are several available stemmers:PorterStemmer, lancaster or WordNetLemmatizer. Check the API for more details."
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Porter: boy children are have is ha Madrid\n",
"Execution time: 0.00093841552734375\n",
"Lancaster: boy childr ar hav is has madrid\n",
"Execution time: 0.0014300346374511719\n",
"WordNet: boy child are have is ha Madrid\n",
"Execution time: 0.0008304119110107422\n",
"SnowBall: boy children are have is has madrid\n",
"Execution time: 0.0017843246459960938\n"
]
}
],
"source": [
"from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer\n",
"from nltk.stem.snowball import EnglishStemmer\n",
"import time\n",
"\n",
"porter = PorterStemmer()\n",
"lancaster = LancasterStemmer()\n",
"wordnet = WordNetLemmatizer()\n",
"snowball = EnglishStemmer()\n",
"\n",
"words = \"boys children are have is has Madrid\"\n",
"\n",
"start = time.time()\n",
"print(\"Porter: \" + \" \".join([porter.stem(w) for w in word_tokenize(words)]))\n",
"end = time.time()\n",
"print(\"Execution time: \" + str(end - start))\n",
"start = time.time()\n",
"print(\"Lancaster: \" + \" \".join([lancaster.stem(w) for w in word_tokenize(words)]))\n",
"end = time.time()\n",
"print(\"Execution time: \" + str(end - start))\n",
"start = time.time()\n",
"print(\"WordNet: \" + \" \".join([wordnet.lemmatize(w) for w in word_tokenize(words)]))\n",
"end = time.time()\n",
"print(\"Execution time: \" + str(end - start))\n",
"start = time.time()\n",
"print(\"SnowBall: \" + \" \".join([snowball.stem(w) for w in word_tokenize(words)]))\n",
"end = time.time()\n",
"print(\"Execution time: \" + str(end - start))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"As we can see, we get the forms *are* and *is* instead of *be*. This is because we have not introduce the Part-Of-Speech (POS), and the default POS is 'n' (name).\n",
"\n",
"The main difference between stemmers and lemmatizers is that stemmers operate in isolated words, while lemmatizers take into account the context (e.g. POS). However, stemmers are quicker and require fewer resources.\n"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"WordNet: be cry be have have\n"
]
}
],
"source": [
"verbs = \"are crying is have has\"\n",
"print(\"WordNet: \" + \" \".join([wordnet.lemmatize(w, pos='v') for w in word_tokenize(verbs)]))\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Depending of the application, we can select stemmers or lemmatizers. \n",
"\n",
"Regarding Twitter, we could use specialised software for managing tweets, such as [*TweetNLP*](http://www.cs.cmu.edu/~ark/TweetNLP/).\n",
"\n",
"Now we go back to our example and we apply stemming."
]
},
{
"cell_type": "code",
"execution_count": 82,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['i', 'purchas', 'thi', 'monitor', 'becaus', 'of', 'budgetari', 'concern', '.', 'thi', 'item', 'wa', 'the', 'most', 'inexpens', '17', 'inch', 'monitor', 'avail', 'to', 'me', 'at', 'the', 'time', 'i', 'made', 'the', 'purchas', '.', 'my', 'overal', 'experi', 'with', 'thi', 'monitor', 'wa', 'veri', 'poor', '.', 'when', 'the', 'screen', 'wa', \"n't\", 'contract', 'or', 'glitch', 'the', 'overal', 'pictur', 'qualiti', 'wa', 'poor', 'to', 'fair', '.', 'i', \"'ve\", 'view', 'numer', 'differ', 'monitor', 'model', 'sinc', 'i', \"'m\", 'a', 'colleg', 'student', 'and', 'thi', 'particular', 'monitor', 'had', 'as', 'poor', 'of', 'pictur', 'qualiti', 'as', 'ani', 'i', \"'ve\", 'seen', '.']\n",
"['Ladi', 'Gaga', 'is', 'actual', 'at', 'the', 'Britney', 'Spear', 'Femm', 'Fatal', 'Concert', 'tonight', '!', '!', '!', 'She', 'still', 'listen', 'to', 'her', 'music', '!', '!', '!', 'WOW', '!', '!', '!', '#ladygaga', '#britney']\n"
]
}
],
"source": [
"def preprocess(words, type='doc'):\n",
" if (type == 'tweet'):\n",
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
" tokens = tknzr.tokenize(tweet)\n",
" else:\n",
" tokens = nltk.word_tokenize(words.lower())\n",
" porter = nltk.PorterStemmer()\n",
" lemmas = [porter.stem(t) for t in tokens]\n",
" return lemmas\n",
"print(preprocess(review))\n",
"print(preprocess(tweet, type='tweet'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Stop word removal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next step is removing stop words."
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']\n"
]
}
],
"source": [
"from nltk.corpus import stopwords\n",
"\n",
"stoplist = stopwords.words('english')\n",
"print(stoplist)"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['purchas', 'thi', 'monitor', 'becaus', 'budgetari', 'concern', '.', 'thi', 'item', 'wa', 'inexpens', '17', 'inch', 'monitor', 'avail', 'time', 'made', 'purchas', '.', 'overal', 'experi', 'thi', 'monitor', 'wa', 'veri', 'poor', '.', 'screen', 'wa', \"n't\", 'contract', 'glitch', 'overal', 'pictur', 'qualiti', 'wa', 'poor', 'fair', '.', \"'ve\", 'view', 'numer', 'differ', 'monitor', 'model', 'sinc', \"'m\", 'colleg', 'student', 'thi', 'particular', 'monitor', 'poor', 'pictur', 'qualiti', 'ani', \"'ve\", 'seen', '.']\n",
"['Ladi', 'Gaga', 'actual', 'Britney', 'Spear', 'Femm', 'Fatal', 'Concert', 'tonight', '!', '!', '!', 'She', 'still', 'listen', 'music', '!', '!', '!', 'WOW', '!', '!', '!', '#ladygaga', '#britney']\n"
]
}
],
"source": [
"def preprocess(words, type='doc'):\n",
" if (type == 'tweet'):\n",
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
" tokens = tknzr.tokenize(tweet)\n",
" else:\n",
" tokens = nltk.word_tokenize(words.lower())\n",
" porter = nltk.PorterStemmer()\n",
" lemmas = [porter.stem(t) for t in tokens]\n",
" stoplist = stopwords.words('english')\n",
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
" return lemmas_clean\n",
"\n",
"print(preprocess(review))\n",
"print(preprocess(tweet, type='tweet'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Punctuation removal"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Punctuation is useful for sentence splitting and POS tagging. Once we have used it, we can remove it easily."
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['purchas', 'thi', 'monitor', 'becaus', 'budgetari', 'concern', 'thi', 'item', 'wa', 'inexpens', '17', 'inch', 'monitor', 'avail', 'time', 'made', 'purchas', 'overal', 'experi', 'thi', 'monitor', 'wa', 'veri', 'poor', 'screen', 'wa', \"n't\", 'contract', 'glitch', 'overal', 'pictur', 'qualiti', 'wa', 'poor', 'fair', \"'ve\", 'view', 'numer', 'differ', 'monitor', 'model', 'sinc', \"'m\", 'colleg', 'student', 'thi', 'particular', 'monitor', 'poor', 'pictur', 'qualiti', 'ani', \"'ve\", 'seen']\n",
"['Ladi', 'Gaga', 'actual', 'Britney', 'Spear', 'Femm', 'Fatal', 'Concert', 'tonight', 'She', 'still', 'listen', 'music', 'WOW', '#ladygaga', '#britney']\n"
]
}
],
"source": [
"import string\n",
"\n",
"def preprocess(words, type='doc'):\n",
" if (type == 'tweet'):\n",
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
" tokens = tknzr.tokenize(tweet)\n",
" else:\n",
" tokens = nltk.word_tokenize(words.lower())\n",
" porter = nltk.PorterStemmer()\n",
" lemmas = [porter.stem(t) for t in tokens]\n",
" stoplist = stopwords.words('english')\n",
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
" punctuation = set(string.punctuation)\n",
" words = [w for w in lemmas_clean if w not in punctuation]\n",
" return words\n",
"\n",
"print(preprocess(review))\n",
"print(preprocess(tweet, type='tweet'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Rare words and spelling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In large corpus, we may want to clean rare words (probably they are typos) and correct spelling. \n",
"\n",
"For the first task, you can exclude the least frequent words (or compare with their frequency in a corpus). NLTK provides facilities for calculating frequencies.\n",
"\n",
"For the second task, you can use spell-checker packages such as [*textblob*](http://textblob.readthedocs.io/) or [*autocorrect*](https://pypi.python.org/pypi/autocorrect/)."
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Most frequent\n",
"[('.', 5), ('I', 5), ('the', 5), ('monitor', 5), ('was', 4), ('this', 3), ('poor', 3), ('as', 2), ('overall', 2), ('picture', 2)]\n",
"Least frequent\n",
"['models', '17', 'purchase', 'different', 'most', \"n't\", 'monitor', \"'m\", 'was', 'My']\n"
]
}
],
"source": [
"frec = nltk.FreqDist(nltk.word_tokenize(review))\n",
"print(\"Most frequent\")\n",
"print(frec.most_common(10))\n",
"print(\"Least frequent\")\n",
"print(list(frec.keys())[-10:])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

@ -0,0 +1,595 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Syntactic Processing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"\n",
"* [Objectives](#Objectives)\n",
"* [POS Tagging](#POS-Tagging)\n",
"* [NER](#NER)\n",
"* [Parsing and Chunking](#Parsing-and-Chunking)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we are going to learn how to analyse the syntax of text. In particular, we will learn\n",
"* Understand and perform POS (Part of Speech) tagging\n",
"* Understand and perform NER (Named Entity Recognition)\n",
"* Understand and parse texts\n",
"\n",
"We will use the same examples than in the previous notebook, slightly modified for learning purposes."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"review = \"\"\"I purchased this Dell monitor because of budgetary concerns. This item was the most inexpensive 17 inch Apple monitor \n",
"available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the \n",
"screen wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different \n",
"monitor models since I 'm a college student at UPM in Madrid and this particular monitor had as poor of picture quality as \n",
"any I 've seen.\"\"\"\n",
"\n",
"tweet = \"\"\"@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to \n",
" her music!!!! WOW!!! #ladygaga #britney\"\"\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# POS Tagging"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"POS Tagging is the process of assigning a grammatical category (known as *part of speech*, POS) to a word. For this purpose, the most common approach is using an annotated corpus such as Penn Treebank. The tag set (categories) depends on the corpus annotation. Fortunately, nltk defines a [universal tagset](http://www.nltk.org/book/ch05.html):\n",
"\n",
"\n",
"Tag\t| Meaning | English Examples\n",
"----|---------|------------------\n",
"ADJ\t| adjective | new, good, high, special, big, local\n",
"ADP\t| adposition | on, of, at, with, by, into, under\n",
"ADV\t| adverb | really, already, still, early, now\n",
"CONJ| conjunction | and, or, but, if, while, although\n",
"DET | determiner, article | the, a, some, most, every, no, which\n",
"NOUN | noun\t | year, home, costs, time, Africa\n",
"NUM\t| numeral | twenty-four, fourth, 1991, 14:24\n",
"PRT | particle | at, on, out, over per, that, up, with\n",
"PRON | pronoun | he, their, her, its, my, I, us\n",
"VERB | verb\t| is, say, told, given, playing, would\n",
". | punctuation marks | . , ; !\n",
"X | other | ersatz, esprit, dunno, gr8, univeristy"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('I', 'PRON'), ('purchased', 'VERB'), ('this', 'DET'), ('Dell', 'NOUN'), ('monitor', 'NOUN'), ('because', 'ADP'), ('of', 'ADP'), ('budgetary', 'ADJ'), ('concerns', 'NOUN'), ('.', '.'), ('This', 'DET'), ('item', 'NOUN'), ('was', 'VERB'), ('the', 'DET'), ('most', 'ADV'), ('inexpensive', 'ADJ'), ('17', 'NUM'), ('inch', 'NOUN'), ('Apple', 'NOUN'), ('monitor', 'NOUN'), ('available', 'ADJ'), ('to', 'PRT'), ('me', 'PRON'), ('at', 'ADP'), ('the', 'DET'), ('time', 'NOUN'), ('I', 'PRON'), ('made', 'VERB'), ('the', 'DET'), ('purchase', 'NOUN'), ('.', '.'), ('My', 'PRON'), ('overall', 'ADJ'), ('experience', 'NOUN'), ('with', 'ADP'), ('this', 'DET'), ('monitor', 'NOUN'), ('was', 'VERB'), ('very', 'ADV'), ('poor', 'ADJ'), ('.', '.'), ('When', 'ADV'), ('the', 'DET'), ('screen', 'NOUN'), ('was', 'VERB'), (\"n't\", 'ADV'), ('contracting', 'VERB'), ('or', 'CONJ'), ('glitching', 'VERB'), ('the', 'DET'), ('overall', 'ADJ'), ('picture', 'NOUN'), ('quality', 'NOUN'), ('was', 'VERB'), ('poor', 'ADJ'), ('to', 'PRT'), ('fair', 'VERB'), ('.', '.'), ('I', 'PRON'), (\"'ve\", 'VERB'), ('viewed', 'VERB'), ('numerous', 'ADJ'), ('different', 'ADJ'), ('monitor', 'NOUN'), ('models', 'NOUN'), ('since', 'ADP'), ('I', 'PRON'), (\"'m\", 'VERB'), ('a', 'DET'), ('college', 'NOUN'), ('student', 'NOUN'), ('at', 'ADP'), ('UPM', 'NOUN'), ('in', 'ADP'), ('Madrid', 'NOUN'), ('and', 'CONJ'), ('this', 'DET'), ('particular', 'ADJ'), ('monitor', 'NOUN'), ('had', 'VERB'), ('as', 'ADP'), ('poor', 'ADJ'), ('of', 'ADP'), ('picture', 'NOUN'), ('quality', 'NOUN'), ('as', 'ADP'), ('any', 'DET'), ('I', 'PRON'), (\"'ve\", 'VERB'), ('seen', 'VERB'), ('.', '.')]\n"
]
}
],
"source": [
"from nltk import pos_tag, word_tokenize\n",
"print (pos_tag(word_tokenize(review), tagset='universal'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on this POS info, we could use correctly now the WordNetLemmatizer. The WordNetLemmatizer only is interesting for 4 POS categories: ADJ, ADV, NOUN, and VERB."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['I', 'purchase', 'Dell', 'monitor', 'because', 'of', 'budgetary', 'concern', 'item', 'be', 'most', 'inexpensive', '17', 'inch', 'Apple', 'monitor', 'available', 'me', 'at', 'time', 'I', 'make', 'purchase', 'My', 'overall', 'experience', 'with', 'monitor', 'be', 'very', 'poor', 'When', 'screen', 'be', \"n't\", 'contract', 'or', 'glitching', 'overall', 'picture', 'quality', 'be', 'poor', 'fair', 'I', \"'ve\", 'view', 'numerous', 'different', 'monitor', 'model', 'since', 'I', \"'m\", 'college', 'student', 'at', 'UPM', 'in', 'Madrid', 'and', 'particular', 'monitor', 'have', 'a', 'poor', 'of', 'picture', 'quality', 'a', 'I', \"'ve\", 'see']\n"
]
}
],
"source": [
"from nltk.stem import WordNetLemmatizer\n",
"\n",
"review_postagged = pos_tag(word_tokenize(review), tagset='universal')\n",
"pos_mapping = {'NOUN': 'n', 'ADJ': 'a', 'VERB': 'v', 'ADV': 'r', 'ADP': 'n', 'CONJ': 'n', \n",
" 'PRON': 'n', 'NUM': 'n', 'X': 'n' }\n",
"\n",
"wordnet = WordNetLemmatizer()\n",
"lemmas = [wordnet.lemmatize(w, pos=pos_mapping[tag]) for (w,tag) in review_postagged if tag in pos_mapping.keys()]\n",
"print(lemmas)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# NER"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Named Entity Recognition (NER) is an information retrieval for identifying named entities of places, organisation of persons. NER usually relies in a tagged corpus. NER algorithms can be trained for new corpora."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(S\n",
" I/PRP\n",
" purchased/VBD\n",
" this/DT\n",
" (ORGANIZATION Dell/NNP)\n",
" monitor/NN\n",
" because/IN\n",
" of/IN\n",
" budgetary/JJ\n",
" concerns/NNS\n",
" ./.\n",
" This/DT\n",
" item/NN\n",
" was/VBD\n",
" the/DT\n",
" most/RBS\n",
" inexpensive/JJ\n",
" 17/CD\n",
" inch/NN\n",
" Apple/NNP\n",
" monitor/NN\n",
" available/JJ\n",
" to/TO\n",
" me/PRP\n",
" at/IN\n",
" the/DT\n",
" time/NN\n",
" I/PRP\n",
" made/VBD\n",
" the/DT\n",
" purchase/NN\n",
" ./.\n",
" My/PRP$\n",
" overall/JJ\n",
" experience/NN\n",
" with/IN\n",
" this/DT\n",
" monitor/NN\n",
" was/VBD\n",
" very/RB\n",
" poor/JJ\n",
" ./.\n",
" When/WRB\n",
" the/DT\n",
" screen/NN\n",
" was/VBD\n",
" n't/RB\n",
" contracting/VBG\n",
" or/CC\n",
" glitching/VBG\n",
" the/DT\n",
" overall/JJ\n",
" picture/NN\n",
" quality/NN\n",
" was/VBD\n",
" poor/JJ\n",
" to/TO\n",
" fair/VB\n",
" ./.\n",
" I/PRP\n",
" 've/VBP\n",
" viewed/VBN\n",
" numerous/JJ\n",
" different/JJ\n",
" monitor/NN\n",
" models/NNS\n",
" since/IN\n",
" I/PRP\n",
" 'm/VBP\n",
" a/DT\n",
" college/NN\n",
" student/NN\n",
" at/IN\n",
" (ORGANIZATION UPM/NNP)\n",
" in/IN\n",
" (GPE Madrid/NNP)\n",
" and/CC\n",
" this/DT\n",
" particular/JJ\n",
" monitor/NN\n",
" had/VBD\n",
" as/IN\n",
" poor/JJ\n",
" of/IN\n",
" picture/NN\n",
" quality/NN\n",
" as/IN\n",
" any/DT\n",
" I/PRP\n",
" 've/VBP\n",
" seen/VBN\n",
" ./.)\n"
]
}
],
"source": [
"from nltk import ne_chunk, pos_tag, word_tokenize\n",
"ne_tagged = ne_chunk(pos_tag(word_tokenize(review)), binary=False)\n",
"print(ne_tagged) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NLTK comes with other NER implementations. We can also use online services, such as [OpenCalais](http://www.opencalais.com/), [DBpedia Spotlight](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service) or [TagME](http://tagme.di.unipi.it/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Parsing and Chunking"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Parsing** is the process of obtaining a parsing tree given a grammar. It which can be very useful to understand the relationship among the words.\n",
"\n",
"As we have seen in class, we can follow a traditional approach and obtain a full parsing tree or shallow parsing (chunking) and obtain a partial tree.\n",
"\n",
"We can use the StandfordParser that is integrated in NLTK, but it requires to configure the CLASSPATH, which can be a bit annoying. Instead, we are going to see some demos to understand how grammars work. In case you are interested, you can consult the [manual](http://www.nltk.org/api/nltk.parse.html) to run it.\n",
"\n",
"In the following example, you will run an interactive context-free parser, called [shift-reduce parser](http://www.nltk.org/book/ch08.html).\n",
"The pane on the left shows the grammar as a list of production rules. The pane on the right contains the stack and the remaining input.\n",
"\n",
"You should:\n",
"* Run pressing 'step' until the sentence is fully analyzed. With each step, the parser either shifts one word onto the stack or reduces two subtrees of the stack into a new subtree.\n",
"* Try to act as the parser. Instead of pressing 'step', press 'shift' and 'reduce'. Follow the 'always shift before reduce' rule. It is likely you will reach a state where the parser cannot proceed. You can go back with 'Undo'."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from nltk.app import srparser_app\n",
"srparser_app.app()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Chunking** o **shallow parsing** aims at extracting relevant parts of the sentence. There are (two main approaches)[http://www.nltk.org/book/ch07.html] to chunking: using regular expressions based on POS tags, or training a chunk parser.\n",
"\n",
"We are going to illustrate the first technique for extracting NP chunks.\n",
"\n",
"We define regular expressions for the chunks we want to get."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(S\n",
" I/PRON\n",
" purchased/VERB\n",
" (NP this/DET Dell/NOUN monitor/NOUN)\n",
" because/ADP\n",
" of/ADP\n",
" (NP budgetary/ADJ concerns/NOUN)\n",
" ./.\n",
" (NP This/DET item/NOUN)\n",
" was/VERB\n",
" (NP\n",
" the/DET\n",
" most/ADV\n",
" inexpensive/ADJ\n",
" 17/NUM\n",
" inch/NOUN\n",
" Apple/NOUN\n",
" monitor/NOUN)\n",
" available/ADJ\n",
" to/PRT\n",
" me/PRON\n",
" at/ADP\n",
" (NP the/DET time/NOUN)\n",
" I/PRON\n",
" made/VERB\n",
" (NP the/DET purchase/NOUN)\n",
" ./.\n",
" (NP My/PRON overall/ADJ experience/NOUN)\n",
" with/ADP\n",
" (NP this/DET monitor/NOUN)\n",
" was/VERB\n",
" very/ADV\n",
" poor/ADJ\n",
" ./.\n",
" When/ADV\n",
" (NP the/DET screen/NOUN)\n",
" was/VERB\n",
" n't/ADV\n",
" contracting/VERB\n",
" or/CONJ\n",
" glitching/VERB\n",
" (NP the/DET overall/ADJ picture/NOUN quality/NOUN)\n",
" was/VERB\n",
" poor/ADJ\n",
" to/PRT\n",
" fair/VERB\n",
" ./.\n",
" I/PRON\n",
" 've/VERB\n",
" viewed/VERB\n",
" (NP numerous/ADJ different/ADJ monitor/NOUN models/NOUN)\n",
" since/ADP\n",
" I/PRON\n",
" 'm/VERB\n",
" (NP a/DET college/NOUN student/NOUN)\n",
" at/ADP\n",
" (NP UPM/NOUN)\n",
" in/ADP\n",
" (NP Madrid/NOUN)\n",
" and/CONJ\n",
" (NP this/DET particular/ADJ monitor/NOUN)\n",
" had/VERB\n",
" as/ADP\n",
" poor/ADJ\n",
" of/ADP\n",
" (NP picture/NOUN quality/NOUN)\n",
" as/ADP\n",
" any/DET\n",
" I/PRON\n",
" 've/VERB\n",
" seen/VERB\n",
" ./.)\n"
]
}
],
"source": [
"from nltk.chunk.regexp import *\n",
"pattern = \"\"\"NP: {<PRON><ADJ><NOUN>+} \n",
" {<DET>?<ADV>?<ADJ|NUM>*?<NOUN>+}\n",
" \"\"\"\n",
"NPChunker = RegexpParser(pattern)\n",
"\n",
"reviews_pos = (pos_tag(word_tokenize(review), tagset='universal'))\n",
"\n",
"chunks_np = NPChunker.parse(reviews_pos)\n",
"print(chunks_np)\n",
"chunks_np.draw()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can traverse the trees and obtain the strings as follows."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[Tree('NP', [('this', 'DET'), ('Dell', 'NOUN'), ('monitor', 'NOUN')]),\n",
" Tree('NP', [('budgetary', 'ADJ'), ('concerns', 'NOUN')]),\n",
" Tree('NP', [('This', 'DET'), ('item', 'NOUN')]),\n",
" Tree('NP', [('the', 'DET'), ('most', 'ADV'), ('inexpensive', 'ADJ'), ('17', 'NUM'), ('inch', 'NOUN'), ('Apple', 'NOUN'), ('monitor', 'NOUN')]),\n",
" Tree('NP', [('the', 'DET'), ('time', 'NOUN')]),\n",
" Tree('NP', [('the', 'DET'), ('purchase', 'NOUN')]),\n",
" Tree('NP', [('My', 'PRON'), ('overall', 'ADJ'), ('experience', 'NOUN')]),\n",
" Tree('NP', [('this', 'DET'), ('monitor', 'NOUN')]),\n",
" Tree('NP', [('the', 'DET'), ('screen', 'NOUN')]),\n",
" Tree('NP', [('the', 'DET'), ('overall', 'ADJ'), ('picture', 'NOUN'), ('quality', 'NOUN')]),\n",
" Tree('NP', [('numerous', 'ADJ'), ('different', 'ADJ'), ('monitor', 'NOUN'), ('models', 'NOUN')]),\n",
" Tree('NP', [('a', 'DET'), ('college', 'NOUN'), ('student', 'NOUN')]),\n",
" Tree('NP', [('UPM', 'NOUN')]),\n",
" Tree('NP', [('Madrid', 'NOUN')]),\n",
" Tree('NP', [('this', 'DET'), ('particular', 'ADJ'), ('monitor', 'NOUN')]),\n",
" Tree('NP', [('picture', 'NOUN'), ('quality', 'NOUN')])]"
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def extractTrees(parsed_tree, category='NP'):\n",
" return list(parsed_tree.subtrees(filter=lambda x: x.label()==category))\n",
"\n",
"extractTrees(chunks_np, 'NP')"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['this Dell monitor',\n",
" 'budgetary concerns',\n",
" 'This item',\n",
" 'the most inexpensive 17 inch Apple monitor',\n",
" 'the time',\n",
" 'the purchase',\n",
" 'My overall experience',\n",
" 'this monitor',\n",
" 'the screen',\n",
" 'the overall picture quality',\n",
" 'numerous different monitor models',\n",
" 'a college student',\n",
" 'UPM',\n",
" 'Madrid',\n",
" 'this particular monitor',\n",
" 'picture quality']"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def extractStrings(parsed_tree, category='NP'):\n",
" return [\" \".join(word for word, pos in vp.leaves()) for vp in extractTrees(parsed_tree, category)]\n",
" \n",
"extractStrings(chunks_np)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

@ -0,0 +1,742 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Vector Representation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Objectives](#Objectives)\n",
"* [Tools](#Tools)\n",
"* [Vector representation: Count vector](#Vector-representation:-Count-vector)\n",
"* [Binary vectors](#Binary-vectors)\n",
"* [Bigram vectors](#Bigram-vectors)\n",
"* [Tf-idf vector representation](#Tf-idf-vector-representation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we are going to transform text into feature vectors, using several representations as presented in class.\n",
"\n",
"We are going to use the examples from the slides."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"doc1 = 'Summer is coming but Summer is short'\n",
"doc2 = 'I like the Summer and I like the Winter'\n",
"doc3 = 'I like sandwiches and I like the Winter'\n",
"documents = [doc1, doc2, doc3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Tools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The different tools we have presented so far (NLTK, Scikit-Learn, TextBlob and CLiPS) provide overlapping functionalities for obtaining vector representations and apply machine learning algorithms.\n",
"\n",
"We are going to focus on the use of scikit-learn so that we can also use easily Pandas as we saw in the previous topic.\n",
"\n",
"Scikit-learn provides specific facililities for processing texts, as described in the [manual](http://scikit-learn.org/stable/modules/feature_extraction.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Vector representation: Count vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn provides two classes for binary vectors: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). The latter is more efficient but does not allow to understand which features are more important, so we use the first class. Nevertheless, they are compatible, so, they can be interchanged for production environments.\n",
"\n",
"The first step for vectorizing with scikit-learn is creating a CountVectorizer object and then we should call 'fit_transform' to fit the vocabulary."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
" lowercase=True, max_df=1.0, max_features=5000, min_df=1,\n",
" ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
" tokenizer=None, vocabulary=None)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"vectorizer = CountVectorizer(analyzer = \"word\", max_features = 5000) \n",
"vectorizer"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"As we can see, [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) comes with many options. We can define many configuration options, such as the maximum or minimum frequency of a term (*min_fd*, *max_df*), maximum number of features (*max_features*), if we analyze words or characters (*analyzer*), or if the output is binary or not (*binary*). *CountVectorizer* also allows us to include if we want to preprocess the input (*preprocessor*) before tokenizing it (*tokenizer*) and exclude stop words (*stop_words*).\n",
"\n",
"We can use NLTK preprocessing and tokenizer functions to tune *CountVectorizer* using these parameters.\n",
"\n",
"We are going to see how the vectors look like."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"<3x10 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 15 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectors = vectorizer.fit_transform(documents)\n",
"vectors"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"We see the vectors are stored as a sparse matrix of 3x6 dimensions.\n",
"We can print the matrix as well as the feature names."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 1 1 2 0 0 1 2 0 0]\n",
" [1 0 0 0 2 0 0 1 2 1]\n",
" [1 0 0 0 2 1 0 0 1 1]]\n",
"['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short', 'summer', 'the', 'winter']\n"
]
}
],
"source": [
"print(vectors.toarray())\n",
"print(vectorizer.get_feature_names())"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"As you can see, the pronoun 'I' has been removed because of the default token_pattern. \n",
"We can change this as follows."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['and',\n",
" 'but',\n",
" 'coming',\n",
" 'i',\n",
" 'is',\n",
" 'like',\n",
" 'sandwiches',\n",
" 'short',\n",
" 'summer',\n",
" 'the',\n",
" 'winter']"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words=None, token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
"vectors = vectorizer.fit_transform(documents)\n",
"vectorizer.get_feature_names()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now filter the stop words (it will remove 'and', 'but', 'I', 'is' and 'the')."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
"vectors = vectorizer.fit_transform(documents)\n",
"vectorizer.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"frozenset({'could', 'sixty', 'onto', 'by', 'against', 'up', 'a', 'everything', 'other', 'otherwise', 'ourselves', 'beside', 'nowhere', 'then', 'below', 'put', 'ten', 'such', 'cannot', 'either', 'due', 'hasnt', 'whereupon', 'were', 'once', 'at', 'for', 'front', 'get', 'whereas', 'that', 'eight', 'another', 'except', 'of', 'wherever', 'over', 'to', 'whom', 'you', 'former', 'behind', 'yours', 'yourself', 'what', 'even', 'however', 'go', 'less', 'bottom', 'may', 'along', 'is', 'can', 'move', 'eg', 'somewhere', 'latterly', 'seemed', 'thence', 'becoming', 'himself', 'whether', 'six', 'first', 'off', 'do', 'many', 'namely', 'never', 'because', 'mostly', 'nevertheless', 'thereupon', 'here', 'least', 'anyone', 'one', 'others', 'cry', 'they', 'thereby', 'ie', 'am', 'this', 'would', 'any', 'while', 'see', 'too', 'your', 'somehow', 'within', 'same', 'sometimes', 'thereafter', 'must', 'take', 're', 'both', 'fill', 'nor', 'sometime', 'he', 'third', 'more', 'also', 'most', 'during', 'much', 'our', 'thick', 'enough', 'full', 'toward', 'with', 'mill', 'anyhow', 'nobody', 'why', 'thru', 'although', 'nothing', 'meanwhile', 'or', 'some', 'ltd', 'wherein', 'thus', 'someone', 'whereby', 'who', 'un', 'are', 'hundred', 'whereafter', 'fire', 'twenty', 'only', 'several', 'among', 'no', 'than', 'before', 'been', 'else', 'find', 'fifteen', 'hence', 'ours', 'already', 'be', 'besides', 'next', 'interest', 'whither', 'whole', 'eleven', 'without', 'five', 'show', 'in', 'throughout', 'own', 'amongst', 'will', 'neither', 'everywhere', 'part', 'give', 'my', 'hers', 'his', 'upon', 'well', 'him', 'yourselves', 'whatever', 'cant', 'though', 'had', 'again', 'every', 'noone', 'top', 'which', 'de', 'almost', 'system', 'under', 'down', 'latter', 'above', 'whence', 'found', 'myself', 'three', 'those', 'become', 'moreover', 'but', 'anyway', 'beyond', 'from', 'now', 'as', 'seeming', 'con', 'themselves', 'hereupon', 'each', 'serious', 'two', 'across', 'out', 'the', 'therein', 'between', 'inc', 'where', 'anything', 'seem', 'co', 'therefore', 'whoever', 'herein', 'about', 'herself', 'should', 'anywhere', 'how', 'we', 'after', 'describe', 'being', 'etc', 'very', 'not', 'an', 'me', 'call', 'per', 'detail', 'still', 'around', 'hereby', 'sincere', 'their', 'has', 'became', 'beforehand', 'everyone', 'hereafter', 'made', 'ever', 'indeed', 'itself', 'something', 'afterwards', 'none', 'done', 'nine', 'alone', 'please', 'its', 'name', 'since', 'on', 'she', 'bill', 'have', 'mine', 'few', 'her', 'seems', 'always', 'side', 'forty', 'further', 'via', 'last', 'amount', 'towards', 'fify', 'through', 'whose', 'couldnt', 'perhaps', 'thin', 'until', 'becomes', 'elsewhere', 'and', 'i', 'them', 'together', 'us', 'was', 'when', 'rather', 'whenever', 'formerly', 'keep', 'so', 'back', 'there', 'amoungst', 'might', 'these', 'all', 'empty', 'often', 'into', 'it', 'twelve', 'yet', 'if', 'four'})\n"
]
}
],
"source": [
"#stop words in scikit-learn for English\n",
"print(vectorizer.get_stop_words())"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 0, 0, 1, 2, 0],\n",
" [0, 2, 0, 0, 1, 1],\n",
" [0, 2, 1, 0, 0, 1]], dtype=int64)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Vectors\n",
"f_array = vectors.toarray()\n",
"f_array"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can compute now the **distance** between vectors."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.666666666667 1.0 0.166666666667\n"
]
}
],
"source": [
"from scipy.spatial.distance import cosine\n",
"d12 = cosine(f_array[0], f_array[1])\n",
"d13 = cosine(f_array[0], f_array[2])\n",
"d23 = cosine(f_array[1], f_array[2])\n",
"print(d12, d13, d23)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Binary vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also get **binary vectors** as follows."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', binary=True) \n",
"vectors = vectorizer.fit_transform(documents)\n",
"vectorizer.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 0, 0, 1, 1, 0],\n",
" [0, 1, 0, 0, 1, 1],\n",
" [0, 1, 1, 0, 0, 1]], dtype=int64)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectors.toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bigram vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also easy to get bigram vectors."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['coming summer',\n",
" 'like sandwiches',\n",
" 'like summer',\n",
" 'like winter',\n",
" 'sandwiches like',\n",
" 'summer coming',\n",
" 'summer like',\n",
" 'summer short']"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', ngram_range=[2,2]) \n",
"vectors = vectorizer.fit_transform(documents)\n",
"vectorizer.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[1, 0, 0, 0, 0, 1, 0, 1],\n",
" [0, 0, 1, 1, 0, 0, 1, 0],\n",
" [0, 1, 0, 1, 1, 0, 0, 0]], dtype=int64)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectors.toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tf-idf vector representation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we can also get a tf-idf vector representation using the class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) instead of CountVectorizer."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer(analyzer=\"word\", stop_words='english')\n",
"vectors = vectorizer.fit_transform(documents)\n",
"vectorizer.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.48148213, 0. , 0. , 0.48148213, 0.73235914,\n",
" 0. ],\n",
" [ 0. , 0.81649658, 0. , 0. , 0.40824829,\n",
" 0.40824829],\n",
" [ 0. , 0.77100584, 0.50689001, 0. , 0. ,\n",
" 0.38550292]])"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectors.toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now compute the similarity of a query and a set of documents as follows."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train = [doc1, doc2, doc3]\n",
"vectorizer = TfidfVectorizer(analyzer=\"word\", stop_words='english')\n",
"\n",
"# We learn the vocabulary (fit) and tranform the docs into vectors\n",
"vectors = vectorizer.fit_transform(train)\n",
"vectorizer.get_feature_names()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.48148213, 0. , 0. , 0.48148213, 0.73235914,\n",
" 0. ],\n",
" [ 0. , 0.81649658, 0. , 0. , 0.40824829,\n",
" 0.40824829],\n",
" [ 0. , 0.77100584, 0.50689001, 0. , 0. ,\n",
" 0.38550292]])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vectors.toarray()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn provides a method to calculate the cosine similarity between one vector and a set of vectors. Based on this, we can rank the similarity. In this case, the ranking for the query is [d1, d2, d3]."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 0.38324078, 0.24713249, 0.23336362]])"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics.pairwise import cosine_similarity\n",
"\n",
"query = ['winter short']\n",
"\n",
"# We transform the query into a vector of the learnt vocabulary\n",
"vector_query = vectorizer.transform(query)\n",
"\n",
"# Here we calculate the distance of the query to the docs\n",
"cosine_similarity(vector_query, vectors)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same result can be obtained with pairwise metrics (kernels in ML terminology) if we use the linear kernel."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0.38324078, 0.24713249, 0.23336362])"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics.pairwise import linear_kernel\n",
"cosine_similarity = linear_kernel(vector_query, vectors).flatten()\n",
"cosine_similarity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#converting-text-to-vectors) Scikit-learn Convert Text to Vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

@ -0,0 +1,439 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Text Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Objectives](#Objectives)\n",
"* [Corpus](#Corpus)\n",
"* [Classifier](#Classifier)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n",
"\n",
"The main objectives of this session are:\n",
"* Understand how to apply machine learning techniques on textual sources\n",
"* Learn the facilities provided by scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Corpus"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",
"\n",
"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n"
]
}
],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"# We remove metadata to avoid bias in the classification\n",
"newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n",
"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n",
"\n",
"# print categories\n",
"print(list(newsgroups_train.target_names))"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"20\n"
]
}
],
"source": [
"#Number of categories\n",
"print(len(newsgroups_train.target_names))"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Category id 4 comp.sys.mac.hardware\n",
"Doc A fair number of brave souls who upgraded their SI clock oscillator have\n",
"shared their experiences for this poll. Please send a brief message detailing\n",
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
"functionality with 800 and 1.4 m floppies are especially requested.\n",
"\n",
"I will be summarizing in the next two days, so please add to the network\n",
"knowledge base if you have done the clock upgrade and haven't answered this\n",
"poll. Thanks.\n"
]
}
],
"source": [
"# Show a document\n",
"docid = 1\n",
"doc = newsgroups_train.data[docid]\n",
"cat = newsgroups_train.target[docid]\n",
"\n",
"print(\"Category id \" + str(cat) + \" \" + newsgroups_train.target_names[cat])\n",
"print(\"Doc \" + doc)"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(11314,)"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Number of files\n",
"newsgroups_train.filenames.shape"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(11314, 101323)"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Obtain a vector\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n",
"\n",
"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
"vectors_train.shape"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"66.80510871486653"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n",
"vectors_train.nnz / float(vectors_train.shape[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn."
]
},
{
"cell_type": "code",
"execution_count": 138,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0.69545360719001303"
]
},
"execution_count": 138,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"\n",
"from sklearn import metrics\n",
"\n",
"\n",
"# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n",
"# Nevertheless, we only transform the test dataset into vectors (transform, not fit_transform)\n",
"\n",
"model = MultinomialNB(alpha=.01)\n",
"model.fit(vectors_train, newsgroups_train.target)\n",
"\n",
"\n",
"pred = model.predict(vectors_test)\n",
"\n",
"metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)"
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dimensionality: 101323\n",
"density: 1.000000\n"
]
}
],
"source": [
"from sklearn.utils.extmath import density\n",
"\n",
"print(\"dimensionality: %d\" % model.coef_.shape[1])\n",
"print(\"density: %f\" % density(model.coef_))"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"alt.atheism: islam atheists say just religion atheism think don people god\n",
"comp.graphics: looking format 3d know program file files thanks image graphics\n",
"comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows\n",
"comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive\n",
"comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac\n",
"comp.windows.x: using windows x11r5 use application thanks widget server motif window\n",
"misc.forsale: asking email sell price condition new shipping offer 00 sale\n",
"rec.autos: don ford new good dealer just engine like cars car\n",
"rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike\n",
"rec.sport.baseball: braves players pitching hit runs games game baseball team year\n",
"rec.sport.hockey: league year nhl games season players play hockey team game\n",
"sci.crypt: people use escrow nsa keys government chip clipper encryption key\n",
"sci.electronics: don thanks voltage used know does like circuit power use\n",
"sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg\n",
"sci.space: just lunar earth shuttle like moon launch orbit nasa space\n",
"soc.religion.christian: believe faith christian christ bible people christians church jesus god\n",
"talk.politics.guns: just law firearms government fbi don weapons people guns gun\n",
"talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel\n",
"talk.politics.misc: know state clinton president just think tax don government people\n",
"talk.religion.misc: think don koresh objective christians bible people christian jesus god\n"
]
}
],
"source": [
"# We can review the top features per topic in Bayes (attribute coef_)\n",
"import numpy as np\n",
"\n",
"def show_top10(classifier, vectorizer, categories):\n",
" feature_names = np.asarray(vectorizer.get_feature_names())\n",
" for i, category in enumerate(categories):\n",
" top10 = np.argsort(classifier.coef_[i])[-10:]\n",
" print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n",
"\n",
" \n",
"show_top10(model, vectorizer, newsgroups_train.target_names)"
]
},
{
"cell_type": "code",
"execution_count": 136,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[ 3 15]\n",
"['comp.sys.ibm.pc.hardware', 'soc.religion.christian']\n"
]
}
],
"source": [
"# We try the classifier in two new docs\n",
"\n",
"new_docs = ['This is a survey of PC computers', 'God is love']\n",
"new_vectors = vectorizer.transform(new_docs)\n",
"\n",
"pred_docs = model.predict(new_vectors)\n",
"print(pred_docs)\n",
"print([newsgroups_train.target_names[i] for i in pred_docs])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

@ -0,0 +1,663 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Semantic Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Objectives](#Objectives)\n",
"* [Corpus](#Corpus)\n",
"* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n",
"* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n",
"* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n",
"\n",
"The main objectives of this session are:\n",
"* Understand the models and their differences\n",
"* Learn to use some of the most popular NLP libraries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Corpus"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",
"\n",
"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(2034, 2807)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"# We filter only some categories, otherwise we have 20 categories\n",
"categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
"# We remove metadata to avoid bias in the classification\n",
"newsgroups_train = fetch_20newsgroups(subset='train', \n",
" remove=('headers', 'footers', 'quotes'), \n",
" categories=categories)\n",
"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n",
" categories=categories)\n",
"\n",
"\n",
"# Obtain a vector\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n",
"\n",
"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
"vectors_train.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Converting Scikit-learn to gensim"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n",
"\n",
"You should install first *gensim*. Run 'conda install -c anaconda gensim=0.12.4' in a terminal."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import matutils\n",
"\n",
"vocab = vectorizer.get_feature_names()\n",
"\n",
"dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n",
"corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Latent Dirichlet Allocation (LDA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.ldamodel import LdaModel\n",
"\n",
"# It takes a long time\n",
"\n",
"# train the lda model, choosing number of topics equal to 4\n",
"lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(0,\n",
" '0.004*objects + 0.004*obtained + 0.003*comets + 0.003*manhattan + 0.003*member + 0.003*beginning + 0.003*center + 0.003*groups + 0.003*aware + 0.003*increased'),\n",
" (1,\n",
" '0.003*activity + 0.002*objects + 0.002*professional + 0.002*eyes + 0.002*manhattan + 0.002*pressure + 0.002*netters + 0.002*chosen + 0.002*attempted + 0.002*medical'),\n",
" (2,\n",
" '0.003*mechanism + 0.003*led + 0.003*platform + 0.003*frank + 0.003*mormons + 0.003*aeronautics + 0.002*concepts + 0.002*header + 0.002*forces + 0.002*profit'),\n",
" (3,\n",
" '0.005*diameter + 0.005*having + 0.004*complex + 0.004*conclusions + 0.004*activity + 0.004*looking + 0.004*action + 0.004*inflatable + 0.004*defined + 0.004*association')]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the topics\n",
"lda.print_topics(4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import the gensim.corpora module to generate dictionary\n",
"from gensim import corpora\n",
"\n",
"from nltk import word_tokenize\n",
"from nltk.corpus import stopwords\n",
"from nltk import RegexpTokenizer\n",
"\n",
"import string\n",
"\n",
"def preprocess(words):\n",
" tokenizer = RegexpTokenizer('[A-Z]\\w+')\n",
" tokens = [w.lower() for w in tokenizer.tokenize(words)]\n",
" stoplist = stopwords.words('english')\n",
" tokens_stop = [w for w in tokens if w not in stoplist]\n",
" punctuation = set(string.punctuation)\n",
" tokens_clean = [w for w in tokens_stop if w not in punctuation]\n",
" return tokens_clean\n",
"\n",
"#words = preprocess(newsgroups_train.data)\n",
"#dictionary = corpora.Dictionary(newsgroups_train.data)\n",
"\n",
"texts = [preprocess(document) for document in newsgroups_train.data]\n",
"\n",
"dictionary = corpora.Dictionary(texts)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"
]
}
],
"source": [
"# You can save the dictionary\n",
"dictionary.save('newsgroup.dict')\n",
"\n",
"print(dictionary)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Generate a list of docs, where each doc is a list of words\n",
"\n",
"docs = [preprocess(doc) for doc in newsgroups_train.data]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import the gensim.corpora module to generate dictionary\n",
"from gensim import corpora\n",
"\n",
"dictionary = corpora.Dictionary(docs)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# You can optionally save the dictionary \n",
"\n",
"dictionary.save('newsgroups.dict')\n",
"lda = LdaModel.load('newsgroups.lda')"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"
]
}
],
"source": [
"# We can print the dictionary, it is a mappying of id and tokens\n",
"\n",
"print(dictionary)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# construct the corpus representing each document as a bag-of-words (bow) vector\n",
"corpus = [dictionary.doc2bow(doc) for doc in docs]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models import TfidfModel\n",
"\n",
"# calculate tfidf\n",
"tfidf_model = TfidfModel(corpus)\n",
"corpus_tfidf = tfidf_model[corpus]"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
]
}
],
"source": [
"#print tf-idf of first document\n",
"print(corpus_tfidf[0])"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.ldamodel import LdaModel\n",
"\n",
"# train the lda model, choosing number of topics equal to 4, it takes a long time\n",
"\n",
"lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(0,\n",
" '0.010*targa + 0.007*ns + 0.006*thanks + 0.006*davidian + 0.006*ssrt + 0.006*yayayay + 0.005*craig + 0.005*bull + 0.005*gerald + 0.005*sorry'),\n",
" (1,\n",
" '0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades'),\n",
" (2,\n",
" '0.007*koresh + 0.007*moon + 0.007*western + 0.006*plane + 0.006*jeff + 0.006*unix + 0.005*bible + 0.005*also + 0.005*basically + 0.005*bob'),\n",
" (3,\n",
" '0.011*whatever + 0.008*joy + 0.007*happy + 0.006*virtual + 0.006*reality + 0.004*really + 0.003*samuel___ + 0.003*oh + 0.003*virtually + 0.003*toaster')]"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the topics\n",
"lda_model.print_topics(4)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.085176135689180726), (1, 0.6919655173835938), (2, 0.1377903468164027), (3, 0.0850680001108228)]\n"
]
}
],
"source": [
"# check the lsa vector for the first document\n",
"corpus_lda = lda_model[corpus_tfidf]\n",
"print(corpus_lda[0])"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('lord', 1), ('god', 2)]\n"
]
}
],
"source": [
"#predict topics of a new doc\n",
"new_doc = \"God is love and God is the Lord\"\n",
"#transform into BOW space\n",
"bow_vector = dictionary.doc2bow(preprocess(new_doc))\n",
"print([(dictionary[id], count) for id, count in bow_vector])"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.062509420435514051), (1, 0.81246608790618835), (2, 0.062508281488992554), (3, 0.062516210169305114)]\n"
]
}
],
"source": [
"#transform into LDA space\n",
"lda_vector = lda_model[bow_vector]\n",
"print(lda_vector)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades\n"
]
}
],
"source": [
"# print the document's single most prominent LDA topic\n",
"print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.10392179866025079), (1, 0.68822094221870811), (2, 0.10391916429993264), (3, 0.10393809482110833)]\n",
"0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades\n"
]
}
],
"source": [
"lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n",
"print(lda_vector_tfidf)\n",
"# print the document's single most prominent LDA topic\n",
"print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Latent Semantic Indexing (LSI)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.lsimodel import LsiModel\n",
"\n",
"#It takes a long time\n",
"\n",
"# train the lsi model, choosing number of topics equal to 20\n",
"\n",
"\n",
"lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"[(0,\n",
" '0.769*\"god\" + 0.346*\"jesus\" + 0.235*\"bible\" + 0.204*\"christian\" + 0.149*\"christians\" + 0.107*\"christ\" + 0.090*\"well\" + 0.085*\"koresh\" + 0.081*\"kent\" + 0.080*\"christianity\"'),\n",
" (1,\n",
" '-0.863*\"thanks\" + -0.255*\"please\" + -0.159*\"hello\" + -0.153*\"hi\" + 0.123*\"god\" + -0.112*\"sorry\" + -0.087*\"could\" + -0.074*\"windows\" + -0.067*\"jpeg\" + -0.063*\"vga\"'),\n",
" (2,\n",
" '0.780*\"well\" + -0.229*\"god\" + 0.165*\"yes\" + -0.153*\"thanks\" + 0.133*\"ico\" + 0.133*\"tek\" + 0.130*\"bronx\" + 0.130*\"beauchaine\" + 0.130*\"queens\" + 0.129*\"manhattan\"'),\n",
" (3,\n",
" '0.340*\"well\" + -0.335*\"ico\" + -0.334*\"tek\" + -0.328*\"beauchaine\" + -0.328*\"bronx\" + -0.328*\"queens\" + -0.326*\"manhattan\" + -0.305*\"bob\" + -0.305*\"com\" + -0.072*\"god\"')]"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check the topics\n",
"lsi_model.print_topics(4)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
]
}
],
"source": [
"# check the lsi vector for the first document\n",
"print(corpus_tfidf[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

@ -0,0 +1,797 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Combining Features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Objectives](#Objectives)\n",
"* [Dataset](#Dataset)\n",
"* [Loading the dataset](#Loading-the-dataset)\n",
"* [Transformers](#Transformers)\n",
"* [Lexical features](#Lexical-features)\n",
"* [Syntactic features](#Syntactic-features)\n",
"* [Feature Extraction Pipelines](#Feature-Extraction-Pipelines)\n",
"* [Feature Union Pipeline](#Feature-Union-Pipeline)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous section we have seen how to analyse lexical, syntactic and semantic features. All these features can help in machine learning techniques.\n",
"\n",
"In this notebook we are going to learn how to combine them. \n",
"\n",
"There are several approaches for combining features, at character, lexical, syntactical, semantic or behavioural levels. \n",
"\n",
"Some authors obtain the different featuras as lists and then join these lists, a good example is shown [here](http://www.aicbt.com/authorship-attribution/) for authorship attribution. Other authors use *FeatureUnion* to join the different sparse matrices, as shown [here](http://es.slideshare.net/PyData/authorship-attribution-forensic-linguistics-with-python-scikit-learn-pandas-kostas-perifanos) and [here](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html). Finally, other authors use FeatureUnions with weights, as shown in [scikit-learn documentation](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html).\n",
"\n",
"A *FeatureUnion* is built using a list of (key, value) pairs, where the key is the name you want to give to a given transformation (an arbitrary string; it only serves as an identifier) and value is an estimator object.\n",
"\n",
"In this chapter we are going to follow the combination of Pipelines and FeatureUnions, as described in scikit-learn, [Zac Stewart](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html), his [Kaggle submission](https://github.com/zacstewart/kaggle_seeclickfix/blob/master/estimator.py), and [Michelle Fullwood](https://michelleful.github.io/code-blog/2015/06/20/pipelines/), since it provides a simple and structured approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to use one [dataset from Kaggle](https://www.kaggle.com/c/asap-aes/) for automatic essay scoring, a very interesting area for teachers.\n",
"\n",
"The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.For this competition, there are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine's capabilities.\n",
"\n",
"Each of these files contains 28 columns:\n",
"\n",
"* essay_id: A unique identifier for each individual student essay\n",
"* essay_set: 1-8, an id for each set of essays\n",
"* essay: The ascii text of a student's response\n",
"* rater1_domain1: Rater 1's domain 1 score; all essays have this\n",
"* rater2_domain1: Rater 2's domain 1 score; all essays have this\n",
"* rater3_domain1: Rater 3's domain 1 score; only some essays in set 8 have this.\n",
"* domain1_score: Resolved score between the raters; all essays have this\n",
"* rater1_domain2: Rater 1's domain 2 score; only essays in set 2 have this\n",
"* rater2_domain2: Rater 2's domain 2 score; only essays in set 2 have this\n",
"* domain2_score: Resolved score between the raters; only essays in set 2 have this\n",
"* rater1_trait1 score - rater3_trait6 score: trait scores for sets 7-8\n",
"\n",
"The dataset is provided in the folder *data-kaggle/training_set_rel3.tsv*.\n",
"\n",
"There are cases in the training set that contain ???, \"illegible\", or \"not legible\" on some words. You may choose to discard them if you wish, and essays with illegible words will not be present in the validation or test sets.\n",
"\n",
"The dataset has been anonymized to remove personally identifying information from the essays using the Named Entity Recognizer (NER) from the Stanford Natural Language Processing group and a variety of other approaches. The relevant entities are identified in the text and then replaced with a string such as \"@PERSON1.\"\n",
"\n",
"The entitities identified by NER are: \"PERSON\", \"ORGANIZATION\", \"LOCATION\", \"DATE\", \"TIME\", \"MONEY\", \"PERCENT\"\n",
"\n",
"Other replacements made: \"MONTH\" (any month name not tagged as a date by the NER), \"EMAIL\" (anything that looks like an e-mail address), \"NUM\" (word containing digits or non-alphanumeric symbols), and \"CAPS\" (any capitalized word that doesn't begin a sentence, except in essays where more than 20% of the characters are capitalized letters), \"DR\" (any word following \"Dr.\" with or without the period, with any capitalization, that doesn't fall into any of the above), \"CITY\" and \"STATE\" (various cities and states)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Loading the dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will use Pandas to load the dataset. We will not go deeper in analysing the dataset, using the techniques already seen previously."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>essay_id</th>\n",
" <th>essay_set</th>\n",
" <th>essay</th>\n",
" <th>rater1_domain1</th>\n",
" <th>rater2_domain1</th>\n",
" <th>rater3_domain1</th>\n",
" <th>domain1_score</th>\n",
" <th>rater1_domain2</th>\n",
" <th>rater2_domain2</th>\n",
" <th>domain2_score</th>\n",
" <th>...</th>\n",
" <th>rater2_trait3</th>\n",
" <th>rater2_trait4</th>\n",
" <th>rater2_trait5</th>\n",
" <th>rater2_trait6</th>\n",
" <th>rater3_trait1</th>\n",
" <th>rater3_trait2</th>\n",
" <th>rater3_trait3</th>\n",
" <th>rater3_trait4</th>\n",
" <th>rater3_trait5</th>\n",
" <th>rater3_trait6</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Dear local newspaper, I think effects computer...</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" <td>NaN</td>\n",
" <td>8</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>Dear @CAPS1 @CAPS2, I believe that using compu...</td>\n",
" <td>5</td>\n",
" <td>4</td>\n",
" <td>NaN</td>\n",
" <td>9</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>7</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>Dear Local Newspaper, @CAPS1 I have found that...</td>\n",
" <td>5</td>\n",
" <td>5</td>\n",
" <td>NaN</td>\n",
" <td>10</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>4 rows × 28 columns</p>\n",
"</div>"
],
"text/plain": [
" essay_id essay_set essay \\\n",
"0 1 1 Dear local newspaper, I think effects computer... \n",
"1 2 1 Dear @CAPS1 @CAPS2, I believe that using compu... \n",
"2 3 1 Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl... \n",
"3 4 1 Dear Local Newspaper, @CAPS1 I have found that... \n",
"\n",
" rater1_domain1 rater2_domain1 rater3_domain1 domain1_score \\\n",
"0 4 4 NaN 8 \n",
"1 5 4 NaN 9 \n",
"2 4 3 NaN 7 \n",
"3 5 5 NaN 10 \n",
"\n",
" rater1_domain2 rater2_domain2 domain2_score ... \\\n",
"0 NaN NaN NaN ... \n",
"1 NaN NaN NaN ... \n",
"2 NaN NaN NaN ... \n",
"3 NaN NaN NaN ... \n",
"\n",
" rater2_trait3 rater2_trait4 rater2_trait5 rater2_trait6 rater3_trait1 \\\n",
"0 NaN NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN \n",
"\n",
" rater3_trait2 rater3_trait3 rater3_trait4 rater3_trait5 rater3_trait6 \n",
"0 NaN NaN NaN NaN NaN \n",
"1 NaN NaN NaN NaN NaN \n",
"2 NaN NaN NaN NaN NaN \n",
"3 NaN NaN NaN NaN NaN \n",
"\n",
"[4 rows x 28 columns]"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# The files are coded in ISO-8859-1\n",
"\n",
"df_orig = pd.read_csv(\"data-essays/training_set_rel3.tsv\", encoding='ISO-8859-1', delimiter=\"\\t\", header=0)\n",
"df_orig[0:4]"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(12976, 28)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_orig.shape"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(1783, 3)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We filter the data of the essay_set number 1, and we keep only two columns for this \n",
"# example\n",
"\n",
"df = df_orig[df_orig['essay_set'] == 1][['essay_id', 'essay', 'domain1_score']].copy()\n",
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>essay_id</th>\n",
" <th>essay</th>\n",
" <th>domain1_score</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Dear local newspaper, I think effects computer...</td>\n",
" <td>8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Dear @CAPS1 @CAPS2, I believe that using compu...</td>\n",
" <td>9</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Dear Local Newspaper, @CAPS1 I have found that...</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Dear @LOCATION1, I know having computers has a...</td>\n",
" <td>8</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" essay_id essay domain1_score\n",
"0 1 Dear local newspaper, I think effects computer... 8\n",
"1 2 Dear @CAPS1 @CAPS2, I believe that using compu... 9\n",
"2 3 Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl... 7\n",
"3 4 Dear Local Newspaper, @CAPS1 I have found that... 10\n",
"4 5 Dear @LOCATION1, I know having computers has a... 8"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[0:5]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Define X and Y\n",
"X = df['essay'].values\n",
"y = df['domain1_score'].values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Transformers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Every feature extractor should be implemented as a custom Transformer. A transformer can be seen as an object that receives data, applies some changes, and returns the data, usually with the same same that the input. The methods we should implement are:\n",
"* *fit* method, in case we need to learn and train for extracting the feature\n",
"* *transform method*, that applies the defined transformation to unseen data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we show the general approach to develop transformers"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Generic Transformer \n",
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"\n",
"class GenericTransformer(BaseEstimator, TransformerMixin):\n",
"\n",
" def transform(self, X, y=None):\n",
" return do_something_to(X, self.vars) # where the actual feature extraction happens\n",
"\n",
" def fit(self, X, y=None):\n",
" return self # used if the feature requires training, for example, clustering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn provides a class [FunctionTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html) that makes easy to create new transformers. We have to provide a function that is executed in the method transform()."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lexical features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we include some examples of lexical features. We have omitted character features (for example, number of exclamation marks)."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Sample of statistics using nltk\n",
"# Another option is defining a function and pass it as a parameter to FunctionTransformer\n",
"\n",
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
"\n",
"class LexicalStats (BaseEstimator, TransformerMixin):\n",
" \"\"\"Extract lexical features from each document\"\"\"\n",
" \n",
" def number_sentences(self, doc):\n",
" sentences = sent_tokenize(doc, language='english')\n",
" return len(sentences)\n",
"\n",
" def fit(self, x, y=None):\n",
" return self\n",
"\n",
" def transform(self, docs):\n",
" return [{'length': len(doc),\n",
" 'num_sentences': self.number_sentences(doc)}\n",
" for doc in docs]\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from nltk.stem import PorterStemmer\n",
"from nltk import word_tokenize\n",
"from nltk.corpus import stopwords\n",
"import string\n",
"\n",
"def custom_tokenizer(words):\n",
" \"\"\"Preprocessing tokens as seen in the lexical notebook\"\"\"\n",
" tokens = word_tokenize(words.lower())\n",
" porter = PorterStemmer()\n",
" lemmas = [porter.stem(t) for t in tokens]\n",
" stoplist = stopwords.words('english')\n",
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
" punctuation = set(string.punctuation)\n",
" lemmas_punct = [w for w in lemmas_clean if w not in punctuation]\n",
" return lemmas_punct"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Syntactic features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we include and example of syntactic feature extraction."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.base import BaseEstimator, TransformerMixin\n",
"from nltk import pos_tag\n",
"from collections import Counter \n",
"\n",
"class PosStats(BaseEstimator, TransformerMixin):\n",
" \"\"\"Obtain number of tokens with POS categories\"\"\"\n",
"\n",
" def stats(self, doc):\n",
" tokens = custom_tokenizer(doc)\n",
" tagged = pos_tag(tokens, tagset='universal')\n",
" counts = Counter(tag for word,tag in tagged)\n",
" total = sum(counts.values())\n",
" #copy tags so that we return always the same number of features\n",
" pos_features = {'NOUN': 0, 'ADJ': 0, 'VERB': 0, 'ADV': 0, 'CONJ': 0, \n",
" 'ADP': 0, 'PRON':0, 'NUM': 0}\n",
" \n",
" pos_dic = dict((tag, float(count)/total) for tag,count in counts.items())\n",
" for k in pos_dic:\n",
" if k in pos_features:\n",
" pos_features[k] = pos_dic[k]\n",
" return pos_features\n",
" \n",
" def transform(self, docs, y=None):\n",
" return [self.stats(doc) for doc in docs]\n",
" \n",
" def fit(self, docs, y=None):\n",
" \"\"\"Returns `self` unless something different happens in train and test\"\"\"\n",
" return self"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Extraction Pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We define Pipelines to extract the desired features.\n",
"\n",
"In case we want to apply different processing techniques to different part of the corpus (e.g. title or body or, ...), look [here](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html) for an example of how to extract and process the different parts into a Pipeline."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline, FeatureUnion\n",
"from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer\n",
"\n",
"\n",
"ngrams_featurizer = Pipeline([\n",
" ('count_vectorizer', CountVectorizer(ngram_range = (1, 3), encoding = 'ISO-8859-1', \n",
" tokenizer=custom_tokenizer)),\n",
" ('tfidf_transformer', TfidfTransformer())\n",
"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Union Pipeline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can ensemble the different pipelines to define which features we want to extract, how to combine them, and apply later machine learning techniques to the resulting feature set.\n",
"\n",
"In Feature Union we can pass either a pipeline or a transformer.\n",
"\n",
"The basic idea is:\n",
"* **Pipelines** consist of sequential steps: one step works on the results of the previous step\n",
"* ** FeatureUnions** consist of parallel tasks whose result is grouped when all have finished."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Scores in every iteration [ 0.39798206 0.27497194]\n",
"Accuracy: 0.34 (+/- 0.12)\n"
]
}
],
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"from sklearn.cross_validation import cross_val_score, KFold\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.feature_extraction import DictVectorizer\n",
"from sklearn.preprocessing import FunctionTransformer\n",
"from sklearn.decomposition import NMF, LatentDirichletAllocation\n",
"\n",
"\n",
"\n",
"## All the steps of the Pipeline should end with a sparse vector as the input data\n",
"\n",
"pipeline = Pipeline([\n",
" ('features', FeatureUnion([\n",
" ('lexical_stats', Pipeline([\n",
" ('stats', LexicalStats()),\n",
" ('vectors', DictVectorizer())\n",
" ])),\n",
" ('words', TfidfVectorizer(tokenizer=custom_tokenizer)),\n",
" ('ngrams', ngrams_featurizer),\n",
" ('pos_stats', Pipeline([\n",
" ('pos_stats', PosStats()),\n",
" ('vectors', DictVectorizer())\n",
" ])),\n",
" ('lda', Pipeline([ \n",
" ('count', CountVectorizer(tokenizer=custom_tokenizer)),\n",
" ('lda', LatentDirichletAllocation(n_topics=4, max_iter=5,\n",
" learning_method='online', \n",
" learning_offset=50.,\n",
" random_state=0))\n",
" ])),\n",
" ])),\n",
" \n",
" ('clf', MultinomialNB(alpha=.01)) # classifier\n",
" ])\n",
"\n",
"# Using KFold validation\n",
"\n",
"cv = KFold(X.shape[0], 2, shuffle=True, random_state=33)\n",
"scores = cross_val_score(pipeline, X, y, cv=cv)\n",
"print(\"Scores in every iteration\", scores)\n",
"print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"The result is not very good :(."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

@ -0,0 +1,144 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"\n",
"* [Exercises](#Exercises)\n",
"\t* [Exercise 1 - Sentiment classification for Twitter](#Exercise-1---Sentiment-classification-for-Twitter)\n",
"\t* [Exercise 2 - Spam classification](#Exercise-2---Spam-classification)\n",
"\t* [Exercise 3 - Automatic essay classification](#Exercise-3---Automatic-essay-classification)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we propose several exercises, it is recommended to work only in one of them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 1 - Sentiment classification for Twitter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The purpose of this exercise is:\n",
"* Collect geolocated tweets\n",
"* Analyse their sentiment\n",
"* Represent the result in a map, so that one can understand the sentiment in a geographic region.\n",
"\n",
"The steps (and most of the code) can be found [here](http://pybonacci.org/2015/11/24/como-hacer-analisis-de-sentimiento-en-espanol-2/). \n",
"\n",
"You can select the tweets in any language."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2 - Spam classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The classification of spam is a classical problem. [Here](http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html) you can find a detailed example of how to do it using the datasets Enron-Spama and SpamAssassin. You can try to test yourself the classification."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 3 - Automatic essay classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you have seen, we did not got great results in the previous notebook. You can try to improve them."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.1 KiB

Loading…
Cancel
Save