mirror of https://github.com/gsi-upm/sitc
Added NLP notebooks
parent
31153f8e5d
commit
3054442421
@ -0,0 +1,92 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Introduction to Natural Language Processing\n",
|
||||
" \n",
|
||||
"In this lab session, we are going to learn how to analyze texts and apply machine learning techniques on textual data sources.\n",
|
||||
"\n",
|
||||
"# Objectives\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Learn how to obtain lexical, syntactic and semantic features from texts\n",
|
||||
"* Learn to use some libraries, such as NLTK, Scikit-learn and gensim for NLP"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. [Home](4_0_Intro_NLP.ipynb)\n",
|
||||
"1. [Lexical processing](4_1_Lexical_Processing.ipynb)\n",
|
||||
"1. [Syntactic processing](4_2_Syntactic_Processing.ipynb)\n",
|
||||
"2. [Vector representation](4_3_Vector_Representation.ipynb)\n",
|
||||
"3. [Classification](4_4_Classification.ipynb)\n",
|
||||
"1. [Semantic models](4_5_Semantic_Models.ipynb)\n",
|
||||
"1. [Combining features](4_6_Combining_Features.ipynb)\n",
|
||||
"5. [Exercises](4_7_Exercises.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
@ -0,0 +1,644 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Lexical Processing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents\n",
|
||||
"* [Objectives](#Objectives)\n",
|
||||
"* [Tools](#Tools)\n",
|
||||
"* [Cleansing](#Cleansing)\n",
|
||||
"* [Tokenization](#Tokenization)\n",
|
||||
"* [Sentence Splitter](#Sentence-Splitter)\n",
|
||||
"* [Word Splitter](#Word-Splitter)\n",
|
||||
"* [Stemming and Lemmatization](#Stemming-and-Lemmatization)\n",
|
||||
"* [Stop word removal](#Stop-word-removal)\n",
|
||||
"* [Punctuation removal](#Punctuation-removal)\n",
|
||||
"* [Rare words and spelling](#Rare-words-and-spelling)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Objectives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this session we are going to learn how to preprocess texts, also known as *text wrangling*. This task involves data munging, text cleansing, specific preprocessing, tokenization, stemming or lemmatization and stop word removal.\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Learn how to preprocess text sources\n",
|
||||
"* Learn to use some of the most popular NLP libraries\n",
|
||||
"\n",
|
||||
"We are going to use as an example part of a computer review included in [Liu's Product Review of IJCA 2015](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets) and a tweet from the [Semeval 2013 Task 2 dataset](https://www.cs.york.ac.uk/semeval-2013/task2/data/uploads/datasets/readme.txt), slightly modified for learning purposes."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"review = \"\"\"I purchased this monitor because of budgetary concerns. This item was the most inexpensive 17 inch monitor \n",
|
||||
"available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the \n",
|
||||
"screen wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different \n",
|
||||
"monitor models since I 'm a college student and this particular monitor had as poor of picture quality as \n",
|
||||
"any I 've seen.\"\"\"\n",
|
||||
"\n",
|
||||
"tweet = \"\"\"@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to \n",
|
||||
" her music!!!! WOW!!! #ladygaga #britney\"\"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tools"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this session we are going to use several libraries, which provide complementary features:\n",
|
||||
"* [NLTK](nltk.org/book_1ed/) - provides functionalities for sentence splitting, tokenization, lemmatization, NER, collocations ... and access to many lexical resources (WordNet, Corpora, ...)\n",
|
||||
"* [Gensim](https://radimrehurek.com/gensim/) - provides functionalities for corpora management, LDA and LSI, among others.\n",
|
||||
"* [TextBlob](http://textblob.readthedocs.io/) - provides a simple way to access to many of the NLP functions. It is simpler than NLTK and integrates additional functionatities, such as language detection, spelling or even sentiment analysis.\n",
|
||||
"* [CLiPS](http://www.clips.ua.ac.be/pages/pattern-en#parser) -- contains a fast part-of-speech tagger for English (identifies nouns, adjectives, verbs, etc. in a sentence), sentiment and mood analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface. Unfortunately, it does not support Python 3 yet.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"In order to use nltk, we should download first the lexical resources we are going to use. We can updated them later. For this, you need:\n",
|
||||
"* import nltk\n",
|
||||
"* Run *nltk.download()* (the first time we use it). A window will appear. You should select just the corpus 'book' and press download.\n",
|
||||
"If you inspect the window, you can get an overview of available lexical resources (corpora, lexicons and grammars). For example, you can find some relevant sentiment lexicons in corpora (SentiWordNet, Sentence Polarity Dataset, Vader, Opinion Lexicon or VADER Sentiment Lexicon)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import nltk\n",
|
||||
"ntlk.download()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Cleansing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this case we will use raw text. In case you need to clean the documents (eliminate HTML markup, etc.), you can use libraries such as [BeautifulSoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tokenization"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Tokenization is the process of transforming a text into tokens. Depending on the input, we can want to split the text into sentences or words. Moreover, some input such as Twitter can require taking into account processing special tokens, such as hashtags.\n",
|
||||
"\n",
|
||||
"NLTK provides good support for [tokenization](http://www.nltk.org/api/nltk.tokenize.html).\n",
|
||||
"\n",
|
||||
"Next we are going to practice several of these features."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Sentence Splitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can use a standard sentence splitter (*sent_tokenize* which uses *PunkTonenizer*), or train a sentence splitter, using the class [*PunktSentenceTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
|
||||
"\n",
|
||||
"If the text is multilingual, we can install [*textblob*](http://textblob.readthedocs.io/) or [*langdetect*](https://pypi.python.org/pypi/langdetect?) to detect the text language and select the most suitable sentence splitter. NLTK comes with 17 trained languages for sentence splitting."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['I purchased this monitor because of budgetary concerns.', 'This item was the most inexpensive 17 inch monitor \\navailable to me at the time I made the purchase.', 'My overall experience with this monitor was very poor.', \"When the \\nscreen wasn't contracting or glitching the overall picture quality was poor to fair.\", \"I've viewed numerous different \\nmonitor models since I 'm a college student and this particular monitor had as poor of picture quality as \\nany I 've seen.\"]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk.tokenize import sent_tokenize, word_tokenize\n",
|
||||
"\n",
|
||||
"sentences = sent_tokenize(review, language='english')\n",
|
||||
"print(sentences)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Word Splitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next stem is dividing every sentence (or the full step) into words."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[['I', 'purchased', 'this', 'monitor', 'because', 'of', 'budgetary', 'concerns', '.'], ['This', 'item', 'was', 'the', 'most', 'inexpensive', '17', 'inch', 'monitor', 'available', 'to', 'me', 'at', 'the', 'time', 'I', 'made', 'the', 'purchase', '.'], ['My', 'overall', 'experience', 'with', 'this', 'monitor', 'was', 'very', 'poor', '.'], ['When', 'the', 'screen', 'was', \"n't\", 'contracting', 'or', 'glitching', 'the', 'overall', 'picture', 'quality', 'was', 'poor', 'to', 'fair', '.'], ['I', \"'ve\", 'viewed', 'numerous', 'different', 'monitor', 'models', 'since', 'I', \"'m\", 'a', 'college', 'student', 'and', 'this', 'particular', 'monitor', 'had', 'as', 'poor', 'of', 'picture', 'quality', 'as', 'any', 'I', \"'ve\", 'seen', '.']]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"words = [word_tokenize(t) for t in sent_tokenize(review)]\n",
|
||||
"print(words)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In our case, we are not interested in processing every sentence, we have split into sentence just for learning purposes. So, we are going to get the word tokens."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['I', 'purchased', 'this', 'monitor', 'because', 'of', 'budgetary', 'concerns', '.', 'This', 'item', 'was', 'the', 'most', 'inexpensive', '17', 'inch', 'monitor', 'available', 'to', 'me', 'at', 'the', 'time', 'I', 'made', 'the', 'purchase', '.', 'My', 'overall', 'experience', 'with', 'this', 'monitor', 'was', 'very', 'poor', '.', 'When', 'the', 'screen', 'was', \"n't\", 'contracting', 'or', 'glitching', 'the', 'overall', 'picture', 'quality', 'was', 'poor', 'to', 'fair', '.', 'I', \"'ve\", 'viewed', 'numerous', 'different', 'monitor', 'models', 'since', 'I', \"'m\", 'a', 'college', 'student', 'and', 'this', 'particular', 'monitor', 'had', 'as', 'poor', 'of', 'picture', 'quality', 'as', 'any', 'I', \"'ve\", 'seen', '.']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"words = word_tokenize(review)\n",
|
||||
"print(words)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can define our own word tokenizer using regular expressions (and using the class [*RegexpTokenizer*](http://www.nltk.org/api/nltk.tokenize.html).\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"With TweetTokenizer Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight ! ! ! She still listens to her music ! ! ! WOW ! ! ! #ladygaga #britney\n",
|
||||
"With word_tokenizer @ concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight ! ! ! She still listens to her music ! ! ! ! WOW ! ! ! # ladygaga # britney\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk.tokenize import TweetTokenizer\n",
|
||||
"tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
||||
"tweet_tokens = tknzr.tokenize(tweet)\n",
|
||||
"print (\"With TweetTokenizer \" + \" \".join(tweet_tokens))\n",
|
||||
"print(\"With word_tokenizer \" + \" \".join(word_tokenize(tweet)))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Stemming and Lemmatization"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"NLTK provides support for stemming in the package [*stem*](http://www.nltk.org/api/nltk.stem.html). There are several available stemmers:PorterStemmer, lancaster or WordNetLemmatizer. Check the API for more details."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 75,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Porter: boy children are have is ha Madrid\n",
|
||||
"Execution time: 0.00093841552734375\n",
|
||||
"Lancaster: boy childr ar hav is has madrid\n",
|
||||
"Execution time: 0.0014300346374511719\n",
|
||||
"WordNet: boy child are have is ha Madrid\n",
|
||||
"Execution time: 0.0008304119110107422\n",
|
||||
"SnowBall: boy children are have is has madrid\n",
|
||||
"Execution time: 0.0017843246459960938\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk.stem import PorterStemmer, LancasterStemmer, WordNetLemmatizer\n",
|
||||
"from nltk.stem.snowball import EnglishStemmer\n",
|
||||
"import time\n",
|
||||
"\n",
|
||||
"porter = PorterStemmer()\n",
|
||||
"lancaster = LancasterStemmer()\n",
|
||||
"wordnet = WordNetLemmatizer()\n",
|
||||
"snowball = EnglishStemmer()\n",
|
||||
"\n",
|
||||
"words = \"boys children are have is has Madrid\"\n",
|
||||
"\n",
|
||||
"start = time.time()\n",
|
||||
"print(\"Porter: \" + \" \".join([porter.stem(w) for w in word_tokenize(words)]))\n",
|
||||
"end = time.time()\n",
|
||||
"print(\"Execution time: \" + str(end - start))\n",
|
||||
"start = time.time()\n",
|
||||
"print(\"Lancaster: \" + \" \".join([lancaster.stem(w) for w in word_tokenize(words)]))\n",
|
||||
"end = time.time()\n",
|
||||
"print(\"Execution time: \" + str(end - start))\n",
|
||||
"start = time.time()\n",
|
||||
"print(\"WordNet: \" + \" \".join([wordnet.lemmatize(w) for w in word_tokenize(words)]))\n",
|
||||
"end = time.time()\n",
|
||||
"print(\"Execution time: \" + str(end - start))\n",
|
||||
"start = time.time()\n",
|
||||
"print(\"SnowBall: \" + \" \".join([snowball.stem(w) for w in word_tokenize(words)]))\n",
|
||||
"end = time.time()\n",
|
||||
"print(\"Execution time: \" + str(end - start))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"source": [
|
||||
"As we can see, we get the forms *are* and *is* instead of *be*. This is because we have not introduce the Part-Of-Speech (POS), and the default POS is 'n' (name).\n",
|
||||
"\n",
|
||||
"The main difference between stemmers and lemmatizers is that stemmers operate in isolated words, while lemmatizers take into account the context (e.g. POS). However, stemmers are quicker and require fewer resources.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 74,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"WordNet: be cry be have have\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"verbs = \"are crying is have has\"\n",
|
||||
"print(\"WordNet: \" + \" \".join([wordnet.lemmatize(w, pos='v') for w in word_tokenize(verbs)]))\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"source": [
|
||||
"Depending of the application, we can select stemmers or lemmatizers. \n",
|
||||
"\n",
|
||||
"Regarding Twitter, we could use specialised software for managing tweets, such as [*TweetNLP*](http://www.cs.cmu.edu/~ark/TweetNLP/).\n",
|
||||
"\n",
|
||||
"Now we go back to our example and we apply stemming."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 82,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['i', 'purchas', 'thi', 'monitor', 'becaus', 'of', 'budgetari', 'concern', '.', 'thi', 'item', 'wa', 'the', 'most', 'inexpens', '17', 'inch', 'monitor', 'avail', 'to', 'me', 'at', 'the', 'time', 'i', 'made', 'the', 'purchas', '.', 'my', 'overal', 'experi', 'with', 'thi', 'monitor', 'wa', 'veri', 'poor', '.', 'when', 'the', 'screen', 'wa', \"n't\", 'contract', 'or', 'glitch', 'the', 'overal', 'pictur', 'qualiti', 'wa', 'poor', 'to', 'fair', '.', 'i', \"'ve\", 'view', 'numer', 'differ', 'monitor', 'model', 'sinc', 'i', \"'m\", 'a', 'colleg', 'student', 'and', 'thi', 'particular', 'monitor', 'had', 'as', 'poor', 'of', 'pictur', 'qualiti', 'as', 'ani', 'i', \"'ve\", 'seen', '.']\n",
|
||||
"['Ladi', 'Gaga', 'is', 'actual', 'at', 'the', 'Britney', 'Spear', 'Femm', 'Fatal', 'Concert', 'tonight', '!', '!', '!', 'She', 'still', 'listen', 'to', 'her', 'music', '!', '!', '!', 'WOW', '!', '!', '!', '#ladygaga', '#britney']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def preprocess(words, type='doc'):\n",
|
||||
" if (type == 'tweet'):\n",
|
||||
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
||||
" tokens = tknzr.tokenize(tweet)\n",
|
||||
" else:\n",
|
||||
" tokens = nltk.word_tokenize(words.lower())\n",
|
||||
" porter = nltk.PorterStemmer()\n",
|
||||
" lemmas = [porter.stem(t) for t in tokens]\n",
|
||||
" return lemmas\n",
|
||||
"print(preprocess(review))\n",
|
||||
"print(preprocess(tweet, type='tweet'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Stop word removal"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next step is removing stop words."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 86,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk.corpus import stopwords\n",
|
||||
"\n",
|
||||
"stoplist = stopwords.words('english')\n",
|
||||
"print(stoplist)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 89,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['purchas', 'thi', 'monitor', 'becaus', 'budgetari', 'concern', '.', 'thi', 'item', 'wa', 'inexpens', '17', 'inch', 'monitor', 'avail', 'time', 'made', 'purchas', '.', 'overal', 'experi', 'thi', 'monitor', 'wa', 'veri', 'poor', '.', 'screen', 'wa', \"n't\", 'contract', 'glitch', 'overal', 'pictur', 'qualiti', 'wa', 'poor', 'fair', '.', \"'ve\", 'view', 'numer', 'differ', 'monitor', 'model', 'sinc', \"'m\", 'colleg', 'student', 'thi', 'particular', 'monitor', 'poor', 'pictur', 'qualiti', 'ani', \"'ve\", 'seen', '.']\n",
|
||||
"['Ladi', 'Gaga', 'actual', 'Britney', 'Spear', 'Femm', 'Fatal', 'Concert', 'tonight', '!', '!', '!', 'She', 'still', 'listen', 'music', '!', '!', '!', 'WOW', '!', '!', '!', '#ladygaga', '#britney']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def preprocess(words, type='doc'):\n",
|
||||
" if (type == 'tweet'):\n",
|
||||
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
||||
" tokens = tknzr.tokenize(tweet)\n",
|
||||
" else:\n",
|
||||
" tokens = nltk.word_tokenize(words.lower())\n",
|
||||
" porter = nltk.PorterStemmer()\n",
|
||||
" lemmas = [porter.stem(t) for t in tokens]\n",
|
||||
" stoplist = stopwords.words('english')\n",
|
||||
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
|
||||
" return lemmas_clean\n",
|
||||
"\n",
|
||||
"print(preprocess(review))\n",
|
||||
"print(preprocess(tweet, type='tweet'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Punctuation removal"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Punctuation is useful for sentence splitting and POS tagging. Once we have used it, we can remove it easily."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 104,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['purchas', 'thi', 'monitor', 'becaus', 'budgetari', 'concern', 'thi', 'item', 'wa', 'inexpens', '17', 'inch', 'monitor', 'avail', 'time', 'made', 'purchas', 'overal', 'experi', 'thi', 'monitor', 'wa', 'veri', 'poor', 'screen', 'wa', \"n't\", 'contract', 'glitch', 'overal', 'pictur', 'qualiti', 'wa', 'poor', 'fair', \"'ve\", 'view', 'numer', 'differ', 'monitor', 'model', 'sinc', \"'m\", 'colleg', 'student', 'thi', 'particular', 'monitor', 'poor', 'pictur', 'qualiti', 'ani', \"'ve\", 'seen']\n",
|
||||
"['Ladi', 'Gaga', 'actual', 'Britney', 'Spear', 'Femm', 'Fatal', 'Concert', 'tonight', 'She', 'still', 'listen', 'music', 'WOW', '#ladygaga', '#britney']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import string\n",
|
||||
"\n",
|
||||
"def preprocess(words, type='doc'):\n",
|
||||
" if (type == 'tweet'):\n",
|
||||
" tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)\n",
|
||||
" tokens = tknzr.tokenize(tweet)\n",
|
||||
" else:\n",
|
||||
" tokens = nltk.word_tokenize(words.lower())\n",
|
||||
" porter = nltk.PorterStemmer()\n",
|
||||
" lemmas = [porter.stem(t) for t in tokens]\n",
|
||||
" stoplist = stopwords.words('english')\n",
|
||||
" lemmas_clean = [w for w in lemmas if w not in stoplist]\n",
|
||||
" punctuation = set(string.punctuation)\n",
|
||||
" words = [w for w in lemmas_clean if w not in punctuation]\n",
|
||||
" return words\n",
|
||||
"\n",
|
||||
"print(preprocess(review))\n",
|
||||
"print(preprocess(tweet, type='tweet'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Rare words and spelling"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In large corpus, we may want to clean rare words (probably they are typos) and correct spelling. \n",
|
||||
"\n",
|
||||
"For the first task, you can exclude the least frequent words (or compare with their frequency in a corpus). NLTK provides facilities for calculating frequencies.\n",
|
||||
"\n",
|
||||
"For the second task, you can use spell-checker packages such as [*textblob*](http://textblob.readthedocs.io/) or [*autocorrect*](https://pypi.python.org/pypi/autocorrect/)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 124,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Most frequent\n",
|
||||
"[('.', 5), ('I', 5), ('the', 5), ('monitor', 5), ('was', 4), ('this', 3), ('poor', 3), ('as', 2), ('overall', 2), ('picture', 2)]\n",
|
||||
"Least frequent\n",
|
||||
"['models', '17', 'purchase', 'different', 'most', \"n't\", 'monitor', \"'m\", 'was', 'My']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"frec = nltk.FreqDist(nltk.word_tokenize(review))\n",
|
||||
"print(\"Most frequent\")\n",
|
||||
"print(frec.most_common(10))\n",
|
||||
"print(\"Least frequent\")\n",
|
||||
"print(list(frec.keys())[-10:])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
||||
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
@ -0,0 +1,595 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Syntactic Processing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents\n",
|
||||
"\n",
|
||||
"* [Objectives](#Objectives)\n",
|
||||
"* [POS Tagging](#POS-Tagging)\n",
|
||||
"* [NER](#NER)\n",
|
||||
"* [Parsing and Chunking](#Parsing-and-Chunking)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Objectives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this session we are going to learn how to analyse the syntax of text. In particular, we will learn\n",
|
||||
"* Understand and perform POS (Part of Speech) tagging\n",
|
||||
"* Understand and perform NER (Named Entity Recognition)\n",
|
||||
"* Understand and parse texts\n",
|
||||
"\n",
|
||||
"We will use the same examples than in the previous notebook, slightly modified for learning purposes."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"review = \"\"\"I purchased this Dell monitor because of budgetary concerns. This item was the most inexpensive 17 inch Apple monitor \n",
|
||||
"available to me at the time I made the purchase. My overall experience with this monitor was very poor. When the \n",
|
||||
"screen wasn't contracting or glitching the overall picture quality was poor to fair. I've viewed numerous different \n",
|
||||
"monitor models since I 'm a college student at UPM in Madrid and this particular monitor had as poor of picture quality as \n",
|
||||
"any I 've seen.\"\"\"\n",
|
||||
"\n",
|
||||
"tweet = \"\"\"@concert Lady Gaga is actually at the Britney Spears Femme Fatale Concert tonight!!! She still listens to \n",
|
||||
" her music!!!! WOW!!! #ladygaga #britney\"\"\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# POS Tagging"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"POS Tagging is the process of assigning a grammatical category (known as *part of speech*, POS) to a word. For this purpose, the most common approach is using an annotated corpus such as Penn Treebank. The tag set (categories) depends on the corpus annotation. Fortunately, nltk defines a [universal tagset](http://www.nltk.org/book/ch05.html):\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Tag\t| Meaning | English Examples\n",
|
||||
"----|---------|------------------\n",
|
||||
"ADJ\t| adjective | new, good, high, special, big, local\n",
|
||||
"ADP\t| adposition | on, of, at, with, by, into, under\n",
|
||||
"ADV\t| adverb | really, already, still, early, now\n",
|
||||
"CONJ| conjunction | and, or, but, if, while, although\n",
|
||||
"DET | determiner, article | the, a, some, most, every, no, which\n",
|
||||
"NOUN | noun\t | year, home, costs, time, Africa\n",
|
||||
"NUM\t| numeral | twenty-four, fourth, 1991, 14:24\n",
|
||||
"PRT | particle | at, on, out, over per, that, up, with\n",
|
||||
"PRON | pronoun | he, their, her, its, my, I, us\n",
|
||||
"VERB | verb\t| is, say, told, given, playing, would\n",
|
||||
". | punctuation marks | . , ; !\n",
|
||||
"X | other | ersatz, esprit, dunno, gr8, univeristy"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[('I', 'PRON'), ('purchased', 'VERB'), ('this', 'DET'), ('Dell', 'NOUN'), ('monitor', 'NOUN'), ('because', 'ADP'), ('of', 'ADP'), ('budgetary', 'ADJ'), ('concerns', 'NOUN'), ('.', '.'), ('This', 'DET'), ('item', 'NOUN'), ('was', 'VERB'), ('the', 'DET'), ('most', 'ADV'), ('inexpensive', 'ADJ'), ('17', 'NUM'), ('inch', 'NOUN'), ('Apple', 'NOUN'), ('monitor', 'NOUN'), ('available', 'ADJ'), ('to', 'PRT'), ('me', 'PRON'), ('at', 'ADP'), ('the', 'DET'), ('time', 'NOUN'), ('I', 'PRON'), ('made', 'VERB'), ('the', 'DET'), ('purchase', 'NOUN'), ('.', '.'), ('My', 'PRON'), ('overall', 'ADJ'), ('experience', 'NOUN'), ('with', 'ADP'), ('this', 'DET'), ('monitor', 'NOUN'), ('was', 'VERB'), ('very', 'ADV'), ('poor', 'ADJ'), ('.', '.'), ('When', 'ADV'), ('the', 'DET'), ('screen', 'NOUN'), ('was', 'VERB'), (\"n't\", 'ADV'), ('contracting', 'VERB'), ('or', 'CONJ'), ('glitching', 'VERB'), ('the', 'DET'), ('overall', 'ADJ'), ('picture', 'NOUN'), ('quality', 'NOUN'), ('was', 'VERB'), ('poor', 'ADJ'), ('to', 'PRT'), ('fair', 'VERB'), ('.', '.'), ('I', 'PRON'), (\"'ve\", 'VERB'), ('viewed', 'VERB'), ('numerous', 'ADJ'), ('different', 'ADJ'), ('monitor', 'NOUN'), ('models', 'NOUN'), ('since', 'ADP'), ('I', 'PRON'), (\"'m\", 'VERB'), ('a', 'DET'), ('college', 'NOUN'), ('student', 'NOUN'), ('at', 'ADP'), ('UPM', 'NOUN'), ('in', 'ADP'), ('Madrid', 'NOUN'), ('and', 'CONJ'), ('this', 'DET'), ('particular', 'ADJ'), ('monitor', 'NOUN'), ('had', 'VERB'), ('as', 'ADP'), ('poor', 'ADJ'), ('of', 'ADP'), ('picture', 'NOUN'), ('quality', 'NOUN'), ('as', 'ADP'), ('any', 'DET'), ('I', 'PRON'), (\"'ve\", 'VERB'), ('seen', 'VERB'), ('.', '.')]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk import pos_tag, word_tokenize\n",
|
||||
"print (pos_tag(word_tokenize(review), tagset='universal'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Based on this POS info, we could use correctly now the WordNetLemmatizer. The WordNetLemmatizer only is interesting for 4 POS categories: ADJ, ADV, NOUN, and VERB."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['I', 'purchase', 'Dell', 'monitor', 'because', 'of', 'budgetary', 'concern', 'item', 'be', 'most', 'inexpensive', '17', 'inch', 'Apple', 'monitor', 'available', 'me', 'at', 'time', 'I', 'make', 'purchase', 'My', 'overall', 'experience', 'with', 'monitor', 'be', 'very', 'poor', 'When', 'screen', 'be', \"n't\", 'contract', 'or', 'glitching', 'overall', 'picture', 'quality', 'be', 'poor', 'fair', 'I', \"'ve\", 'view', 'numerous', 'different', 'monitor', 'model', 'since', 'I', \"'m\", 'college', 'student', 'at', 'UPM', 'in', 'Madrid', 'and', 'particular', 'monitor', 'have', 'a', 'poor', 'of', 'picture', 'quality', 'a', 'I', \"'ve\", 'see']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk.stem import WordNetLemmatizer\n",
|
||||
"\n",
|
||||
"review_postagged = pos_tag(word_tokenize(review), tagset='universal')\n",
|
||||
"pos_mapping = {'NOUN': 'n', 'ADJ': 'a', 'VERB': 'v', 'ADV': 'r', 'ADP': 'n', 'CONJ': 'n', \n",
|
||||
" 'PRON': 'n', 'NUM': 'n', 'X': 'n' }\n",
|
||||
"\n",
|
||||
"wordnet = WordNetLemmatizer()\n",
|
||||
"lemmas = [wordnet.lemmatize(w, pos=pos_mapping[tag]) for (w,tag) in review_postagged if tag in pos_mapping.keys()]\n",
|
||||
"print(lemmas)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# NER"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Named Entity Recognition (NER) is an information retrieval for identifying named entities of places, organisation of persons. NER usually relies in a tagged corpus. NER algorithms can be trained for new corpora."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"(S\n",
|
||||
" I/PRP\n",
|
||||
" purchased/VBD\n",
|
||||
" this/DT\n",
|
||||
" (ORGANIZATION Dell/NNP)\n",
|
||||
" monitor/NN\n",
|
||||
" because/IN\n",
|
||||
" of/IN\n",
|
||||
" budgetary/JJ\n",
|
||||
" concerns/NNS\n",
|
||||
" ./.\n",
|
||||
" This/DT\n",
|
||||
" item/NN\n",
|
||||
" was/VBD\n",
|
||||
" the/DT\n",
|
||||
" most/RBS\n",
|
||||
" inexpensive/JJ\n",
|
||||
" 17/CD\n",
|
||||
" inch/NN\n",
|
||||
" Apple/NNP\n",
|
||||
" monitor/NN\n",
|
||||
" available/JJ\n",
|
||||
" to/TO\n",
|
||||
" me/PRP\n",
|
||||
" at/IN\n",
|
||||
" the/DT\n",
|
||||
" time/NN\n",
|
||||
" I/PRP\n",
|
||||
" made/VBD\n",
|
||||
" the/DT\n",
|
||||
" purchase/NN\n",
|
||||
" ./.\n",
|
||||
" My/PRP$\n",
|
||||
" overall/JJ\n",
|
||||
" experience/NN\n",
|
||||
" with/IN\n",
|
||||
" this/DT\n",
|
||||
" monitor/NN\n",
|
||||
" was/VBD\n",
|
||||
" very/RB\n",
|
||||
" poor/JJ\n",
|
||||
" ./.\n",
|
||||
" When/WRB\n",
|
||||
" the/DT\n",
|
||||
" screen/NN\n",
|
||||
" was/VBD\n",
|
||||
" n't/RB\n",
|
||||
" contracting/VBG\n",
|
||||
" or/CC\n",
|
||||
" glitching/VBG\n",
|
||||
" the/DT\n",
|
||||
" overall/JJ\n",
|
||||
" picture/NN\n",
|
||||
" quality/NN\n",
|
||||
" was/VBD\n",
|
||||
" poor/JJ\n",
|
||||
" to/TO\n",
|
||||
" fair/VB\n",
|
||||
" ./.\n",
|
||||
" I/PRP\n",
|
||||
" 've/VBP\n",
|
||||
" viewed/VBN\n",
|
||||
" numerous/JJ\n",
|
||||
" different/JJ\n",
|
||||
" monitor/NN\n",
|
||||
" models/NNS\n",
|
||||
" since/IN\n",
|
||||
" I/PRP\n",
|
||||
" 'm/VBP\n",
|
||||
" a/DT\n",
|
||||
" college/NN\n",
|
||||
" student/NN\n",
|
||||
" at/IN\n",
|
||||
" (ORGANIZATION UPM/NNP)\n",
|
||||
" in/IN\n",
|
||||
" (GPE Madrid/NNP)\n",
|
||||
" and/CC\n",
|
||||
" this/DT\n",
|
||||
" particular/JJ\n",
|
||||
" monitor/NN\n",
|
||||
" had/VBD\n",
|
||||
" as/IN\n",
|
||||
" poor/JJ\n",
|
||||
" of/IN\n",
|
||||
" picture/NN\n",
|
||||
" quality/NN\n",
|
||||
" as/IN\n",
|
||||
" any/DT\n",
|
||||
" I/PRP\n",
|
||||
" 've/VBP\n",
|
||||
" seen/VBN\n",
|
||||
" ./.)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk import ne_chunk, pos_tag, word_tokenize\n",
|
||||
"ne_tagged = ne_chunk(pos_tag(word_tokenize(review)), binary=False)\n",
|
||||
"print(ne_tagged) "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"NLTK comes with other NER implementations. We can also use online services, such as [OpenCalais](http://www.opencalais.com/), [DBpedia Spotlight](https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service) or [TagME](http://tagme.di.unipi.it/)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Parsing and Chunking"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Parsing** is the process of obtaining a parsing tree given a grammar. It which can be very useful to understand the relationship among the words.\n",
|
||||
"\n",
|
||||
"As we have seen in class, we can follow a traditional approach and obtain a full parsing tree or shallow parsing (chunking) and obtain a partial tree.\n",
|
||||
"\n",
|
||||
"We can use the StandfordParser that is integrated in NLTK, but it requires to configure the CLASSPATH, which can be a bit annoying. Instead, we are going to see some demos to understand how grammars work. In case you are interested, you can consult the [manual](http://www.nltk.org/api/nltk.parse.html) to run it.\n",
|
||||
"\n",
|
||||
"In the following example, you will run an interactive context-free parser, called [shift-reduce parser](http://www.nltk.org/book/ch08.html).\n",
|
||||
"The pane on the left shows the grammar as a list of production rules. The pane on the right contains the stack and the remaining input.\n",
|
||||
"\n",
|
||||
"You should:\n",
|
||||
"* Run pressing 'step' until the sentence is fully analyzed. With each step, the parser either shifts one word onto the stack or reduces two subtrees of the stack into a new subtree.\n",
|
||||
"* Try to act as the parser. Instead of pressing 'step', press 'shift' and 'reduce'. Follow the 'always shift before reduce' rule. It is likely you will reach a state where the parser cannot proceed. You can go back with 'Undo'."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from nltk.app import srparser_app\n",
|
||||
"srparser_app.app()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"**Chunking** o **shallow parsing** aims at extracting relevant parts of the sentence. There are (two main approaches)[http://www.nltk.org/book/ch07.html] to chunking: using regular expressions based on POS tags, or training a chunk parser.\n",
|
||||
"\n",
|
||||
"We are going to illustrate the first technique for extracting NP chunks.\n",
|
||||
"\n",
|
||||
"We define regular expressions for the chunks we want to get."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 42,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"(S\n",
|
||||
" I/PRON\n",
|
||||
" purchased/VERB\n",
|
||||
" (NP this/DET Dell/NOUN monitor/NOUN)\n",
|
||||
" because/ADP\n",
|
||||
" of/ADP\n",
|
||||
" (NP budgetary/ADJ concerns/NOUN)\n",
|
||||
" ./.\n",
|
||||
" (NP This/DET item/NOUN)\n",
|
||||
" was/VERB\n",
|
||||
" (NP\n",
|
||||
" the/DET\n",
|
||||
" most/ADV\n",
|
||||
" inexpensive/ADJ\n",
|
||||
" 17/NUM\n",
|
||||
" inch/NOUN\n",
|
||||
" Apple/NOUN\n",
|
||||
" monitor/NOUN)\n",
|
||||
" available/ADJ\n",
|
||||
" to/PRT\n",
|
||||
" me/PRON\n",
|
||||
" at/ADP\n",
|
||||
" (NP the/DET time/NOUN)\n",
|
||||
" I/PRON\n",
|
||||
" made/VERB\n",
|
||||
" (NP the/DET purchase/NOUN)\n",
|
||||
" ./.\n",
|
||||
" (NP My/PRON overall/ADJ experience/NOUN)\n",
|
||||
" with/ADP\n",
|
||||
" (NP this/DET monitor/NOUN)\n",
|
||||
" was/VERB\n",
|
||||
" very/ADV\n",
|
||||
" poor/ADJ\n",
|
||||
" ./.\n",
|
||||
" When/ADV\n",
|
||||
" (NP the/DET screen/NOUN)\n",
|
||||
" was/VERB\n",
|
||||
" n't/ADV\n",
|
||||
" contracting/VERB\n",
|
||||
" or/CONJ\n",
|
||||
" glitching/VERB\n",
|
||||
" (NP the/DET overall/ADJ picture/NOUN quality/NOUN)\n",
|
||||
" was/VERB\n",
|
||||
" poor/ADJ\n",
|
||||
" to/PRT\n",
|
||||
" fair/VERB\n",
|
||||
" ./.\n",
|
||||
" I/PRON\n",
|
||||
" 've/VERB\n",
|
||||
" viewed/VERB\n",
|
||||
" (NP numerous/ADJ different/ADJ monitor/NOUN models/NOUN)\n",
|
||||
" since/ADP\n",
|
||||
" I/PRON\n",
|
||||
" 'm/VERB\n",
|
||||
" (NP a/DET college/NOUN student/NOUN)\n",
|
||||
" at/ADP\n",
|
||||
" (NP UPM/NOUN)\n",
|
||||
" in/ADP\n",
|
||||
" (NP Madrid/NOUN)\n",
|
||||
" and/CONJ\n",
|
||||
" (NP this/DET particular/ADJ monitor/NOUN)\n",
|
||||
" had/VERB\n",
|
||||
" as/ADP\n",
|
||||
" poor/ADJ\n",
|
||||
" of/ADP\n",
|
||||
" (NP picture/NOUN quality/NOUN)\n",
|
||||
" as/ADP\n",
|
||||
" any/DET\n",
|
||||
" I/PRON\n",
|
||||
" 've/VERB\n",
|
||||
" seen/VERB\n",
|
||||
" ./.)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from nltk.chunk.regexp import *\n",
|
||||
"pattern = \"\"\"NP: {<PRON><ADJ><NOUN>+} \n",
|
||||
" {<DET>?<ADV>?<ADJ|NUM>*?<NOUN>+}\n",
|
||||
" \"\"\"\n",
|
||||
"NPChunker = RegexpParser(pattern)\n",
|
||||
"\n",
|
||||
"reviews_pos = (pos_tag(word_tokenize(review), tagset='universal'))\n",
|
||||
"\n",
|
||||
"chunks_np = NPChunker.parse(reviews_pos)\n",
|
||||
"print(chunks_np)\n",
|
||||
"chunks_np.draw()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we can traverse the trees and obtain the strings as follows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 54,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[Tree('NP', [('this', 'DET'), ('Dell', 'NOUN'), ('monitor', 'NOUN')]),\n",
|
||||
" Tree('NP', [('budgetary', 'ADJ'), ('concerns', 'NOUN')]),\n",
|
||||
" Tree('NP', [('This', 'DET'), ('item', 'NOUN')]),\n",
|
||||
" Tree('NP', [('the', 'DET'), ('most', 'ADV'), ('inexpensive', 'ADJ'), ('17', 'NUM'), ('inch', 'NOUN'), ('Apple', 'NOUN'), ('monitor', 'NOUN')]),\n",
|
||||
" Tree('NP', [('the', 'DET'), ('time', 'NOUN')]),\n",
|
||||
" Tree('NP', [('the', 'DET'), ('purchase', 'NOUN')]),\n",
|
||||
" Tree('NP', [('My', 'PRON'), ('overall', 'ADJ'), ('experience', 'NOUN')]),\n",
|
||||
" Tree('NP', [('this', 'DET'), ('monitor', 'NOUN')]),\n",
|
||||
" Tree('NP', [('the', 'DET'), ('screen', 'NOUN')]),\n",
|
||||
" Tree('NP', [('the', 'DET'), ('overall', 'ADJ'), ('picture', 'NOUN'), ('quality', 'NOUN')]),\n",
|
||||
" Tree('NP', [('numerous', 'ADJ'), ('different', 'ADJ'), ('monitor', 'NOUN'), ('models', 'NOUN')]),\n",
|
||||
" Tree('NP', [('a', 'DET'), ('college', 'NOUN'), ('student', 'NOUN')]),\n",
|
||||
" Tree('NP', [('UPM', 'NOUN')]),\n",
|
||||
" Tree('NP', [('Madrid', 'NOUN')]),\n",
|
||||
" Tree('NP', [('this', 'DET'), ('particular', 'ADJ'), ('monitor', 'NOUN')]),\n",
|
||||
" Tree('NP', [('picture', 'NOUN'), ('quality', 'NOUN')])]"
|
||||
]
|
||||
},
|
||||
"execution_count": 54,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def extractTrees(parsed_tree, category='NP'):\n",
|
||||
" return list(parsed_tree.subtrees(filter=lambda x: x.label()==category))\n",
|
||||
"\n",
|
||||
"extractTrees(chunks_np, 'NP')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 90,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['this Dell monitor',\n",
|
||||
" 'budgetary concerns',\n",
|
||||
" 'This item',\n",
|
||||
" 'the most inexpensive 17 inch Apple monitor',\n",
|
||||
" 'the time',\n",
|
||||
" 'the purchase',\n",
|
||||
" 'My overall experience',\n",
|
||||
" 'this monitor',\n",
|
||||
" 'the screen',\n",
|
||||
" 'the overall picture quality',\n",
|
||||
" 'numerous different monitor models',\n",
|
||||
" 'a college student',\n",
|
||||
" 'UPM',\n",
|
||||
" 'Madrid',\n",
|
||||
" 'this particular monitor',\n",
|
||||
" 'picture quality']"
|
||||
]
|
||||
},
|
||||
"execution_count": 90,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"def extractStrings(parsed_tree, category='NP'):\n",
|
||||
" return [\" \".join(word for word, pos in vp.leaves()) for vp in extractTrees(parsed_tree, category)]\n",
|
||||
" \n",
|
||||
"extractStrings(chunks_np)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
||||
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
@ -0,0 +1,742 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Vector Representation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents\n",
|
||||
"* [Objectives](#Objectives)\n",
|
||||
"* [Tools](#Tools)\n",
|
||||
"* [Vector representation: Count vector](#Vector-representation:-Count-vector)\n",
|
||||
"* [Binary vectors](#Binary-vectors)\n",
|
||||
"* [Bigram vectors](#Bigram-vectors)\n",
|
||||
"* [Tf-idf vector representation](#Tf-idf-vector-representation)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Objectives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this notebook we are going to transform text into feature vectors, using several representations as presented in class.\n",
|
||||
"\n",
|
||||
"We are going to use the examples from the slides."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"doc1 = 'Summer is coming but Summer is short'\n",
|
||||
"doc2 = 'I like the Summer and I like the Winter'\n",
|
||||
"doc3 = 'I like sandwiches and I like the Winter'\n",
|
||||
"documents = [doc1, doc2, doc3]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"source": [
|
||||
"# Tools"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The different tools we have presented so far (NLTK, Scikit-Learn, TextBlob and CLiPS) provide overlapping functionalities for obtaining vector representations and apply machine learning algorithms.\n",
|
||||
"\n",
|
||||
"We are going to focus on the use of scikit-learn so that we can also use easily Pandas as we saw in the previous topic.\n",
|
||||
"\n",
|
||||
"Scikit-learn provides specific facililities for processing texts, as described in the [manual](http://scikit-learn.org/stable/modules/feature_extraction.html)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Vector representation: Count vector"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Scikit-learn provides two classes for binary vectors: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). The latter is more efficient but does not allow to understand which features are more important, so we use the first class. Nevertheless, they are compatible, so, they can be interchanged for production environments.\n",
|
||||
"\n",
|
||||
"The first step for vectorizing with scikit-learn is creating a CountVectorizer object and then we should call 'fit_transform' to fit the vocabulary."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"CountVectorizer(analyzer='word', binary=False, decode_error='strict',\n",
|
||||
" dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',\n",
|
||||
" lowercase=True, max_df=1.0, max_features=5000, min_df=1,\n",
|
||||
" ngram_range=(1, 1), preprocessor=None, stop_words=None,\n",
|
||||
" strip_accents=None, token_pattern='(?u)\\\\b\\\\w\\\\w+\\\\b',\n",
|
||||
" tokenizer=None, vocabulary=None)"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
||||
"\n",
|
||||
"vectorizer = CountVectorizer(analyzer = \"word\", max_features = 5000) \n",
|
||||
"vectorizer"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"source": [
|
||||
"As we can see, [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) comes with many options. We can define many configuration options, such as the maximum or minimum frequency of a term (*min_fd*, *max_df*), maximum number of features (*max_features*), if we analyze words or characters (*analyzer*), or if the output is binary or not (*binary*). *CountVectorizer* also allows us to include if we want to preprocess the input (*preprocessor*) before tokenizing it (*tokenizer*) and exclude stop words (*stop_words*).\n",
|
||||
"\n",
|
||||
"We can use NLTK preprocessing and tokenizer functions to tune *CountVectorizer* using these parameters.\n",
|
||||
"\n",
|
||||
"We are going to see how the vectors look like."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<3x10 sparse matrix of type '<class 'numpy.int64'>'\n",
|
||||
"\twith 15 stored elements in Compressed Sparse Row format>"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"source": [
|
||||
"We see the vectors are stored as a sparse matrix of 3x6 dimensions.\n",
|
||||
"We can print the matrix as well as the feature names."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[[0 1 1 2 0 0 1 2 0 0]\n",
|
||||
" [1 0 0 0 2 0 0 1 2 1]\n",
|
||||
" [1 0 0 0 2 1 0 0 1 1]]\n",
|
||||
"['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short', 'summer', 'the', 'winter']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(vectors.toarray())\n",
|
||||
"print(vectorizer.get_feature_names())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"source": [
|
||||
"As you can see, the pronoun 'I' has been removed because of the default token_pattern. \n",
|
||||
"We can change this as follows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['and',\n",
|
||||
" 'but',\n",
|
||||
" 'coming',\n",
|
||||
" 'i',\n",
|
||||
" 'is',\n",
|
||||
" 'like',\n",
|
||||
" 'sandwiches',\n",
|
||||
" 'short',\n",
|
||||
" 'summer',\n",
|
||||
" 'the',\n",
|
||||
" 'winter']"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words=None, token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can now filter the stop words (it will remove 'and', 'but', 'I', 'is' and 'the')."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"frozenset({'could', 'sixty', 'onto', 'by', 'against', 'up', 'a', 'everything', 'other', 'otherwise', 'ourselves', 'beside', 'nowhere', 'then', 'below', 'put', 'ten', 'such', 'cannot', 'either', 'due', 'hasnt', 'whereupon', 'were', 'once', 'at', 'for', 'front', 'get', 'whereas', 'that', 'eight', 'another', 'except', 'of', 'wherever', 'over', 'to', 'whom', 'you', 'former', 'behind', 'yours', 'yourself', 'what', 'even', 'however', 'go', 'less', 'bottom', 'may', 'along', 'is', 'can', 'move', 'eg', 'somewhere', 'latterly', 'seemed', 'thence', 'becoming', 'himself', 'whether', 'six', 'first', 'off', 'do', 'many', 'namely', 'never', 'because', 'mostly', 'nevertheless', 'thereupon', 'here', 'least', 'anyone', 'one', 'others', 'cry', 'they', 'thereby', 'ie', 'am', 'this', 'would', 'any', 'while', 'see', 'too', 'your', 'somehow', 'within', 'same', 'sometimes', 'thereafter', 'must', 'take', 're', 'both', 'fill', 'nor', 'sometime', 'he', 'third', 'more', 'also', 'most', 'during', 'much', 'our', 'thick', 'enough', 'full', 'toward', 'with', 'mill', 'anyhow', 'nobody', 'why', 'thru', 'although', 'nothing', 'meanwhile', 'or', 'some', 'ltd', 'wherein', 'thus', 'someone', 'whereby', 'who', 'un', 'are', 'hundred', 'whereafter', 'fire', 'twenty', 'only', 'several', 'among', 'no', 'than', 'before', 'been', 'else', 'find', 'fifteen', 'hence', 'ours', 'already', 'be', 'besides', 'next', 'interest', 'whither', 'whole', 'eleven', 'without', 'five', 'show', 'in', 'throughout', 'own', 'amongst', 'will', 'neither', 'everywhere', 'part', 'give', 'my', 'hers', 'his', 'upon', 'well', 'him', 'yourselves', 'whatever', 'cant', 'though', 'had', 'again', 'every', 'noone', 'top', 'which', 'de', 'almost', 'system', 'under', 'down', 'latter', 'above', 'whence', 'found', 'myself', 'three', 'those', 'become', 'moreover', 'but', 'anyway', 'beyond', 'from', 'now', 'as', 'seeming', 'con', 'themselves', 'hereupon', 'each', 'serious', 'two', 'across', 'out', 'the', 'therein', 'between', 'inc', 'where', 'anything', 'seem', 'co', 'therefore', 'whoever', 'herein', 'about', 'herself', 'should', 'anywhere', 'how', 'we', 'after', 'describe', 'being', 'etc', 'very', 'not', 'an', 'me', 'call', 'per', 'detail', 'still', 'around', 'hereby', 'sincere', 'their', 'has', 'became', 'beforehand', 'everyone', 'hereafter', 'made', 'ever', 'indeed', 'itself', 'something', 'afterwards', 'none', 'done', 'nine', 'alone', 'please', 'its', 'name', 'since', 'on', 'she', 'bill', 'have', 'mine', 'few', 'her', 'seems', 'always', 'side', 'forty', 'further', 'via', 'last', 'amount', 'towards', 'fify', 'through', 'whose', 'couldnt', 'perhaps', 'thin', 'until', 'becomes', 'elsewhere', 'and', 'i', 'them', 'together', 'us', 'was', 'when', 'rather', 'whenever', 'formerly', 'keep', 'so', 'back', 'there', 'amoungst', 'might', 'these', 'all', 'empty', 'often', 'into', 'it', 'twelve', 'yet', 'if', 'four'})\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#stop words in scikit-learn for English\n",
|
||||
"print(vectorizer.get_stop_words())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[1, 0, 0, 1, 2, 0],\n",
|
||||
" [0, 2, 0, 0, 1, 1],\n",
|
||||
" [0, 2, 1, 0, 0, 1]], dtype=int64)"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Vectors\n",
|
||||
"f_array = vectors.toarray()\n",
|
||||
"f_array"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can compute now the **distance** between vectors."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.666666666667 1.0 0.166666666667\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from scipy.spatial.distance import cosine\n",
|
||||
"d12 = cosine(f_array[0], f_array[1])\n",
|
||||
"d13 = cosine(f_array[0], f_array[2])\n",
|
||||
"d23 = cosine(f_array[1], f_array[2])\n",
|
||||
"print(d12, d13, d23)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Binary vectors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also get **binary vectors** as follows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', binary=True) \n",
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[1, 0, 0, 1, 1, 0],\n",
|
||||
" [0, 1, 0, 0, 1, 1],\n",
|
||||
" [0, 1, 1, 0, 0, 1]], dtype=int64)"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectors.toarray()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Bigram vectors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"It is also easy to get bigram vectors."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['coming summer',\n",
|
||||
" 'like sandwiches',\n",
|
||||
" 'like summer',\n",
|
||||
" 'like winter',\n",
|
||||
" 'sandwiches like',\n",
|
||||
" 'summer coming',\n",
|
||||
" 'summer like',\n",
|
||||
" 'summer short']"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', ngram_range=[2,2]) \n",
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[1, 0, 0, 0, 0, 1, 0, 1],\n",
|
||||
" [0, 0, 1, 1, 0, 0, 1, 0],\n",
|
||||
" [0, 1, 0, 1, 1, 0, 0, 0]], dtype=int64)"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectors.toarray()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Tf-idf vector representation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Finally, we can also get a tf-idf vector representation using the class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) instead of CountVectorizer."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||||
"\n",
|
||||
"vectorizer = TfidfVectorizer(analyzer=\"word\", stop_words='english')\n",
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[ 0.48148213, 0. , 0. , 0.48148213, 0.73235914,\n",
|
||||
" 0. ],\n",
|
||||
" [ 0. , 0.81649658, 0. , 0. , 0.40824829,\n",
|
||||
" 0.40824829],\n",
|
||||
" [ 0. , 0.77100584, 0.50689001, 0. , 0. ,\n",
|
||||
" 0.38550292]])"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectors.toarray()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can now compute the similarity of a query and a set of documents as follows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 30,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
|
||||
]
|
||||
},
|
||||
"execution_count": 30,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"train = [doc1, doc2, doc3]\n",
|
||||
"vectorizer = TfidfVectorizer(analyzer=\"word\", stop_words='english')\n",
|
||||
"\n",
|
||||
"# We learn the vocabulary (fit) and tranform the docs into vectors\n",
|
||||
"vectors = vectorizer.fit_transform(train)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"metadata": {
|
||||
"collapsed": false,
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[ 0.48148213, 0. , 0. , 0.48148213, 0.73235914,\n",
|
||||
" 0. ],\n",
|
||||
" [ 0. , 0.81649658, 0. , 0. , 0.40824829,\n",
|
||||
" 0.40824829],\n",
|
||||
" [ 0. , 0.77100584, 0.50689001, 0. , 0. ,\n",
|
||||
" 0.38550292]])"
|
||||
]
|
||||
},
|
||||
"execution_count": 31,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectors.toarray()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Scikit-learn provides a method to calculate the cosine similarity between one vector and a set of vectors. Based on this, we can rank the similarity. In this case, the ranking for the query is [d1, d2, d3]."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[ 0.38324078, 0.24713249, 0.23336362]])"
|
||||
]
|
||||
},
|
||||
"execution_count": 33,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.metrics.pairwise import cosine_similarity\n",
|
||||
"\n",
|
||||
"query = ['winter short']\n",
|
||||
"\n",
|
||||
"# We transform the query into a vector of the learnt vocabulary\n",
|
||||
"vector_query = vectorizer.transform(query)\n",
|
||||
"\n",
|
||||
"# Here we calculate the distance of the query to the docs\n",
|
||||
"cosine_similarity(vector_query, vectors)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The same result can be obtained with pairwise metrics (kernels in ML terminology) if we use the linear kernel."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([ 0.38324078, 0.24713249, 0.23336362])"
|
||||
]
|
||||
},
|
||||
"execution_count": 29,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.metrics.pairwise import linear_kernel\n",
|
||||
"cosine_similarity = linear_kernel(vector_query, vectors).flatten()\n",
|
||||
"cosine_similarity"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [Scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#converting-text-to-vectors) Scikit-learn Convert Text to Vectors"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
@ -0,0 +1,439 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Text Classification"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents\n",
|
||||
"* [Objectives](#Objectives)\n",
|
||||
"* [Corpus](#Corpus)\n",
|
||||
"* [Classifier](#Classifier)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Objectives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Understand how to apply machine learning techniques on textual sources\n",
|
||||
"* Learn the facilities provided by scikit-learn"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Corpus"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",
|
||||
"\n",
|
||||
"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 93,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.datasets import fetch_20newsgroups\n",
|
||||
"\n",
|
||||
"# We remove metadata to avoid bias in the classification\n",
|
||||
"newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n",
|
||||
"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n",
|
||||
"\n",
|
||||
"# print categories\n",
|
||||
"print(list(newsgroups_train.target_names))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 106,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"20\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#Number of categories\n",
|
||||
"print(len(newsgroups_train.target_names))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 94,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Category id 4 comp.sys.mac.hardware\n",
|
||||
"Doc A fair number of brave souls who upgraded their SI clock oscillator have\n",
|
||||
"shared their experiences for this poll. Please send a brief message detailing\n",
|
||||
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
|
||||
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
|
||||
"functionality with 800 and 1.4 m floppies are especially requested.\n",
|
||||
"\n",
|
||||
"I will be summarizing in the next two days, so please add to the network\n",
|
||||
"knowledge base if you have done the clock upgrade and haven't answered this\n",
|
||||
"poll. Thanks.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Show a document\n",
|
||||
"docid = 1\n",
|
||||
"doc = newsgroups_train.data[docid]\n",
|
||||
"cat = newsgroups_train.target[docid]\n",
|
||||
"\n",
|
||||
"print(\"Category id \" + str(cat) + \" \" + newsgroups_train.target_names[cat])\n",
|
||||
"print(\"Doc \" + doc)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 95,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(11314,)"
|
||||
]
|
||||
},
|
||||
"execution_count": 95,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#Number of files\n",
|
||||
"newsgroups_train.filenames.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 96,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(11314, 101323)"
|
||||
]
|
||||
},
|
||||
"execution_count": 96,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Obtain a vector\n",
|
||||
"\n",
|
||||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||||
"\n",
|
||||
"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n",
|
||||
"\n",
|
||||
"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
|
||||
"vectors_train.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 97,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"66.80510871486653"
|
||||
]
|
||||
},
|
||||
"execution_count": 97,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n",
|
||||
"vectors_train.nnz / float(vectors_train.shape[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Classifier"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 138,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0.69545360719001303"
|
||||
]
|
||||
},
|
||||
"execution_count": 138,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.naive_bayes import MultinomialNB\n",
|
||||
"\n",
|
||||
"from sklearn import metrics\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n",
|
||||
"# Nevertheless, we only transform the test dataset into vectors (transform, not fit_transform)\n",
|
||||
"\n",
|
||||
"model = MultinomialNB(alpha=.01)\n",
|
||||
"model.fit(vectors_train, newsgroups_train.target)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"pred = model.predict(vectors_test)\n",
|
||||
"\n",
|
||||
"metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 141,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"dimensionality: 101323\n",
|
||||
"density: 1.000000\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.utils.extmath import density\n",
|
||||
"\n",
|
||||
"print(\"dimensionality: %d\" % model.coef_.shape[1])\n",
|
||||
"print(\"density: %f\" % density(model.coef_))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 140,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"alt.atheism: islam atheists say just religion atheism think don people god\n",
|
||||
"comp.graphics: looking format 3d know program file files thanks image graphics\n",
|
||||
"comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows\n",
|
||||
"comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive\n",
|
||||
"comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac\n",
|
||||
"comp.windows.x: using windows x11r5 use application thanks widget server motif window\n",
|
||||
"misc.forsale: asking email sell price condition new shipping offer 00 sale\n",
|
||||
"rec.autos: don ford new good dealer just engine like cars car\n",
|
||||
"rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike\n",
|
||||
"rec.sport.baseball: braves players pitching hit runs games game baseball team year\n",
|
||||
"rec.sport.hockey: league year nhl games season players play hockey team game\n",
|
||||
"sci.crypt: people use escrow nsa keys government chip clipper encryption key\n",
|
||||
"sci.electronics: don thanks voltage used know does like circuit power use\n",
|
||||
"sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg\n",
|
||||
"sci.space: just lunar earth shuttle like moon launch orbit nasa space\n",
|
||||
"soc.religion.christian: believe faith christian christ bible people christians church jesus god\n",
|
||||
"talk.politics.guns: just law firearms government fbi don weapons people guns gun\n",
|
||||
"talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel\n",
|
||||
"talk.politics.misc: know state clinton president just think tax don government people\n",
|
||||
"talk.religion.misc: think don koresh objective christians bible people christian jesus god\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# We can review the top features per topic in Bayes (attribute coef_)\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"def show_top10(classifier, vectorizer, categories):\n",
|
||||
" feature_names = np.asarray(vectorizer.get_feature_names())\n",
|
||||
" for i, category in enumerate(categories):\n",
|
||||
" top10 = np.argsort(classifier.coef_[i])[-10:]\n",
|
||||
" print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n",
|
||||
"\n",
|
||||
" \n",
|
||||
"show_top10(model, vectorizer, newsgroups_train.target_names)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 136,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[ 3 15]\n",
|
||||
"['comp.sys.ibm.pc.hardware', 'soc.religion.christian']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# We try the classifier in two new docs\n",
|
||||
"\n",
|
||||
"new_docs = ['This is a survey of PC computers', 'God is love']\n",
|
||||
"new_vectors = vectorizer.transform(new_docs)\n",
|
||||
"\n",
|
||||
"pred_docs = model.predict(new_vectors)\n",
|
||||
"print(pred_docs)\n",
|
||||
"print([newsgroups_train.target_names[i] for i in pred_docs])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
||||
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
@ -0,0 +1,663 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Semantic Models"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents\n",
|
||||
"* [Objectives](#Objectives)\n",
|
||||
"* [Corpus](#Corpus)\n",
|
||||
"* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n",
|
||||
"* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n",
|
||||
"* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Objectives"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Understand the models and their differences\n",
|
||||
"* Learn to use some of the most popular NLP libraries"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Corpus"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",
|
||||
"\n",
|
||||
"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(2034, 2807)"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.datasets import fetch_20newsgroups\n",
|
||||
"\n",
|
||||
"# We filter only some categories, otherwise we have 20 categories\n",
|
||||
"categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
|
||||
"# We remove metadata to avoid bias in the classification\n",
|
||||
"newsgroups_train = fetch_20newsgroups(subset='train', \n",
|
||||
" remove=('headers', 'footers', 'quotes'), \n",
|
||||
" categories=categories)\n",
|
||||
"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n",
|
||||
" categories=categories)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Obtain a vector\n",
|
||||
"\n",
|
||||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||||
"\n",
|
||||
"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n",
|
||||
"\n",
|
||||
"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
|
||||
"vectors_train.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Converting Scikit-learn to gensim"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n",
|
||||
"\n",
|
||||
"You should install first *gensim*. Run 'conda install -c anaconda gensim=0.12.4' in a terminal."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from gensim import matutils\n",
|
||||
"\n",
|
||||
"vocab = vectorizer.get_feature_names()\n",
|
||||
"\n",
|
||||
"dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n",
|
||||
"corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Latent Dirichlet Allocation (LDA)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from gensim.models.ldamodel import LdaModel\n",
|
||||
"\n",
|
||||
"# It takes a long time\n",
|
||||
"\n",
|
||||
"# train the lda model, choosing number of topics equal to 4\n",
|
||||
"lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(0,\n",
|
||||
" '0.004*objects + 0.004*obtained + 0.003*comets + 0.003*manhattan + 0.003*member + 0.003*beginning + 0.003*center + 0.003*groups + 0.003*aware + 0.003*increased'),\n",
|
||||
" (1,\n",
|
||||
" '0.003*activity + 0.002*objects + 0.002*professional + 0.002*eyes + 0.002*manhattan + 0.002*pressure + 0.002*netters + 0.002*chosen + 0.002*attempted + 0.002*medical'),\n",
|
||||
" (2,\n",
|
||||
" '0.003*mechanism + 0.003*led + 0.003*platform + 0.003*frank + 0.003*mormons + 0.003*aeronautics + 0.002*concepts + 0.002*header + 0.002*forces + 0.002*profit'),\n",
|
||||
" (3,\n",
|
||||
" '0.005*diameter + 0.005*having + 0.004*complex + 0.004*conclusions + 0.004*activity + 0.004*looking + 0.004*action + 0.004*inflatable + 0.004*defined + 0.004*association')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# check the topics\n",
|
||||
"lda.print_topics(4)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# import the gensim.corpora module to generate dictionary\n",
|
||||
"from gensim import corpora\n",
|
||||
"\n",
|
||||
"from nltk import word_tokenize\n",
|
||||
"from nltk.corpus import stopwords\n",
|
||||
"from nltk import RegexpTokenizer\n",
|
||||
"\n",
|
||||
"import string\n",
|
||||
"\n",
|
||||
"def preprocess(words):\n",
|
||||
" tokenizer = RegexpTokenizer('[A-Z]\\w+')\n",
|
||||
" tokens = [w.lower() for w in tokenizer.tokenize(words)]\n",
|
||||
" stoplist = stopwords.words('english')\n",
|
||||
" tokens_stop = [w for w in tokens if w not in stoplist]\n",
|
||||
" punctuation = set(string.punctuation)\n",
|
||||
" tokens_clean = [w for w in tokens_stop if w not in punctuation]\n",
|
||||
" return tokens_clean\n",
|
||||
"\n",
|
||||
"#words = preprocess(newsgroups_train.data)\n",
|
||||
"#dictionary = corpora.Dictionary(newsgroups_train.data)\n",
|
||||
"\n",
|
||||
"texts = [preprocess(document) for document in newsgroups_train.data]\n",
|
||||
"\n",
|
||||
"dictionary = corpora.Dictionary(texts)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# You can save the dictionary\n",
|
||||
"dictionary.save('newsgroup.dict')\n",
|
||||
"\n",
|
||||
"print(dictionary)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Generate a list of docs, where each doc is a list of words\n",
|
||||
"\n",
|
||||
"docs = [preprocess(doc) for doc in newsgroups_train.data]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# import the gensim.corpora module to generate dictionary\n",
|
||||
"from gensim import corpora\n",
|
||||
"\n",
|
||||
"dictionary = corpora.Dictionary(docs)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# You can optionally save the dictionary \n",
|
||||
"\n",
|
||||
"dictionary.save('newsgroups.dict')\n",
|
||||
"lda = LdaModel.load('newsgroups.lda')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# We can print the dictionary, it is a mappying of id and tokens\n",
|
||||
"\n",
|
||||
"print(dictionary)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {
|
||||
"collapsed": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# construct the corpus representing each document as a bag-of-words (bow) vector\n",
|
||||
"corpus = [dictionary.doc2bow(doc) for doc in docs]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from gensim.models import TfidfModel\n",
|
||||
"\n",
|
||||
"# calculate tfidf\n",
|
||||
"tfidf_model = TfidfModel(corpus)\n",
|
||||
"corpus_tfidf = tfidf_model[corpus]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#print tf-idf of first document\n",
|
||||
"print(corpus_tfidf[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from gensim.models.ldamodel import LdaModel\n",
|
||||
"\n",
|
||||
"# train the lda model, choosing number of topics equal to 4, it takes a long time\n",
|
||||
"\n",
|
||||
"lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(0,\n",
|
||||
" '0.010*targa + 0.007*ns + 0.006*thanks + 0.006*davidian + 0.006*ssrt + 0.006*yayayay + 0.005*craig + 0.005*bull + 0.005*gerald + 0.005*sorry'),\n",
|
||||
" (1,\n",
|
||||
" '0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades'),\n",
|
||||
" (2,\n",
|
||||
" '0.007*koresh + 0.007*moon + 0.007*western + 0.006*plane + 0.006*jeff + 0.006*unix + 0.005*bible + 0.005*also + 0.005*basically + 0.005*bob'),\n",
|
||||
" (3,\n",
|
||||
" '0.011*whatever + 0.008*joy + 0.007*happy + 0.006*virtual + 0.006*reality + 0.004*really + 0.003*samuel___ + 0.003*oh + 0.003*virtually + 0.003*toaster')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# check the topics\n",
|
||||
"lda_model.print_topics(4)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.085176135689180726), (1, 0.6919655173835938), (2, 0.1377903468164027), (3, 0.0850680001108228)]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# check the lsa vector for the first document\n",
|
||||
"corpus_lda = lda_model[corpus_tfidf]\n",
|
||||
"print(corpus_lda[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 20,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[('lord', 1), ('god', 2)]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#predict topics of a new doc\n",
|
||||
"new_doc = \"God is love and God is the Lord\"\n",
|
||||
"#transform into BOW space\n",
|
||||
"bow_vector = dictionary.doc2bow(preprocess(new_doc))\n",
|
||||
"print([(dictionary[id], count) for id, count in bow_vector])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 21,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.062509420435514051), (1, 0.81246608790618835), (2, 0.062508281488992554), (3, 0.062516210169305114)]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#transform into LDA space\n",
|
||||
"lda_vector = lda_model[bow_vector]\n",
|
||||
"print(lda_vector)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 22,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# print the document's single most prominent LDA topic\n",
|
||||
"print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.10392179866025079), (1, 0.68822094221870811), (2, 0.10391916429993264), (3, 0.10393809482110833)]\n",
|
||||
"0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n",
|
||||
"print(lda_vector_tfidf)\n",
|
||||
"# print the document's single most prominent LDA topic\n",
|
||||
"print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Latent Semantic Indexing (LSI)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 25,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from gensim.models.lsimodel import LsiModel\n",
|
||||
"\n",
|
||||
"#It takes a long time\n",
|
||||
"\n",
|
||||
"# train the lsi model, choosing number of topics equal to 20\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(0,\n",
|
||||
" '0.769*\"god\" + 0.346*\"jesus\" + 0.235*\"bible\" + 0.204*\"christian\" + 0.149*\"christians\" + 0.107*\"christ\" + 0.090*\"well\" + 0.085*\"koresh\" + 0.081*\"kent\" + 0.080*\"christianity\"'),\n",
|
||||
" (1,\n",
|
||||
" '-0.863*\"thanks\" + -0.255*\"please\" + -0.159*\"hello\" + -0.153*\"hi\" + 0.123*\"god\" + -0.112*\"sorry\" + -0.087*\"could\" + -0.074*\"windows\" + -0.067*\"jpeg\" + -0.063*\"vga\"'),\n",
|
||||
" (2,\n",
|
||||
" '0.780*\"well\" + -0.229*\"god\" + 0.165*\"yes\" + -0.153*\"thanks\" + 0.133*\"ico\" + 0.133*\"tek\" + 0.130*\"bronx\" + 0.130*\"beauchaine\" + 0.130*\"queens\" + 0.129*\"manhattan\"'),\n",
|
||||
" (3,\n",
|
||||
" '0.340*\"well\" + -0.335*\"ico\" + -0.334*\"tek\" + -0.328*\"beauchaine\" + -0.328*\"bronx\" + -0.328*\"queens\" + -0.326*\"manhattan\" + -0.305*\"bob\" + -0.305*\"com\" + -0.072*\"god\"')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 26,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# check the topics\n",
|
||||
"lsi_model.print_topics(4)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"metadata": {
|
||||
"collapsed": false
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# check the lsi vector for the first document\n",
|
||||
"print(corpus_tfidf[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# References"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
||||
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
@ -0,0 +1,144 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Exercises"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents\n",
|
||||
"\n",
|
||||
"* [Exercises](#Exercises)\n",
|
||||
"\t* [Exercise 1 - Sentiment classification for Twitter](#Exercise-1---Sentiment-classification-for-Twitter)\n",
|
||||
"\t* [Exercise 2 - Spam classification](#Exercise-2---Spam-classification)\n",
|
||||
"\t* [Exercise 3 - Automatic essay classification](#Exercise-3---Automatic-essay-classification)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Exercises"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Here we propose several exercises, it is recommended to work only in one of them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 1 - Sentiment classification for Twitter"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The purpose of this exercise is:\n",
|
||||
"* Collect geolocated tweets\n",
|
||||
"* Analyse their sentiment\n",
|
||||
"* Represent the result in a map, so that one can understand the sentiment in a geographic region.\n",
|
||||
"\n",
|
||||
"The steps (and most of the code) can be found [here](http://pybonacci.org/2015/11/24/como-hacer-analisis-de-sentimiento-en-espanol-2/). \n",
|
||||
"\n",
|
||||
"You can select the tweets in any language."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 2 - Spam classification"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The classification of spam is a classical problem. [Here](http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html) you can find a detailed example of how to do it using the datasets Enron-Spama and SpamAssassin. You can try to test yourself the classification."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Exercise 3 - Automatic essay classification"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As you have seen, we did not got great results in the previous notebook. You can try to improve them."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.5.1"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 0
|
||||
}
|
File diff suppressed because one or more lines are too long
Binary file not shown.
After Width: | Height: | Size: 3.1 KiB |
Loading…
Reference in New Issue