mirror of
				https://github.com/balkian/gists.git
				synced 2025-10-30 23:28:27 +00:00 
			
		
		
		
	This commit is contained in:
		
							
								
								
									
										439
									
								
								4_4.ipynb
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										439
									
								
								4_4.ipynb
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,439 @@ | ||||
| { | ||||
|  "cells": [ | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "# Course Notes for Learning Intelligent Systems" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "# Text Classification" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "# Table of Contents\n", | ||||
|     "* [Objectives](#Objectives)\n", | ||||
|     "* [Corpus](#Corpus)\n", | ||||
|     "* [Classifier](#Classifier)" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "# Objectives" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n", | ||||
|     "\n", | ||||
|     "The main objectives of this session are:\n", | ||||
|     "* Understand how to apply machine learning techniques on textual sources\n", | ||||
|     "* Learn the facilities provided by scikit-learn" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "# Corpus" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20  newsgroup dataset contains 20k documents that belong to 20 topics.\n", | ||||
|     "\n", | ||||
|     "We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 1, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "name": "stdout", | ||||
|      "output_type": "stream", | ||||
|      "text": [ | ||||
|       "['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n" | ||||
|      ] | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "from sklearn.datasets import fetch_20newsgroups\n", | ||||
|     "\n", | ||||
|     "# We remove metadata to avoid bias in the classification\n", | ||||
|     "newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n", | ||||
|     "newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n", | ||||
|     "\n", | ||||
|     "# print categories\n", | ||||
|     "print(list(newsgroups_train.target_names))" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 2, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "name": "stdout", | ||||
|      "output_type": "stream", | ||||
|      "text": [ | ||||
|       "20\n" | ||||
|      ] | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "#Number of categories\n", | ||||
|     "print(len(newsgroups_train.target_names))" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 3, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "name": "stdout", | ||||
|      "output_type": "stream", | ||||
|      "text": [ | ||||
|       "Category id 4 comp.sys.mac.hardware\n", | ||||
|       "Doc A fair number of brave souls who upgraded their SI clock oscillator have\n", | ||||
|       "shared their experiences for this poll. Please send a brief message detailing\n", | ||||
|       "your experiences with the procedure. Top speed attained, CPU rated speed,\n", | ||||
|       "add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n", | ||||
|       "functionality with 800 and 1.4 m floppies are especially requested.\n", | ||||
|       "\n", | ||||
|       "I will be summarizing in the next two days, so please add to the network\n", | ||||
|       "knowledge base if you have done the clock upgrade and haven't answered this\n", | ||||
|       "poll. Thanks.\n" | ||||
|      ] | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "# Show a document\n", | ||||
|     "docid = 1\n", | ||||
|     "doc = newsgroups_train.data[docid]\n", | ||||
|     "cat = newsgroups_train.target[docid]\n", | ||||
|     "\n", | ||||
|     "print(\"Category id \" +  str(cat) + \" \" + newsgroups_train.target_names[cat])\n", | ||||
|     "print(\"Doc \" + doc)" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 4, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "data": { | ||||
|       "text/plain": [ | ||||
|        "(11314,)" | ||||
|       ] | ||||
|      }, | ||||
|      "execution_count": 4, | ||||
|      "metadata": {}, | ||||
|      "output_type": "execute_result" | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "#Number of files\n", | ||||
|     "newsgroups_train.filenames.shape" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 5, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "data": { | ||||
|       "text/plain": [ | ||||
|        "(11314, 101322)" | ||||
|       ] | ||||
|      }, | ||||
|      "execution_count": 5, | ||||
|      "metadata": {}, | ||||
|      "output_type": "execute_result" | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "# Obtain a vector\n", | ||||
|     "\n", | ||||
|     "from sklearn.feature_extraction.text import TfidfVectorizer\n", | ||||
|     "\n", | ||||
|     "vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n", | ||||
|     "\n", | ||||
|     "vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n", | ||||
|     "vectors_train.shape" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 6, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "data": { | ||||
|       "text/plain": [ | ||||
|        "66.802987449178" | ||||
|       ] | ||||
|      }, | ||||
|      "execution_count": 6, | ||||
|      "metadata": {}, | ||||
|      "output_type": "execute_result" | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n", | ||||
|     "vectors_train.nnz / float(vectors_train.shape[0])" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "# Classifier" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn." | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 7, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "data": { | ||||
|       "text/plain": [ | ||||
|        "0.69545360719001303" | ||||
|       ] | ||||
|      }, | ||||
|      "execution_count": 7, | ||||
|      "metadata": {}, | ||||
|      "output_type": "execute_result" | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "from sklearn.naive_bayes import MultinomialNB\n", | ||||
|     "\n", | ||||
|     "from sklearn import metrics\n", | ||||
|     "\n", | ||||
|     "\n", | ||||
|     "# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n", | ||||
|     "# Nevertheless, we only transform the test dataset into vectors  (transform, not fit_transform)\n", | ||||
|     "\n", | ||||
|     "model = MultinomialNB(alpha=.01)\n", | ||||
|     "model.fit(vectors_train, newsgroups_train.target)\n", | ||||
|     "\n", | ||||
|     "vectors_test = vectorizer.transform(newsgroups_test.data)\n", | ||||
|     "pred = model.predict(vectors_test)\n", | ||||
|     "\n", | ||||
|     "metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 8, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "name": "stdout", | ||||
|      "output_type": "stream", | ||||
|      "text": [ | ||||
|       "dimensionality: 101322\n", | ||||
|       "density: 1.000000\n" | ||||
|      ] | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "from sklearn.utils.extmath import density\n", | ||||
|     "\n", | ||||
|     "print(\"dimensionality: %d\" % model.coef_.shape[1])\n", | ||||
|     "print(\"density: %f\" % density(model.coef_))" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 9, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "name": "stdout", | ||||
|      "output_type": "stream", | ||||
|      "text": [ | ||||
|       "alt.atheism: islam atheists say just religion atheism think don people god\n", | ||||
|       "comp.graphics: looking format 3d know program file files thanks image graphics\n", | ||||
|       "comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows\n", | ||||
|       "comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive\n", | ||||
|       "comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac\n", | ||||
|       "comp.windows.x: using windows x11r5 use application thanks widget server motif window\n", | ||||
|       "misc.forsale: asking email sell price condition new shipping offer 00 sale\n", | ||||
|       "rec.autos: don ford new good dealer just engine like cars car\n", | ||||
|       "rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike\n", | ||||
|       "rec.sport.baseball: braves players pitching hit runs games game baseball team year\n", | ||||
|       "rec.sport.hockey: league year nhl games season players play hockey team game\n", | ||||
|       "sci.crypt: people use escrow nsa keys government chip clipper encryption key\n", | ||||
|       "sci.electronics: don thanks voltage used know does like circuit power use\n", | ||||
|       "sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg\n", | ||||
|       "sci.space: just lunar earth shuttle like moon launch orbit nasa space\n", | ||||
|       "soc.religion.christian: believe faith christian christ bible people christians church jesus god\n", | ||||
|       "talk.politics.guns: just law firearms government fbi don weapons people guns gun\n", | ||||
|       "talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel\n", | ||||
|       "talk.politics.misc: know state clinton president just think tax don government people\n", | ||||
|       "talk.religion.misc: think don koresh objective christians bible people christian jesus god\n" | ||||
|      ] | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "# We can review the top features per topic in Bayes (attribute coef_)\n", | ||||
|     "import numpy as np\n", | ||||
|     "\n", | ||||
|     "def show_top10(classifier, vectorizer, categories):\n", | ||||
|     "    feature_names = np.asarray(vectorizer.get_feature_names())\n", | ||||
|     "    for i, category in enumerate(categories):\n", | ||||
|     "        top10 = np.argsort(classifier.coef_[i])[-10:]\n", | ||||
|     "        print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n", | ||||
|     "\n", | ||||
|     "        \n", | ||||
|     "show_top10(model, vectorizer, newsgroups_train.target_names)" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "code", | ||||
|    "execution_count": 10, | ||||
|    "metadata": { | ||||
|     "collapsed": false | ||||
|    }, | ||||
|    "outputs": [ | ||||
|     { | ||||
|      "name": "stdout", | ||||
|      "output_type": "stream", | ||||
|      "text": [ | ||||
|       "[ 2 15]\n", | ||||
|       "['comp.os.ms-windows.misc', 'soc.religion.christian']\n" | ||||
|      ] | ||||
|     } | ||||
|    ], | ||||
|    "source": [ | ||||
|     "# We try the classifier in two new docs\n", | ||||
|     "\n", | ||||
|     "new_docs = ['This is a survey of PC computers', 'God is love']\n", | ||||
|     "new_vectors = vectorizer.transform(new_docs)\n", | ||||
|     "\n", | ||||
|     "pred_docs = model.predict(new_vectors)\n", | ||||
|     "print(pred_docs)\n", | ||||
|     "print([newsgroups_train.target_names[i] for i in pred_docs])" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "## References\n", | ||||
|     "\n" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n", | ||||
|     "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "## Licence" | ||||
|    ] | ||||
|   }, | ||||
|   { | ||||
|    "cell_type": "markdown", | ||||
|    "metadata": {}, | ||||
|    "source": [ | ||||
|     "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n", | ||||
|     "\n", | ||||
|     "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid." | ||||
|    ] | ||||
|   } | ||||
|  ], | ||||
|  "metadata": { | ||||
|   "kernelspec": { | ||||
|    "display_name": "Python 3", | ||||
|    "language": "python", | ||||
|    "name": "python3" | ||||
|   }, | ||||
|   "language_info": { | ||||
|    "codemirror_mode": { | ||||
|     "name": "ipython", | ||||
|     "version": 3 | ||||
|    }, | ||||
|    "file_extension": ".py", | ||||
|    "mimetype": "text/x-python", | ||||
|    "name": "python", | ||||
|    "nbconvert_exporter": "python", | ||||
|    "pygments_lexer": "ipython3", | ||||
|    "version": "3.5.2" | ||||
|   } | ||||
|  }, | ||||
|  "nbformat": 4, | ||||
|  "nbformat_minor": 0 | ||||
| } | ||||
							
								
								
									
										12
									
								
								README.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										12
									
								
								README.md
									
									
									
									
									
										Normal file
									
								
							| @@ -0,0 +1,12 @@ | ||||
| Clone the repo | ||||
| ``` | ||||
| git clone https://github.com/gsi-upm/sitc | ||||
| ``` | ||||
|  | ||||
| Run jupyter either through the `jupyter notebook` command, or with docker: | ||||
|  | ||||
| ``` | ||||
| docker run -v $PWD/:/home/jovyan/work/ -p 8888:8888 jupyter/scipy-notebook | ||||
| ``` | ||||
|  | ||||
| Visit the URL you'll get, or copy the code. | ||||
		Reference in New Issue
	
	Block a user