mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-10 00:42:28 +00:00
464 lines
13 KiB
Plaintext
464 lines
13 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Text Classification"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Objectives](#Objectives)\n",
|
|
"* [Corpus](#Corpus)\n",
|
|
"* [Classifier](#Classifier)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Objectives"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n",
|
|
"\n",
|
|
"The main objectives of this session are:\n",
|
|
"* Understand how to apply machine learning techniques on textual sources\n",
|
|
"* Learn the facilities provided by scikit-learn"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Corpus"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",
|
|
"\n",
|
|
"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.datasets import fetch_20newsgroups\n",
|
|
"\n",
|
|
"# We remove metadata to avoid bias in the classification\n",
|
|
"newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n",
|
|
"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n",
|
|
"\n",
|
|
"# print categories\n",
|
|
"print(list(newsgroups_train.target_names))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"20\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"#Number of categories\n",
|
|
"print(len(newsgroups_train.target_names))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Category id 4 comp.sys.mac.hardware\n",
|
|
"Doc A fair number of brave souls who upgraded their SI clock oscillator have\n",
|
|
"shared their experiences for this poll. Please send a brief message detailing\n",
|
|
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
|
|
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
|
|
"functionality with 800 and 1.4 m floppies are especially requested.\n",
|
|
"\n",
|
|
"I will be summarizing in the next two days, so please add to the network\n",
|
|
"knowledge base if you have done the clock upgrade and haven't answered this\n",
|
|
"poll. Thanks.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Show a document\n",
|
|
"docid = 1\n",
|
|
"doc = newsgroups_train.data[docid]\n",
|
|
"cat = newsgroups_train.target[docid]\n",
|
|
"\n",
|
|
"print(\"Category id \" + str(cat) + \" \" + newsgroups_train.target_names[cat])\n",
|
|
"print(\"Doc \" + doc)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(11314,)"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Number of files\n",
|
|
"newsgroups_train.filenames.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/home/cif/anaconda3/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.\n",
|
|
" VisibleDeprecationWarning)\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(11314, 101323)"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Obtain a vector\n",
|
|
"\n",
|
|
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
|
"\n",
|
|
"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n",
|
|
"\n",
|
|
"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
|
|
"vectors_train.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"66.80510871486653"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n",
|
|
"vectors_train.nnz / float(vectors_train.shape[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Classifier"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/home/cif/anaconda3/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.\n",
|
|
" VisibleDeprecationWarning)\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0.69545360719001303"
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.naive_bayes import MultinomialNB\n",
|
|
"\n",
|
|
"from sklearn import metrics\n",
|
|
"\n",
|
|
"\n",
|
|
"# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n",
|
|
"# Nevertheless, we only transform the test dataset into vectors (transform, not fit_transform)\n",
|
|
"\n",
|
|
"model = MultinomialNB(alpha=.01)\n",
|
|
"model.fit(vectors_train, newsgroups_train.target)\n",
|
|
"\n",
|
|
"vectors_test = vectorizer.transform(newsgroups_test.data)\n",
|
|
"pred = model.predict(vectors_test)\n",
|
|
"\n",
|
|
"metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"dimensionality: 101323\n",
|
|
"density: 1.000000\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.utils.extmath import density\n",
|
|
"\n",
|
|
"print(\"dimensionality: %d\" % model.coef_.shape[1])\n",
|
|
"print(\"density: %f\" % density(model.coef_))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"alt.atheism: islam atheists say just religion atheism think don people god\n",
|
|
"comp.graphics: looking format 3d know program file files thanks image graphics\n",
|
|
"comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows\n",
|
|
"comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive\n",
|
|
"comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac\n",
|
|
"comp.windows.x: using windows x11r5 use application thanks widget server motif window\n",
|
|
"misc.forsale: asking email sell price condition new shipping offer 00 sale\n",
|
|
"rec.autos: don ford new good dealer just engine like cars car\n",
|
|
"rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike\n",
|
|
"rec.sport.baseball: braves players pitching hit runs games game baseball team year\n",
|
|
"rec.sport.hockey: league year nhl games season players play hockey team game\n",
|
|
"sci.crypt: people use escrow nsa keys government chip clipper encryption key\n",
|
|
"sci.electronics: don thanks voltage used know does like circuit power use\n",
|
|
"sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg\n",
|
|
"sci.space: just lunar earth shuttle like moon launch orbit nasa space\n",
|
|
"soc.religion.christian: believe faith christian christ bible people christians church jesus god\n",
|
|
"talk.politics.guns: just law firearms government fbi don weapons people guns gun\n",
|
|
"talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel\n",
|
|
"talk.politics.misc: know state clinton president just think tax don government people\n",
|
|
"talk.religion.misc: think don koresh objective christians bible people christian jesus god\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# We can review the top features per topic in Bayes (attribute coef_)\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"def show_top10(classifier, vectorizer, categories):\n",
|
|
" feature_names = np.asarray(vectorizer.get_feature_names())\n",
|
|
" for i, category in enumerate(categories):\n",
|
|
" top10 = np.argsort(classifier.coef_[i])[-10:]\n",
|
|
" print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n",
|
|
"\n",
|
|
" \n",
|
|
"show_top10(model, vectorizer, newsgroups_train.target_names)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[ 2 15]\n",
|
|
"['comp.os.ms-windows.misc', 'soc.religion.christian']\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/home/cif/anaconda3/lib/python3.5/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.\n",
|
|
" VisibleDeprecationWarning)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# We try the classifier in two new docs\n",
|
|
"\n",
|
|
"new_docs = ['This is a survey of PC computers', 'God is love']\n",
|
|
"new_vectors = vectorizer.transform(new_docs)\n",
|
|
"\n",
|
|
"pred_docs = model.predict(new_vectors)\n",
|
|
"print(pred_docs)\n",
|
|
"print([newsgroups_train.target_names[i] for i in pred_docs])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## References\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
|
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
}
|