sitc/nlp/4_4_Classification.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Objectives](#Objectives)\n",
    "* [Corpus](#Corpus)\n",
    "* [Classifier](#Classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n",
    "\n",
    "The main objectives of this session are:\n",
    "* Understand how to apply machine learning techniques on textual sources\n",
    "* Learn the facilities provided by scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20  newsgroup dataset contains 20k documents that belong to 20 topics.\n",
    "\n",
    "We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "\n",
    "# We remove metadata to avoid bias in the classification\n",
    "newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n",
    "newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n",
    "\n",
    "# print categories\n",
    "print(list(newsgroups_train.target_names))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Number of categories\n",
    "print(len(newsgroups_train.target_names))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show a document\n",
    "docid = 1\n",
    "doc = newsgroups_train.data[docid]\n",
    "cat = newsgroups_train.target[docid]\n",
    "\n",
    "print(\"Category id \" +  str(cat) + \" \" + newsgroups_train.target_names[cat])\n",
    "print(\"Doc \" + doc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Number of files\n",
    "newsgroups_train.filenames.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Obtain a vector\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n",
    "\n",
    "vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
    "vectors_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n",
    "vectors_train.nnz / float(vectors_train.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "\n",
    "from sklearn import metrics\n",
    "\n",
    "\n",
    "# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n",
    "# Nevertheless, we only transform the test dataset into vectors  (transform, not fit_transform)\n",
    "\n",
    "model = MultinomialNB(alpha=.01)\n",
    "model.fit(vectors_train, newsgroups_train.target)\n",
    "\n",
    "vectors_test = vectorizer.transform(newsgroups_test.data)\n",
    "pred = model.predict(vectors_test)\n",
    "\n",
    "metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils.extmath import density\n",
    "\n",
    "print(\"dimensionality: %d\" % model.coef_.shape[1])\n",
    "print(\"density: %f\" % density(model.coef_))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can review the top features per topic in Bayes (attribute coef_)\n",
    "import numpy as np\n",
    "\n",
    "def show_top10(classifier, vectorizer, categories):\n",
    "    feature_names = np.asarray(vectorizer.get_feature_names())\n",
    "    for i, category in enumerate(categories):\n",
    "        top10 = np.argsort(classifier.coef_[i])[-10:]\n",
    "        print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n",
    "\n",
    "        \n",
    "show_top10(model, vectorizer, newsgroups_train.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We try the classifier in two new docs\n",
    "\n",
    "new_docs = ['This is a survey of PC computers', 'God is love']\n",
    "new_vectors = vectorizer.transform(new_docs)\n",
    "\n",
    "pred_docs = model.predict(new_vectors)\n",
    "print(pred_docs)\n",
    "print([newsgroups_train.target_names[i] for i in pred_docs])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
    "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Updated notebooks 2019-03-06 16:46:12 +00:00			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Text Classification"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"* [Objectives](#Objectives)\n",`
			`"* [Corpus](#Corpus)\n",`
			`"* [Classifier](#Classifier)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Objectives"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n",`
			`"\n",`
			`"The main objectives of this session are:\n",`
			`"* Understand how to apply machine learning techniques on textual sources\n",`
			`"* Learn the facilities provided by scikit-learn"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Corpus"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",`
			`"\n",`
			`"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"from sklearn.datasets import fetch_20newsgroups\n",`
			`"\n",`
			`"# We remove metadata to avoid bias in the classification\n",`
			`"newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n",`
			`"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n",`
			`"\n",`
			`"# print categories\n",`
			`"print(list(newsgroups_train.target_names))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"#Number of categories\n",`
			`"print(len(newsgroups_train.target_names))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"# Show a document\n",`
			`"docid = 1\n",`
			`"doc = newsgroups_train.data[docid]\n",`
			`"cat = newsgroups_train.target[docid]\n",`
			`"\n",`
			`"print(\"Category id \" + str(cat) + \" \" + newsgroups_train.target_names[cat])\n",`
			`"print(\"Doc \" + doc)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"#Number of files\n",`
			`"newsgroups_train.filenames.shape"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"# Obtain a vector\n",`
			`"\n",`
			`"from sklearn.feature_extraction.text import TfidfVectorizer\n",`
			`"\n",`
			`"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n",`
			`"\n",`
			`"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",`
			`"vectors_train.shape"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n",`
			`"vectors_train.nnz / float(vectors_train.shape[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Classifier"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"from sklearn.naive_bayes import MultinomialNB\n",`
			`"\n",`
			`"from sklearn import metrics\n",`
			`"\n",`
			`"\n",`
			`"# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n",`
			`"# Nevertheless, we only transform the test dataset into vectors (transform, not fit_transform)\n",`
			`"\n",`
			`"model = MultinomialNB(alpha=.01)\n",`
			`"model.fit(vectors_train, newsgroups_train.target)\n",`
			`"\n",`
Included installation of nltk 2017-04-20 10:56:39 +00:00			`"vectors_test = vectorizer.transform(newsgroups_test.data)\n",`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"pred = model.predict(vectors_test)\n",`
			`"\n",`
			`"metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"from sklearn.utils.extmath import density\n",`
			`"\n",`
			`"print(\"dimensionality: %d\" % model.coef_.shape[1])\n",`
			`"print(\"density: %f\" % density(model.coef_))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"# We can review the top features per topic in Bayes (attribute coef_)\n",`
			`"import numpy as np\n",`
			`"\n",`
			`"def show_top10(classifier, vectorizer, categories):\n",`
			`" feature_names = np.asarray(vectorizer.get_feature_names())\n",`
			`" for i, category in enumerate(categories):\n",`
			`" top10 = np.argsort(classifier.coef_[i])[-10:]\n",`
			`" print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n",`
			`"\n",`
			`" \n",`
			`"show_top10(model, vectorizer, newsgroups_train.target_names)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`"source": [`
			`"# We try the classifier in two new docs\n",`
			`"\n",`
			`"new_docs = ['This is a survey of PC computers', 'God is love']\n",`
			`"new_vectors = vectorizer.transform(new_docs)\n",`
			`"\n",`
			`"pred_docs = model.predict(new_vectors)\n",`
			`"print(pred_docs)\n",`
			`"print([newsgroups_train.target_names[i] for i in pred_docs])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## References\n",`
			`"\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",`
			`"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
Updated notebooks 2019-03-06 16:46:12 +00:00			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Updated notebooks 2019-03-06 16:46:12 +00:00			`"version": "3.7.1"`
			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`}`
			`},`
			`"nbformat": 4,`
Updated notebooks 2019-03-06 16:46:12 +00:00			`"nbformat_minor": 1`
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`}`