sitc/nlp/4_4_Classification.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Objectives](#Objectives)\n",
    "* [Corpus](#Corpus)\n",
    "* [Classifier](#Classifier)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n",
    "\n",
    "The main objectives of this session are:\n",
    "* Understand how to apply machine learning techniques on textual sources\n",
    "* Learn the facilities provided by scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20  newsgroup dataset contains 20k documents that belong to 20 topics.\n",
    "\n",
    "We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n"
     ]
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "\n",
    "# We remove metadata to avoid bias in the classification\n",
    "newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n",
    "newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n",
    "\n",
    "# print categories\n",
    "print(list(newsgroups_train.target_names))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "20\n"
     ]
    }
   ],
   "source": [
    "#Number of categories\n",
    "print(len(newsgroups_train.target_names))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Category id 4 comp.sys.mac.hardware\n",
      "Doc A fair number of brave souls who upgraded their SI clock oscillator have\n",
      "shared their experiences for this poll. Please send a brief message detailing\n",
      "your experiences with the procedure. Top speed attained, CPU rated speed,\n",
      "add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
      "functionality with 800 and 1.4 m floppies are especially requested.\n",
      "\n",
      "I will be summarizing in the next two days, so please add to the network\n",
      "knowledge base if you have done the clock upgrade and haven't answered this\n",
      "poll. Thanks.\n"
     ]
    }
   ],
   "source": [
    "# Show a document\n",
    "docid = 1\n",
    "doc = newsgroups_train.data[docid]\n",
    "cat = newsgroups_train.target[docid]\n",
    "\n",
    "print(\"Category id \" +  str(cat) + \" \" + newsgroups_train.target_names[cat])\n",
    "print(\"Doc \" + doc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(11314,)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Number of files\n",
    "newsgroups_train.filenames.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(11314, 101322)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Obtain a vector\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n",
    "\n",
    "vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
    "vectors_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "66.802987449178"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n",
    "vectors_train.nnz / float(vectors_train.shape[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.695453607190013"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.naive_bayes import MultinomialNB\n",
    "\n",
    "from sklearn import metrics\n",
    "\n",
    "\n",
    "# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n",
    "# Nevertheless, we only transform the test dataset into vectors  (transform, not fit_transform)\n",
    "\n",
    "model = MultinomialNB(alpha=.01)\n",
    "model.fit(vectors_train, newsgroups_train.target)\n",
    "\n",
    "vectors_test = vectorizer.transform(newsgroups_test.data)\n",
    "pred = model.predict(vectors_test)\n",
    "\n",
    "metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "alt.atheism: islam atheists say just religion atheism think don people god\n",
      "comp.graphics: looking format 3d know program file files thanks image graphics\n",
      "comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows\n",
      "comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive\n",
      "comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac\n",
      "comp.windows.x: using windows x11r5 use application thanks widget server motif window\n",
      "misc.forsale: asking email sell price condition new shipping offer 00 sale\n",
      "rec.autos: don ford new good dealer just engine like cars car\n",
      "rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike\n",
      "rec.sport.baseball: braves players pitching hit runs games game baseball team year\n",
      "rec.sport.hockey: league year nhl games season players play hockey team game\n",
      "sci.crypt: people use escrow nsa keys government chip clipper encryption key\n",
      "sci.electronics: don thanks voltage used know does like circuit power use\n",
      "sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg\n",
      "sci.space: just lunar earth shuttle like moon launch orbit nasa space\n",
      "soc.religion.christian: believe faith christian christ bible people christians church jesus god\n",
      "talk.politics.guns: just law firearms government fbi don weapons people guns gun\n",
      "talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel\n",
      "talk.politics.misc: know state clinton president just think tax don government people\n",
      "talk.religion.misc: think don koresh objective christians bible people christian jesus god\n"
     ]
    }
   ],
   "source": [
    "# We can review the top features per topic in Bayes (attribute feature_log_prob_)\n",
    "import numpy as np\n",
    "\n",
    "def show_top10(classifier, vectorizer, categories):\n",
    "    feature_names = np.asarray(vectorizer.get_feature_names_out())\n",
    "    for i, category in enumerate(categories):\n",
    "        top10 = np.argsort(classifier.feature_log_prob_[i, :])[-10:]\n",
    "        print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n",
    "\n",
    "        \n",
    "show_top10(model, vectorizer, newsgroups_train.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[ 2 15]\n",
      "['comp.os.ms-windows.misc', 'soc.religion.christian']\n"
     ]
    }
   ],
   "source": [
    "# We try the classifier in two new docs\n",
    "\n",
    "new_docs = ['This is a survey of PC computers', 'God is love']\n",
    "new_vectors = vectorizer.transform(new_docs)\n",
    "\n",
    "pred_docs = model.predict(new_vectors)\n",
    "print(pred_docs)\n",
    "print([newsgroups_train.target_names[i] for i in pred_docs])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
    "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}