sitc/nlp/4_5_Semantic_Models.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Semantic Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Objectives](#Objectives)\n",
    "* [Corpus](#Corpus)\n",
    "* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n",
    "* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n",
    "* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n",
    "\n",
    "The main objectives of this session are:\n",
    "* Understand the models and their differences\n",
    "* Learn to use some of the most popular NLP libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20  newsgroup dataset contains 20k documents that belong to 20 topics.\n",
    "\n",
    "We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2034, 2807)"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "\n",
    "# We filter only some categories, otherwise we have 20 categories\n",
    "categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
    "# We remove metadata to avoid bias in the classification\n",
    "newsgroups_train = fetch_20newsgroups(subset='train', \n",
    "                                      remove=('headers', 'footers', 'quotes'), \n",
    "                                      categories=categories)\n",
    "newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n",
    "                                    categories=categories)\n",
    "\n",
    "\n",
    "# Obtain a vector\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n",
    "\n",
    "vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
    "vectors_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Converting Scikit-learn to gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n",
    "\n",
    "You should install first:\n",
    "\n",
    "* *gensim*. Run 'conda install gensim' in a terminal.\n",
    "* *python-Levenshtein*. Run 'conda install python-Levenshtein' in a terminal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim import matutils\n",
    "\n",
    "vocab = vectorizer.get_feature_names()\n",
    "\n",
    "dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n",
    "corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Latent Dirichlet Allocation (LDA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models.ldamodel import LdaModel\n",
    "\n",
    "# It takes a long time\n",
    "\n",
    "#  train the lda model, choosing number of topics equal to 4\n",
    "lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 61,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.011*\"baptist\" + 0.010*\"koresh\" + 0.009*\"bible\" + 0.006*\"reality\" + 0.006*\"virtual\" + 0.005*\"scarlet\" + 0.005*\"shag\" + 0.004*\"tootsie\" + 0.004*\"kinda\" + 0.004*\"captain\"'),\n",
       " (1,\n",
       "  '0.010*\"targa\" + 0.008*\"thanks\" + 0.008*\"moon\" + 0.007*\"craig\" + 0.007*\"zoroastrians\" + 0.006*\"yayayay\" + 0.005*\"unfortunately\" + 0.005*\"windows\" + 0.005*\"rayshade\" + 0.004*\"tdb\"'),\n",
       " (2,\n",
       "  '0.009*\"mary\" + 0.007*\"whatever\" + 0.006*\"god\" + 0.005*\"ns\" + 0.005*\"lucky\" + 0.005*\"joseph\" + 0.005*\"ssrt\" + 0.005*\"samaritan\" + 0.005*\"crusades\" + 0.004*\"phobos\"'),\n",
       " (3,\n",
       "  '0.009*\"islam\" + 0.008*\"western\" + 0.008*\"plane\" + 0.008*\"jeff\" + 0.007*\"cheers\" + 0.007*\"kent\" + 0.007*\"joy\" + 0.007*\"khomeini\" + 0.007*\"davidian\" + 0.006*\"basically\"')]"
      ]
     },
     "execution_count": 61,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the topics\n",
    "lda.print_topics(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the gensim.corpora module to generate dictionary\n",
    "from gensim import corpora\n",
    "\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk import RegexpTokenizer\n",
    "\n",
    "import string\n",
    "\n",
    "def preprocess(words):\n",
    "    tokenizer = RegexpTokenizer('[A-Z]\\w+')\n",
    "    tokens = [w.lower() for w in tokenizer.tokenize(words)]\n",
    "    stoplist = stopwords.words('english')\n",
    "    tokens_stop = [w for w in tokens if w not in stoplist]\n",
    "    punctuation = set(string.punctuation)\n",
    "    tokens_clean = [w for w in tokens_stop if  w not in punctuation]\n",
    "    return tokens_clean\n",
    "\n",
    "#words = preprocess(newsgroups_train.data)\n",
    "#dictionary = corpora.Dictionary(newsgroups_train.data)\n",
    "\n",
    "texts = [preprocess(document) for document in newsgroups_train.data]\n",
    "\n",
    "dictionary = corpora.Dictionary(texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dictionary(10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...)\n"
     ]
    }
   ],
   "source": [
    "# You can save the dictionary\n",
    "dictionary.save('newsgroup.dict.texts')\n",
    "\n",
    "print(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate a list of docs, where each doc is a list of words\n",
    "\n",
    "docs = [preprocess(doc) for doc in newsgroups_train.data]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the gensim.corpora module to generate dictionary\n",
    "from gensim import corpora\n",
    "\n",
    "dictionary = corpora.Dictionary(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dictionary(10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...)\n"
     ]
    }
   ],
   "source": [
    "# We can print the dictionary, it is a mappying of id and tokens\n",
    "\n",
    "print(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [],
   "source": [
    "# construct the corpus representing each document as a bag-of-words (bow) vector\n",
    "corpus = [dictionary.doc2bow(doc) for doc in docs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 73,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models import TfidfModel\n",
    "\n",
    "# calculate tfidf\n",
    "tfidf_model = TfidfModel(corpus)\n",
    "corpus_tfidf = tfidf_model[corpus]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.24093628445650234), (1, 0.5700978153855775), (2, 0.10438175896914427), (3, 0.1598114653031772), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
     ]
    }
   ],
   "source": [
    "#print tf-idf of first document\n",
    "print(corpus_tfidf[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models.ldamodel import LdaModel\n",
    "\n",
    "# train the lda model, choosing number of topics equal to 4, it takes a long time\n",
    "\n",
    "lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.009*\"whatever\" + 0.007*\"plane\" + 0.007*\"ns\" + 0.007*\"joy\" + 0.006*\"happy\" + 0.005*\"bob\" + 0.004*\"phil\" + 0.004*\"nasa\" + 0.003*\"purdue\" + 0.003*\"neie\"'),\n",
       " (1,\n",
       "  '0.009*\"god\" + 0.008*\"mary\" + 0.008*\"targa\" + 0.007*\"baptist\" + 0.007*\"thanks\" + 0.007*\"koresh\" + 0.006*\"really\" + 0.006*\"bible\" + 0.005*\"lot\" + 0.005*\"lucky\"'),\n",
       " (2,\n",
       "  '0.010*\"moon\" + 0.007*\"phobos\" + 0.006*\"unfortunately\" + 0.006*\"martian\" + 0.006*\"russian\" + 0.005*\"rayshade\" + 0.005*\"anybody\" + 0.005*\"perturbations\" + 0.005*\"thanks\" + 0.004*\"apollo\"'),\n",
       " (3,\n",
       "  '0.008*\"islam\" + 0.008*\"western\" + 0.007*\"jeff\" + 0.007*\"zoroastrians\" + 0.006*\"davidian\" + 0.006*\"basically\" + 0.005*\"bull\" + 0.005*\"gerald\" + 0.005*\"sorry\" + 0.004*\"kent\"')]"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the topics\n",
    "lda_model.print_topics(4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.7154438), (1, 0.10569019), (2, 0.09522807), (3, 0.08363795)]\n"
     ]
    }
   ],
   "source": [
    "# check the lsa vector for the first document\n",
    "corpus_lda = lda_model[corpus_tfidf]\n",
    "print(corpus_lda[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('lord', 1), ('god', 2)]\n"
     ]
    }
   ],
   "source": [
    "#predict topics of a new doc\n",
    "new_doc = \"God is love and God is the Lord\"\n",
    "#transform into BOW space\n",
    "bow_vector = dictionary.doc2bow(preprocess(new_doc))\n",
    "print([(dictionary[id], count) for id, count in bow_vector])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.06320839), (1, 0.80878526), (2, 0.06274223), (3, 0.065264106)]\n"
     ]
    }
   ],
   "source": [
    "#transform into LDA space\n",
    "lda_vector = lda_model[bow_vector]\n",
    "print(lda_vector)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.009*\"god\" + 0.008*\"mary\" + 0.008*\"targa\" + 0.007*\"baptist\" + 0.007*\"thanks\" + 0.007*\"koresh\" + 0.006*\"really\" + 0.006*\"bible\" + 0.005*\"lot\" + 0.005*\"lucky\"\n"
     ]
    }
   ],
   "source": [
    "# print the document's single most prominent LDA topic\n",
    "print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.10564032), (1, 0.67894983), (2, 0.104482815), (3, 0.11092702)]\n",
      "0.009*\"god\" + 0.008*\"mary\" + 0.008*\"targa\" + 0.007*\"baptist\" + 0.007*\"thanks\" + 0.007*\"koresh\" + 0.006*\"really\" + 0.006*\"bible\" + 0.005*\"lot\" + 0.005*\"lucky\"\n"
     ]
    }
   ],
   "source": [
    "lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n",
    "print(lda_vector_tfidf)\n",
    "# print the document's single most prominent LDA topic\n",
    "print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Latent Semantic Indexing (LSI)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models.lsimodel import LsiModel\n",
    "\n",
    "#It takes a long time\n",
    "\n",
    "# train the lsi model, choosing number of topics equal to 20\n",
    "\n",
    "\n",
    "lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.769*\"god\" + 0.346*\"jesus\" + 0.235*\"bible\" + 0.204*\"christian\" + 0.148*\"christians\" + 0.107*\"christ\" + 0.090*\"well\" + 0.085*\"koresh\" + 0.081*\"kent\" + 0.080*\"christianity\"'),\n",
       " (1,\n",
       "  '-0.863*\"thanks\" + -0.255*\"please\" + -0.159*\"hello\" + -0.152*\"hi\" + 0.124*\"god\" + -0.111*\"sorry\" + -0.088*\"could\" + -0.074*\"windows\" + -0.067*\"jpeg\" + -0.063*\"gif\"'),\n",
       " (2,\n",
       "  '-0.780*\"well\" + 0.229*\"god\" + -0.165*\"yes\" + 0.154*\"thanks\" + -0.133*\"ico\" + -0.133*\"tek\" + -0.130*\"queens\" + -0.130*\"bronx\" + -0.130*\"beauchaine\" + -0.130*\"manhattan\"'),\n",
       " (3,\n",
       "  '-0.338*\"well\" + 0.336*\"ico\" + 0.334*\"tek\" + 0.328*\"bronx\" + 0.328*\"beauchaine\" + 0.328*\"queens\" + 0.326*\"manhattan\" + 0.305*\"com\" + 0.305*\"bob\" + 0.072*\"god\"')]"
      ]
     },
     "execution_count": 83,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the topics\n",
    "lsi_model.print_topics(4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 84,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.24093628445650234), (1, 0.5700978153855775), (2, 0.10438175896914427), (3, 0.1598114653031772), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
     ]
    }
   ],
   "source": [
    "# check the lsi vector for the first document\n",
    "print(corpus_tfidf[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
    "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}