sitc/nlp/4_5_Semantic_Models.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Semantic Models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Objectives](#Objectives)\n",
    "* [Corpus](#Corpus)\n",
    "* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n",
    "* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n",
    "* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n",
    "\n",
    "The main objectives of this session are:\n",
    "* Understand the models and their differences\n",
    "* Learn to use some of the most popular NLP libraries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Corpus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20  newsgroup dataset contains 20k documents that belong to 20 topics.\n",
    "\n",
    "We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(2034, 2807)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.datasets import fetch_20newsgroups\n",
    "\n",
    "# We filter only some categories, otherwise we have 20 categories\n",
    "categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
    "# We remove metadata to avoid bias in the classification\n",
    "newsgroups_train = fetch_20newsgroups(subset='train', \n",
    "                                      remove=('headers', 'footers', 'quotes'), \n",
    "                                      categories=categories)\n",
    "newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n",
    "                                    categories=categories)\n",
    "\n",
    "\n",
    "# Obtain a vector\n",
    "\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n",
    "\n",
    "vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
    "vectors_train.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Converting Scikit-learn to gensim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n",
    "\n",
    "You should install first *gensim*. Run 'conda install -c anaconda gensim=0.12.4' in a terminal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from gensim import matutils\n",
    "\n",
    "vocab = vectorizer.get_feature_names()\n",
    "\n",
    "dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n",
    "corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Latent Dirichlet Allocation (LDA)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from gensim.models.ldamodel import LdaModel\n",
    "\n",
    "# It takes a long time\n",
    "\n",
    "#  train the lda model, choosing number of topics equal to 4\n",
    "lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.004*objects + 0.004*obtained + 0.003*comets + 0.003*manhattan + 0.003*member + 0.003*beginning + 0.003*center + 0.003*groups + 0.003*aware + 0.003*increased'),\n",
       " (1,\n",
       "  '0.003*activity + 0.002*objects + 0.002*professional + 0.002*eyes + 0.002*manhattan + 0.002*pressure + 0.002*netters + 0.002*chosen + 0.002*attempted + 0.002*medical'),\n",
       " (2,\n",
       "  '0.003*mechanism + 0.003*led + 0.003*platform + 0.003*frank + 0.003*mormons + 0.003*aeronautics + 0.002*concepts + 0.002*header + 0.002*forces + 0.002*profit'),\n",
       " (3,\n",
       "  '0.005*diameter + 0.005*having + 0.004*complex + 0.004*conclusions + 0.004*activity + 0.004*looking + 0.004*action + 0.004*inflatable + 0.004*defined + 0.004*association')]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the topics\n",
    "lda.print_topics(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# import the gensim.corpora module to generate dictionary\n",
    "from gensim import corpora\n",
    "\n",
    "from nltk import word_tokenize\n",
    "from nltk.corpus import stopwords\n",
    "from nltk import RegexpTokenizer\n",
    "\n",
    "import string\n",
    "\n",
    "def preprocess(words):\n",
    "    tokenizer = RegexpTokenizer('[A-Z]\\w+')\n",
    "    tokens = [w.lower() for w in tokenizer.tokenize(words)]\n",
    "    stoplist = stopwords.words('english')\n",
    "    tokens_stop = [w for w in tokens if w not in stoplist]\n",
    "    punctuation = set(string.punctuation)\n",
    "    tokens_clean = [w for w in tokens_stop if  w not in punctuation]\n",
    "    return tokens_clean\n",
    "\n",
    "#words = preprocess(newsgroups_train.data)\n",
    "#dictionary = corpora.Dictionary(newsgroups_train.data)\n",
    "\n",
    "texts = [preprocess(document) for document in newsgroups_train.data]\n",
    "\n",
    "dictionary = corpora.Dictionary(texts)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"
     ]
    }
   ],
   "source": [
    "# You can save the dictionary\n",
    "dictionary.save('newsgroup.dict')\n",
    "\n",
    "print(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Generate a list of docs, where each doc is a list of words\n",
    "\n",
    "docs = [preprocess(doc) for doc in newsgroups_train.data]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# import the gensim.corpora module to generate dictionary\n",
    "from gensim import corpora\n",
    "\n",
    "dictionary = corpora.Dictionary(docs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# You can optionally save the  dictionary \n",
    "\n",
    "dictionary.save('newsgroups.dict')\n",
    "lda = LdaModel.load('newsgroups.lda')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"
     ]
    }
   ],
   "source": [
    "# We can print the dictionary, it is a mappying of id and tokens\n",
    "\n",
    "print(dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# construct the corpus representing each document as a bag-of-words (bow) vector\n",
    "corpus = [dictionary.doc2bow(doc) for doc in docs]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from gensim.models import TfidfModel\n",
    "\n",
    "# calculate tfidf\n",
    "tfidf_model = TfidfModel(corpus)\n",
    "corpus_tfidf = tfidf_model[corpus]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
     ]
    }
   ],
   "source": [
    "#print tf-idf of first document\n",
    "print(corpus_tfidf[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from gensim.models.ldamodel import LdaModel\n",
    "\n",
    "# train the lda model, choosing number of topics equal to 4, it takes a long time\n",
    "\n",
    "lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.010*targa + 0.007*ns + 0.006*thanks + 0.006*davidian + 0.006*ssrt + 0.006*yayayay + 0.005*craig + 0.005*bull + 0.005*gerald + 0.005*sorry'),\n",
       " (1,\n",
       "  '0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades'),\n",
       " (2,\n",
       "  '0.007*koresh + 0.007*moon + 0.007*western + 0.006*plane + 0.006*jeff + 0.006*unix + 0.005*bible + 0.005*also + 0.005*basically + 0.005*bob'),\n",
       " (3,\n",
       "  '0.011*whatever + 0.008*joy + 0.007*happy + 0.006*virtual + 0.006*reality + 0.004*really + 0.003*samuel___ + 0.003*oh + 0.003*virtually + 0.003*toaster')]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the topics\n",
    "lda_model.print_topics(4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.085176135689180726), (1, 0.6919655173835938), (2, 0.1377903468164027), (3, 0.0850680001108228)]\n"
     ]
    }
   ],
   "source": [
    "# check the lsa vector for the first document\n",
    "corpus_lda = lda_model[corpus_tfidf]\n",
    "print(corpus_lda[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('lord', 1), ('god', 2)]\n"
     ]
    }
   ],
   "source": [
    "#predict topics of a new doc\n",
    "new_doc = \"God is love and God is the Lord\"\n",
    "#transform into BOW space\n",
    "bow_vector = dictionary.doc2bow(preprocess(new_doc))\n",
    "print([(dictionary[id], count) for id, count in bow_vector])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.062509420435514051), (1, 0.81246608790618835), (2, 0.062508281488992554), (3, 0.062516210169305114)]\n"
     ]
    }
   ],
   "source": [
    "#transform into LDA space\n",
    "lda_vector = lda_model[bow_vector]\n",
    "print(lda_vector)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades\n"
     ]
    }
   ],
   "source": [
    "# print the document's single most prominent LDA topic\n",
    "print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.10392179866025079), (1, 0.68822094221870811), (2, 0.10391916429993264), (3, 0.10393809482110833)]\n",
      "0.011*god + 0.010*mary + 0.008*baptist + 0.008*islam + 0.006*zoroastrians + 0.006*joseph + 0.006*lucky + 0.006*khomeini + 0.006*samaritan + 0.005*crusades\n"
     ]
    }
   ],
   "source": [
    "lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n",
    "print(lda_vector_tfidf)\n",
    "# print the document's single most prominent LDA topic\n",
    "print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Latent Semantic Indexing (LSI)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from gensim.models.lsimodel import LsiModel\n",
    "\n",
    "#It takes a long time\n",
    "\n",
    "# train the lsi model, choosing number of topics equal to 20\n",
    "\n",
    "\n",
    "lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[(0,\n",
       "  '0.769*\"god\" + 0.346*\"jesus\" + 0.235*\"bible\" + 0.204*\"christian\" + 0.149*\"christians\" + 0.107*\"christ\" + 0.090*\"well\" + 0.085*\"koresh\" + 0.081*\"kent\" + 0.080*\"christianity\"'),\n",
       " (1,\n",
       "  '-0.863*\"thanks\" + -0.255*\"please\" + -0.159*\"hello\" + -0.153*\"hi\" + 0.123*\"god\" + -0.112*\"sorry\" + -0.087*\"could\" + -0.074*\"windows\" + -0.067*\"jpeg\" + -0.063*\"vga\"'),\n",
       " (2,\n",
       "  '0.780*\"well\" + -0.229*\"god\" + 0.165*\"yes\" + -0.153*\"thanks\" + 0.133*\"ico\" + 0.133*\"tek\" + 0.130*\"bronx\" + 0.130*\"beauchaine\" + 0.130*\"queens\" + 0.129*\"manhattan\"'),\n",
       " (3,\n",
       "  '0.340*\"well\" + -0.335*\"ico\" + -0.334*\"tek\" + -0.328*\"beauchaine\" + -0.328*\"bronx\" + -0.328*\"queens\" + -0.326*\"manhattan\" + -0.305*\"bob\" + -0.305*\"com\" + -0.072*\"god\"')]"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check the topics\n",
    "lsi_model.print_topics(4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
     ]
    }
   ],
   "source": [
    "# check the lsi vector for the first document\n",
    "print(corpus_tfidf[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
    "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
Added NLP notebooks 2016-05-26 12:33:27 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Semantic Models"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"* [Objectives](#Objectives)\n",`
			`"* [Corpus](#Corpus)\n",`
			`"* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n",`
			`"* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n",`
			`"* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Objectives"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n",`
			`"\n",`
			`"The main objectives of this session are:\n",`
			`"* Understand the models and their differences\n",`
			`"* Learn to use some of the most popular NLP libraries"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Corpus"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",`
			`"\n",`
			`"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"(2034, 2807)"`
			`]`
			`},`
			`"execution_count": 4,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from sklearn.datasets import fetch_20newsgroups\n",`
			`"\n",`
			`"# We filter only some categories, otherwise we have 20 categories\n",`
			`"categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",`
			`"# We remove metadata to avoid bias in the classification\n",`
			`"newsgroups_train = fetch_20newsgroups(subset='train', \n",`
			`" remove=('headers', 'footers', 'quotes'), \n",`
			`" categories=categories)\n",`
			`"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n",`
			`" categories=categories)\n",`
			`"\n",`
			`"\n",`
			`"# Obtain a vector\n",`
			`"\n",`
			`"from sklearn.feature_extraction.text import TfidfVectorizer\n",`
			`"\n",`
			`"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n",`
			`"\n",`
			`"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",`
			`"vectors_train.shape"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Converting Scikit-learn to gensim"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Although scikit-learn provides an LDA implementation, it is more popular the package gensim, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function matutils.Sparse2Corpus(). Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n",`
			`"\n",`
			`"You should install first gensim. Run 'conda install -c anaconda gensim=0.12.4' in a terminal."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from gensim import matutils\n",`
			`"\n",`
			`"vocab = vectorizer.get_feature_names()\n",`
			`"\n",`
			`"dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n",`
			`"corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Latent Dirichlet Allocation (LDA)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Although scikit-learn provides an LDA implementation, it is more popular the package gensim, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function matutils.Sparse2Corpus()."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from gensim.models.ldamodel import LdaModel\n",`
			`"\n",`
			`"# It takes a long time\n",`
			`"\n",`
			`"# train the lda model, choosing number of topics equal to 4\n",`
			`"lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[(0,\n",`
			`" '0.004objects + 0.004obtained + 0.003comets + 0.003manhattan + 0.003member + 0.003beginning + 0.003center + 0.003groups + 0.003aware + 0.003increased'),\n",`
			`" (1,\n",`
			`" '0.003activity + 0.002objects + 0.002professional + 0.002eyes + 0.002manhattan + 0.002pressure + 0.002netters + 0.002chosen + 0.002attempted + 0.002medical'),\n",`
			`" (2,\n",`
			`" '0.003mechanism + 0.003led + 0.003platform + 0.003frank + 0.003mormons + 0.003aeronautics + 0.002concepts + 0.002header + 0.002forces + 0.002profit'),\n",`
			`" (3,\n",`
			`" '0.005diameter + 0.005having + 0.004complex + 0.004conclusions + 0.004activity + 0.004looking + 0.004action + 0.004inflatable + 0.004defined + 0.004association')]"`
			`]`
			`},`
			`"execution_count": 7,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# check the topics\n",`
			`"lda.print_topics(4)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"# import the gensim.corpora module to generate dictionary\n",`
			`"from gensim import corpora\n",`
			`"\n",`
			`"from nltk import word_tokenize\n",`
			`"from nltk.corpus import stopwords\n",`
			`"from nltk import RegexpTokenizer\n",`
			`"\n",`
			`"import string\n",`
			`"\n",`
			`"def preprocess(words):\n",`
			`" tokenizer = RegexpTokenizer('[A-Z]\\w+')\n",`
			`" tokens = [w.lower() for w in tokenizer.tokenize(words)]\n",`
			`" stoplist = stopwords.words('english')\n",`
			`" tokens_stop = [w for w in tokens if w not in stoplist]\n",`
			`" punctuation = set(string.punctuation)\n",`
			`" tokens_clean = [w for w in tokens_stop if w not in punctuation]\n",`
			`" return tokens_clean\n",`
			`"\n",`
			`"#words = preprocess(newsgroups_train.data)\n",`
			`"#dictionary = corpora.Dictionary(newsgroups_train.data)\n",`
			`"\n",`
			`"texts = [preprocess(document) for document in newsgroups_train.data]\n",`
			`"\n",`
			`"dictionary = corpora.Dictionary(texts)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"# You can save the dictionary\n",`
			`"dictionary.save('newsgroup.dict')\n",`
			`"\n",`
			`"print(dictionary)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"# Generate a list of docs, where each doc is a list of words\n",`
			`"\n",`
			`"docs = [preprocess(doc) for doc in newsgroups_train.data]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"# import the gensim.corpora module to generate dictionary\n",`
			`"from gensim import corpora\n",`
			`"\n",`
			`"dictionary = corpora.Dictionary(docs)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 12,`
			`"metadata": {`
			`"collapsed": true`
			`},`
			`"outputs": [],`
			`"source": [`
			`"# You can optionally save the dictionary \n",`
			`"\n",`
			`"dictionary.save('newsgroups.dict')\n",`
			`"lda = LdaModel.load('newsgroups.lda')"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 13,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Dictionary(10913 unique tokens: ['whose', 'used', 'hoc', 'transfinite', 'newtek']...)\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"# We can print the dictionary, it is a mappying of id and tokens\n",`
			`"\n",`
			`"print(dictionary)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 14,`
			`"metadata": {`
			`"collapsed": true`
			`},`
			`"outputs": [],`
			`"source": [`
			`"# construct the corpus representing each document as a bag-of-words (bow) vector\n",`
			`"corpus = [dictionary.doc2bow(doc) for doc in docs]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 15,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from gensim.models import TfidfModel\n",`
			`"\n",`
			`"# calculate tfidf\n",`
			`"tfidf_model = TfidfModel(corpus)\n",`
			`"corpus_tfidf = tfidf_model[corpus]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 16,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"#print tf-idf of first document\n",`
			`"print(corpus_tfidf[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from gensim.models.ldamodel import LdaModel\n",`
			`"\n",`
			`"# train the lda model, choosing number of topics equal to 4, it takes a long time\n",`
			`"\n",`
			`"lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[(0,\n",`
			`" '0.010targa + 0.007ns + 0.006thanks + 0.006davidian + 0.006ssrt + 0.006yayayay + 0.005craig + 0.005bull + 0.005gerald + 0.005sorry'),\n",`
			`" (1,\n",`
			`" '0.011god + 0.010mary + 0.008baptist + 0.008islam + 0.006zoroastrians + 0.006joseph + 0.006lucky + 0.006khomeini + 0.006samaritan + 0.005crusades'),\n",`
			`" (2,\n",`
			`" '0.007koresh + 0.007moon + 0.007western + 0.006plane + 0.006jeff + 0.006unix + 0.005bible + 0.005also + 0.005basically + 0.005bob'),\n",`
			`" (3,\n",`
			`" '0.011whatever + 0.008joy + 0.007happy + 0.006virtual + 0.006reality + 0.004really + 0.003samuel___ + 0.003oh + 0.003virtually + 0.003toaster')]"`
			`]`
			`},`
			`"execution_count": 18,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# check the topics\n",`
			`"lda_model.print_topics(4)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 19,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[(0, 0.085176135689180726), (1, 0.6919655173835938), (2, 0.1377903468164027), (3, 0.0850680001108228)]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"# check the lsa vector for the first document\n",`
			`"corpus_lda = lda_model[corpus_tfidf]\n",`
			`"print(corpus_lda[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 20,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[('lord', 1), ('god', 2)]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"#predict topics of a new doc\n",`
			`"new_doc = \"God is love and God is the Lord\"\n",`
			`"#transform into BOW space\n",`
			`"bow_vector = dictionary.doc2bow(preprocess(new_doc))\n",`
			`"print([(dictionary[id], count) for id, count in bow_vector])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 21,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[(0, 0.062509420435514051), (1, 0.81246608790618835), (2, 0.062508281488992554), (3, 0.062516210169305114)]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"#transform into LDA space\n",`
			`"lda_vector = lda_model[bow_vector]\n",`
			`"print(lda_vector)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 22,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"0.011god + 0.010mary + 0.008baptist + 0.008islam + 0.006zoroastrians + 0.006joseph + 0.006lucky + 0.006khomeini + 0.006samaritan + 0.005crusades\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"# print the document's single most prominent LDA topic\n",`
			`"print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 23,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[(0, 0.10392179866025079), (1, 0.68822094221870811), (2, 0.10391916429993264), (3, 0.10393809482110833)]\n",`
			`"0.011god + 0.010mary + 0.008baptist + 0.008islam + 0.006zoroastrians + 0.006joseph + 0.006lucky + 0.006khomeini + 0.006samaritan + 0.005crusades\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n",`
			`"print(lda_vector_tfidf)\n",`
			`"# print the document's single most prominent LDA topic\n",`
			`"print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Latent Semantic Indexing (LSI)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 25,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [],`
			`"source": [`
			`"from gensim.models.lsimodel import LsiModel\n",`
			`"\n",`
			`"#It takes a long time\n",`
			`"\n",`
			`"# train the lsi model, choosing number of topics equal to 20\n",`
			`"\n",`
			`"\n",`
			`"lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 26,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[(0,\n",`
			`" '0.769\"god\" + 0.346\"jesus\" + 0.235\"bible\" + 0.204\"christian\" + 0.149\"christians\" + 0.107\"christ\" + 0.090\"well\" + 0.085\"koresh\" + 0.081\"kent\" + 0.080\"christianity\"'),\n",`
			`" (1,\n",`
			`" '-0.863\"thanks\" + -0.255\"please\" + -0.159\"hello\" + -0.153\"hi\" + 0.123\"god\" + -0.112\"sorry\" + -0.087\"could\" + -0.074\"windows\" + -0.067\"jpeg\" + -0.063\"vga\"'),\n",`
			`" (2,\n",`
			`" '0.780\"well\" + -0.229\"god\" + 0.165\"yes\" + -0.153\"thanks\" + 0.133\"ico\" + 0.133\"tek\" + 0.130\"bronx\" + 0.130\"beauchaine\" + 0.130\"queens\" + 0.129\"manhattan\"'),\n",`
			`" (3,\n",`
			`" '0.340\"well\" + -0.335\"ico\" + -0.334\"tek\" + -0.328\"beauchaine\" + -0.328\"bronx\" + -0.328\"queens\" + -0.326\"manhattan\" + -0.305\"bob\" + -0.305\"com\" + -0.072\"god\"')]"`
			`]`
			`},`
			`"execution_count": 26,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# check the topics\n",`
			`"lsi_model.print_topics(4)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 27,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"[(0, 0.1598114653031772), (1, 0.10438175896914427), (2, 0.5700978153855775), (3, 0.24093628445650234), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"# check the lsi vector for the first document\n",`
			`"print(corpus_tfidf[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",`
			`"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
			`"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.5.1"`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 0`
			`}`