{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Semantic Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Objectives](#Objectives)\n", "* [Corpus](#Corpus)\n", "* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n", "* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n", "* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n", "\n", "The main objectives of this session are:\n", "* Understand the models and their differences\n", "* Learn to use some of the most popular NLP libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n", "\n", "We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "\n", "# We filter only some categories, otherwise we have 20 categories\n", "categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n", "# We remove metadata to avoid bias in the classification\n", "newsgroups_train = fetch_20newsgroups(subset='train', \n", " remove=('headers', 'footers', 'quotes'), \n", " categories=categories)\n", "newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n", " categories=categories)\n", "\n", "\n", "# Obtain a vector\n", "\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n", "\n", "vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n", "vectors_train.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Converting Scikit-learn to gensim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n", "\n", "You should install first *gensim*. Run 'conda install -c anaconda gensim=0.12.4' in a terminal." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim import matutils\n", "\n", "vocab = vectorizer.get_feature_names()\n", "\n", "dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n", "corpus_tfidf = matutils.Sparse2Corpus(vectors_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Latent Dirichlet Allocation (LDA)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models.ldamodel import LdaModel\n", "\n", "# It takes a long time\n", "\n", "# train the lda model, choosing number of topics equal to 4\n", "lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the topics\n", "lda.print_topics(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import the gensim.corpora module to generate dictionary\n", "from gensim import corpora\n", "\n", "from nltk import word_tokenize\n", "from nltk.corpus import stopwords\n", "from nltk import RegexpTokenizer\n", "\n", "import string\n", "\n", "def preprocess(words):\n", " tokenizer = RegexpTokenizer('[A-Z]\\w+')\n", " tokens = [w.lower() for w in tokenizer.tokenize(words)]\n", " stoplist = stopwords.words('english')\n", " tokens_stop = [w for w in tokens if w not in stoplist]\n", " punctuation = set(string.punctuation)\n", " tokens_clean = [w for w in tokens_stop if w not in punctuation]\n", " return tokens_clean\n", "\n", "#words = preprocess(newsgroups_train.data)\n", "#dictionary = corpora.Dictionary(newsgroups_train.data)\n", "\n", "texts = [preprocess(document) for document in newsgroups_train.data]\n", "\n", "dictionary = corpora.Dictionary(texts)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can save the dictionary\n", "dictionary.save('newsgroup.dict')\n", "\n", "print(dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Generate a list of docs, where each doc is a list of words\n", "\n", "docs = [preprocess(doc) for doc in newsgroups_train.data]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# import the gensim.corpora module to generate dictionary\n", "from gensim import corpora\n", "\n", "dictionary = corpora.Dictionary(docs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# You can optionally save the dictionary \n", "\n", "dictionary.save('newsgroups.dict')\n", "lda = LdaModel.load('newsgroups.lda')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can print the dictionary, it is a mappying of id and tokens\n", "\n", "print(dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# construct the corpus representing each document as a bag-of-words (bow) vector\n", "corpus = [dictionary.doc2bow(doc) for doc in docs]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models import TfidfModel\n", "\n", "# calculate tfidf\n", "tfidf_model = TfidfModel(corpus)\n", "corpus_tfidf = tfidf_model[corpus]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#print tf-idf of first document\n", "print(corpus_tfidf[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models.ldamodel import LdaModel\n", "\n", "# train the lda model, choosing number of topics equal to 4, it takes a long time\n", "\n", "lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the topics\n", "lda_model.print_topics(4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the lsa vector for the first document\n", "corpus_lda = lda_model[corpus_tfidf]\n", "print(corpus_lda[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#predict topics of a new doc\n", "new_doc = \"God is love and God is the Lord\"\n", "#transform into BOW space\n", "bow_vector = dictionary.doc2bow(preprocess(new_doc))\n", "print([(dictionary[id], count) for id, count in bow_vector])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#transform into LDA space\n", "lda_vector = lda_model[bow_vector]\n", "print(lda_vector)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# print the document's single most prominent LDA topic\n", "print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n", "print(lda_vector_tfidf)\n", "# print the document's single most prominent LDA topic\n", "print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Latent Semantic Indexing (LSI)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from gensim.models.lsimodel import LsiModel\n", "\n", "#It takes a long time\n", "\n", "# train the lsi model, choosing number of topics equal to 20\n", "\n", "\n", "lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the topics\n", "lsi_model.print_topics(4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# check the lsi vector for the first document\n", "print(corpus_tfidf[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n", "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }