1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-11-25 15:52:29 +00:00
sitc/nlp/4_5_Semantic_Models.ipynb
2017-04-20 12:56:39 +02:00

526 lines
13 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Semantic Models"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Objectives](#Objectives)\n",
"* [Corpus](#Corpus)\n",
"* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n",
"* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n",
"* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Objectives"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n",
"\n",
"The main objectives of this session are:\n",
"* Understand the models and their differences\n",
"* Learn to use some of the most popular NLP libraries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Corpus"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",
"\n",
"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_20newsgroups\n",
"\n",
"# We filter only some categories, otherwise we have 20 categories\n",
"categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
"# We remove metadata to avoid bias in the classification\n",
"newsgroups_train = fetch_20newsgroups(subset='train', \n",
" remove=('headers', 'footers', 'quotes'), \n",
" categories=categories)\n",
"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n",
" categories=categories)\n",
"\n",
"\n",
"# Obtain a vector\n",
"\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"\n",
"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n",
"\n",
"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
"vectors_train.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Converting Scikit-learn to gensim"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n",
"\n",
"You should install first *gensim*. Run 'conda install -c anaconda gensim=0.12.4' in a terminal."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim import matutils\n",
"\n",
"vocab = vectorizer.get_feature_names()\n",
"\n",
"dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n",
"corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Latent Dirichlet Allocation (LDA)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.ldamodel import LdaModel\n",
"\n",
"# It takes a long time\n",
"\n",
"# train the lda model, choosing number of topics equal to 4\n",
"lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# check the topics\n",
"lda.print_topics(4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import the gensim.corpora module to generate dictionary\n",
"from gensim import corpora\n",
"\n",
"from nltk import word_tokenize\n",
"from nltk.corpus import stopwords\n",
"from nltk import RegexpTokenizer\n",
"\n",
"import string\n",
"\n",
"def preprocess(words):\n",
" tokenizer = RegexpTokenizer('[A-Z]\\w+')\n",
" tokens = [w.lower() for w in tokenizer.tokenize(words)]\n",
" stoplist = stopwords.words('english')\n",
" tokens_stop = [w for w in tokens if w not in stoplist]\n",
" punctuation = set(string.punctuation)\n",
" tokens_clean = [w for w in tokens_stop if w not in punctuation]\n",
" return tokens_clean\n",
"\n",
"#words = preprocess(newsgroups_train.data)\n",
"#dictionary = corpora.Dictionary(newsgroups_train.data)\n",
"\n",
"texts = [preprocess(document) for document in newsgroups_train.data]\n",
"\n",
"dictionary = corpora.Dictionary(texts)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# You can save the dictionary\n",
"dictionary.save('newsgroup.dict')\n",
"\n",
"print(dictionary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Generate a list of docs, where each doc is a list of words\n",
"\n",
"docs = [preprocess(doc) for doc in newsgroups_train.data]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import the gensim.corpora module to generate dictionary\n",
"from gensim import corpora\n",
"\n",
"dictionary = corpora.Dictionary(docs)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# You can optionally save the dictionary \n",
"\n",
"dictionary.save('newsgroups.dict')\n",
"lda = LdaModel.load('newsgroups.lda')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# We can print the dictionary, it is a mappying of id and tokens\n",
"\n",
"print(dictionary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# construct the corpus representing each document as a bag-of-words (bow) vector\n",
"corpus = [dictionary.doc2bow(doc) for doc in docs]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models import TfidfModel\n",
"\n",
"# calculate tfidf\n",
"tfidf_model = TfidfModel(corpus)\n",
"corpus_tfidf = tfidf_model[corpus]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#print tf-idf of first document\n",
"print(corpus_tfidf[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.ldamodel import LdaModel\n",
"\n",
"# train the lda model, choosing number of topics equal to 4, it takes a long time\n",
"\n",
"lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# check the topics\n",
"lda_model.print_topics(4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# check the lsa vector for the first document\n",
"corpus_lda = lda_model[corpus_tfidf]\n",
"print(corpus_lda[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#predict topics of a new doc\n",
"new_doc = \"God is love and God is the Lord\"\n",
"#transform into BOW space\n",
"bow_vector = dictionary.doc2bow(preprocess(new_doc))\n",
"print([(dictionary[id], count) for id, count in bow_vector])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#transform into LDA space\n",
"lda_vector = lda_model[bow_vector]\n",
"print(lda_vector)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# print the document's single most prominent LDA topic\n",
"print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n",
"print(lda_vector_tfidf)\n",
"# print the document's single most prominent LDA topic\n",
"print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Latent Semantic Indexing (LSI)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from gensim.models.lsimodel import LsiModel\n",
"\n",
"#It takes a long time\n",
"\n",
"# train the lsi model, choosing number of topics equal to 20\n",
"\n",
"\n",
"lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# check the topics\n",
"lsi_model.print_topics(4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# check the lsi vector for the first document\n",
"print(corpus_tfidf[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}