mirror of
https://github.com/gsi-upm/sitc
synced 2025-01-08 04:01:27 +00:00
661 lines
20 KiB
Plaintext
661 lines
20 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Semantic Models"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Objectives](#Objectives)\n",
|
|
"* [Corpus](#Corpus)\n",
|
|
"* [Converting Scikit-learn to gensim](#Converting-Scikit-learn-to-gensim)\n",
|
|
"* [Latent Dirichlet Allocation (LDA)](#Latent-Dirichlet-Allocation-%28LDA%29)\n",
|
|
"* [Latent Semantic Indexing (LSI)](#Latent-Semantic-Indexing-%28LSI%29)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Objectives"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this session we provide a quick overview of the semantic models presented during the classes. In this case, we will use a real corpus so that we can extract meaningful patterns.\n",
|
|
"\n",
|
|
"The main objectives of this session are:\n",
|
|
"* Understand the models and their differences\n",
|
|
"* Learn to use some of the most popular NLP libraries"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Corpus"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n",
|
|
"\n",
|
|
"We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(2034, 2807)"
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from sklearn.datasets import fetch_20newsgroups\n",
|
|
"\n",
|
|
"# We filter only some categories, otherwise we have 20 categories\n",
|
|
"categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']\n",
|
|
"# We remove metadata to avoid bias in the classification\n",
|
|
"newsgroups_train = fetch_20newsgroups(subset='train', \n",
|
|
" remove=('headers', 'footers', 'quotes'), \n",
|
|
" categories=categories)\n",
|
|
"newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'),\n",
|
|
" categories=categories)\n",
|
|
"\n",
|
|
"\n",
|
|
"# Obtain a vector\n",
|
|
"\n",
|
|
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
|
"\n",
|
|
"vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', min_df=10)\n",
|
|
"\n",
|
|
"vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n",
|
|
"vectors_train.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Converting Scikit-learn to gensim"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*. Anyway, if you are using intensively LDA,it can be convenient to create the corpus with their functions.\n",
|
|
"\n",
|
|
"You should install first:\n",
|
|
"\n",
|
|
"* *gensim*. Run 'conda install gensim' in a terminal.\n",
|
|
"* *python-Levenshtein*. Run 'conda install python-Levenshtein' in a terminal"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Requirement already satisfied: gensim in /home/cif/anaconda3/lib/python3.10/site-packages (4.3.1)\n",
|
|
"Requirement already satisfied: scipy>=1.7.0 in /home/cif/anaconda3/lib/python3.10/site-packages (from gensim) (1.10.1)\n",
|
|
"Requirement already satisfied: smart-open>=1.8.1 in /home/cif/anaconda3/lib/python3.10/site-packages (from gensim) (6.3.0)\n",
|
|
"Requirement already satisfied: numpy>=1.18.5 in /home/cif/anaconda3/lib/python3.10/site-packages (from gensim) (1.24.2)\n",
|
|
"Note: you may need to restart the kernel to use updated packages.\n",
|
|
"Requirement already satisfied: python-Levenshtein in /home/cif/anaconda3/lib/python3.10/site-packages (0.21.0)\n",
|
|
"Requirement already satisfied: Levenshtein==0.21.0 in /home/cif/anaconda3/lib/python3.10/site-packages (from python-Levenshtein) (0.21.0)\n",
|
|
"Requirement already satisfied: rapidfuzz<4.0.0,>=2.3.0 in /home/cif/anaconda3/lib/python3.10/site-packages (from Levenshtein==0.21.0->python-Levenshtein) (3.0.0)\n",
|
|
"Note: you may need to restart the kernel to use updated packages.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"%pip install gensim\n",
|
|
"%pip install python-Levenshtein"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from gensim import matutils\n",
|
|
"\n",
|
|
"vocab = vectorizer.get_feature_names_out()\n",
|
|
"\n",
|
|
"dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names_out())])\n",
|
|
"corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Latent Dirichlet Allocation (LDA)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Although scikit-learn provides an LDA implementation, it is more popular the package *gensim*, which also provides an LSI implementation, as well as other functionalities. Fortunately, scikit-learn sparse matrices can be used in Gensim using the function *matutils.Sparse2Corpus()*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from gensim.models.ldamodel import LdaModel\n",
|
|
"\n",
|
|
"# It takes a long time\n",
|
|
"\n",
|
|
"# train the lda model, choosing number of topics equal to 4\n",
|
|
"lda = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[(0,\n",
|
|
" '0.004*\"central\" + 0.004*\"assumptions\" + 0.004*\"matthew\" + 0.004*\"define\" + 0.004*\"holes\" + 0.003*\"killing\" + 0.003*\"item\" + 0.003*\"curious\" + 0.003*\"going\" + 0.003*\"presentations\"'),\n",
|
|
" (1,\n",
|
|
" '0.002*\"mechanism\" + 0.002*\"led\" + 0.002*\"apple\" + 0.002*\"color\" + 0.002*\"mormons\" + 0.002*\"activity\" + 0.002*\"concepts\" + 0.002*\"frank\" + 0.002*\"platform\" + 0.002*\"fault\"'),\n",
|
|
" (2,\n",
|
|
" '0.005*\"objects\" + 0.005*\"obtained\" + 0.003*\"manhattan\" + 0.003*\"capability\" + 0.003*\"education\" + 0.003*\"men\" + 0.003*\"photo\" + 0.003*\"decent\" + 0.003*\"environmental\" + 0.003*\"pain\"'),\n",
|
|
" (3,\n",
|
|
" '0.004*\"car\" + 0.004*\"contain\" + 0.004*\"groups\" + 0.004*\"center\" + 0.004*\"evil\" + 0.004*\"maintain\" + 0.004*\"comets\" + 0.004*\"88\" + 0.004*\"density\" + 0.003*\"company\"')]"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# check the topics\n",
|
|
"lda.print_topics(4)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Since there are some problems for translating the corpus from Scikit-Learn to LSI, we are now going to create 'natively' the corpus with Gensim."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# import the gensim.corpora module to generate dictionary\n",
|
|
"from gensim import corpora\n",
|
|
"\n",
|
|
"from nltk import word_tokenize\n",
|
|
"from nltk.corpus import stopwords\n",
|
|
"from nltk import RegexpTokenizer\n",
|
|
"\n",
|
|
"import string\n",
|
|
"\n",
|
|
"def preprocess(words):\n",
|
|
" tokenizer = RegexpTokenizer('[A-Z]\\w+')\n",
|
|
" tokens = [w.lower() for w in tokenizer.tokenize(words)]\n",
|
|
" stoplist = stopwords.words('english')\n",
|
|
" tokens_stop = [w for w in tokens if w not in stoplist]\n",
|
|
" punctuation = set(string.punctuation)\n",
|
|
" tokens_clean = [w for w in tokens_stop if w not in punctuation]\n",
|
|
" return tokens_clean\n",
|
|
"\n",
|
|
"#words = preprocess(newsgroups_train.data)\n",
|
|
"#dictionary = corpora.Dictionary(newsgroups_train.data)\n",
|
|
"\n",
|
|
"texts = [preprocess(document) for document in newsgroups_train.data]\n",
|
|
"\n",
|
|
"dictionary = corpora.Dictionary(texts)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Dictionary<10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...>\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# You can save the dictionary\n",
|
|
"dictionary.save('newsgroup.dict.texts')\n",
|
|
"\n",
|
|
"print(dictionary)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Generate a list of docs, where each doc is a list of words\n",
|
|
"\n",
|
|
"docs = [preprocess(doc) for doc in newsgroups_train.data]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# import the gensim.corpora module to generate dictionary\n",
|
|
"from gensim import corpora\n",
|
|
"\n",
|
|
"dictionary = corpora.Dictionary(docs)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Dictionary<10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...>\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# We can print the dictionary, it is a mappying of id and tokens\n",
|
|
"\n",
|
|
"print(dictionary)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# construct the corpus representing each document as a bag-of-words (bow) vector\n",
|
|
"corpus = [dictionary.doc2bow(doc) for doc in docs]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from gensim.models import TfidfModel\n",
|
|
"\n",
|
|
"# calculate tfidf\n",
|
|
"tfidf_model = TfidfModel(corpus)\n",
|
|
"corpus_tfidf = tfidf_model[corpus]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(0, 0.24093628445650234), (1, 0.5700978153855775), (2, 0.10438175896914427), (3, 0.1598114653031772), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"#print tf-idf of first document\n",
|
|
"print(corpus_tfidf[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from gensim.models.ldamodel import LdaModel\n",
|
|
"\n",
|
|
"# train the lda model, choosing number of topics equal to 4, it takes a long time\n",
|
|
"\n",
|
|
"lda_model = LdaModel(corpus_tfidf, num_topics=4, passes=20, id2word=dictionary)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[(0,\n",
|
|
" '0.011*\"mary\" + 0.007*\"ns\" + 0.006*\"joseph\" + 0.006*\"lucky\" + 0.006*\"ssrt\" + 0.005*\"god\" + 0.005*\"unfortunately\" + 0.004*\"rayshade\" + 0.004*\"phil\" + 0.004*\"nasa\"'),\n",
|
|
" (1,\n",
|
|
" '0.009*\"thanks\" + 0.009*\"targa\" + 0.008*\"whatever\" + 0.008*\"baptist\" + 0.007*\"islam\" + 0.006*\"cheers\" + 0.006*\"kent\" + 0.006*\"zoroastrians\" + 0.006*\"joy\" + 0.006*\"lot\"'),\n",
|
|
" (2,\n",
|
|
" '0.008*\"moon\" + 0.008*\"really\" + 0.008*\"western\" + 0.007*\"plane\" + 0.006*\"samaritan\" + 0.006*\"crusades\" + 0.006*\"baltimore\" + 0.005*\"bob\" + 0.005*\"septuagint\" + 0.005*\"virtual\"'),\n",
|
|
" (3,\n",
|
|
" '0.009*\"koresh\" + 0.008*\"bible\" + 0.008*\"jeff\" + 0.007*\"basically\" + 0.006*\"gerald\" + 0.006*\"bull\" + 0.005*\"pd\" + 0.004*\"also\" + 0.003*\"dam\" + 0.003*\"feiner\"')]"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# check the topics\n",
|
|
"lda_model.print_topics(4)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(0, 0.09161347), (1, 0.1133858), (2, 0.103424065), (3, 0.69157666)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# check the lsa vector for the first document\n",
|
|
"corpus_lda = lda_model[corpus_tfidf]\n",
|
|
"print(corpus_lda[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[('lord', 1), ('god', 2)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"#predict topics of a new doc\n",
|
|
"new_doc = \"God is love and God is the Lord\"\n",
|
|
"#transform into BOW space\n",
|
|
"bow_vector = dictionary.doc2bow(preprocess(new_doc))\n",
|
|
"print([(dictionary[id], count) for id, count in bow_vector])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(0, 0.066217005), (1, 0.8084562), (2, 0.062542014), (3, 0.0627848)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"#transform into LDA space\n",
|
|
"lda_vector = lda_model[bow_vector]\n",
|
|
"print(lda_vector)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"0.009*\"thanks\" + 0.009*\"targa\" + 0.008*\"whatever\" + 0.008*\"baptist\" + 0.007*\"islam\" + 0.006*\"cheers\" + 0.006*\"kent\" + 0.006*\"zoroastrians\" + 0.006*\"joy\" + 0.006*\"lot\"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# print the document's single most prominent LDA topic\n",
|
|
"print(lda_model.print_topic(max(lda_vector, key=lambda item: item[1])[0]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(0, 0.11006463), (1, 0.6813435), (2, 0.10399808), (3, 0.10459379)]\n",
|
|
"0.009*\"thanks\" + 0.009*\"targa\" + 0.008*\"whatever\" + 0.008*\"baptist\" + 0.007*\"islam\" + 0.006*\"cheers\" + 0.006*\"kent\" + 0.006*\"zoroastrians\" + 0.006*\"joy\" + 0.006*\"lot\"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"lda_vector_tfidf = lda_model[tfidf_model[bow_vector]]\n",
|
|
"print(lda_vector_tfidf)\n",
|
|
"# print the document's single most prominent LDA topic\n",
|
|
"print(lda_model.print_topic(max(lda_vector_tfidf, key=lambda item: item[1])[0]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Latent Semantic Indexing (LSI)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from gensim.models.lsimodel import LsiModel\n",
|
|
"\n",
|
|
"#It takes a long time\n",
|
|
"\n",
|
|
"# train the lsi model, choosing number of topics equal to 20\n",
|
|
"\n",
|
|
"\n",
|
|
"lsi_model = LsiModel(corpus_tfidf, num_topics=4, id2word=dictionary)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"[(0,\n",
|
|
" '-0.769*\"god\" + -0.345*\"jesus\" + -0.235*\"bible\" + -0.203*\"christian\" + -0.149*\"christians\" + -0.107*\"christ\" + -0.089*\"well\" + -0.085*\"koresh\" + -0.082*\"kent\" + -0.081*\"christianity\"'),\n",
|
|
" (1,\n",
|
|
" '-0.863*\"thanks\" + -0.255*\"please\" + -0.159*\"hello\" + -0.152*\"hi\" + 0.123*\"god\" + -0.112*\"sorry\" + -0.088*\"could\" + -0.074*\"windows\" + -0.067*\"jpeg\" + -0.063*\"gif\"'),\n",
|
|
" (2,\n",
|
|
" '0.779*\"well\" + -0.229*\"god\" + 0.165*\"yes\" + -0.154*\"thanks\" + 0.135*\"ico\" + 0.134*\"tek\" + 0.131*\"queens\" + 0.131*\"bronx\" + 0.131*\"beauchaine\" + 0.131*\"manhattan\"'),\n",
|
|
" (3,\n",
|
|
" '-0.342*\"well\" + 0.335*\"ico\" + 0.333*\"tek\" + 0.327*\"bronx\" + 0.327*\"queens\" + 0.327*\"beauchaine\" + 0.325*\"manhattan\" + 0.305*\"bob\" + 0.304*\"com\" + 0.073*\"god\"')]"
|
|
]
|
|
},
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# check the topics\n",
|
|
"lsi_model.print_topics(4)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[(0, 0.24093628445650234), (1, 0.5700978153855775), (2, 0.10438175896914427), (3, 0.1598114653031772), (4, 0.722808853369507), (5, 0.24093628445650234)]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# check the lsi vector for the first document\n",
|
|
"print(corpus_tfidf[0])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
|
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"datacleaner": {
|
|
"position": {
|
|
"top": "50px"
|
|
},
|
|
"python": {
|
|
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
|
},
|
|
"window_display": false
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.10"
|
|
},
|
|
"latex_envs": {
|
|
"LaTeX_envs_menu_present": true,
|
|
"autocomplete": true,
|
|
"bibliofile": "biblio.bib",
|
|
"cite_by": "apalike",
|
|
"current_citInitial": 1,
|
|
"eqLabelWithNumbers": true,
|
|
"eqNumInitial": 1,
|
|
"hotkeys": {
|
|
"equation": "Ctrl-E",
|
|
"itemize": "Ctrl-I"
|
|
},
|
|
"labels_anchors": false,
|
|
"latex_user_defs": false,
|
|
"report_style_numbering": false,
|
|
"user_envs_cfg": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|