"doc1 = 'Summer is coming but Summer is short'\n",
"doc2 = 'I like the Summer and I like the Winter'\n",
"doc3 = 'I like sandwiches and I like the Winter'\n",
"documents = [doc1, doc2, doc3]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Tools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The different tools we have presented so far (NLTK, Scikit-Learn, TextBlob and CLiPS) provide overlapping functionalities for obtaining vector representations and apply machine learning algorithms.\n",
"\n",
"We are going to focus on the use of scikit-learn so that we can also use easily Pandas as we saw in the previous topic.\n",
"\n",
"Scikit-learn provides specific facililities for processing texts, as described in the [manual](http://scikit-learn.org/stable/modules/feature_extraction.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Vector representation: Count vector"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Scikit-learn provides two classes for binary vectors: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). The latter is more efficient but does not allow to understand which features are more important, so we use the first class. Nevertheless, they are compatible, so, they can be interchanged for production environments.\n",
"\n",
"The first step for vectorizing with scikit-learn is creating a CountVectorizer object and then we should call 'fit_transform' to fit the vocabulary."
"As we can see, [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) comes with many options. We can define many configuration options, such as the maximum or minimum frequency of a term (*min_fd*, *max_df*), maximum number of features (*max_features*), if we analyze words or characters (*analyzer*), or if the output is binary or not (*binary*). *CountVectorizer* also allows us to include if we want to preprocess the input (*preprocessor*) before tokenizing it (*tokenizer*) and exclude stop words (*stop_words*).\n",
"\n",
"We can use NLTK preprocessing and tokenizer functions to tune *CountVectorizer* using these parameters.\n",
"Finally, we can also get a tf-idf vector representation using the class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) instead of CountVectorizer."
"Scikit-learn provides a method to calculate the cosine similarity between one vector and a set of vectors. Based on this, we can rank the similarity. In this case, the ranking for the query is [d1, d2, d3]."
"* [Scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#converting-text-to-vectors) Scikit-learn Convert Text to Vectors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",