![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias

# Vector Representation

# Table of Contents
* [Objectives](#Objectives)
* [Tools](#Tools)
* [Vector representation: Count vector](#Vector-representation:-Count-vector)
* [Binary vectors](#Binary-vectors)
* [Bigram vectors](#Bigram-vectors)
* [Tf-idf vector representation](#Tf-idf-vector-representation)

# Objectives

In this notebook we are going to transform text into feature vectors, using several representations as presented in class.

We are going to use the examples from the slides.

In [2]:
doc1 = 'Summer is coming but Summer is short'
doc2 = 'I like the Summer and I like the Winter'
doc3 = 'I like sandwiches and I like the Winter'
documents = [doc1, doc2, doc3]

# Tools

The different tools we have presented so far (NLTK, Scikit-Learn, TextBlob and CLiPS) provide overlapping functionalities for obtaining vector representations and apply machine learning algorithms.

We are going to focus on the use of scikit-learn so that we can also use easily Pandas as we saw in the previous topic.

Scikit-learn provides specific facililities for processing texts, as described in the [manual](http://scikit-learn.org/stable/modules/feature_extraction.html).

# Vector representation: Count vector

Scikit-learn provides two classes for binary vectors: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). The latter is more efficient but does not allow to understand which features are more important, so we use the first class. Nevertheless, they are compatible, so, they can be interchanged for production environments.

The first step for vectorizing with scikit-learn is creating a CountVectorizer object and then we should call 'fit_transform' to fit the vocabulary.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = "word", max_features = 5000) 
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

As we can see, [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) comes with many options. We can define many configuration options, such as the maximum or minimum frequency of a term (*min_fd*, *max_df*), maximum number of features (*max_features*), if we analyze words or characters (*analyzer*), or if the output is binary or not (*binary*). *CountVectorizer* also allows us to include if we want to preprocess the input (*preprocessor*) before tokenizing it (*tokenizer*) and exclude stop words (*stop_words*).

We can use NLTK preprocessing and tokenizer functions to tune *CountVectorizer* using these parameters.

We are going to see how the vectors look like.

In [5]:
vectors = vectorizer.fit_transform(documents)
vectors

<3x10 sparse matrix of type '<class 'numpy.int64'>'
	with 15 stored elements in Compressed Sparse Row format>

We see the vectors are stored as a sparse matrix of 3x6 dimensions.
We can print the matrix as well as the feature names.

In [6]:
print(vectors.toarray())
print(vectorizer.get_feature_names())

[[0 1 1 2 0 0 1 2 0 0]
 [1 0 0 0 2 0 0 1 2 1]
 [1 0 0 0 2 1 0 0 1 1]]
['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short', 'summer', 'the', 'winter']


As you can see, the pronoun 'I' has been removed because of the default token_pattern. 
We can change this as follows.

In [7]:
vectorizer = CountVectorizer(analyzer="word", stop_words=None, token_pattern='(?u)\\b\\w+\\b') 
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['and',
 'but',
 'coming',
 'i',
 'is',
 'like',
 'sandwiches',
 'short',
 'summer',
 'the',
 'winter']

We can now filter the stop words (it will remove 'and', 'but', 'I', 'is' and 'the').

In [8]:
vectorizer = CountVectorizer(analyzer="word", stop_words='english', token_pattern='(?u)\\b\\w+\\b') 
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

In [9]:
#stop words in scikit-learn for English
print(vectorizer.get_stop_words())

frozenset({'could', 'sixty', 'onto', 'by', 'against', 'up', 'a', 'everything', 'other', 'otherwise', 'ourselves', 'beside', 'nowhere', 'then', 'below', 'put', 'ten', 'such', 'cannot', 'either', 'due', 'hasnt', 'whereupon', 'were', 'once', 'at', 'for', 'front', 'get', 'whereas', 'that', 'eight', 'another', 'except', 'of', 'wherever', 'over', 'to', 'whom', 'you', 'former', 'behind', 'yours', 'yourself', 'what', 'even', 'however', 'go', 'less', 'bottom', 'may', 'along', 'is', 'can', 'move', 'eg', 'somewhere', 'latterly', 'seemed', 'thence', 'becoming', 'himself', 'whether', 'six', 'first', 'off', 'do', 'many', 'namely', 'never', 'because', 'mostly', 'nevertheless', 'thereupon', 'here', 'least', 'anyone', 'one', 'others', 'cry', 'they', 'thereby', 'ie', 'am', 'this', 'would', 'any', 'while', 'see', 'too', 'your', 'somehow', 'within', 'same', 'sometimes', 'thereafter', 'must', 'take', 're', 'both', 'fill', 'nor', 'sometime', 'he', 'third', 'more', 'also', 'most', 'during', 'much', 'our', 't

In [10]:
# Vectors
f_array = vectors.toarray()
f_array

array([[1, 0, 0, 1, 2, 0],
       [0, 2, 0, 0, 1, 1],
       [0, 2, 1, 0, 0, 1]], dtype=int64)

We can compute now the **distance** between vectors.

In [11]:
from scipy.spatial.distance import cosine
d12 = cosine(f_array[0], f_array[1])
d13 = cosine(f_array[0], f_array[2])
d23 = cosine(f_array[1], f_array[2])
print(d12, d13, d23)

0.666666666667 1.0 0.166666666667


# Binary vectors

We can also get **binary vectors** as follows.

In [12]:
vectorizer = CountVectorizer(analyzer="word", stop_words='english', binary=True) 
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

In [13]:
vectors.toarray()

array([[1, 0, 0, 1, 1, 0],
       [0, 1, 0, 0, 1, 1],
       [0, 1, 1, 0, 0, 1]], dtype=int64)

# Bigram vectors

It is also easy to get bigram vectors.

In [14]:
vectorizer = CountVectorizer(analyzer="word", stop_words='english', ngram_range=[2,2]) 
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming summer',
 'like sandwiches',
 'like summer',
 'like winter',
 'sandwiches like',
 'summer coming',
 'summer like',
 'summer short']

In [15]:
vectors.toarray()

array([[1, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 1, 1, 0, 0, 1, 0],
       [0, 1, 0, 1, 1, 0, 0, 0]], dtype=int64)

# Tf-idf vector representation

Finally, we can also get a tf-idf vector representation using the class [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) instead of CountVectorizer.

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer="word", stop_words='english')
vectors = vectorizer.fit_transform(documents)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

In [17]:
vectors.toarray()

array([[ 0.48148213,  0.        ,  0.        ,  0.48148213,  0.73235914,
         0.        ],
       [ 0.        ,  0.81649658,  0.        ,  0.        ,  0.40824829,
         0.40824829],
       [ 0.        ,  0.77100584,  0.50689001,  0.        ,  0.        ,
         0.38550292]])

We can now compute the similarity of a query and a set of documents as follows.

In [30]:
train = [doc1, doc2, doc3]
vectorizer = TfidfVectorizer(analyzer="word", stop_words='english')

# We learn the vocabulary (fit) and tranform the docs into vectors
vectors = vectorizer.fit_transform(train)
vectorizer.get_feature_names()

['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']

In [31]:
vectors.toarray()

array([[ 0.48148213,  0.        ,  0.        ,  0.48148213,  0.73235914,
         0.        ],
       [ 0.        ,  0.81649658,  0.        ,  0.        ,  0.40824829,
         0.40824829],
       [ 0.        ,  0.77100584,  0.50689001,  0.        ,  0.        ,
         0.38550292]])

Scikit-learn provides a method to calculate the cosine similarity between one vector and a set of vectors. Based on this, we can rank the similarity. In this case, the ranking for the query is [d1, d2, d3].

In [33]:
from sklearn.metrics.pairwise import cosine_similarity

query = ['winter short']

# We transform the query into a vector of the learnt vocabulary
vector_query = vectorizer.transform(query)

# Here we calculate the distance of the query to the docs
cosine_similarity(vector_query, vectors)

array([[ 0.38324078,  0.24713249,  0.23336362]])

The same result can be obtained with pairwise metrics (kernels in ML terminology) if we use the linear kernel.

In [29]:
from sklearn.metrics.pairwise import linear_kernel
cosine_similarity = linear_kernel(vector_query, vectors).flatten()
cosine_similarity

array([ 0.38324078,  0.24713249,  0.23336362])

## References



* [Scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#converting-text-to-vectors) Scikit-learn Convert Text to Vectors

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid.