![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

# Lexical Processing

# Table of Contents
* [Objectives](#Objectives)
* [NLP Basics](#NLP-Basics)
    * [Spacy installation](#Spacy-installation)
    * [Spacy pipeline](#Spacy-pipeline)
    * [Tokenization](#Tokenization)
    * [Noun chunks](#Noun-chunks)
    * [Stemming](#Stemming)
    * [Sentence segmentation](#Sentence-segmentation)
    * [Lemmatization](#Lemmatization)
    * [Stop words](#Stop-words)
    * [POS](#POS)
    * [NER](#NER)
* [Text Feature extraction](#Text-Feature-extraction)
* [Classifying spam](#Classifying-spam)
* [Vectors and similarity](#Vectors-and-similarity)

# Objectives

In this session, we are going to learn to process text so that we can apply machine learning techniques.

# NLP Basics
In this notebook, we are going to use two popular NLP libraries:
* NLTK (Natural Language Toolkit, https://www.nltk.org/) 
* Spacy (https://spacy.io/)

Main characteristics:
* both are open source and very popular
* NLTK was released in 2001, while Spacy was in 2015
* Spacy provides very efficient implementations

# Spacy installation

You need to install spacy if not installed:
* `pip install spacy`
* or `conda install -c conda-forge spacy`

and install the small English model 
* `python -m spacy download en_core_web_sm`

# Spacy pipelines

The function **nlp** takes a raw text and performs several operations (tokenization, tagger, NER, ...)
![](spacy/spacy-pipeline.svg "Spacy pipelines")

From text to doc through the pipeline

In [1]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Albert Einstein won the Nobel Prize for Physics in 1921')
doc2 = nlp(u'"Let\'s go to N.Y.!"')

## Tokenization
From text to tokens
![](spacy/tokenization.svg "Tokenization")

The tokenizer checks:

* **Tokenizer exception:** Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.
* **Prefix:** Character(s) at the beginning, e.g. $, (, “, ¿.
* **Suffix:** Character(s) at the end, e.g. km, ”, !.
* **Infix:** Character(s) in between, e.g. -, --, /, ….

Let's do it!

In [2]:
# print tokens
for token in doc:
    print(token.text)

Albert
Einstein
won
the
Nobel
Prize
for
Physics
in
1921


In [3]:
for token in doc2:
    print(token.text)

"
Let
's
go
to
N.Y.
!
"


## Noun chunks
Noun phrases

In [4]:
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward


In [5]:
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True,options={'distance':130})

## Sentence segmentation


In [6]:
doc3 =nlp(u'This is a sentence. This is another sentence.')
for sent in doc3.sents:
    print(sent)

This is a sentence.
This is another sentence.


## Stemming
Spacy does not include a stemmer. 
We will use nltk.
The purpose is removing the ending of a word based on rules.

In [7]:
import nltk
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
words = ['caresses', 'flies','is', 'been', 'generously']
stems = [stemmer.stem(word) for word in words]
for stem in stems:
    print(stem)

caress
fli
is
been
gener


## Lemmatization
Lemmatization includes a morphological analysis.

In [8]:
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])

['I', 'be', 'read', 'the', 'paper', '.']


In [9]:
for token in doc:
    print(token.text, '\t', token.pos_,  '\t', token.lemma_)

I 	 PRON 	 I
was 	 AUX 	 be
reading 	 VERB 	 read
the 	 DET 	 the
paper 	 NOUN 	 paper
. 	 PUNCT 	 .


## Stopwords

In [10]:
nlp = spacy.load('en_core_web_sm')
print(nlp.Defaults.stop_words)

{'other', 'toward', 'everywhere', 'whether', 'i', 'his', 'afterwards', 'whenever', 'except', 'but', 'for', 'noone', 'whereas', 'she', 'name', 'per', 'among', 'where', 'am', '’re', 'on', 'thereby', 'fifty', "'ll", '‘m', 'would', 'we', 'once', 'can', 'meanwhile', 'anything', 'still', 'these', 'without', 'below', 'rather', 'were', 'six', 'them', 'latter', 'sometime', 'seemed', 'next', 'move', 'there', 'various', 'fifteen', '’m', 'onto', 'n’t', 'much', 'one', 'due', 'our', 'although', 'whom', 'done', 'made', 'at', 'empty', 'about', 'herself', 'down', 'never', 'thereafter', "'re", 'off', 'he', 'hereafter', 'whereafter', 'what', 'above', 'along', 'a', '’ve', '’s', 'else', 'or', 'sometimes', 'twenty', "n't", 'over', 'both', 'someone', 'beside', 'therein', 'whole', 'make', 'any', 'becomes', 'anyhow', 'quite', 'hence', 'here', 'same', 'which', 'whereby', 'whereupon', 'must', 'me', 'part', 'serious', 'into', 'namely', 'hers', 'enough', 'with', 'because', 'own', 'give', 'see', 'somehow', 'since',

In [11]:
nlp.vocab['for'].is_stop

True

In [12]:
nlp.vocab['day'].is_stop

False

In [13]:
nlp.vocab['btw'].is_stop

False

In [14]:
#add stop words
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True
nlp.vocab['btw'].is_stop

True

In [15]:
en_stopwords = nlp.Defaults.stop_words
text = "Nick likes to play football, however he is not too fond of tennis."

lst=[]
for token in text.split():
    if token.lower() not in en_stopwords:
        lst.append(token)

print('Original Text')        
print(text,'\n\n')

print('Text after removing stop words')
print(' '.join(lst))

Original Text
Nick likes to play football, however he is not too fond of tennis. 


Text after removing stop words
Nick likes play football, fond tennis.


We can also use methods from the token class (https://spacy.io/api/token), such as:

* **is_stop:** is the token a stop word?
* **is_punct:** is the token punctuation?
* **like_email:** does the token resemble an email address?
* **is_digit:** Does the token consist of digits? 

## POS

In [16]:
# POS
# information available at https://spacy.io/usage/linguistic-features/#pos-tagging
for token in doc:
    print(token.text, '\t', token.pos_, '\t', spacy.explain(token.pos_))

I 	 PRON 	 pronoun
was 	 AUX 	 auxiliary
reading 	 VERB 	 verb
the 	 DET 	 determiner
paper 	 NOUN 	 noun
. 	 PUNCT 	 punctuation


In [17]:
for token in doc:
    print(token.text, '\t', token.pos_, '\t', spacy.explain(token.pos_), '\t', token.dep_)

I 	 PRON 	 pronoun 	 nsubj
was 	 AUX 	 auxiliary 	 aux
reading 	 VERB 	 verb 	 ROOT
the 	 DET 	 determiner 	 det
paper 	 NOUN 	 noun 	 dobj
. 	 PUNCT 	 punctuation 	 punct


In [18]:
# List the pipeline
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f74618d3d00>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f74618d3880>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f74619006d0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f74618ffd00>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f746171d900>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f74619007b0>)]

In [19]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

We can also get some statistics, regarding the frequency of tags in POS, DEP or TAG.

In [20]:
doc = nlp(u'Nick likes to play football, however he is not too fond of tennis.')
POS_counts = doc.count_by(spacy.attrs.POS)

for code,freq in sorted(POS_counts.items()):
    print(doc.vocab[code].text, freq)

ADJ 1
ADP 1
ADV 2
AUX 1
NOUN 2
PART 2
PRON 1
PROPN 1
PUNCT 2
VERB 2


In [21]:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for code,freq in sorted(TAG_counts.items()):
    print(doc.vocab[code].text, freq)

RB 3
IN 1
, 1
TO 1
JJ 1
. 1
PRP 1
VBZ 2
VB 1
NN 2
NNP 1


In [22]:
DEP_counts = doc.count_by(spacy.attrs.DEP)

for code,freq in sorted(DEP_counts.items()):
    print(doc.vocab[code].text, freq)

acomp 1
advmod 2
aux 1
ccomp 1
dobj 1
neg 1
nsubj 2
pobj 1
prep 1
punct 2
xcomp 1
ROOT 1


## NER

In [23]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [24]:
displacy.render(doc, style='ent', jupyter=True,options={'distance':130})

In [25]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')
displacy.render(doc, style='ent', jupyter=True,options={'distance':130})

# Text Feature Extraction

## CountVectorizer
Transforming text into a vector

In [26]:
import sklearn
# Count vectorization
texts = ["Summer is coming but Summer is short", 
         "I like the Summer and I like the Winter", 
         "I like sandwiches and I like the Winter"]


from sklearn.feature_extraction.text import CountVectorizer



# Count occurrences of unique words
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html 

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
vectorizer.get_feature_names_out()

array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',
       'summer', 'the', 'winter'], dtype=object)

In [27]:
print(X.toarray())

[[0 1 1 2 0 0 1 2 0 0]
 [1 0 0 0 2 0 0 1 2 1]
 [1 0 0 0 2 1 0 0 1 1]]


In [28]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2)) # only bigrams; (1,2):unigram, bigram
X2 = vectorizer2.fit_transform(texts)
vectorizer2.get_feature_names_out()

array(['and like', 'but summer', 'coming but', 'is coming', 'is short',
       'like sandwiches', 'like the', 'sandwiches and', 'summer and',
       'summer is', 'the summer', 'the winter'], dtype=object)

In [29]:
print(X2.toarray())

[[0 1 1 1 1 0 0 0 0 2 0 0]
 [1 0 0 0 0 0 2 0 1 0 1 1]
 [1 0 0 0 0 1 1 1 0 0 0 1]]


Remove stop words

In [30]:
vectorizer3 = CountVectorizer(analyzer='word', stop_words='english') # only bigrams
X3 = vectorizer3.fit_transform(texts)
vectorizer3.get_feature_names_out()

array(['coming', 'like', 'sandwiches', 'short', 'summer', 'winter'],
      dtype=object)

In [31]:
print(X3.toarray())

[[1 0 0 1 2 0]
 [0 2 0 0 1 1]
 [0 2 1 0 0 1]]


## TF-IDF

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer()
XTIDF = vect.fit_transform(texts)
vect.get_feature_names_out()

array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',
       'summer', 'the', 'winter'], dtype=object)

In [33]:
# Counter
print(X.toarray())
# TF-IDF
print(XTIDF.toarray())

[[0 1 1 2 0 0 1 2 0 0]
 [1 0 0 0 2 0 0 1 2 1]
 [1 0 0 0 2 1 0 0 1 1]]
[[0.         0.32767345 0.32767345 0.65534691 0.         0.
  0.32767345 0.49840822 0.         0.        ]
 [0.30151134 0.         0.         0.         0.60302269 0.
  0.         0.30151134 0.60302269 0.30151134]
 [0.33846987 0.         0.         0.         0.67693975 0.44504721
  0.         0.         0.33846987 0.33846987]]


In [34]:
vect3 = TfidfVectorizer(analyzer='word', stop_words='english')
XTIDF3 = vect3.fit_transform(texts)
print(vect3.get_feature_names_out())
print(X3.toarray())
print(XTIDF3.toarray())

['coming' 'like' 'sandwiches' 'short' 'summer' 'winter']
[[1 0 0 1 2 0]
 [0 2 0 0 1 1]
 [0 2 1 0 0 1]]
[[0.48148213 0.         0.         0.48148213 0.73235914 0.        ]
 [0.         0.81649658 0.         0.         0.40824829 0.40824829]
 [0.         0.77100584 0.50689001 0.         0.         0.38550292]]


# Classifying spam
We will use the sms spam collection dataset taken from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [35]:
import numpy as np
import pandas as pd

In [36]:
column_names = ['label', 'message']
df = pd.read_csv('SMSSpamCollection', sep='\t', names=column_names, header=None)
df[0:5]

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [37]:
#check missing values
df.isnull().sum()

label      0
message    0
dtype: int64

In [None]:
# df['label'].replace('', np.nan, inplace=True)
# df['message'].replace('', np.nan, inplace=True)
# df.dropna(inplace=True)

In [38]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [39]:
# spling training and testing
from sklearn.model_selection import train_test_split

In [40]:
X = df['message']
y = df['label']

In [41]:
#train
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.33, random_state=42)

In [42]:
from sklearn.feature_extraction.text import CountVectorizer

In [43]:
count_vect = CountVectorizer()

In [44]:
# fit vectorizer to data: build dictionary, count words,...
# transform: transform original text message to the vector
X_train_counts = count_vect.fit_transform(X_train)

In [45]:
X_train.shape

(3733,)

In [46]:
X_train_counts

<3733x7082 sparse matrix of type '<class 'numpy.int64'>'
	with 49992 stored elements in Compressed Sparse Row format>

We see the vocabulary are 7082 words, but most values are zeros

In [47]:
from sklearn.feature_extraction.text import TfidfTransformer

In [48]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

Instead of first count vectorization and then tf-idf transformation, better TF-IDF vectorizer, which makes these two things

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

In [None]:
X_train_tfidf = vectorizer.fit_transform(X_train)

## Classifier

In [49]:
from sklearn.svm import LinearSVC

In [50]:
clf = LinearSVC()

In [51]:
clf.fit(X_train_tfidf, y_train)

# Pipeline
Simple way to define the processing steps for repeating the operation.

In [52]:
from sklearn.pipeline import Pipeline
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [53]:
text_clf.fit(X_train, y_train)

In [54]:
predictions = text_clf.predict(X_test)

In [55]:
from sklearn.metrics import confusion_matrix, classification_report

In [56]:
print(confusion_matrix(y_test, predictions))

[[1586    7]
 [  12  234]]


In [57]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [58]:
from sklearn import metrics
metrics.accuracy_score(y_test, predictions)

0.989668297988037

In [59]:
text_clf.predict(["This is a summer school"])

array(['ham'], dtype=object)

In [60]:
text_clf.predict(["Free tickets and CASH"])

array(['spam'], dtype=object)

# Vectors and Similarity
You need to install previously spacy if not installed:
* `pip install spacy`
* or `conda install -c conda-forge spacy`

and install the English models (large or medium):
* `python -m spacy download en_core_web_md`
* `python -m spacy download en_core_web_lg`


In [1]:
import spacy
nlp = spacy.load('en_core_web_lg')

In [2]:
nlp(u'girl').vector

array([ 1.8023e+00,  3.9075e+00, -4.2940e+00, -7.6117e+00, -3.7172e+00,
       -1.5229e-01, -1.1368e+00, -6.8427e-01, -9.3067e-01,  5.6531e+00,
        4.2536e+00, -4.1175e+00, -8.3049e-01,  2.7701e+00,  6.4474e+00,
       -6.6389e-02, -8.3026e-01, -7.4532e+00,  1.7888e-01,  2.5130e+00,
       -4.4785e-01,  8.4806e+00, -2.7056e+00, -6.9836e+00,  9.2242e-01,
       -3.3579e+00, -3.2071e+00,  1.2901e-01,  3.5933e+00, -4.8096e+00,
        3.2596e-01, -3.0782e-01, -3.8023e+00, -1.2818e-01,  9.7322e-02,
        1.0876e+00, -4.5140e+00, -8.5375e-02, -4.4139e+00, -1.4073e+00,
       -2.4729e+00,  1.3307e-01,  3.1949e+00,  2.9971e+00,  5.3643e+00,
       -3.2407e+00, -2.7512e+00,  3.6586e-01,  2.7333e-01,  6.6513e+00,
        4.8740e+00,  1.3732e+00, -7.3595e-01, -2.3265e+00,  1.4045e+00,
        1.5080e-01,  3.1985e+00, -5.7459e+00,  3.5059e+00,  8.1671e-01,
       -1.1113e+00, -8.9306e-01, -4.2963e+00,  8.4042e-01, -8.3586e-01,
       -2.5407e+00, -1.1414e+00, -5.5050e+00, -3.6670e+00,  1.73

In [3]:
nlp(u'girl').vector.shape

(300,)

In [4]:
#Document vector: vector with the average of single words
nlp(u'the girl is blond').vector.shape

(300,)

In [5]:
doc = nlp(u'cat lion dog pet')
#doc = nlp(u'buy sell rent')

In [6]:
for word1 in doc:
    for word2 in doc:
        print(word1.text, word2.text, word1.similarity(word2))

cat cat 1.0
cat lion 0.3854507803916931
cat dog 0.8220816850662231
cat pet 0.732966423034668
lion cat 0.3854507803916931
lion lion 1.0
lion dog 0.2949307858943939
lion pet 0.20031584799289703
dog cat 0.8220816850662231
dog lion 0.2949307858943939
dog dog 1.0
dog pet 0.7856059074401855
pet cat 0.732966423034668
pet lion 0.20031584799289703
pet dog 0.7856059074401855
pet pet 1.0


In [7]:
nlp.vocab.vectors.shape

(514157, 300)

In [8]:
doc = nlp(u'catr')
token = doc[0]
print(token.has_vector)
print(token.is_oov)

False
True


In [9]:
from scipy import spatial

cosine_similarity = lambda v1, v2: 1- spatial.distance.cosine(v1, v2)

In [10]:
king = nlp.vocab['king'].vector
man = nlp.vocab['man'].vector
woman = nlp.vocab['woman'].vector

In [11]:
# king - man + woman
new_vector = king-man+ woman

In [12]:
computed_similarity = []
for id in nlp.vocab.vectors:
    word = nlp.vocab[id]
    if word.has_vector:
        if word.is_lower:
            if word.is_alpha: 
                similarity = cosine_similarity(new_vector, word.vector)
                computed_similarity.append((word, similarity))

In [13]:
computed_similariy = sorted(computed_similarity,key=lambda item:-item[1])
print([t[0].text for t in computed_similariy[:50]])

['king', 'kings', 'princes', 'consort', 'princeling', 'monarch', 'princelings', 'princesses', 'prince', 'kingship', 'princess', 'ruler', 'consorts', 'kingi', 'princedom', 'rulers', 'kingii', 'enthronement', 'monarchical', 'queen', 'monarchs', 'enthroning', 'queening', 'regents', 'principality', 'kingsize', 'throne', 'princesa', 'dynastic', 'princedoms', 'nobility', 'monarchic', 'imperial', 'princesse', 'rulership', 'courtiers', 'dynasties', 'monarchial', 'kingdom', 'predynastic', 'enthrone', 'succession', 'princely', 'royal', 'kingly', 'mcqueen', 'dethronement', 'royally', 'emperor', 'princeps']


## References



* [Spacy](https://spacy.io/usage/spacy-101/#annotations) 
* [NLTK stemmer](https://www.nltk.org/howto/stem.html)
* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)
* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)
* Natural Language Processing with Python, José Portilla, 2019.

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.