![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias

# Text Classification

# Table of Contents
* [Objectives](#Objectives)
* [Corpus](#Corpus)
* [Classifier](#Classifier)

# Objectives

In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.

The main objectives of this session are:
* Understand how to apply machine learning techniques on textual sources
* Learn the facilities provided by scikit-learn

# Corpus

We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.

We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)

In [None]:
from sklearn.datasets import fetch_20newsgroups

# We remove metadata to avoid bias in the classification
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

# print categories
print(list(newsgroups_train.target_names))

In [None]:
#Number of categories
print(len(newsgroups_train.target_names))

In [None]:
# Show a document
docid = 1
doc = newsgroups_train.data[docid]
cat = newsgroups_train.target[docid]

print("Category id " + str(cat) + " " + newsgroups_train.target_names[cat])
print("Doc " + doc)

In [None]:
#Number of files
newsgroups_train.filenames.shape

In [None]:
# Obtain a vector

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')

vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_train.shape

In [None]:
# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)
vectors_train.nnz / float(vectors_train.shape[0])

# Classifier

Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn.

In [None]:
from sklearn.naive_bayes import MultinomialNB

from sklearn import metrics


# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)
# Nevertheless, we only transform the test dataset into vectors (transform, not fit_transform)

model = MultinomialNB(alpha=.01)
model.fit(vectors_train, newsgroups_train.target)

vectors_test = vectorizer.transform(newsgroups_test.data)
pred = model.predict(vectors_test)

metrics.f1_score(newsgroups_test.target, pred, average='weighted')


We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)

In [None]:
from sklearn.utils.extmath import density

print("dimensionality: %d" % model.coef_.shape[1])
print("density: %f" % density(model.coef_))

In [None]:
# We can review the top features per topic in Bayes (attribute coef_)
import numpy as np

def show_top10(classifier, vectorizer, categories):
 feature_names = np.asarray(vectorizer.get_feature_names())
 for i, category in enumerate(categories):
 top10 = np.argsort(classifier.coef_[i])[-10:]
 print("%s: %s" % (category, " ".join(feature_names[top10])))

 
show_top10(model, vectorizer, newsgroups_train.target_names)

In [None]:
# We try the classifier in two new docs

new_docs = ['This is a survey of PC computers', 'God is love']
new_vectors = vectorizer.transform(new_docs)

pred_docs = model.predict(new_vectors)
print(pred_docs)
print([newsgroups_train.target_names[i] for i in pred_docs])

## References



* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)
* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). 

© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid.