{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Objectives](#Objectives)\n", "* [Corpus](#Corpus)\n", "* [Classifier](#Classifier)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this session we provide a quick overview of how the vector models we have presented previously can be used for applying machine learning techniques, such as classification.\n", "\n", "The main objectives of this session are:\n", "* Understand how to apply machine learning techniques on textual sources\n", "* Learn the facilities provided by scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Corpus" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to use on of the corpus that come prepackaged with Scikit-learn: the [20 newsgroup datase](http://qwone.com/~jason/20Newsgroups/). The 20 newsgroup dataset contains 20k documents that belong to 20 topics.\n", "\n", "We inspect now the corpus using the facilities from Scikit-learn, as explain in [scikit-learn](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html#newsgroups)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "\n", "# We remove metadata to avoid bias in the classification\n", "newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))\n", "newsgroups_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))\n", "\n", "# print categories\n", "print(list(newsgroups_train.target_names))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Number of categories\n", "print(len(newsgroups_train.target_names))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show a document\n", "docid = 1\n", "doc = newsgroups_train.data[docid]\n", "cat = newsgroups_train.target[docid]\n", "\n", "print(\"Category id \" + str(cat) + \" \" + newsgroups_train.target_names[cat])\n", "print(\"Doc \" + doc)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Number of files\n", "newsgroups_train.filenames.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Obtain a vector\n", "\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vectorizer = TfidfVectorizer(analyzer='word', stop_words='english')\n", "\n", "vectors_train = vectorizer.fit_transform(newsgroups_train.data)\n", "vectors_train.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n", "vectors_train.nnz / float(vectors_train.shape[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have vectors, we can create classifiers (or other machine learning algorithms such as clustering) as we saw previously in the notebooks of machine learning with scikit-learn." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.naive_bayes import MultinomialNB\n", "\n", "from sklearn import metrics\n", "\n", "\n", "# We learn the vocabulary (fit) with the train dataset and transform into vectors (fit_transform)\n", "# Nevertheless, we only transform the test dataset into vectors (transform, not fit_transform)\n", "\n", "model = MultinomialNB(alpha=.01)\n", "model.fit(vectors_train, newsgroups_train.target)\n", "\n", "vectors_test = vectorizer.transform(newsgroups_test.data)\n", "pred = model.predict(vectors_test)\n", "\n", "metrics.f1_score(newsgroups_test.target, pred, average='weighted')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are getting F1 of 0.69 for 20 categories this could be improved (optimization, preprocessing, etc.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.utils.extmath import density\n", "\n", "print(\"dimensionality: %d\" % model.coef_.shape[1])\n", "print(\"density: %f\" % density(model.coef_))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can review the top features per topic in Bayes (attribute coef_)\n", "import numpy as np\n", "\n", "def show_top10(classifier, vectorizer, categories):\n", " feature_names = np.asarray(vectorizer.get_feature_names())\n", " for i, category in enumerate(categories):\n", " top10 = np.argsort(classifier.coef_[i])[-10:]\n", " print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n", "\n", " \n", "show_top10(model, vectorizer, newsgroups_train.target_names)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We try the classifier in two new docs\n", "\n", "new_docs = ['This is a survey of PC computers', 'God is love']\n", "new_vectors = vectorizer.transform(new_docs)\n", "\n", "pred_docs = model.predict(new_vectors)\n", "print(pred_docs)\n", "print([newsgroups_train.target_names[i] for i in pred_docs])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n", "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }