{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Vector Representation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Objectives](#Objectives)\n", "* [Tools](#Tools)\n", "* [Vector representation: Count vector](#Vector-representation:-Count-vector)\n", "* [Binary vectors](#Binary-vectors)\n", "* [Bigram vectors](#Bigram-vectors)\n", "* [Tf-idf vector representation](#Tf-idf-vector-representation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Objectives" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook we are going to transform text into feature vectors, using several representations as presented in class.\n", "\n", "We are going to use the examples from the slides." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "doc1 = 'Summer is coming but Summer is short'\n", "doc2 = 'I like the Summer and I like the Winter'\n", "doc3 = 'I like sandwiches and I like the Winter'\n", "documents = [doc1, doc2, doc3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Tools" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The different tools we have presented so far (NLTK, Scikit-Learn, TextBlob and CLiPS) provide overlapping functionalities for obtaining vector representations and apply machine learning algorithms.\n", "\n", "We are going to focus on the use of scikit-learn so that we can also use easily Pandas as we saw in the previous topic.\n", "\n", "Scikit-learn provides specific facililities for processing texts, as described in the [manual](http://scikit-learn.org/stable/modules/feature_extraction.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Vector representation: Count vector" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn provides two classes for binary vectors: [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) and [HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html). The latter is more efficient but does not allow to understand which features are more important, so we use the first class. Nevertheless, they are compatible, so, they can be interchanged for production environments.\n", "\n", "The first step for vectorizing with scikit-learn is creating a CountVectorizer object and then we should call 'fit_transform' to fit the vocabulary." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
CountVectorizer(max_features=5000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
CountVectorizer(max_features=5000)