sitc/nlp/0_1_NLP_Slides.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Lexical Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Table of Contents\n",
    "* [Objectives](#Objectives)\n",
    "* [NLP Basics](#NLP-Basics)\n",
    "    * [Spacy installation](#Spacy-installation)\n",
    "    * [Spacy pipeline](#Spacy-pipeline)\n",
    "    * [Tokenization](#Tokenization)\n",
    "    * [Noun chunks](#Noun-chunks)\n",
    "    * [Stemming](#Stemming)\n",
    "    * [Sentence segmentation](#Sentence-segmentation)\n",
    "    * [Lemmatization](#Lemmatization)\n",
    "    * [Stop words](#Stop-words)\n",
    "    * [POS](#POS)\n",
    "    * [NER](#NER)\n",
    "* [Text Feature extraction](#Text-Feature-extraction)\n",
    "* [Classifying spam](#Classifying-spam)\n",
    "* [Vectors and similarity](#Vectors-and-similarity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "In this session we are going to learn to process text so that can apply machine learning techniques."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# NLP Basics\n",
    "In this notebook we are going to use two popular NLP libraries:\n",
    "* NLTK (Natural Language Toolkit, https://www.nltk.org/) \n",
    "* Spacy (https://spacy.io/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Main characteristics:\n",
    "* both are open source and very popular\n",
    "* NLTK was released in 2001 while Spacy was in 2015\n",
    "* Spacy provides very efficient implementations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Spacy installation\n",
    "\n",
    "You need to install previously spacy if not installed:\n",
    "* `pip install spacy`\n",
    "* or `conda install -c conda-forge spacy`\n",
    "\n",
    "and install the small English model \n",
    "* `python -m spacy download en_core_web_sm`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Spacy pipelines\n",
    "\n",
    "The function **nlp** takes a raw text and perform several operations (tokenization, tagger, NER, ...)\n",
    "![](spacy/spacy-pipeline.svg \"Spacy pipelines\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "From text to doc trough the pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "doc = nlp(u'Albert Einstein won the Nobel Prize for Physics in 1921')\n",
    "doc2 = nlp(u'\"Let\\'s go to N.Y.!\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Tokenization\n",
    "From text to tokens\n",
    "![](spacy/tokenization.svg \"Tokenization\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "The tokenizer checks:\n",
    "\n",
    "* **Tokenizer exception:** Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.\n",
    "* **Prefix:** Character(s) at the beginning, e.g. $, (, “, ¿.\n",
    "* **Suffix:** Character(s) at the end, e.g. km, ), ”, !.\n",
    "* **Infix:** Character(s) in between, e.g. -, --, /, …."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Let's do it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Albert\n",
      "Einstein\n",
      "won\n",
      "the\n",
      "Nobel\n",
      "Prize\n",
      "for\n",
      "Physics\n",
      "in\n",
      "1921\n"
     ]
    }
   ],
   "source": [
    "# print tokens\n",
    "for token in doc:\n",
    "    print(token.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"\n",
      "Let\n",
      "'s\n",
      "go\n",
      "to\n",
      "N.Y.\n",
      "!\n",
      "\"\n"
     ]
    }
   ],
   "source": [
    "for token in doc2:\n",
    "    print(token.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Noun chunks\n",
    "Noun phrases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Autonomous cars cars nsubj shift\n",
      "insurance liability liability dobj shift\n",
      "manufacturers manufacturers pobj toward\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(\"Autonomous cars shift insurance liability toward manufacturers\")\n",
    "for chunk in doc.noun_chunks:\n",
    "    print(chunk.text, chunk.root.text, chunk.root.dep_,\n",
    "            chunk.root.head.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"4ea90bf45ed945c3978bf5e93dc2e120-0\" class=\"displacy\" width=\"960\" height=\"332.0\" direction=\"ltr\" style=\"max-width: none; height: 332.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Autonomous</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"180\">cars</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"180\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"310\">shift</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"310\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"440\">insurance</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"440\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"570\">liability</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"570\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"700\">toward</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"700\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"830\">manufacturers</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"830\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-0\" stroke-width=\"2px\" d=\"M70,197.0 C70,132.0 170.0,132.0 170.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M70,199.0 L62,187.0 78,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-1\" stroke-width=\"2px\" d=\"M200,197.0 C200,132.0 300.0,132.0 300.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M200,199.0 L192,187.0 208,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-2\" stroke-width=\"2px\" d=\"M460,197.0 C460,132.0 560.0,132.0 560.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M460,199.0 L452,187.0 468,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-3\" stroke-width=\"2px\" d=\"M330,197.0 C330,67.0 565.0,67.0 565.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M565.0,199.0 L573.0,187.0 557.0,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-4\" stroke-width=\"2px\" d=\"M330,197.0 C330,2.0 700.0,2.0 700.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M700.0,199.0 L708.0,187.0 692.0,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-5\" stroke-width=\"2px\" d=\"M720,197.0 C720,132.0 820.0,132.0 820.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M820.0,199.0 L828.0,187.0 812.0,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "</svg></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from spacy import displacy\n",
    "displacy.render(doc, style='dep', jupyter=True,options={'distance':130})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Sentence segmentation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is a sentence.\n",
      "This is another sentence.\n"
     ]
    }
   ],
   "source": [
    "doc3 =nlp(u'This is a sentence. This is another sentence.')\n",
    "for sent in doc3.sents:\n",
    "    print(sent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Stemming\n",
    "Spacy does not include a stemmer. \n",
    "We will use nltk.\n",
    "The purpose is removing the ending of a word based on rules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "caress\n",
      "fli\n",
      "is\n",
      "been\n",
      "gener\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "\n",
    "stemmer = PorterStemmer()\n",
    "words = ['caresses', 'flies','is', 'been', 'generously']\n",
    "stems = [stemmer.stem(word) for word in words]\n",
    "for stem in stems:\n",
    "    print(stem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Lemmatization\n",
    "Lemmatization includes a morphological analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['I', 'be', 'read', 'the', 'paper', '.']\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(\"I was reading the paper.\")\n",
    "print([token.lemma_ for token in doc])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I \t PRON \t I\n",
      "was \t AUX \t be\n",
      "reading \t VERB \t read\n",
      "the \t DET \t the\n",
      "paper \t NOUN \t paper\n",
      ". \t PUNCT \t .\n"
     ]
    }
   ],
   "source": [
    "for token in doc:\n",
    "    print(token.text, '\\t', token.pos_,  '\\t', token.lemma_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'other', 'toward', 'everywhere', 'whether', 'i', 'his', 'afterwards', 'whenever', 'except', 'but', 'for', 'noone', 'whereas', 'she', 'name', 'per', 'among', 'where', 'am', '’re', 'on', 'thereby', 'fifty', \"'ll\", '‘m', 'would', 'we', 'once', 'can', 'meanwhile', 'anything', 'still', 'these', 'without', 'below', 'rather', 'were', 'six', 'them', 'latter', 'sometime', 'seemed', 'next', 'move', 'there', 'various', 'fifteen', '’m', 'onto', 'n’t', 'much', 'one', 'due', 'our', 'although', 'whom', 'done', 'made', 'at', 'empty', 'about', 'herself', 'down', 'never', 'thereafter', \"'re\", 'off', 'he', 'hereafter', 'whereafter', 'what', 'above', 'along', 'a', '’ve', '’s', 'else', 'or', 'sometimes', 'twenty', \"n't\", 'over', 'both', 'someone', 'beside', 'therein', 'whole', 'make', 'any', 'becomes', 'anyhow', 'quite', 'hence', 'here', 'same', 'which', 'whereby', 'whereupon', 'must', 'me', 'part', 'serious', 'into', 'namely', 'hers', 'enough', 'with', 'because', 'own', 'give', 'see', 'somehow', 'since', 'just', 'seems', 'top', 'across', 'ourselves', 'in', 'anywhere', 'few', 'myself', 'say', 'all', 'together', 'had', 'back', 'besides', 'please', 'n‘t', 'many', 'whoever', \"'d\", 'nothing', 'be', 'did', 'yours', 'how', 'also', 'those', 'that', 'throughout', '‘ve', 'amount', 'others', 'keep', 'hundred', 'up', 'already', 'amongst', 'front', 'thru', 'if', '‘s', 'least', 'us', 'anyone', 'might', 'thereupon', 'third', 'nor', 'sixty', 'nobody', 'more', 'her', 'by', 'himself', 'each', 'than', 're', 'behind', 'almost', 'seeming', 'when', 'is', 'mostly', 'so', 'ours', 'becoming', 'towards', 'some', 'two', 'seem', 'put', 'four', 'you', 'call', 'thence', 'moreover', 'nowhere', 'do', 'former', 'formerly', 'elsewhere', 'full', 'after', 'thus', 'less', 'go', 'ever', 'nine', 'anyway', 'somewhere', 'will', 'three', 'within', 'been', 'before', 'of', 'has', 'beyond', 'such', 'why', 'none', 'whose', 'eight', 'either', 'no', 'itself', 'doing', \"'ve\", '‘d', 'though', 'neither', 'while', 'get', 'around', 'your', 'twelve', 'even', 'and', 'something', 'always', \"'m\", 'until', 'an', 'during', 'it', 'most', 'the', 'yourselves', '’d', '‘ll', 'very', 'really', 'show', 'too', 'everyone', 'wherein', 'therefore', 'whither', 'who', 'may', 'latterly', 'between', 'its', 'however', '‘re', 'through', 'everything', 'another', 'does', 'whatever', 'their', 'then', 'beforehand', 'my', 'eleven', 'now', 'should', 'cannot', 'have', 'take', \"'s\", 'perhaps', 'regarding', 'against', 'used', 'indeed', 'upon', 'him', 'using', 'under', 'became', 'to', 'hereupon', 'bottom', 'not', 'from', 'hereby', 'side', 'yet', 'this', 'could', 'themselves', '’ll', 'ca', 'nevertheless', 'via', 'five', 'become', 'as', 'only', 'otherwise', 'are', 'was', 'first', 'further', 'ten', 'herein', 'yourself', 'forty', 'alone', 'every', 'again', 'well', 'last', 'several', 'wherever', 'mine', 'often', 'whence', 'out', 'they', 'being', 'unless'}\n"
     ]
    }
   ],
   "source": [
    "nlp = spacy.load('en_core_web_sm')\n",
    "print(nlp.Defaults.stop_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab['for'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab['day'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab['btw'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#add stop words\n",
    "nlp.Defaults.stop_words.add('btw')\n",
    "nlp.vocab['btw'].is_stop = True\n",
    "nlp.vocab['btw'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original Text\n",
      "Nick likes to play football, however he is not too fond of tennis. \n",
      "\n",
      "\n",
      "Text after removing stop words\n",
      "Nick likes play football, fond tennis.\n"
     ]
    }
   ],
   "source": [
    "en_stopwords = nlp.Defaults.stop_words\n",
    "text = \"Nick likes to play football, however he is not too fond of tennis.\"\n",
    "\n",
    "lst=[]\n",
    "for token in text.split():\n",
    "    if token.lower() not in en_stopwords:\n",
    "        lst.append(token)\n",
    "\n",
    "print('Original Text')        \n",
    "print(text,'\\n\\n')\n",
    "\n",
    "print('Text after removing stop words')\n",
    "print(' '.join(lst))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can also use methods from the token class (https://spacy.io/api/token), such as:\n",
    "\n",
    "* **is_stop:** is the token a stop word?\n",
    "* **is_punct:** is the token punctuation?\n",
    "* **like_email:** does the token resemble an email address?\n",
    "* **is_digit:** Does the token consist of digits? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## POS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I \t PRON \t pronoun\n",
      "was \t AUX \t auxiliary\n",
      "reading \t VERB \t verb\n",
      "the \t DET \t determiner\n",
      "paper \t NOUN \t noun\n",
      ". \t PUNCT \t punctuation\n"
     ]
    }
   ],
   "source": [
    "# POS\n",
    "# information available at https://spacy.io/usage/linguistic-features/#pos-tagging\n",
    "for token in doc:\n",
    "    print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I \t PRON \t pronoun \t nsubj\n",
      "was \t AUX \t auxiliary \t aux\n",
      "reading \t VERB \t verb \t ROOT\n",
      "the \t DET \t determiner \t det\n",
      "paper \t NOUN \t noun \t dobj\n",
      ". \t PUNCT \t punctuation \t punct\n"
     ]
    }
   ],
   "source": [
    "for token in doc:\n",
    "    print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_), '\\t', token.dep_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f74618d3d00>),\n",
       " ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f74618d3880>),\n",
       " ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f74619006d0>),\n",
       " ('attribute_ruler',\n",
       "  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f74618ffd00>),\n",
       " ('lemmatizer',\n",
       "  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f746171d900>),\n",
       " ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f74619007b0>)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# List the pipeline\n",
    "nlp.pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.pipe_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We can also get some statistics, regarding the frequency of tags in POS, DEP or TAG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ADJ 1\n",
      "ADP 1\n",
      "ADV 2\n",
      "AUX 1\n",
      "NOUN 2\n",
      "PART 2\n",
      "PRON 1\n",
      "PROPN 1\n",
      "PUNCT 2\n",
      "VERB 2\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(u'Nick likes to play football, however he is not too fond of tennis.')\n",
    "POS_counts = doc.count_by(spacy.attrs.POS)\n",
    "\n",
    "for code,freq in sorted(POS_counts.items()):\n",
    "    print(doc.vocab[code].text, freq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RB 3\n",
      "IN 1\n",
      ", 1\n",
      "TO 1\n",
      "JJ 1\n",
      ". 1\n",
      "PRP 1\n",
      "VBZ 2\n",
      "VB 1\n",
      "NN 2\n",
      "NNP 1\n"
     ]
    }
   ],
   "source": [
    "TAG_counts = doc.count_by(spacy.attrs.TAG)\n",
    "\n",
    "for code,freq in sorted(TAG_counts.items()):\n",
    "    print(doc.vocab[code].text, freq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "acomp 1\n",
      "advmod 2\n",
      "aux 1\n",
      "ccomp 1\n",
      "dobj 1\n",
      "neg 1\n",
      "nsubj 2\n",
      "pobj 1\n",
      "prep 1\n",
      "punct 2\n",
      "xcomp 1\n",
      "ROOT 1\n"
     ]
    }
   ],
   "source": [
    "DEP_counts = doc.count_by(spacy.attrs.DEP)\n",
    "\n",
    "for code,freq in sorted(DEP_counts.items()):\n",
    "    print(doc.vocab[code].text, freq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## NER"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Apple 0 5 ORG\n",
      "U.K. 27 31 GPE\n",
      "$1 billion 44 54 MONEY\n"
     ]
    }
   ],
   "source": [
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')\n",
    "\n",
    "for ent in doc.ents:\n",
    "    print(ent.text, ent.start_char, ent.end_char, ent.label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Apple\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " is looking at buying \n",
       "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    U.K.\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
       "</mark>\n",
       " startup for \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    $1 billion\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
       "</mark>\n",
       "</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(doc, style='ent', jupyter=True,options={'distance':130})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">Over \n",
       "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    the last quarter\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Apple\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " sold \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    nearly 20 thousand\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    iPods\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
       "</mark>\n",
       " for a profit of \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    $6 million\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
       "</mark>\n",
       ".</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')\n",
    "displacy.render(doc, style='ent', jupyter=True,options={'distance':130})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Text Feature Extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## CountVectorizer\n",
    "Transforming text into a vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n",
       "       'summer', 'the', 'winter'], dtype=object)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sklearn\n",
    "# Count vectorization\n",
    "texts = [\"Summer is coming but Summer is short\", \n",
    "         \"I like the Summer and I like the Winter\", \n",
    "         \"I like sandwiches and I like the Winter\"]\n",
    "\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "\n",
    "\n",
    "# Count occurrences of unique words\n",
    "# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html \n",
    "\n",
    "vectorizer = CountVectorizer()\n",
    "X = vectorizer.fit_transform(texts)\n",
    "vectorizer.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 2 0 0 1 2 0 0]\n",
      " [1 0 0 0 2 0 0 1 2 1]\n",
      " [1 0 0 0 2 1 0 0 1 1]]\n"
     ]
    }
   ],
   "source": [
    "print(X.toarray())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['and like', 'but summer', 'coming but', 'is coming', 'is short',\n",
       "       'like sandwiches', 'like the', 'sandwiches and', 'summer and',\n",
       "       'summer is', 'the summer', 'the winter'], dtype=object)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2)) # only bigrams; (1,2):unigram, bigram\n",
    "X2 = vectorizer2.fit_transform(texts)\n",
    "vectorizer2.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 1 1 0 0 0 0 2 0 0]\n",
      " [1 0 0 0 0 0 2 0 1 0 1 1]\n",
      " [1 0 0 0 0 1 1 1 0 0 0 1]]\n"
     ]
    }
   ],
   "source": [
    "print(X2.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Remove stop words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['coming', 'like', 'sandwiches', 'short', 'summer', 'winter'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer3 = CountVectorizer(analyzer='word', stop_words='english') # only bigrams\n",
    "X3 = vectorizer3.fit_transform(texts)\n",
    "vectorizer3.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 0 0 1 2 0]\n",
      " [0 2 0 0 1 1]\n",
      " [0 2 1 0 0 1]]\n"
     ]
    }
   ],
   "source": [
    "print(X3.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## TF-IDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n",
       "       'summer', 'the', 'winter'], dtype=object)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vect = TfidfVectorizer()\n",
    "XTIDF = vect.fit_transform(texts)\n",
    "vect.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 2 0 0 1 2 0 0]\n",
      " [1 0 0 0 2 0 0 1 2 1]\n",
      " [1 0 0 0 2 1 0 0 1 1]]\n",
      "[[0.         0.32767345 0.32767345 0.65534691 0.         0.\n",
      "  0.32767345 0.49840822 0.         0.        ]\n",
      " [0.30151134 0.         0.         0.         0.60302269 0.\n",
      "  0.         0.30151134 0.60302269 0.30151134]\n",
      " [0.33846987 0.         0.         0.         0.67693975 0.44504721\n",
      "  0.         0.         0.33846987 0.33846987]]\n"
     ]
    }
   ],
   "source": [
    "# Counter\n",
    "print(X.toarray())\n",
    "# TF-IDF\n",
    "print(XTIDF.toarray())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['coming' 'like' 'sandwiches' 'short' 'summer' 'winter']\n",
      "[[1 0 0 1 2 0]\n",
      " [0 2 0 0 1 1]\n",
      " [0 2 1 0 0 1]]\n",
      "[[0.48148213 0.         0.         0.48148213 0.73235914 0.        ]\n",
      " [0.         0.81649658 0.         0.         0.40824829 0.40824829]\n",
      " [0.         0.77100584 0.50689001 0.         0.         0.38550292]]\n"
     ]
    }
   ],
   "source": [
    "vect3 = TfidfVectorizer(analyzer='word', stop_words='english')\n",
    "XTIDF3 = vect3.fit_transform(texts)\n",
    "print(vect3.get_feature_names_out())\n",
    "print(X3.toarray())\n",
    "print(XTIDF3.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Classifying spam\n",
    "We will use the sms spam collection dataset taken from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label                                            message\n",
       "0   ham  Go until jurong point, crazy.. Available only ...\n",
       "1   ham                      Ok lar... Joking wif u oni...\n",
       "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
       "3   ham  U dun say so early hor... U c already then say...\n",
       "4   ham  Nah I don't think he goes to usf, he lives aro..."
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "column_names = ['label', 'message']\n",
    "df = pd.read_csv('SMSSpamCollection', sep='\\t', names=column_names, header=None)\n",
    "df[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "label      0\n",
       "message    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#check missing values\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "# df['label'].replace('', np.nan, inplace=True)\n",
    "# df['message'].replace('', np.nan, inplace=True)\n",
    "# df.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ham     4825\n",
       "spam     747\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['label'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# spling training and testing\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "X = df['message']\n",
    "y = df['label']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "#train\n",
    "X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.33, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "count_vect = CountVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# fit vectorizer to data: build dictionary, count words,...\n",
    "# transform: transform original text message to the vector\n",
    "X_train_counts = count_vect.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3733,)"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3733x7082 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 49992 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We see the vocabulary are 7082 words, but most values are zeros"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfTransformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3733, 7082)"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf_transformer = TfidfTransformer()\n",
    "X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)\n",
    "X_train_tfidf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Instead of first count vectorization and then tf-idf transformation, better TF-IDF vectorizer, which makes these two things"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vectorizer = TfidfVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "X_train_tfidf = vectorizer.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.svm import LinearSVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "clf = LinearSVC()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LinearSVC()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearSVC</label><div class=\"sk-toggleable__content\"><pre>LinearSVC()</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "LinearSVC()"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X_train_tfidf, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Pipeline\n",
    "Simple way to define the processing steps for repeating the operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[(&#x27;tfidf&#x27;, TfidfVectorizer()), (&#x27;clf&#x27;, LinearSVC())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[(&#x27;tfidf&#x27;, TfidfVectorizer()), (&#x27;clf&#x27;, LinearSVC())])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">TfidfVectorizer</label><div class=\"sk-toggleable__content\"><pre>TfidfVectorizer()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearSVC</label><div class=\"sk-toggleable__content\"><pre>LinearSVC()</pre></div></div></div></div></div></div></div>"
      ],
      "text/plain": [
       "Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "predictions = text_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1586    7]\n",
      " [  12  234]]\n"
     ]
    }
   ],
   "source": [
    "print(confusion_matrix(y_test, predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "         ham       0.99      1.00      0.99      1593\n",
      "        spam       0.97      0.95      0.96       246\n",
      "\n",
      "    accuracy                           0.99      1839\n",
      "   macro avg       0.98      0.97      0.98      1839\n",
      "weighted avg       0.99      0.99      0.99      1839\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_test, predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.989668297988037"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import metrics\n",
    "metrics.accuracy_score(y_test, predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['ham'], dtype=object)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.predict([\"This is a summer school\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['spam'], dtype=object)"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.predict([\"Free tickets and CASH\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Vectors and Similarity\n",
    "You need to install previously spacy if not installed:\n",
    "* `pip install spacy`\n",
    "* or `conda install -c conda-forge spacy`\n",
    "\n",
    "and install the English models (large or medium):\n",
    "* `python -m spacy download en_core_web_md`\n",
    "* `python -m spacy download en_core_web_lg`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_lg')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.8023e+00,  3.9075e+00, -4.2940e+00, -7.6117e+00, -3.7172e+00,\n",
       "       -1.5229e-01, -1.1368e+00, -6.8427e-01, -9.3067e-01,  5.6531e+00,\n",
       "        4.2536e+00, -4.1175e+00, -8.3049e-01,  2.7701e+00,  6.4474e+00,\n",
       "       -6.6389e-02, -8.3026e-01, -7.4532e+00,  1.7888e-01,  2.5130e+00,\n",
       "       -4.4785e-01,  8.4806e+00, -2.7056e+00, -6.9836e+00,  9.2242e-01,\n",
       "       -3.3579e+00, -3.2071e+00,  1.2901e-01,  3.5933e+00, -4.8096e+00,\n",
       "        3.2596e-01, -3.0782e-01, -3.8023e+00, -1.2818e-01,  9.7322e-02,\n",
       "        1.0876e+00, -4.5140e+00, -8.5375e-02, -4.4139e+00, -1.4073e+00,\n",
       "       -2.4729e+00,  1.3307e-01,  3.1949e+00,  2.9971e+00,  5.3643e+00,\n",
       "       -3.2407e+00, -2.7512e+00,  3.6586e-01,  2.7333e-01,  6.6513e+00,\n",
       "        4.8740e+00,  1.3732e+00, -7.3595e-01, -2.3265e+00,  1.4045e+00,\n",
       "        1.5080e-01,  3.1985e+00, -5.7459e+00,  3.5059e+00,  8.1671e-01,\n",
       "       -1.1113e+00, -8.9306e-01, -4.2963e+00,  8.4042e-01, -8.3586e-01,\n",
       "       -2.5407e+00, -1.1414e+00, -5.5050e+00, -3.6670e+00,  1.7393e+00,\n",
       "       -1.9284e+00,  2.7994e+00,  4.4476e+00, -1.0855e+00,  2.5439e+00,\n",
       "       -1.8681e+00,  2.1162e+00,  5.5460e+00,  2.8248e+00, -1.1810e+00,\n",
       "       -9.3259e-01, -1.8681e+00, -3.0654e-02, -3.4096e+00,  2.0261e+00,\n",
       "       -5.4005e-01,  8.2070e-01,  4.3283e+00, -3.4484e+00, -2.1291e+00,\n",
       "        1.2265e+00, -4.4106e-01,  3.8392e+00, -5.8643e+00, -7.3440e-01,\n",
       "        1.9785e+00,  4.1928e+00,  1.4577e+00,  2.8668e+00, -6.3762e+00,\n",
       "        2.7575e+00,  1.7991e+00,  1.3388e-02, -5.1316e-01, -6.3303e+00,\n",
       "       -2.5989e+00, -1.0406e+00,  1.8325e+00,  9.6654e-02,  4.4002e+00,\n",
       "       -1.3231e+00,  2.7717e+00,  4.3340e+00,  2.9027e-01,  7.2542e+00,\n",
       "       -1.2149e+00, -1.7366e+00, -5.2755e+00,  7.5762e-01, -6.0150e+00,\n",
       "        2.1634e+00, -1.6577e+00, -6.4410e+00,  2.5107e+00, -7.6881e+00,\n",
       "       -6.3143e-01,  6.0914e+00,  4.7114e+00,  1.0778e+00,  1.8121e+00,\n",
       "       -3.1133e+00, -5.5923e+00,  5.0992e-01, -2.2783e+00,  1.3641e+00,\n",
       "        3.4367e+00, -1.0224e+00, -3.1824e+00,  2.0683e+00,  2.0398e+00,\n",
       "       -8.2011e+00,  4.5388e-01,  2.7002e+00,  3.9199e+00, -5.5184e-01,\n",
       "       -3.3309e+00, -3.8620e+00,  1.7020e-01,  4.9659e+00,  6.9592e-01,\n",
       "        3.4792e+00, -2.7438e+00, -6.0489e-01,  1.9883e-02,  2.3192e-01,\n",
       "       -4.0591e-01,  3.9470e+00,  1.4145e+00, -8.4031e-01, -1.9433e+00,\n",
       "       -2.5783e+00, -6.8732e+00, -3.7792e+00,  6.4090e+00, -2.3963e+00,\n",
       "       -3.1485e+00,  2.2938e+00, -1.3649e+00, -1.3070e+00, -7.4143e-01,\n",
       "        3.5752e+00,  3.1999e+00, -2.7599e+00,  3.9996e+00, -2.6275e+00,\n",
       "       -3.2632e+00, -2.7695e+00, -2.0046e+00,  3.4848e-01, -3.7322e+00,\n",
       "        3.9018e+00,  1.1883e-02,  6.7589e+00, -4.2182e+00, -1.7291e+00,\n",
       "        1.3949e+00,  5.9161e-01, -4.0226e+00,  1.7388e+00, -1.9609e+00,\n",
       "       -5.4280e-02,  1.4707e+00, -4.2497e+00, -7.4698e-01,  5.7317e+00,\n",
       "       -5.9729e+00,  4.3627e-01,  6.9487e+00, -2.9021e+00,  2.8235e+00,\n",
       "        4.4695e+00, -2.7154e+00, -1.7771e+00, -1.6288e+00, -4.9338e+00,\n",
       "       -2.1144e+00,  1.4976e+00, -4.4156e+00, -3.3974e+00, -9.0295e+00,\n",
       "        2.1685e+00, -1.7372e+00, -8.9336e-03,  2.2437e+00, -1.3924e+00,\n",
       "       -2.5530e+00, -2.0714e+00,  2.0850e+00, -5.2257e+00,  8.4517e-01,\n",
       "        1.6804e+00, -7.9530e+00,  3.8700e+00,  9.2134e+00, -4.5150e+00,\n",
       "        2.8401e+00,  5.1596e-01, -3.7684e+00,  2.3126e+00,  2.2748e+00,\n",
       "       -4.7895e+00, -2.3299e+00, -2.3546e+00, -2.0999e+00, -3.7111e+00,\n",
       "        1.4847e+00, -1.6953e+00,  4.9883e+00,  2.5845e-01,  4.1598e+00,\n",
       "       -8.4808e-01,  3.1341e+00,  4.1797e+00, -9.9561e-01,  1.1814e+00,\n",
       "       -3.0735e+00, -2.7010e+00, -9.5470e-01,  1.4944e+00,  2.4461e+00,\n",
       "       -1.2699e+00,  2.3195e+00, -7.0078e-01,  2.6868e+00, -2.9822e+00,\n",
       "        3.8670e+00,  3.1915e+00,  3.2350e+00, -3.2919e+00,  4.2211e-01,\n",
       "        3.9947e+00, -1.4124e+00, -2.1844e+00, -3.0904e+00,  2.3693e+00,\n",
       "       -2.8532e+00, -7.5463e-01, -3.6133e+00, -7.8667e+00,  4.7647e+00,\n",
       "        2.6976e+00, -2.6137e-01, -5.2056e+00, -2.2392e+00,  2.7426e+00,\n",
       "       -1.2172e+00, -4.4441e-02, -3.1014e+00, -4.7598e+00,  5.2652e+00,\n",
       "       -4.0911e+00, -4.9625e+00,  2.8234e-01,  1.5329e+00,  5.3542e+00,\n",
       "       -1.5295e+00, -3.5151e+00, -1.5575e+00, -3.6066e+00, -3.2199e+00,\n",
       "        4.5560e+00, -3.6332e-01,  1.6928e+00, -2.5321e+00, -4.1381e+00,\n",
       "       -3.4422e+00,  2.4066e+00,  6.1191e+00, -1.1493e+00,  3.0401e+00],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp(u'girl').vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300,)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp(u'girl').vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300,)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Document vector: vector with the average of single words\n",
    "nlp(u'the girl is blond').vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "doc = nlp(u'cat lion dog pet')\n",
    "#doc = nlp(u'buy sell rent')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cat cat 1.0\n",
      "cat lion 0.3854507803916931\n",
      "cat dog 0.8220816850662231\n",
      "cat pet 0.732966423034668\n",
      "lion cat 0.3854507803916931\n",
      "lion lion 1.0\n",
      "lion dog 0.2949307858943939\n",
      "lion pet 0.20031584799289703\n",
      "dog cat 0.8220816850662231\n",
      "dog lion 0.2949307858943939\n",
      "dog dog 1.0\n",
      "dog pet 0.7856059074401855\n",
      "pet cat 0.732966423034668\n",
      "pet lion 0.20031584799289703\n",
      "pet dog 0.7856059074401855\n",
      "pet pet 1.0\n"
     ]
    }
   ],
   "source": [
    "for word1 in doc:\n",
    "    for word2 in doc:\n",
    "        print(word1.text, word2.text, word1.similarity(word2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(514157, 300)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab.vectors.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(u'catr')\n",
    "token = doc[0]\n",
    "print(token.has_vector)\n",
    "print(token.is_oov)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from scipy import spatial\n",
    "\n",
    "cosine_similarity = lambda v1, v2: 1- spatial.distance.cosine(v1, v2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "king = nlp.vocab['king'].vector\n",
    "man = nlp.vocab['man'].vector\n",
    "woman = nlp.vocab['woman'].vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# king - man + woman\n",
    "new_vector = king-man+ woman"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "computed_similarity = []\n",
    "for id in nlp.vocab.vectors:\n",
    "    word = nlp.vocab[id]\n",
    "    if word.has_vector:\n",
    "        if word.is_lower:\n",
    "            if word.is_alpha: \n",
    "                similarity = cosine_similarity(new_vector, word.vector)\n",
    "                computed_similarity.append((word, similarity))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['king', 'kings', 'princes', 'consort', 'princeling', 'monarch', 'princelings', 'princesses', 'prince', 'kingship', 'princess', 'ruler', 'consorts', 'kingi', 'princedom', 'rulers', 'kingii', 'enthronement', 'monarchical', 'queen', 'monarchs', 'enthroning', 'queening', 'regents', 'principality', 'kingsize', 'throne', 'princesa', 'dynastic', 'princedoms', 'nobility', 'monarchic', 'imperial', 'princesse', 'rulership', 'courtiers', 'dynasties', 'monarchial', 'kingdom', 'predynastic', 'enthrone', 'succession', 'princely', 'royal', 'kingly', 'mcqueen', 'dethronement', 'royally', 'emperor', 'princeps']\n"
     ]
    }
   ],
   "source": [
    "computed_similariy = sorted(computed_similarity,key=lambda item:-item[1])\n",
    "print([t[0].text for t in computed_similariy[:50]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## References\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "* [Spacy](https://spacy.io/usage/spacy-101/#annotations) \n",
    "* [NLTK stemmer](https://www.nltk.org/howto/stem.html)\n",
    "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
    "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)\n",
    "* Natural Language Processing with Python, José Portilla, 2019."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}