sitc/nlp/0_1_NLP_Slides.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Lexical Processing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Table of Contents\n",
    "* [Objectives](#Objectives)\n",
    "* [NLP Basics](#NLP-Basics)\n",
    "    * [Spacy installation](#Spacy-installation)\n",
    "    * [Spacy pipeline](#Spacy-pipeline)\n",
    "    * [Tokenization](#Tokenization)\n",
    "    * [Noun chunks](#Noun-chunks)\n",
    "    * [Stemming](#Stemming)\n",
    "    * [Sentence segmentation](#Sentence-segmentation)\n",
    "    * [Lemmatization](#Lemmatization)\n",
    "    * [Stop words](#Stop-words)\n",
    "    * [POS](#POS)\n",
    "    * [NER](#NER)\n",
    "* [Text Feature extraction](#Text-Feature-extraction)\n",
    "* [Classifying spam](#Classifying-spam)\n",
    "* [Vectors and similarity](#Vectors-and-similarity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Objectives"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "In this session we are going to learn to process text so that can apply machine learning techniques."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# NLP Basics\n",
    "In this notebook we are going to use two popular NLP libraries:\n",
    "* NLTK (Natural Language Toolkit, https://www.nltk.org/) \n",
    "* Spacy (https://spacy.io/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Main characteristics:\n",
    "* both are open source and very popular\n",
    "* NLTK was released in 2001 while Spacy was in 2015\n",
    "* Spacy provides very efficient implementations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Spacy installation\n",
    "\n",
    "You need to install previously spacy if not installed:\n",
    "* `pip install spacy`\n",
    "* or `conda install -c conda-forge spacy`\n",
    "\n",
    "and install the small English model \n",
    "* `python -m spacy download en_core_web_sm`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Spacy pipelines\n",
    "\n",
    "The function **nlp** takes a raw text and perform several operations (tokenization, tagger, NER, ...)\n",
    "![](spacy/spacy-pipeline.svg \"Spacy pipelines\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "From text to doc trough the pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "\n",
    "nlp = spacy.load('en_core_web_sm')\n",
    "doc = nlp(u'Albert Einstein won the Nobel Prize for Physics in 1921')\n",
    "doc2 = nlp(u'\"Let\\'s go to N.Y.!\"')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Tokenization\n",
    "From text to tokens\n",
    "![](spacy/tokenization.svg \"Tokenization\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "The tokenizer checks:\n",
    "\n",
    "* **Tokenizer exception:** Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.\n",
    "* **Prefix:** Character(s) at the beginning, e.g. $, (, “, ¿.\n",
    "* **Suffix:** Character(s) at the end, e.g. km, ), ”, !.\n",
    "* **Infix:** Character(s) in between, e.g. -, --, /, …."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Let's do it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Albert\n",
      "Einstein\n",
      "won\n",
      "the\n",
      "Nobel\n",
      "Prize\n",
      "for\n",
      "Physics\n",
      "in\n",
      "1921\n"
     ]
    }
   ],
   "source": [
    "# print tokens\n",
    "for token in doc:\n",
    "    print(token.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\"\n",
      "Let\n",
      "'s\n",
      "go\n",
      "to\n",
      "N.Y.\n",
      "!\n",
      "\"\n"
     ]
    }
   ],
   "source": [
    "for token in doc2:\n",
    "    print(token.text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Noun chunks\n",
    "Noun phrases"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Autonomous cars cars nsubj shift\n",
      "insurance liability liability dobj shift\n",
      "manufacturers manufacturers pobj toward\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(\"Autonomous cars shift insurance liability toward manufacturers\")\n",
    "for chunk in doc.noun_chunks:\n",
    "    print(chunk.text, chunk.root.text, chunk.root.dep_,\n",
    "            chunk.root.head.text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"4ea90bf45ed945c3978bf5e93dc2e120-0\" class=\"displacy\" width=\"960\" height=\"332.0\" direction=\"ltr\" style=\"max-width: none; height: 332.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Autonomous</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">ADJ</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"180\">cars</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"180\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"310\">shift</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"310\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"440\">insurance</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"440\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"570\">liability</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"570\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"700\">toward</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"700\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"830\">manufacturers</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"830\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-0\" stroke-width=\"2px\" d=\"M70,197.0 C70,132.0 170.0,132.0 170.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M70,199.0 L62,187.0 78,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-1\" stroke-width=\"2px\" d=\"M200,197.0 C200,132.0 300.0,132.0 300.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M200,199.0 L192,187.0 208,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-2\" stroke-width=\"2px\" d=\"M460,197.0 C460,132.0 560.0,132.0 560.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M460,199.0 L452,187.0 468,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-3\" stroke-width=\"2px\" d=\"M330,197.0 C330,67.0 565.0,67.0 565.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M565.0,199.0 L573.0,187.0 557.0,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-4\" stroke-width=\"2px\" d=\"M330,197.0 C330,2.0 700.0,2.0 700.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M700.0,199.0 L708.0,187.0 692.0,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-5\" stroke-width=\"2px\" d=\"M720,197.0 C720,132.0 820.0,132.0 820.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M820.0,199.0 L828.0,187.0 812.0,187.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "</svg></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from spacy import displacy\n",
    "displacy.render(doc, style='dep', jupyter=True,options={'distance':130})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Sentence segmentation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "This is a sentence.\n",
      "This is another sentence.\n"
     ]
    }
   ],
   "source": [
    "doc3 =nlp(u'This is a sentence. This is another sentence.')\n",
    "for sent in doc3.sents:\n",
    "    print(sent)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Stemming\n",
    "Spacy does not include a stemmer. \n",
    "We will use nltk.\n",
    "The purpose is removing the ending of a word based on rules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "caress\n",
      "fli\n",
      "is\n",
      "been\n",
      "gener\n"
     ]
    }
   ],
   "source": [
    "import nltk\n",
    "from nltk.stem.porter import PorterStemmer\n",
    "\n",
    "stemmer = PorterStemmer()\n",
    "words = ['caresses', 'flies','is', 'been', 'generously']\n",
    "stems = [stemmer.stem(word) for word in words]\n",
    "for stem in stems:\n",
    "    print(stem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Lemmatization\n",
    "Lemmatization includes a morphological analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['I', 'be', 'read', 'the', 'paper', '.']\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(\"I was reading the paper.\")\n",
    "print([token.lemma_ for token in doc])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I \t PRON \t I\n",
      "was \t AUX \t be\n",
      "reading \t VERB \t read\n",
      "the \t DET \t the\n",
      "paper \t NOUN \t paper\n",
      ". \t PUNCT \t .\n"
     ]
    }
   ],
   "source": [
    "for token in doc:\n",
    "    print(token.text, '\\t', token.pos_,  '\\t', token.lemma_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Stopwords"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'other', 'toward', 'everywhere', 'whether', 'i', 'his', 'afterwards', 'whenever', 'except', 'but', 'for', 'noone', 'whereas', 'she', 'name', 'per', 'among', 'where', 'am', '’re', 'on', 'thereby', 'fifty', \"'ll\", '‘m', 'would', 'we', 'once', 'can', 'meanwhile', 'anything', 'still', 'these', 'without', 'below', 'rather', 'were', 'six', 'them', 'latter', 'sometime', 'seemed', 'next', 'move', 'there', 'various', 'fifteen', '’m', 'onto', 'n’t', 'much', 'one', 'due', 'our', 'although', 'whom', 'done', 'made', 'at', 'empty', 'about', 'herself', 'down', 'never', 'thereafter', \"'re\", 'off', 'he', 'hereafter', 'whereafter', 'what', 'above', 'along', 'a', '’ve', '’s', 'else', 'or', 'sometimes', 'twenty', \"n't\", 'over', 'both', 'someone', 'beside', 'therein', 'whole', 'make', 'any', 'becomes', 'anyhow', 'quite', 'hence', 'here', 'same', 'which', 'whereby', 'whereupon', 'must', 'me', 'part', 'serious', 'into', 'namely', 'hers', 'enough', 'with', 'because', 'own', 'give', 'see', 'somehow', 'since', 'just', 'seems', 'top', 'across', 'ourselves', 'in', 'anywhere', 'few', 'myself', 'say', 'all', 'together', 'had', 'back', 'besides', 'please', 'n‘t', 'many', 'whoever', \"'d\", 'nothing', 'be', 'did', 'yours', 'how', 'also', 'those', 'that', 'throughout', '‘ve', 'amount', 'others', 'keep', 'hundred', 'up', 'already', 'amongst', 'front', 'thru', 'if', '‘s', 'least', 'us', 'anyone', 'might', 'thereupon', 'third', 'nor', 'sixty', 'nobody', 'more', 'her', 'by', 'himself', 'each', 'than', 're', 'behind', 'almost', 'seeming', 'when', 'is', 'mostly', 'so', 'ours', 'becoming', 'towards', 'some', 'two', 'seem', 'put', 'four', 'you', 'call', 'thence', 'moreover', 'nowhere', 'do', 'former', 'formerly', 'elsewhere', 'full', 'after', 'thus', 'less', 'go', 'ever', 'nine', 'anyway', 'somewhere', 'will', 'three', 'within', 'been', 'before', 'of', 'has', 'beyond', 'such', 'why', 'none', 'whose', 'eight', 'either', 'no', 'itself', 'doing', \"'ve\", '‘d', 'though', 'neither', 'while', 'get', 'around', 'your', 'twelve', 'even', 'and', 'something', 'always', \"'m\", 'until', 'an', 'during', 'it', 'most', 'the', 'yourselves', '’d', '‘ll', 'very', 'really', 'show', 'too', 'everyone', 'wherein', 'therefore', 'whither', 'who', 'may', 'latterly', 'between', 'its', 'however', '‘re', 'through', 'everything', 'another', 'does', 'whatever', 'their', 'then', 'beforehand', 'my', 'eleven', 'now', 'should', 'cannot', 'have', 'take', \"'s\", 'perhaps', 'regarding', 'against', 'used', 'indeed', 'upon', 'him', 'using', 'under', 'became', 'to', 'hereupon', 'bottom', 'not', 'from', 'hereby', 'side', 'yet', 'this', 'could', 'themselves', '’ll', 'ca', 'nevertheless', 'via', 'five', 'become', 'as', 'only', 'otherwise', 'are', 'was', 'first', 'further', 'ten', 'herein', 'yourself', 'forty', 'alone', 'every', 'again', 'well', 'last', 'several', 'wherever', 'mine', 'often', 'whence', 'out', 'they', 'being', 'unless'}\n"
     ]
    }
   ],
   "source": [
    "nlp = spacy.load('en_core_web_sm')\n",
    "print(nlp.Defaults.stop_words)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab['for'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab['day'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab['btw'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#add stop words\n",
    "nlp.Defaults.stop_words.add('btw')\n",
    "nlp.vocab['btw'].is_stop = True\n",
    "nlp.vocab['btw'].is_stop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Original Text\n",
      "Nick likes to play football, however he is not too fond of tennis. \n",
      "\n",
      "\n",
      "Text after removing stop words\n",
      "Nick likes play football, fond tennis.\n"
     ]
    }
   ],
   "source": [
    "en_stopwords = nlp.Defaults.stop_words\n",
    "text = \"Nick likes to play football, however he is not too fond of tennis.\"\n",
    "\n",
    "lst=[]\n",
    "for token in text.split():\n",
    "    if token.lower() not in en_stopwords:\n",
    "        lst.append(token)\n",
    "\n",
    "print('Original Text')        \n",
    "print(text,'\\n\\n')\n",
    "\n",
    "print('Text after removing stop words')\n",
    "print(' '.join(lst))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can also use methods from the token class (https://spacy.io/api/token), such as:\n",
    "\n",
    "* **is_stop:** is the token a stop word?\n",
    "* **is_punct:** is the token punctuation?\n",
    "* **like_email:** does the token resemble an email address?\n",
    "* **is_digit:** Does the token consist of digits? "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## POS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I \t PRON \t pronoun\n",
      "was \t AUX \t auxiliary\n",
      "reading \t VERB \t verb\n",
      "the \t DET \t determiner\n",
      "paper \t NOUN \t noun\n",
      ". \t PUNCT \t punctuation\n"
     ]
    }
   ],
   "source": [
    "# POS\n",
    "# information available at https://spacy.io/usage/linguistic-features/#pos-tagging\n",
    "for token in doc:\n",
    "    print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I \t PRON \t pronoun \t nsubj\n",
      "was \t AUX \t auxiliary \t aux\n",
      "reading \t VERB \t verb \t ROOT\n",
      "the \t DET \t determiner \t det\n",
      "paper \t NOUN \t noun \t dobj\n",
      ". \t PUNCT \t punctuation \t punct\n"
     ]
    }
   ],
   "source": [
    "for token in doc:\n",
    "    print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_), '\\t', token.dep_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f74618d3d00>),\n",
       " ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f74618d3880>),\n",
       " ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f74619006d0>),\n",
       " ('attribute_ruler',\n",
       "  <spacy.pipeline.attributeruler.AttributeRuler at 0x7f74618ffd00>),\n",
       " ('lemmatizer',\n",
       "  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f746171d900>),\n",
       " ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f74619007b0>)]"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# List the pipeline\n",
    "nlp.pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.pipe_names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We can also get some statistics, regarding the frequency of tags in POS, DEP or TAG."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "ADJ 1\n",
      "ADP 1\n",
      "ADV 2\n",
      "AUX 1\n",
      "NOUN 2\n",
      "PART 2\n",
      "PRON 1\n",
      "PROPN 1\n",
      "PUNCT 2\n",
      "VERB 2\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(u'Nick likes to play football, however he is not too fond of tennis.')\n",
    "POS_counts = doc.count_by(spacy.attrs.POS)\n",
    "\n",
    "for code,freq in sorted(POS_counts.items()):\n",
    "    print(doc.vocab[code].text, freq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RB 3\n",
      "IN 1\n",
      ", 1\n",
      "TO 1\n",
      "JJ 1\n",
      ". 1\n",
      "PRP 1\n",
      "VBZ 2\n",
      "VB 1\n",
      "NN 2\n",
      "NNP 1\n"
     ]
    }
   ],
   "source": [
    "TAG_counts = doc.count_by(spacy.attrs.TAG)\n",
    "\n",
    "for code,freq in sorted(TAG_counts.items()):\n",
    "    print(doc.vocab[code].text, freq)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "acomp 1\n",
      "advmod 2\n",
      "aux 1\n",
      "ccomp 1\n",
      "dobj 1\n",
      "neg 1\n",
      "nsubj 2\n",
      "pobj 1\n",
      "prep 1\n",
      "punct 2\n",
      "xcomp 1\n",
      "ROOT 1\n"
     ]
    }
   ],
   "source": [
    "DEP_counts = doc.count_by(spacy.attrs.DEP)\n",
    "\n",
    "for code,freq in sorted(DEP_counts.items()):\n",
    "    print(doc.vocab[code].text, freq)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## NER"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Apple 0 5 ORG\n",
      "U.K. 27 31 GPE\n",
      "$1 billion 44 54 MONEY\n"
     ]
    }
   ],
   "source": [
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')\n",
    "\n",
    "for ent in doc.ents:\n",
    "    print(ent.text, ent.start_char, ent.end_char, ent.label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Apple\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " is looking at buying \n",
       "<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    U.K.\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
       "</mark>\n",
       " startup for \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    $1 billion\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
       "</mark>\n",
       "</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(doc, style='ent', jupyter=True,options={'distance':130})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">Over \n",
       "<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    the last quarter\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    Apple\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
       "</mark>\n",
       " sold \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    nearly 20 thousand\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
       "</mark>\n",
       " \n",
       "<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    iPods\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
       "</mark>\n",
       " for a profit of \n",
       "<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    $6 million\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
       "</mark>\n",
       ".</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')\n",
    "displacy.render(doc, style='ent', jupyter=True,options={'distance':130})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Text Feature Extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## CountVectorizer\n",
    "Transforming text into a vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n",
       "       'summer', 'the', 'winter'], dtype=object)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import sklearn\n",
    "# Count vectorization\n",
    "texts = [\"Summer is coming but Summer is short\", \n",
    "         \"I like the Summer and I like the Winter\", \n",
    "         \"I like sandwiches and I like the Winter\"]\n",
    "\n",
    "\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "\n",
    "\n",
    "# Count occurrences of unique words\n",
    "# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html \n",
    "\n",
    "vectorizer = CountVectorizer()\n",
    "X = vectorizer.fit_transform(texts)\n",
    "vectorizer.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 2 0 0 1 2 0 0]\n",
      " [1 0 0 0 2 0 0 1 2 1]\n",
      " [1 0 0 0 2 1 0 0 1 1]]\n"
     ]
    }
   ],
   "source": [
    "print(X.toarray())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['and like', 'but summer', 'coming but', 'is coming', 'is short',\n",
       "       'like sandwiches', 'like the', 'sandwiches and', 'summer and',\n",
       "       'summer is', 'the summer', 'the winter'], dtype=object)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2)) # only bigrams; (1,2):unigram, bigram\n",
    "X2 = vectorizer2.fit_transform(texts)\n",
    "vectorizer2.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 1 1 0 0 0 0 2 0 0]\n",
      " [1 0 0 0 0 0 2 0 1 0 1 1]\n",
      " [1 0 0 0 0 1 1 1 0 0 0 1]]\n"
     ]
    }
   ],
   "source": [
    "print(X2.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Remove stop words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['coming', 'like', 'sandwiches', 'short', 'summer', 'winter'],\n",
       "      dtype=object)"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer3 = CountVectorizer(analyzer='word', stop_words='english') # only bigrams\n",
    "X3 = vectorizer3.fit_transform(texts)\n",
    "vectorizer3.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1 0 0 1 2 0]\n",
      " [0 2 0 0 1 1]\n",
      " [0 2 1 0 0 1]]\n"
     ]
    }
   ],
   "source": [
    "print(X3.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## TF-IDF"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n",
       "       'summer', 'the', 'winter'], dtype=object)"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "\n",
    "vect = TfidfVectorizer()\n",
    "XTIDF = vect.fit_transform(texts)\n",
    "vect.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 2 0 0 1 2 0 0]\n",
      " [1 0 0 0 2 0 0 1 2 1]\n",
      " [1 0 0 0 2 1 0 0 1 1]]\n",
      "[[0.         0.32767345 0.32767345 0.65534691 0.         0.\n",
      "  0.32767345 0.49840822 0.         0.        ]\n",
      " [0.30151134 0.         0.         0.         0.60302269 0.\n",
      "  0.         0.30151134 0.60302269 0.30151134]\n",
      " [0.33846987 0.         0.         0.         0.67693975 0.44504721\n",
      "  0.         0.         0.33846987 0.33846987]]\n"
     ]
    }
   ],
   "source": [
    "# Counter\n",
    "print(X.toarray())\n",
    "# TF-IDF\n",
    "print(XTIDF.toarray())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['coming' 'like' 'sandwiches' 'short' 'summer' 'winter']\n",
      "[[1 0 0 1 2 0]\n",
      " [0 2 0 0 1 1]\n",
      " [0 2 1 0 0 1]]\n",
      "[[0.48148213 0.         0.         0.48148213 0.73235914 0.        ]\n",
      " [0.         0.81649658 0.         0.         0.40824829 0.40824829]\n",
      " [0.         0.77100584 0.50689001 0.         0.         0.38550292]]\n"
     ]
    }
   ],
   "source": [
    "vect3 = TfidfVectorizer(analyzer='word', stop_words='english')\n",
    "XTIDF3 = vect3.fit_transform(texts)\n",
    "print(vect3.get_feature_names_out())\n",
    "print(X3.toarray())\n",
    "print(XTIDF3.toarray())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Classifying spam\n",
    "We will use the sms spam collection dataset taken from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>label</th>\n",
       "      <th>message</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ham</td>\n",
       "      <td>Go until jurong point, crazy.. Available only ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ham</td>\n",
       "      <td>Ok lar... Joking wif u oni...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>spam</td>\n",
       "      <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ham</td>\n",
       "      <td>U dun say so early hor... U c already then say...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ham</td>\n",
       "      <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  label                                            message\n",
       "0   ham  Go until jurong point, crazy.. Available only ...\n",
       "1   ham                      Ok lar... Joking wif u oni...\n",
       "2  spam  Free entry in 2 a wkly comp to win FA Cup fina...\n",
       "3   ham  U dun say so early hor... U c already then say...\n",
       "4   ham  Nah I don't think he goes to usf, he lives aro..."
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "column_names = ['label', 'message']\n",
    "df = pd.read_csv('SMSSpamCollection', sep='\\t', names=column_names, header=None)\n",
    "df[0:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "label      0\n",
       "message    0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#check missing values\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "# df['label'].replace('', np.nan, inplace=True)\n",
    "# df['message'].replace('', np.nan, inplace=True)\n",
    "# df.dropna(inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "ham     4825\n",
       "spam     747\n",
       "Name: label, dtype: int64"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['label'].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# spling training and testing\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "X = df['message']\n",
    "y = df['label']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "#train\n",
    "X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.33, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "count_vect = CountVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "# fit vectorizer to data: build dictionary, count words,...\n",
    "# transform: transform original text message to the vector\n",
    "X_train_counts = count_vect.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3733,)"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3733x7082 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 49992 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 46,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We see the vocabulary are 7082 words, but most values are zeros"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfTransformer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(3733, 7082)"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tfidf_transformer = TfidfTransformer()\n",
    "X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)\n",
    "X_train_tfidf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Instead of first count vectorization and then tf-idf transformation, better TF-IDF vectorizer, which makes these two things"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "vectorizer = TfidfVectorizer()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "X_train_tfidf = vectorizer.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Classifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.svm import LinearSVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "clf = LinearSVC()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\
      ],
      "text/plain": [
       "LinearSVC()"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.fit(X_train_tfidf, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Pipeline\n",
    "Simple way to define the processing steps for repeating the operation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\
      ],
      "text/plain": [
       "Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "predictions = text_clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import confusion_matrix, classification_report"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[1586    7]\n",
      " [  12  234]]\n"
     ]
    }
   ],
   "source": [
    "print(confusion_matrix(y_test, predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              precision    recall  f1-score   support\n",
      "\n",
      "         ham       0.99      1.00      0.99      1593\n",
      "        spam       0.97      0.95      0.96       246\n",
      "\n",
      "    accuracy                           0.99      1839\n",
      "   macro avg       0.98      0.97      0.98      1839\n",
      "weighted avg       0.99      0.99      0.99      1839\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(classification_report(y_test, predictions))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.989668297988037"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import metrics\n",
    "metrics.accuracy_score(y_test, predictions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['ham'], dtype=object)"
      ]
     },
     "execution_count": 59,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.predict([\"This is a summer school\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['spam'], dtype=object)"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text_clf.predict([\"Free tickets and CASH\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Vectors and Similarity\n",
    "You need to install previously spacy if not installed:\n",
    "* `pip install spacy`\n",
    "* or `conda install -c conda-forge spacy`\n",
    "\n",
    "and install the English models (large or medium):\n",
    "* `python -m spacy download en_core_web_md`\n",
    "* `python -m spacy download en_core_web_lg`\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load('en_core_web_lg')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.8023e+00,  3.9075e+00, -4.2940e+00, -7.6117e+00, -3.7172e+00,\n",
       "       -1.5229e-01, -1.1368e+00, -6.8427e-01, -9.3067e-01,  5.6531e+00,\n",
       "        4.2536e+00, -4.1175e+00, -8.3049e-01,  2.7701e+00,  6.4474e+00,\n",
       "       -6.6389e-02, -8.3026e-01, -7.4532e+00,  1.7888e-01,  2.5130e+00,\n",
       "       -4.4785e-01,  8.4806e+00, -2.7056e+00, -6.9836e+00,  9.2242e-01,\n",
       "       -3.3579e+00, -3.2071e+00,  1.2901e-01,  3.5933e+00, -4.8096e+00,\n",
       "        3.2596e-01, -3.0782e-01, -3.8023e+00, -1.2818e-01,  9.7322e-02,\n",
       "        1.0876e+00, -4.5140e+00, -8.5375e-02, -4.4139e+00, -1.4073e+00,\n",
       "       -2.4729e+00,  1.3307e-01,  3.1949e+00,  2.9971e+00,  5.3643e+00,\n",
       "       -3.2407e+00, -2.7512e+00,  3.6586e-01,  2.7333e-01,  6.6513e+00,\n",
       "        4.8740e+00,  1.3732e+00, -7.3595e-01, -2.3265e+00,  1.4045e+00,\n",
       "        1.5080e-01,  3.1985e+00, -5.7459e+00,  3.5059e+00,  8.1671e-01,\n",
       "       -1.1113e+00, -8.9306e-01, -4.2963e+00,  8.4042e-01, -8.3586e-01,\n",
       "       -2.5407e+00, -1.1414e+00, -5.5050e+00, -3.6670e+00,  1.7393e+00,\n",
       "       -1.9284e+00,  2.7994e+00,  4.4476e+00, -1.0855e+00,  2.5439e+00,\n",
       "       -1.8681e+00,  2.1162e+00,  5.5460e+00,  2.8248e+00, -1.1810e+00,\n",
       "       -9.3259e-01, -1.8681e+00, -3.0654e-02, -3.4096e+00,  2.0261e+00,\n",
       "       -5.4005e-01,  8.2070e-01,  4.3283e+00, -3.4484e+00, -2.1291e+00,\n",
       "        1.2265e+00, -4.4106e-01,  3.8392e+00, -5.8643e+00, -7.3440e-01,\n",
       "        1.9785e+00,  4.1928e+00,  1.4577e+00,  2.8668e+00, -6.3762e+00,\n",
       "        2.7575e+00,  1.7991e+00,  1.3388e-02, -5.1316e-01, -6.3303e+00,\n",
       "       -2.5989e+00, -1.0406e+00,  1.8325e+00,  9.6654e-02,  4.4002e+00,\n",
       "       -1.3231e+00,  2.7717e+00,  4.3340e+00,  2.9027e-01,  7.2542e+00,\n",
       "       -1.2149e+00, -1.7366e+00, -5.2755e+00,  7.5762e-01, -6.0150e+00,\n",
       "        2.1634e+00, -1.6577e+00, -6.4410e+00,  2.5107e+00, -7.6881e+00,\n",
       "       -6.3143e-01,  6.0914e+00,  4.7114e+00,  1.0778e+00,  1.8121e+00,\n",
       "       -3.1133e+00, -5.5923e+00,  5.0992e-01, -2.2783e+00,  1.3641e+00,\n",
       "        3.4367e+00, -1.0224e+00, -3.1824e+00,  2.0683e+00,  2.0398e+00,\n",
       "       -8.2011e+00,  4.5388e-01,  2.7002e+00,  3.9199e+00, -5.5184e-01,\n",
       "       -3.3309e+00, -3.8620e+00,  1.7020e-01,  4.9659e+00,  6.9592e-01,\n",
       "        3.4792e+00, -2.7438e+00, -6.0489e-01,  1.9883e-02,  2.3192e-01,\n",
       "       -4.0591e-01,  3.9470e+00,  1.4145e+00, -8.4031e-01, -1.9433e+00,\n",
       "       -2.5783e+00, -6.8732e+00, -3.7792e+00,  6.4090e+00, -2.3963e+00,\n",
       "       -3.1485e+00,  2.2938e+00, -1.3649e+00, -1.3070e+00, -7.4143e-01,\n",
       "        3.5752e+00,  3.1999e+00, -2.7599e+00,  3.9996e+00, -2.6275e+00,\n",
       "       -3.2632e+00, -2.7695e+00, -2.0046e+00,  3.4848e-01, -3.7322e+00,\n",
       "        3.9018e+00,  1.1883e-02,  6.7589e+00, -4.2182e+00, -1.7291e+00,\n",
       "        1.3949e+00,  5.9161e-01, -4.0226e+00,  1.7388e+00, -1.9609e+00,\n",
       "       -5.4280e-02,  1.4707e+00, -4.2497e+00, -7.4698e-01,  5.7317e+00,\n",
       "       -5.9729e+00,  4.3627e-01,  6.9487e+00, -2.9021e+00,  2.8235e+00,\n",
       "        4.4695e+00, -2.7154e+00, -1.7771e+00, -1.6288e+00, -4.9338e+00,\n",
       "       -2.1144e+00,  1.4976e+00, -4.4156e+00, -3.3974e+00, -9.0295e+00,\n",
       "        2.1685e+00, -1.7372e+00, -8.9336e-03,  2.2437e+00, -1.3924e+00,\n",
       "       -2.5530e+00, -2.0714e+00,  2.0850e+00, -5.2257e+00,  8.4517e-01,\n",
       "        1.6804e+00, -7.9530e+00,  3.8700e+00,  9.2134e+00, -4.5150e+00,\n",
       "        2.8401e+00,  5.1596e-01, -3.7684e+00,  2.3126e+00,  2.2748e+00,\n",
       "       -4.7895e+00, -2.3299e+00, -2.3546e+00, -2.0999e+00, -3.7111e+00,\n",
       "        1.4847e+00, -1.6953e+00,  4.9883e+00,  2.5845e-01,  4.1598e+00,\n",
       "       -8.4808e-01,  3.1341e+00,  4.1797e+00, -9.9561e-01,  1.1814e+00,\n",
       "       -3.0735e+00, -2.7010e+00, -9.5470e-01,  1.4944e+00,  2.4461e+00,\n",
       "       -1.2699e+00,  2.3195e+00, -7.0078e-01,  2.6868e+00, -2.9822e+00,\n",
       "        3.8670e+00,  3.1915e+00,  3.2350e+00, -3.2919e+00,  4.2211e-01,\n",
       "        3.9947e+00, -1.4124e+00, -2.1844e+00, -3.0904e+00,  2.3693e+00,\n",
       "       -2.8532e+00, -7.5463e-01, -3.6133e+00, -7.8667e+00,  4.7647e+00,\n",
       "        2.6976e+00, -2.6137e-01, -5.2056e+00, -2.2392e+00,  2.7426e+00,\n",
       "       -1.2172e+00, -4.4441e-02, -3.1014e+00, -4.7598e+00,  5.2652e+00,\n",
       "       -4.0911e+00, -4.9625e+00,  2.8234e-01,  1.5329e+00,  5.3542e+00,\n",
       "       -1.5295e+00, -3.5151e+00, -1.5575e+00, -3.6066e+00, -3.2199e+00,\n",
       "        4.5560e+00, -3.6332e-01,  1.6928e+00, -2.5321e+00, -4.1381e+00,\n",
       "       -3.4422e+00,  2.4066e+00,  6.1191e+00, -1.1493e+00,  3.0401e+00],\n",
       "      dtype=float32)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp(u'girl').vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300,)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp(u'girl').vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300,)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Document vector: vector with the average of single words\n",
    "nlp(u'the girl is blond').vector.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "doc = nlp(u'cat lion dog pet')\n",
    "#doc = nlp(u'buy sell rent')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "cat cat 1.0\n",
      "cat lion 0.3854507803916931\n",
      "cat dog 0.8220816850662231\n",
      "cat pet 0.732966423034668\n",
      "lion cat 0.3854507803916931\n",
      "lion lion 1.0\n",
      "lion dog 0.2949307858943939\n",
      "lion pet 0.20031584799289703\n",
      "dog cat 0.8220816850662231\n",
      "dog lion 0.2949307858943939\n",
      "dog dog 1.0\n",
      "dog pet 0.7856059074401855\n",
      "pet cat 0.732966423034668\n",
      "pet lion 0.20031584799289703\n",
      "pet dog 0.7856059074401855\n",
      "pet pet 1.0\n"
     ]
    }
   ],
   "source": [
    "for word1 in doc:\n",
    "    for word2 in doc:\n",
    "        print(word1.text, word2.text, word1.similarity(word2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(514157, 300)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nlp.vocab.vectors.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "False\n",
      "True\n"
     ]
    }
   ],
   "source": [
    "doc = nlp(u'catr')\n",
    "token = doc[0]\n",
    "print(token.has_vector)\n",
    "print(token.is_oov)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "from scipy import spatial\n",
    "\n",
    "cosine_similarity = lambda v1, v2: 1- spatial.distance.cosine(v1, v2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "king = nlp.vocab['king'].vector\n",
    "man = nlp.vocab['man'].vector\n",
    "woman = nlp.vocab['woman'].vector"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "# king - man + woman\n",
    "new_vector = king-man+ woman"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "computed_similarity = []\n",
    "for id in nlp.vocab.vectors:\n",
    "    word = nlp.vocab[id]\n",
    "    if word.has_vector:\n",
    "        if word.is_lower:\n",
    "            if word.is_alpha: \n",
    "                similarity = cosine_similarity(new_vector, word.vector)\n",
    "                computed_similarity.append((word, similarity))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['king', 'kings', 'princes', 'consort', 'princeling', 'monarch', 'princelings', 'princesses', 'prince', 'kingship', 'princess', 'ruler', 'consorts', 'kingi', 'princedom', 'rulers', 'kingii', 'enthronement', 'monarchical', 'queen', 'monarchs', 'enthroning', 'queening', 'regents', 'principality', 'kingsize', 'throne', 'princesa', 'dynastic', 'princedoms', 'nobility', 'monarchic', 'imperial', 'princesse', 'rulership', 'courtiers', 'dynasties', 'monarchial', 'kingdom', 'predynastic', 'enthrone', 'succession', 'princely', 'royal', 'kingly', 'mcqueen', 'dethronement', 'royally', 'emperor', 'princeps']\n"
     ]
    }
   ],
   "source": [
    "computed_similariy = sorted(computed_similarity,key=lambda item:-item[1])\n",
    "print([t[0].text for t in computed_similariy[:50]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## References\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "* [Spacy](https://spacy.io/usage/spacy-101/#annotations) \n",
    "* [NLTK stemmer](https://www.nltk.org/howto/stem.html)\n",
    "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
    "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)\n",
    "* Natural Language Processing with Python, José Portilla, 2019."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}