mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-04 23:21:42 +00:00
2539 lines
76 KiB
Plaintext
2539 lines
76 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Course Notes for Learning Intelligent Systems"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Lexical Processing"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Table of Contents\n",
|
||
"* [Objectives](#Objectives)\n",
|
||
"* [NLP Basics](#NLP-Basics)\n",
|
||
" * [Spacy installation](#Spacy-installation)\n",
|
||
" * [Spacy pipeline](#Spacy-pipeline)\n",
|
||
" * [Tokenization](#Tokenization)\n",
|
||
" * [Noun chunks](#Noun-chunks)\n",
|
||
" * [Stemming](#Stemming)\n",
|
||
" * [Sentence segmentation](#Sentence-segmentation)\n",
|
||
" * [Lemmatization](#Lemmatization)\n",
|
||
" * [Stop words](#Stop-words)\n",
|
||
" * [POS](#POS)\n",
|
||
" * [NER](#NER)\n",
|
||
"* [Text Feature extraction](#Text-Feature-extraction)\n",
|
||
"* [Classifying spam](#Classifying-spam)\n",
|
||
"* [Vectors and similarity](#Vectors-and-similarity)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Objectives"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"In this session we are going to learn to process text so that can apply machine learning techniques."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# NLP Basics\n",
|
||
"In this notebook we are going to use two popular NLP libraries:\n",
|
||
"* NLTK (Natural Language Toolkit, https://www.nltk.org/) \n",
|
||
"* Spacy (https://spacy.io/)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Main characteristics:\n",
|
||
"* both are open source and very popular\n",
|
||
"* NLTK was released in 2001 while Spacy was in 2015\n",
|
||
"* Spacy provides very efficient implementations"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Spacy installation\n",
|
||
"\n",
|
||
"You need to install previously spacy if not installed:\n",
|
||
"* `pip install spacy`\n",
|
||
"* or `conda install -c conda-forge spacy`\n",
|
||
"\n",
|
||
"and install the small English model \n",
|
||
"* `python -m spacy download en_core_web_sm`"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Spacy pipelines\n",
|
||
"\n",
|
||
"The function **nlp** takes a raw text and perform several operations (tokenization, tagger, NER, ...)\n",
|
||
"![](spacy/spacy-pipeline.svg \"Spacy pipelines\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"From text to doc trough the pipeline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import spacy\n",
|
||
"\n",
|
||
"nlp = spacy.load('en_core_web_sm')\n",
|
||
"doc = nlp(u'Albert Einstein won the Nobel Prize for Physics in 1921')\n",
|
||
"doc2 = nlp(u'\"Let\\'s go to N.Y.!\"')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Tokenization\n",
|
||
"From text to tokens\n",
|
||
"![](spacy/tokenization.svg \"Tokenization\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"The tokenizer checks:\n",
|
||
"\n",
|
||
"* **Tokenizer exception:** Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.\n",
|
||
"* **Prefix:** Character(s) at the beginning, e.g. $, (, “, ¿.\n",
|
||
"* **Suffix:** Character(s) at the end, e.g. km, ), ”, !.\n",
|
||
"* **Infix:** Character(s) in between, e.g. -, --, /, …."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Let's do it!"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Albert\n",
|
||
"Einstein\n",
|
||
"won\n",
|
||
"the\n",
|
||
"Nobel\n",
|
||
"Prize\n",
|
||
"for\n",
|
||
"Physics\n",
|
||
"in\n",
|
||
"1921\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# print tokens\n",
|
||
"for token in doc:\n",
|
||
" print(token.text)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"\"\n",
|
||
"Let\n",
|
||
"'s\n",
|
||
"go\n",
|
||
"to\n",
|
||
"N.Y.\n",
|
||
"!\n",
|
||
"\"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for token in doc2:\n",
|
||
" print(token.text)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Noun chunks\n",
|
||
"Noun phrases"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Autonomous cars cars nsubj shift\n",
|
||
"insurance liability liability dobj shift\n",
|
||
"manufacturers manufacturers pobj toward\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"doc = nlp(\"Autonomous cars shift insurance liability toward manufacturers\")\n",
|
||
"for chunk in doc.noun_chunks:\n",
|
||
" print(chunk.text, chunk.root.text, chunk.root.dep_,\n",
|
||
" chunk.root.head.text)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"4ea90bf45ed945c3978bf5e93dc2e120-0\" class=\"displacy\" width=\"960\" height=\"332.0\" direction=\"ltr\" style=\"max-width: none; height: 332.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
|
||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
|
||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">Autonomous</tspan>\n",
|
||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">ADJ</tspan>\n",
|
||
"</text>\n",
|
||
"\n",
|
||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
|
||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"180\">cars</tspan>\n",
|
||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"180\">NOUN</tspan>\n",
|
||
"</text>\n",
|
||
"\n",
|
||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
|
||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"310\">shift</tspan>\n",
|
||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"310\">VERB</tspan>\n",
|
||
"</text>\n",
|
||
"\n",
|
||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
|
||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"440\">insurance</tspan>\n",
|
||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"440\">NOUN</tspan>\n",
|
||
"</text>\n",
|
||
"\n",
|
||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
|
||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"570\">liability</tspan>\n",
|
||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"570\">NOUN</tspan>\n",
|
||
"</text>\n",
|
||
"\n",
|
||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
|
||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"700\">toward</tspan>\n",
|
||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"700\">ADP</tspan>\n",
|
||
"</text>\n",
|
||
"\n",
|
||
"<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"242.0\">\n",
|
||
" <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"830\">manufacturers</tspan>\n",
|
||
" <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"830\">NOUN</tspan>\n",
|
||
"</text>\n",
|
||
"\n",
|
||
"<g class=\"displacy-arrow\">\n",
|
||
" <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-0\" stroke-width=\"2px\" d=\"M70,197.0 C70,132.0 170.0,132.0 170.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||
" <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">amod</textPath>\n",
|
||
" </text>\n",
|
||
" <path class=\"displacy-arrowhead\" d=\"M70,199.0 L62,187.0 78,187.0\" fill=\"currentColor\"/>\n",
|
||
"</g>\n",
|
||
"\n",
|
||
"<g class=\"displacy-arrow\">\n",
|
||
" <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-1\" stroke-width=\"2px\" d=\"M200,197.0 C200,132.0 300.0,132.0 300.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||
" <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
|
||
" </text>\n",
|
||
" <path class=\"displacy-arrowhead\" d=\"M200,199.0 L192,187.0 208,187.0\" fill=\"currentColor\"/>\n",
|
||
"</g>\n",
|
||
"\n",
|
||
"<g class=\"displacy-arrow\">\n",
|
||
" <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-2\" stroke-width=\"2px\" d=\"M460,197.0 C460,132.0 560.0,132.0 560.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||
" <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
|
||
" </text>\n",
|
||
" <path class=\"displacy-arrowhead\" d=\"M460,199.0 L452,187.0 468,187.0\" fill=\"currentColor\"/>\n",
|
||
"</g>\n",
|
||
"\n",
|
||
"<g class=\"displacy-arrow\">\n",
|
||
" <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-3\" stroke-width=\"2px\" d=\"M330,197.0 C330,67.0 565.0,67.0 565.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||
" <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
|
||
" </text>\n",
|
||
" <path class=\"displacy-arrowhead\" d=\"M565.0,199.0 L573.0,187.0 557.0,187.0\" fill=\"currentColor\"/>\n",
|
||
"</g>\n",
|
||
"\n",
|
||
"<g class=\"displacy-arrow\">\n",
|
||
" <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-4\" stroke-width=\"2px\" d=\"M330,197.0 C330,2.0 700.0,2.0 700.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||
" <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">prep</textPath>\n",
|
||
" </text>\n",
|
||
" <path class=\"displacy-arrowhead\" d=\"M700.0,199.0 L708.0,187.0 692.0,187.0\" fill=\"currentColor\"/>\n",
|
||
"</g>\n",
|
||
"\n",
|
||
"<g class=\"displacy-arrow\">\n",
|
||
" <path class=\"displacy-arc\" id=\"arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-5\" stroke-width=\"2px\" d=\"M720,197.0 C720,132.0 820.0,132.0 820.0,197.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
|
||
" <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
|
||
" <textPath xlink:href=\"#arrow-4ea90bf45ed945c3978bf5e93dc2e120-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">pobj</textPath>\n",
|
||
" </text>\n",
|
||
" <path class=\"displacy-arrowhead\" d=\"M820.0,199.0 L828.0,187.0 812.0,187.0\" fill=\"currentColor\"/>\n",
|
||
"</g>\n",
|
||
"</svg></span>"
|
||
],
|
||
"text/plain": [
|
||
"<IPython.core.display.HTML object>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"from spacy import displacy\n",
|
||
"displacy.render(doc, style='dep', jupyter=True,options={'distance':130})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Sentence segmentation\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"This is a sentence.\n",
|
||
"This is another sentence.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"doc3 =nlp(u'This is a sentence. This is another sentence.')\n",
|
||
"for sent in doc3.sents:\n",
|
||
" print(sent)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Stemming\n",
|
||
"Spacy does not include a stemmer. \n",
|
||
"We will use nltk.\n",
|
||
"The purpose is removing the ending of a word based on rules."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"caress\n",
|
||
"fli\n",
|
||
"is\n",
|
||
"been\n",
|
||
"gener\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"import nltk\n",
|
||
"from nltk.stem.porter import PorterStemmer\n",
|
||
"\n",
|
||
"stemmer = PorterStemmer()\n",
|
||
"words = ['caresses', 'flies','is', 'been', 'generously']\n",
|
||
"stems = [stemmer.stem(word) for word in words]\n",
|
||
"for stem in stems:\n",
|
||
" print(stem)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Lemmatization\n",
|
||
"Lemmatization includes a morphological analysis."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"['I', 'be', 'read', 'the', 'paper', '.']\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"doc = nlp(\"I was reading the paper.\")\n",
|
||
"print([token.lemma_ for token in doc])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"I \t PRON \t I\n",
|
||
"was \t AUX \t be\n",
|
||
"reading \t VERB \t read\n",
|
||
"the \t DET \t the\n",
|
||
"paper \t NOUN \t paper\n",
|
||
". \t PUNCT \t .\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for token in doc:\n",
|
||
" print(token.text, '\\t', token.pos_, '\\t', token.lemma_)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Stopwords"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"{'other', 'toward', 'everywhere', 'whether', 'i', 'his', 'afterwards', 'whenever', 'except', 'but', 'for', 'noone', 'whereas', 'she', 'name', 'per', 'among', 'where', 'am', '’re', 'on', 'thereby', 'fifty', \"'ll\", '‘m', 'would', 'we', 'once', 'can', 'meanwhile', 'anything', 'still', 'these', 'without', 'below', 'rather', 'were', 'six', 'them', 'latter', 'sometime', 'seemed', 'next', 'move', 'there', 'various', 'fifteen', '’m', 'onto', 'n’t', 'much', 'one', 'due', 'our', 'although', 'whom', 'done', 'made', 'at', 'empty', 'about', 'herself', 'down', 'never', 'thereafter', \"'re\", 'off', 'he', 'hereafter', 'whereafter', 'what', 'above', 'along', 'a', '’ve', '’s', 'else', 'or', 'sometimes', 'twenty', \"n't\", 'over', 'both', 'someone', 'beside', 'therein', 'whole', 'make', 'any', 'becomes', 'anyhow', 'quite', 'hence', 'here', 'same', 'which', 'whereby', 'whereupon', 'must', 'me', 'part', 'serious', 'into', 'namely', 'hers', 'enough', 'with', 'because', 'own', 'give', 'see', 'somehow', 'since', 'just', 'seems', 'top', 'across', 'ourselves', 'in', 'anywhere', 'few', 'myself', 'say', 'all', 'together', 'had', 'back', 'besides', 'please', 'n‘t', 'many', 'whoever', \"'d\", 'nothing', 'be', 'did', 'yours', 'how', 'also', 'those', 'that', 'throughout', '‘ve', 'amount', 'others', 'keep', 'hundred', 'up', 'already', 'amongst', 'front', 'thru', 'if', '‘s', 'least', 'us', 'anyone', 'might', 'thereupon', 'third', 'nor', 'sixty', 'nobody', 'more', 'her', 'by', 'himself', 'each', 'than', 're', 'behind', 'almost', 'seeming', 'when', 'is', 'mostly', 'so', 'ours', 'becoming', 'towards', 'some', 'two', 'seem', 'put', 'four', 'you', 'call', 'thence', 'moreover', 'nowhere', 'do', 'former', 'formerly', 'elsewhere', 'full', 'after', 'thus', 'less', 'go', 'ever', 'nine', 'anyway', 'somewhere', 'will', 'three', 'within', 'been', 'before', 'of', 'has', 'beyond', 'such', 'why', 'none', 'whose', 'eight', 'either', 'no', 'itself', 'doing', \"'ve\", '‘d', 'though', 'neither', 'while', 'get', 'around', 'your', 'twelve', 'even', 'and', 'something', 'always', \"'m\", 'until', 'an', 'during', 'it', 'most', 'the', 'yourselves', '’d', '‘ll', 'very', 'really', 'show', 'too', 'everyone', 'wherein', 'therefore', 'whither', 'who', 'may', 'latterly', 'between', 'its', 'however', '‘re', 'through', 'everything', 'another', 'does', 'whatever', 'their', 'then', 'beforehand', 'my', 'eleven', 'now', 'should', 'cannot', 'have', 'take', \"'s\", 'perhaps', 'regarding', 'against', 'used', 'indeed', 'upon', 'him', 'using', 'under', 'became', 'to', 'hereupon', 'bottom', 'not', 'from', 'hereby', 'side', 'yet', 'this', 'could', 'themselves', '’ll', 'ca', 'nevertheless', 'via', 'five', 'become', 'as', 'only', 'otherwise', 'are', 'was', 'first', 'further', 'ten', 'herein', 'yourself', 'forty', 'alone', 'every', 'again', 'well', 'last', 'several', 'wherever', 'mine', 'often', 'whence', 'out', 'they', 'being', 'unless'}\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp = spacy.load('en_core_web_sm')\n",
|
||
"print(nlp.Defaults.stop_words)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp.vocab['for'].is_stop"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"False"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp.vocab['day'].is_stop"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"False"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp.vocab['btw'].is_stop"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"True"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#add stop words\n",
|
||
"nlp.Defaults.stop_words.add('btw')\n",
|
||
"nlp.vocab['btw'].is_stop = True\n",
|
||
"nlp.vocab['btw'].is_stop"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Original Text\n",
|
||
"Nick likes to play football, however he is not too fond of tennis. \n",
|
||
"\n",
|
||
"\n",
|
||
"Text after removing stop words\n",
|
||
"Nick likes play football, fond tennis.\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"en_stopwords = nlp.Defaults.stop_words\n",
|
||
"text = \"Nick likes to play football, however he is not too fond of tennis.\"\n",
|
||
"\n",
|
||
"lst=[]\n",
|
||
"for token in text.split():\n",
|
||
" if token.lower() not in en_stopwords:\n",
|
||
" lst.append(token)\n",
|
||
"\n",
|
||
"print('Original Text') \n",
|
||
"print(text,'\\n\\n')\n",
|
||
"\n",
|
||
"print('Text after removing stop words')\n",
|
||
"print(' '.join(lst))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"We can also use methods from the token class (https://spacy.io/api/token), such as:\n",
|
||
"\n",
|
||
"* **is_stop:** is the token a stop word?\n",
|
||
"* **is_punct:** is the token punctuation?\n",
|
||
"* **like_email:** does the token resemble an email address?\n",
|
||
"* **is_digit:** Does the token consist of digits? "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## POS"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"I \t PRON \t pronoun\n",
|
||
"was \t AUX \t auxiliary\n",
|
||
"reading \t VERB \t verb\n",
|
||
"the \t DET \t determiner\n",
|
||
"paper \t NOUN \t noun\n",
|
||
". \t PUNCT \t punctuation\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# POS\n",
|
||
"# information available at https://spacy.io/usage/linguistic-features/#pos-tagging\n",
|
||
"for token in doc:\n",
|
||
" print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"I \t PRON \t pronoun \t nsubj\n",
|
||
"was \t AUX \t auxiliary \t aux\n",
|
||
"reading \t VERB \t verb \t ROOT\n",
|
||
"the \t DET \t determiner \t det\n",
|
||
"paper \t NOUN \t noun \t dobj\n",
|
||
". \t PUNCT \t punctuation \t punct\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for token in doc:\n",
|
||
" print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_), '\\t', token.dep_)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7f74618d3d00>),\n",
|
||
" ('tagger', <spacy.pipeline.tagger.Tagger at 0x7f74618d3880>),\n",
|
||
" ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7f74619006d0>),\n",
|
||
" ('attribute_ruler',\n",
|
||
" <spacy.pipeline.attributeruler.AttributeRuler at 0x7f74618ffd00>),\n",
|
||
" ('lemmatizer',\n",
|
||
" <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7f746171d900>),\n",
|
||
" ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7f74619007b0>)]"
|
||
]
|
||
},
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"# List the pipeline\n",
|
||
"nlp.pipeline"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp.pipe_names"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"We can also get some statistics, regarding the frequency of tags in POS, DEP or TAG."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"ADJ 1\n",
|
||
"ADP 1\n",
|
||
"ADV 2\n",
|
||
"AUX 1\n",
|
||
"NOUN 2\n",
|
||
"PART 2\n",
|
||
"PRON 1\n",
|
||
"PROPN 1\n",
|
||
"PUNCT 2\n",
|
||
"VERB 2\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"doc = nlp(u'Nick likes to play football, however he is not too fond of tennis.')\n",
|
||
"POS_counts = doc.count_by(spacy.attrs.POS)\n",
|
||
"\n",
|
||
"for code,freq in sorted(POS_counts.items()):\n",
|
||
" print(doc.vocab[code].text, freq)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"RB 3\n",
|
||
"IN 1\n",
|
||
", 1\n",
|
||
"TO 1\n",
|
||
"JJ 1\n",
|
||
". 1\n",
|
||
"PRP 1\n",
|
||
"VBZ 2\n",
|
||
"VB 1\n",
|
||
"NN 2\n",
|
||
"NNP 1\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"TAG_counts = doc.count_by(spacy.attrs.TAG)\n",
|
||
"\n",
|
||
"for code,freq in sorted(TAG_counts.items()):\n",
|
||
" print(doc.vocab[code].text, freq)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"acomp 1\n",
|
||
"advmod 2\n",
|
||
"aux 1\n",
|
||
"ccomp 1\n",
|
||
"dobj 1\n",
|
||
"neg 1\n",
|
||
"nsubj 2\n",
|
||
"pobj 1\n",
|
||
"prep 1\n",
|
||
"punct 2\n",
|
||
"xcomp 1\n",
|
||
"ROOT 1\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"DEP_counts = doc.count_by(spacy.attrs.DEP)\n",
|
||
"\n",
|
||
"for code,freq in sorted(DEP_counts.items()):\n",
|
||
" print(doc.vocab[code].text, freq)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## NER"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Apple 0 5 ORG\n",
|
||
"U.K. 27 31 GPE\n",
|
||
"$1 billion 44 54 MONEY\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp = spacy.load(\"en_core_web_sm\")\n",
|
||
"doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')\n",
|
||
"\n",
|
||
"for ent in doc.ents:\n",
|
||
" print(ent.text, ent.start_char, ent.end_char, ent.label_)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">\n",
|
||
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" Apple\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
|
||
"</mark>\n",
|
||
" is looking at buying \n",
|
||
"<mark class=\"entity\" style=\"background: #feca74; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" U.K.\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">GPE</span>\n",
|
||
"</mark>\n",
|
||
" startup for \n",
|
||
"<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" $1 billion\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
|
||
"</mark>\n",
|
||
"</div></span>"
|
||
],
|
||
"text/plain": [
|
||
"<IPython.core.display.HTML object>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"displacy.render(doc, style='ent', jupyter=True,options={'distance':130})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">Over \n",
|
||
"<mark class=\"entity\" style=\"background: #bfe1d9; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" the last quarter\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">DATE</span>\n",
|
||
"</mark>\n",
|
||
" \n",
|
||
"<mark class=\"entity\" style=\"background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" Apple\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">ORG</span>\n",
|
||
"</mark>\n",
|
||
" sold \n",
|
||
"<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" nearly 20 thousand\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">CARDINAL</span>\n",
|
||
"</mark>\n",
|
||
" \n",
|
||
"<mark class=\"entity\" style=\"background: #bfeeb7; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" iPods\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">PRODUCT</span>\n",
|
||
"</mark>\n",
|
||
" for a profit of \n",
|
||
"<mark class=\"entity\" style=\"background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
|
||
" $6 million\n",
|
||
" <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MONEY</span>\n",
|
||
"</mark>\n",
|
||
".</div></span>"
|
||
],
|
||
"text/plain": [
|
||
"<IPython.core.display.HTML object>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')\n",
|
||
"displacy.render(doc, style='ent', jupyter=True,options={'distance':130})"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Text Feature Extraction"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## CountVectorizer\n",
|
||
"Transforming text into a vector"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n",
|
||
" 'summer', 'the', 'winter'], dtype=object)"
|
||
]
|
||
},
|
||
"execution_count": 26,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"import sklearn\n",
|
||
"# Count vectorization\n",
|
||
"texts = [\"Summer is coming but Summer is short\", \n",
|
||
" \"I like the Summer and I like the Winter\", \n",
|
||
" \"I like sandwiches and I like the Winter\"]\n",
|
||
"\n",
|
||
"\n",
|
||
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
||
"\n",
|
||
"\n",
|
||
"\n",
|
||
"# Count occurrences of unique words\n",
|
||
"# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html \n",
|
||
"\n",
|
||
"vectorizer = CountVectorizer()\n",
|
||
"X = vectorizer.fit_transform(texts)\n",
|
||
"vectorizer.get_feature_names_out()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[0 1 1 2 0 0 1 2 0 0]\n",
|
||
" [1 0 0 0 2 0 0 1 2 1]\n",
|
||
" [1 0 0 0 2 1 0 0 1 1]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(X.toarray())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array(['and like', 'but summer', 'coming but', 'is coming', 'is short',\n",
|
||
" 'like sandwiches', 'like the', 'sandwiches and', 'summer and',\n",
|
||
" 'summer is', 'the summer', 'the winter'], dtype=object)"
|
||
]
|
||
},
|
||
"execution_count": 28,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2)) # only bigrams; (1,2):unigram, bigram\n",
|
||
"X2 = vectorizer2.fit_transform(texts)\n",
|
||
"vectorizer2.get_feature_names_out()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 29,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[0 1 1 1 1 0 0 0 0 2 0 0]\n",
|
||
" [1 0 0 0 0 0 2 0 1 0 1 1]\n",
|
||
" [1 0 0 0 0 1 1 1 0 0 0 1]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(X2.toarray())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Remove stop words"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 30,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array(['coming', 'like', 'sandwiches', 'short', 'summer', 'winter'],\n",
|
||
" dtype=object)"
|
||
]
|
||
},
|
||
"execution_count": 30,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"vectorizer3 = CountVectorizer(analyzer='word', stop_words='english') # only bigrams\n",
|
||
"X3 = vectorizer3.fit_transform(texts)\n",
|
||
"vectorizer3.get_feature_names_out()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 31,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[1 0 0 1 2 0]\n",
|
||
" [0 2 0 0 1 1]\n",
|
||
" [0 2 1 0 0 1]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(X3.toarray())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## TF-IDF"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 32,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n",
|
||
" 'summer', 'the', 'winter'], dtype=object)"
|
||
]
|
||
},
|
||
"execution_count": 32,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||
"\n",
|
||
"vect = TfidfVectorizer()\n",
|
||
"XTIDF = vect.fit_transform(texts)\n",
|
||
"vect.get_feature_names_out()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 33,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[0 1 1 2 0 0 1 2 0 0]\n",
|
||
" [1 0 0 0 2 0 0 1 2 1]\n",
|
||
" [1 0 0 0 2 1 0 0 1 1]]\n",
|
||
"[[0. 0.32767345 0.32767345 0.65534691 0. 0.\n",
|
||
" 0.32767345 0.49840822 0. 0. ]\n",
|
||
" [0.30151134 0. 0. 0. 0.60302269 0.\n",
|
||
" 0. 0.30151134 0.60302269 0.30151134]\n",
|
||
" [0.33846987 0. 0. 0. 0.67693975 0.44504721\n",
|
||
" 0. 0. 0.33846987 0.33846987]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Counter\n",
|
||
"print(X.toarray())\n",
|
||
"# TF-IDF\n",
|
||
"print(XTIDF.toarray())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 34,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"['coming' 'like' 'sandwiches' 'short' 'summer' 'winter']\n",
|
||
"[[1 0 0 1 2 0]\n",
|
||
" [0 2 0 0 1 1]\n",
|
||
" [0 2 1 0 0 1]]\n",
|
||
"[[0.48148213 0. 0. 0.48148213 0.73235914 0. ]\n",
|
||
" [0. 0.81649658 0. 0. 0.40824829 0.40824829]\n",
|
||
" [0. 0.77100584 0.50689001 0. 0. 0.38550292]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"vect3 = TfidfVectorizer(analyzer='word', stop_words='english')\n",
|
||
"XTIDF3 = vect3.fit_transform(texts)\n",
|
||
"print(vect3.get_feature_names_out())\n",
|
||
"print(X3.toarray())\n",
|
||
"print(XTIDF3.toarray())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Classifying spam\n",
|
||
"We will use the sms spam collection dataset taken from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 35,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import numpy as np\n",
|
||
"import pandas as pd"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 36,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>label</th>\n",
|
||
" <th>message</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>ham</td>\n",
|
||
" <td>Go until jurong point, crazy.. Available only ...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>ham</td>\n",
|
||
" <td>Ok lar... Joking wif u oni...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>2</th>\n",
|
||
" <td>spam</td>\n",
|
||
" <td>Free entry in 2 a wkly comp to win FA Cup fina...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3</th>\n",
|
||
" <td>ham</td>\n",
|
||
" <td>U dun say so early hor... U c already then say...</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>4</th>\n",
|
||
" <td>ham</td>\n",
|
||
" <td>Nah I don't think he goes to usf, he lives aro...</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" label message\n",
|
||
"0 ham Go until jurong point, crazy.. Available only ...\n",
|
||
"1 ham Ok lar... Joking wif u oni...\n",
|
||
"2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n",
|
||
"3 ham U dun say so early hor... U c already then say...\n",
|
||
"4 ham Nah I don't think he goes to usf, he lives aro..."
|
||
]
|
||
},
|
||
"execution_count": 36,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"column_names = ['label', 'message']\n",
|
||
"df = pd.read_csv('SMSSpamCollection', sep='\\t', names=column_names, header=None)\n",
|
||
"df[0:5]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 37,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"label 0\n",
|
||
"message 0\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 37,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#check missing values\n",
|
||
"df.isnull().sum()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# df['label'].replace('', np.nan, inplace=True)\n",
|
||
"# df['message'].replace('', np.nan, inplace=True)\n",
|
||
"# df.dropna(inplace=True)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 38,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"ham 4825\n",
|
||
"spam 747\n",
|
||
"Name: label, dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 38,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df['label'].value_counts()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 39,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# spling training and testing\n",
|
||
"from sklearn.model_selection import train_test_split"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 40,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"X = df['message']\n",
|
||
"y = df['label']"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 41,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#train\n",
|
||
"X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.33, random_state=42)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 42,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.feature_extraction.text import CountVectorizer"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 43,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"count_vect = CountVectorizer()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 44,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# fit vectorizer to data: build dictionary, count words,...\n",
|
||
"# transform: transform original text message to the vector\n",
|
||
"X_train_counts = count_vect.fit_transform(X_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 45,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(3733,)"
|
||
]
|
||
},
|
||
"execution_count": 45,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"X_train.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 46,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"<3733x7082 sparse matrix of type '<class 'numpy.int64'>'\n",
|
||
"\twith 49992 stored elements in Compressed Sparse Row format>"
|
||
]
|
||
},
|
||
"execution_count": 46,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"X_train_counts"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"source": [
|
||
"We see the vocabulary are 7082 words, but most values are zeros"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 47,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.feature_extraction.text import TfidfTransformer"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 48,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(3733, 7082)"
|
||
]
|
||
},
|
||
"execution_count": 48,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"tfidf_transformer = TfidfTransformer()\n",
|
||
"X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)\n",
|
||
"X_train_tfidf.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"source": [
|
||
"Instead of first count vectorization and then tf-idf transformation, better TF-IDF vectorizer, which makes these two things"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||
"vectorizer = TfidfVectorizer()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"X_train_tfidf = vectorizer.fit_transform(X_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Classifier"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 49,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.svm import LinearSVC"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 50,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"clf = LinearSVC()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 51,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>LinearSVC()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearSVC</label><div class=\"sk-toggleable__content\"><pre>LinearSVC()</pre></div></div></div></div></div>"
|
||
],
|
||
"text/plain": [
|
||
"LinearSVC()"
|
||
]
|
||
},
|
||
"execution_count": 51,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"clf.fit(X_train_tfidf, y_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Pipeline\n",
|
||
"Simple way to define the processing steps for repeating the operation."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 52,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.pipeline import Pipeline\n",
|
||
"text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 53,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<style>#sk-container-id-2 {color: black;background-color: white;}#sk-container-id-2 pre{padding: 0;}#sk-container-id-2 div.sk-toggleable {background-color: white;}#sk-container-id-2 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-2 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-2 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-2 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-2 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-2 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-2 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-2 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-2 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-2 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-2 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-2 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-2 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-2 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-2 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-2 div.sk-item {position: relative;z-index: 1;}#sk-container-id-2 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-2 div.sk-item::before, #sk-container-id-2 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-2 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-2 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-2 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-2 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-2 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-2 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-2 div.sk-label-container {text-align: center;}#sk-container-id-2 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-2 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item sk-dashed-wrapped\"><div class=\"sk-label-container\"><div class=\"sk-label sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" ><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">Pipeline</label><div class=\"sk-toggleable__content\"><pre>Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])</pre></div></div></div><div class=\"sk-serial\"><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-3\" type=\"checkbox\" ><label for=\"sk-estimator-id-3\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">TfidfVectorizer</label><div class=\"sk-toggleable__content\"><pre>TfidfVectorizer()</pre></div></div></div><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-4\" type=\"checkbox\" ><label for=\"sk-estimator-id-4\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">LinearSVC</label><div class=\"sk-toggleable__content\"><pre>LinearSVC()</pre></div></div></div></div></div></div></div>"
|
||
],
|
||
"text/plain": [
|
||
"Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])"
|
||
]
|
||
},
|
||
"execution_count": 53,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"text_clf.fit(X_train, y_train)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 54,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"predictions = text_clf.predict(X_test)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 55,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from sklearn.metrics import confusion_matrix, classification_report"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 56,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"[[1586 7]\n",
|
||
" [ 12 234]]\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(confusion_matrix(y_test, predictions))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 57,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
" precision recall f1-score support\n",
|
||
"\n",
|
||
" ham 0.99 1.00 0.99 1593\n",
|
||
" spam 0.97 0.95 0.96 246\n",
|
||
"\n",
|
||
" accuracy 0.99 1839\n",
|
||
" macro avg 0.98 0.97 0.98 1839\n",
|
||
"weighted avg 0.99 0.99 0.99 1839\n",
|
||
"\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"print(classification_report(y_test, predictions))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 58,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"0.989668297988037"
|
||
]
|
||
},
|
||
"execution_count": 58,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"from sklearn import metrics\n",
|
||
"metrics.accuracy_score(y_test, predictions)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 59,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array(['ham'], dtype=object)"
|
||
]
|
||
},
|
||
"execution_count": 59,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"text_clf.predict([\"This is a summer school\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 60,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array(['spam'], dtype=object)"
|
||
]
|
||
},
|
||
"execution_count": 60,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"text_clf.predict([\"Free tickets and CASH\"])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "slide"
|
||
}
|
||
},
|
||
"source": [
|
||
"# Vectors and Similarity\n",
|
||
"You need to install previously spacy if not installed:\n",
|
||
"* `pip install spacy`\n",
|
||
"* or `conda install -c conda-forge spacy`\n",
|
||
"\n",
|
||
"and install the English models (large or medium):\n",
|
||
"* `python -m spacy download en_core_web_md`\n",
|
||
"* `python -m spacy download en_core_web_lg`\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import spacy\n",
|
||
"nlp = spacy.load('en_core_web_lg')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"array([ 1.8023e+00, 3.9075e+00, -4.2940e+00, -7.6117e+00, -3.7172e+00,\n",
|
||
" -1.5229e-01, -1.1368e+00, -6.8427e-01, -9.3067e-01, 5.6531e+00,\n",
|
||
" 4.2536e+00, -4.1175e+00, -8.3049e-01, 2.7701e+00, 6.4474e+00,\n",
|
||
" -6.6389e-02, -8.3026e-01, -7.4532e+00, 1.7888e-01, 2.5130e+00,\n",
|
||
" -4.4785e-01, 8.4806e+00, -2.7056e+00, -6.9836e+00, 9.2242e-01,\n",
|
||
" -3.3579e+00, -3.2071e+00, 1.2901e-01, 3.5933e+00, -4.8096e+00,\n",
|
||
" 3.2596e-01, -3.0782e-01, -3.8023e+00, -1.2818e-01, 9.7322e-02,\n",
|
||
" 1.0876e+00, -4.5140e+00, -8.5375e-02, -4.4139e+00, -1.4073e+00,\n",
|
||
" -2.4729e+00, 1.3307e-01, 3.1949e+00, 2.9971e+00, 5.3643e+00,\n",
|
||
" -3.2407e+00, -2.7512e+00, 3.6586e-01, 2.7333e-01, 6.6513e+00,\n",
|
||
" 4.8740e+00, 1.3732e+00, -7.3595e-01, -2.3265e+00, 1.4045e+00,\n",
|
||
" 1.5080e-01, 3.1985e+00, -5.7459e+00, 3.5059e+00, 8.1671e-01,\n",
|
||
" -1.1113e+00, -8.9306e-01, -4.2963e+00, 8.4042e-01, -8.3586e-01,\n",
|
||
" -2.5407e+00, -1.1414e+00, -5.5050e+00, -3.6670e+00, 1.7393e+00,\n",
|
||
" -1.9284e+00, 2.7994e+00, 4.4476e+00, -1.0855e+00, 2.5439e+00,\n",
|
||
" -1.8681e+00, 2.1162e+00, 5.5460e+00, 2.8248e+00, -1.1810e+00,\n",
|
||
" -9.3259e-01, -1.8681e+00, -3.0654e-02, -3.4096e+00, 2.0261e+00,\n",
|
||
" -5.4005e-01, 8.2070e-01, 4.3283e+00, -3.4484e+00, -2.1291e+00,\n",
|
||
" 1.2265e+00, -4.4106e-01, 3.8392e+00, -5.8643e+00, -7.3440e-01,\n",
|
||
" 1.9785e+00, 4.1928e+00, 1.4577e+00, 2.8668e+00, -6.3762e+00,\n",
|
||
" 2.7575e+00, 1.7991e+00, 1.3388e-02, -5.1316e-01, -6.3303e+00,\n",
|
||
" -2.5989e+00, -1.0406e+00, 1.8325e+00, 9.6654e-02, 4.4002e+00,\n",
|
||
" -1.3231e+00, 2.7717e+00, 4.3340e+00, 2.9027e-01, 7.2542e+00,\n",
|
||
" -1.2149e+00, -1.7366e+00, -5.2755e+00, 7.5762e-01, -6.0150e+00,\n",
|
||
" 2.1634e+00, -1.6577e+00, -6.4410e+00, 2.5107e+00, -7.6881e+00,\n",
|
||
" -6.3143e-01, 6.0914e+00, 4.7114e+00, 1.0778e+00, 1.8121e+00,\n",
|
||
" -3.1133e+00, -5.5923e+00, 5.0992e-01, -2.2783e+00, 1.3641e+00,\n",
|
||
" 3.4367e+00, -1.0224e+00, -3.1824e+00, 2.0683e+00, 2.0398e+00,\n",
|
||
" -8.2011e+00, 4.5388e-01, 2.7002e+00, 3.9199e+00, -5.5184e-01,\n",
|
||
" -3.3309e+00, -3.8620e+00, 1.7020e-01, 4.9659e+00, 6.9592e-01,\n",
|
||
" 3.4792e+00, -2.7438e+00, -6.0489e-01, 1.9883e-02, 2.3192e-01,\n",
|
||
" -4.0591e-01, 3.9470e+00, 1.4145e+00, -8.4031e-01, -1.9433e+00,\n",
|
||
" -2.5783e+00, -6.8732e+00, -3.7792e+00, 6.4090e+00, -2.3963e+00,\n",
|
||
" -3.1485e+00, 2.2938e+00, -1.3649e+00, -1.3070e+00, -7.4143e-01,\n",
|
||
" 3.5752e+00, 3.1999e+00, -2.7599e+00, 3.9996e+00, -2.6275e+00,\n",
|
||
" -3.2632e+00, -2.7695e+00, -2.0046e+00, 3.4848e-01, -3.7322e+00,\n",
|
||
" 3.9018e+00, 1.1883e-02, 6.7589e+00, -4.2182e+00, -1.7291e+00,\n",
|
||
" 1.3949e+00, 5.9161e-01, -4.0226e+00, 1.7388e+00, -1.9609e+00,\n",
|
||
" -5.4280e-02, 1.4707e+00, -4.2497e+00, -7.4698e-01, 5.7317e+00,\n",
|
||
" -5.9729e+00, 4.3627e-01, 6.9487e+00, -2.9021e+00, 2.8235e+00,\n",
|
||
" 4.4695e+00, -2.7154e+00, -1.7771e+00, -1.6288e+00, -4.9338e+00,\n",
|
||
" -2.1144e+00, 1.4976e+00, -4.4156e+00, -3.3974e+00, -9.0295e+00,\n",
|
||
" 2.1685e+00, -1.7372e+00, -8.9336e-03, 2.2437e+00, -1.3924e+00,\n",
|
||
" -2.5530e+00, -2.0714e+00, 2.0850e+00, -5.2257e+00, 8.4517e-01,\n",
|
||
" 1.6804e+00, -7.9530e+00, 3.8700e+00, 9.2134e+00, -4.5150e+00,\n",
|
||
" 2.8401e+00, 5.1596e-01, -3.7684e+00, 2.3126e+00, 2.2748e+00,\n",
|
||
" -4.7895e+00, -2.3299e+00, -2.3546e+00, -2.0999e+00, -3.7111e+00,\n",
|
||
" 1.4847e+00, -1.6953e+00, 4.9883e+00, 2.5845e-01, 4.1598e+00,\n",
|
||
" -8.4808e-01, 3.1341e+00, 4.1797e+00, -9.9561e-01, 1.1814e+00,\n",
|
||
" -3.0735e+00, -2.7010e+00, -9.5470e-01, 1.4944e+00, 2.4461e+00,\n",
|
||
" -1.2699e+00, 2.3195e+00, -7.0078e-01, 2.6868e+00, -2.9822e+00,\n",
|
||
" 3.8670e+00, 3.1915e+00, 3.2350e+00, -3.2919e+00, 4.2211e-01,\n",
|
||
" 3.9947e+00, -1.4124e+00, -2.1844e+00, -3.0904e+00, 2.3693e+00,\n",
|
||
" -2.8532e+00, -7.5463e-01, -3.6133e+00, -7.8667e+00, 4.7647e+00,\n",
|
||
" 2.6976e+00, -2.6137e-01, -5.2056e+00, -2.2392e+00, 2.7426e+00,\n",
|
||
" -1.2172e+00, -4.4441e-02, -3.1014e+00, -4.7598e+00, 5.2652e+00,\n",
|
||
" -4.0911e+00, -4.9625e+00, 2.8234e-01, 1.5329e+00, 5.3542e+00,\n",
|
||
" -1.5295e+00, -3.5151e+00, -1.5575e+00, -3.6066e+00, -3.2199e+00,\n",
|
||
" 4.5560e+00, -3.6332e-01, 1.6928e+00, -2.5321e+00, -4.1381e+00,\n",
|
||
" -3.4422e+00, 2.4066e+00, 6.1191e+00, -1.1493e+00, 3.0401e+00],\n",
|
||
" dtype=float32)"
|
||
]
|
||
},
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp(u'girl').vector"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(300,)"
|
||
]
|
||
},
|
||
"execution_count": 3,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp(u'girl').vector.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(300,)"
|
||
]
|
||
},
|
||
"execution_count": 4,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"#Document vector: vector with the average of single words\n",
|
||
"nlp(u'the girl is blond').vector.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"doc = nlp(u'cat lion dog pet')\n",
|
||
"#doc = nlp(u'buy sell rent')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"cat cat 1.0\n",
|
||
"cat lion 0.3854507803916931\n",
|
||
"cat dog 0.8220816850662231\n",
|
||
"cat pet 0.732966423034668\n",
|
||
"lion cat 0.3854507803916931\n",
|
||
"lion lion 1.0\n",
|
||
"lion dog 0.2949307858943939\n",
|
||
"lion pet 0.20031584799289703\n",
|
||
"dog cat 0.8220816850662231\n",
|
||
"dog lion 0.2949307858943939\n",
|
||
"dog dog 1.0\n",
|
||
"dog pet 0.7856059074401855\n",
|
||
"pet cat 0.732966423034668\n",
|
||
"pet lion 0.20031584799289703\n",
|
||
"pet dog 0.7856059074401855\n",
|
||
"pet pet 1.0\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"for word1 in doc:\n",
|
||
" for word2 in doc:\n",
|
||
" print(word1.text, word2.text, word1.similarity(word2))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(514157, 300)"
|
||
]
|
||
},
|
||
"execution_count": 7,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"nlp.vocab.vectors.shape"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"False\n",
|
||
"True\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"doc = nlp(u'catr')\n",
|
||
"token = doc[0]\n",
|
||
"print(token.has_vector)\n",
|
||
"print(token.is_oov)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"from scipy import spatial\n",
|
||
"\n",
|
||
"cosine_similarity = lambda v1, v2: 1- spatial.distance.cosine(v1, v2)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"king = nlp.vocab['king'].vector\n",
|
||
"man = nlp.vocab['man'].vector\n",
|
||
"woman = nlp.vocab['woman'].vector"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# king - man + woman\n",
|
||
"new_vector = king-man+ woman"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "subslide"
|
||
}
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"computed_similarity = []\n",
|
||
"for id in nlp.vocab.vectors:\n",
|
||
" word = nlp.vocab[id]\n",
|
||
" if word.has_vector:\n",
|
||
" if word.is_lower:\n",
|
||
" if word.is_alpha: \n",
|
||
" similarity = cosine_similarity(new_vector, word.vector)\n",
|
||
" computed_similarity.append((word, similarity))"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "fragment"
|
||
}
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"['king', 'kings', 'princes', 'consort', 'princeling', 'monarch', 'princelings', 'princesses', 'prince', 'kingship', 'princess', 'ruler', 'consorts', 'kingi', 'princedom', 'rulers', 'kingii', 'enthronement', 'monarchical', 'queen', 'monarchs', 'enthroning', 'queening', 'regents', 'principality', 'kingsize', 'throne', 'princesa', 'dynastic', 'princedoms', 'nobility', 'monarchic', 'imperial', 'princesse', 'rulership', 'courtiers', 'dynasties', 'monarchial', 'kingdom', 'predynastic', 'enthrone', 'succession', 'princely', 'royal', 'kingly', 'mcqueen', 'dethronement', 'royally', 'emperor', 'princeps']\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"computed_similariy = sorted(computed_similarity,key=lambda item:-item[1])\n",
|
||
"print([t[0].text for t in computed_similariy[:50]])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"## References\n",
|
||
"\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"* [Spacy](https://spacy.io/usage/spacy-101/#annotations) \n",
|
||
"* [NLTK stemmer](https://www.nltk.org/howto/stem.html)\n",
|
||
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
||
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)\n",
|
||
"* Natural Language Processing with Python, José Portilla, 2019."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"## Licence"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"slideshow": {
|
||
"slide_type": "skip"
|
||
}
|
||
},
|
||
"source": [
|
||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||
"\n",
|
||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"celltoolbar": "Slideshow",
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.10.10"
|
||
},
|
||
"latex_envs": {
|
||
"LaTeX_envs_menu_present": true,
|
||
"autocomplete": true,
|
||
"bibliofile": "biblio.bib",
|
||
"cite_by": "apalike",
|
||
"current_citInitial": 1,
|
||
"eqLabelWithNumbers": true,
|
||
"eqNumInitial": 1,
|
||
"hotkeys": {
|
||
"equation": "Ctrl-E",
|
||
"itemize": "Ctrl-I"
|
||
},
|
||
"labels_anchors": false,
|
||
"latex_user_defs": false,
|
||
"report_style_numbering": false,
|
||
"user_envs_cfg": false
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 1
|
||
}
|