{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Lexical Processing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Table of Contents\n", "* [Objectives](#Objectives)\n", "* [NLP Basics](#NLP-Basics)\n", " * [Spacy installation](#Spacy-installation)\n", " * [Spacy pipeline](#Spacy-pipeline)\n", " * [Tokenization](#Tokenization)\n", " * [Noun chunks](#Noun-chunks)\n", " * [Stemming](#Stemming)\n", " * [Sentence segmentation](#Sentence-segmentation)\n", " * [Lemmatization](#Lemmatization)\n", " * [Stop words](#Stop-words)\n", " * [POS](#POS)\n", " * [NER](#NER)\n", "* [Text Feature extraction](#Text-Feature-extraction)\n", "* [Classifying spam](#Classifying-spam)\n", "* [Vectors and similarity](#Vectors-and-similarity)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Objectives" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In this session we are going to learn to process text so that can apply machine learning techniques." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# NLP Basics\n", "In this notebook we are going to use two popular NLP libraries:\n", "* NLTK (Natural Language Toolkit, https://www.nltk.org/) \n", "* Spacy (https://spacy.io/)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Main characteristics:\n", "* both are open source and very popular\n", "* NLTK was released in 2001 while Spacy was in 2015\n", "* Spacy provides very efficient implementations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Spacy installation\n", "\n", "You need to install previously spacy if not installed:\n", "* `pip install spacy`\n", "* or `conda install -c conda-forge spacy`\n", "\n", "and install the small English model \n", "* `python -m spacy download en_core_web_sm`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Spacy pipelines\n", "\n", "The function **nlp** takes a raw text and perform several operations (tokenization, tagger, NER, ...)\n", "![](spacy/spacy-pipeline.svg \"Spacy pipelines\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "From text to doc trough the pipeline" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import spacy\n", "\n", "nlp = spacy.load('en_core_web_sm')\n", "doc = nlp(u'Albert Einstein won the Nobel Prize for Physics in 1921')\n", "doc2 = nlp(u'\"Let\\'s go to N.Y.!\"')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Tokenization\n", "From text to tokens\n", "![](spacy/tokenization.svg \"Tokenization\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The tokenizer checks:\n", "\n", "* **Tokenizer exception:** Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.\n", "* **Prefix:** Character(s) at the beginning, e.g. $, (, “, ¿.\n", "* **Suffix:** Character(s) at the end, e.g. km, ), ”, !.\n", "* **Infix:** Character(s) in between, e.g. -, --, /, …." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's do it!" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Albert\n", "Einstein\n", "won\n", "the\n", "Nobel\n", "Prize\n", "for\n", "Physics\n", "in\n", "1921\n" ] } ], "source": [ "# print tokens\n", "for token in doc:\n", " print(token.text)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"\n", "Let\n", "'s\n", "go\n", "to\n", "N.Y.\n", "!\n", "\"\n" ] } ], "source": [ "for token in doc2:\n", " print(token.text)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Noun chunks\n", "Noun phrases" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Autonomous cars cars nsubj shift\n", "insurance liability liability dobj shift\n", "manufacturers manufacturers pobj toward\n" ] } ], "source": [ "doc = nlp(\"Autonomous cars shift insurance liability toward manufacturers\")\n", "for chunk in doc.noun_chunks:\n", " print(chunk.text, chunk.root.text, chunk.root.dep_,\n", " chunk.root.head.text)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "\n", "\n", " Autonomous\n", " ADJ\n", "\n", "\n", "\n", " cars\n", " NOUN\n", "\n", "\n", "\n", " shift\n", " VERB\n", "\n", "\n", "\n", " insurance\n", " NOUN\n", "\n", "\n", "\n", " liability\n", " NOUN\n", "\n", "\n", "\n", " toward\n", " ADP\n", "\n", "\n", "\n", " manufacturers\n", " NOUN\n", "\n", "\n", "\n", " \n", " \n", " amod\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " prep\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " pobj\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from spacy import displacy\n", "displacy.render(doc, style='dep', jupyter=True,options={'distance':130})" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sentence segmentation\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This is a sentence.\n", "This is another sentence.\n" ] } ], "source": [ "doc3 =nlp(u'This is a sentence. This is another sentence.')\n", "for sent in doc3.sents:\n", " print(sent)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Stemming\n", "Spacy does not include a stemmer. \n", "We will use nltk.\n", "The purpose is removing the ending of a word based on rules." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "caress\n", "fli\n", "is\n", "been\n", "gener\n" ] } ], "source": [ "import nltk\n", "from nltk.stem.porter import PorterStemmer\n", "\n", "stemmer = PorterStemmer()\n", "words = ['caresses', 'flies','is', 'been', 'generously']\n", "stems = [stemmer.stem(word) for word in words]\n", "for stem in stems:\n", " print(stem)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Lemmatization\n", "Lemmatization includes a morphological analysis." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['I', 'be', 'read', 'the', 'paper', '.']\n" ] } ], "source": [ "doc = nlp(\"I was reading the paper.\")\n", "print([token.lemma_ for token in doc])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I \t PRON \t I\n", "was \t AUX \t be\n", "reading \t VERB \t read\n", "the \t DET \t the\n", "paper \t NOUN \t paper\n", ". \t PUNCT \t .\n" ] } ], "source": [ "for token in doc:\n", " print(token.text, '\\t', token.pos_, '\\t', token.lemma_)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Stopwords" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'other', 'toward', 'everywhere', 'whether', 'i', 'his', 'afterwards', 'whenever', 'except', 'but', 'for', 'noone', 'whereas', 'she', 'name', 'per', 'among', 'where', 'am', '’re', 'on', 'thereby', 'fifty', \"'ll\", '‘m', 'would', 'we', 'once', 'can', 'meanwhile', 'anything', 'still', 'these', 'without', 'below', 'rather', 'were', 'six', 'them', 'latter', 'sometime', 'seemed', 'next', 'move', 'there', 'various', 'fifteen', '’m', 'onto', 'n’t', 'much', 'one', 'due', 'our', 'although', 'whom', 'done', 'made', 'at', 'empty', 'about', 'herself', 'down', 'never', 'thereafter', \"'re\", 'off', 'he', 'hereafter', 'whereafter', 'what', 'above', 'along', 'a', '’ve', '’s', 'else', 'or', 'sometimes', 'twenty', \"n't\", 'over', 'both', 'someone', 'beside', 'therein', 'whole', 'make', 'any', 'becomes', 'anyhow', 'quite', 'hence', 'here', 'same', 'which', 'whereby', 'whereupon', 'must', 'me', 'part', 'serious', 'into', 'namely', 'hers', 'enough', 'with', 'because', 'own', 'give', 'see', 'somehow', 'since', 'just', 'seems', 'top', 'across', 'ourselves', 'in', 'anywhere', 'few', 'myself', 'say', 'all', 'together', 'had', 'back', 'besides', 'please', 'n‘t', 'many', 'whoever', \"'d\", 'nothing', 'be', 'did', 'yours', 'how', 'also', 'those', 'that', 'throughout', '‘ve', 'amount', 'others', 'keep', 'hundred', 'up', 'already', 'amongst', 'front', 'thru', 'if', '‘s', 'least', 'us', 'anyone', 'might', 'thereupon', 'third', 'nor', 'sixty', 'nobody', 'more', 'her', 'by', 'himself', 'each', 'than', 're', 'behind', 'almost', 'seeming', 'when', 'is', 'mostly', 'so', 'ours', 'becoming', 'towards', 'some', 'two', 'seem', 'put', 'four', 'you', 'call', 'thence', 'moreover', 'nowhere', 'do', 'former', 'formerly', 'elsewhere', 'full', 'after', 'thus', 'less', 'go', 'ever', 'nine', 'anyway', 'somewhere', 'will', 'three', 'within', 'been', 'before', 'of', 'has', 'beyond', 'such', 'why', 'none', 'whose', 'eight', 'either', 'no', 'itself', 'doing', \"'ve\", '‘d', 'though', 'neither', 'while', 'get', 'around', 'your', 'twelve', 'even', 'and', 'something', 'always', \"'m\", 'until', 'an', 'during', 'it', 'most', 'the', 'yourselves', '’d', '‘ll', 'very', 'really', 'show', 'too', 'everyone', 'wherein', 'therefore', 'whither', 'who', 'may', 'latterly', 'between', 'its', 'however', '‘re', 'through', 'everything', 'another', 'does', 'whatever', 'their', 'then', 'beforehand', 'my', 'eleven', 'now', 'should', 'cannot', 'have', 'take', \"'s\", 'perhaps', 'regarding', 'against', 'used', 'indeed', 'upon', 'him', 'using', 'under', 'became', 'to', 'hereupon', 'bottom', 'not', 'from', 'hereby', 'side', 'yet', 'this', 'could', 'themselves', '’ll', 'ca', 'nevertheless', 'via', 'five', 'become', 'as', 'only', 'otherwise', 'are', 'was', 'first', 'further', 'ten', 'herein', 'yourself', 'forty', 'alone', 'every', 'again', 'well', 'last', 'several', 'wherever', 'mine', 'often', 'whence', 'out', 'they', 'being', 'unless'}\n" ] } ], "source": [ "nlp = spacy.load('en_core_web_sm')\n", "print(nlp.Defaults.stop_words)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab['for'].is_stop" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab['day'].is_stop" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab['btw'].is_stop" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#add stop words\n", "nlp.Defaults.stop_words.add('btw')\n", "nlp.vocab['btw'].is_stop = True\n", "nlp.vocab['btw'].is_stop" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original Text\n", "Nick likes to play football, however he is not too fond of tennis. \n", "\n", "\n", "Text after removing stop words\n", "Nick likes play football, fond tennis.\n" ] } ], "source": [ "en_stopwords = nlp.Defaults.stop_words\n", "text = \"Nick likes to play football, however he is not too fond of tennis.\"\n", "\n", "lst=[]\n", "for token in text.split():\n", " if token.lower() not in en_stopwords:\n", " lst.append(token)\n", "\n", "print('Original Text') \n", "print(text,'\\n\\n')\n", "\n", "print('Text after removing stop words')\n", "print(' '.join(lst))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also use methods from the token class (https://spacy.io/api/token), such as:\n", "\n", "* **is_stop:** is the token a stop word?\n", "* **is_punct:** is the token punctuation?\n", "* **like_email:** does the token resemble an email address?\n", "* **is_digit:** Does the token consist of digits? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## POS" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I \t PRON \t pronoun\n", "was \t AUX \t auxiliary\n", "reading \t VERB \t verb\n", "the \t DET \t determiner\n", "paper \t NOUN \t noun\n", ". \t PUNCT \t punctuation\n" ] } ], "source": [ "# POS\n", "# information available at https://spacy.io/usage/linguistic-features/#pos-tagging\n", "for token in doc:\n", " print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_))" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I \t PRON \t pronoun \t nsubj\n", "was \t AUX \t auxiliary \t aux\n", "reading \t VERB \t verb \t ROOT\n", "the \t DET \t determiner \t det\n", "paper \t NOUN \t noun \t dobj\n", ". \t PUNCT \t punctuation \t punct\n" ] } ], "source": [ "for token in doc:\n", " print(token.text, '\\t', token.pos_, '\\t', spacy.explain(token.pos_), '\\t', token.dep_)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "[('tok2vec', ),\n", " ('tagger', ),\n", " ('parser', ),\n", " ('attribute_ruler',\n", " ),\n", " ('lemmatizer',\n", " ),\n", " ('ner', )]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List the pipeline\n", "nlp.pipeline" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.pipe_names" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We can also get some statistics, regarding the frequency of tags in POS, DEP or TAG." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ADJ 1\n", "ADP 1\n", "ADV 2\n", "AUX 1\n", "NOUN 2\n", "PART 2\n", "PRON 1\n", "PROPN 1\n", "PUNCT 2\n", "VERB 2\n" ] } ], "source": [ "doc = nlp(u'Nick likes to play football, however he is not too fond of tennis.')\n", "POS_counts = doc.count_by(spacy.attrs.POS)\n", "\n", "for code,freq in sorted(POS_counts.items()):\n", " print(doc.vocab[code].text, freq)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "RB 3\n", "IN 1\n", ", 1\n", "TO 1\n", "JJ 1\n", ". 1\n", "PRP 1\n", "VBZ 2\n", "VB 1\n", "NN 2\n", "NNP 1\n" ] } ], "source": [ "TAG_counts = doc.count_by(spacy.attrs.TAG)\n", "\n", "for code,freq in sorted(TAG_counts.items()):\n", " print(doc.vocab[code].text, freq)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "acomp 1\n", "advmod 2\n", "aux 1\n", "ccomp 1\n", "dobj 1\n", "neg 1\n", "nsubj 2\n", "pobj 1\n", "prep 1\n", "punct 2\n", "xcomp 1\n", "ROOT 1\n" ] } ], "source": [ "DEP_counts = doc.count_by(spacy.attrs.DEP)\n", "\n", "for code,freq in sorted(DEP_counts.items()):\n", " print(doc.vocab[code].text, freq)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## NER" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Apple 0 5 ORG\n", "U.K. 27 31 GPE\n", "$1 billion 44 54 MONEY\n" ] } ], "source": [ "nlp = spacy.load(\"en_core_web_sm\")\n", "doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')\n", "\n", "for ent in doc.ents:\n", " print(ent.text, ent.start_char, ent.end_char, ent.label_)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " Apple\n", " ORG\n", "\n", " is looking at buying \n", "\n", " U.K.\n", " GPE\n", "\n", " startup for \n", "\n", " $1 billion\n", " MONEY\n", "\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(doc, style='ent', jupyter=True,options={'distance':130})" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
Over \n", "\n", " the last quarter\n", " DATE\n", "\n", " \n", "\n", " Apple\n", " ORG\n", "\n", " sold \n", "\n", " nearly 20 thousand\n", " CARDINAL\n", "\n", " \n", "\n", " iPods\n", " PRODUCT\n", "\n", " for a profit of \n", "\n", " $6 million\n", " MONEY\n", "\n", ".
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')\n", "displacy.render(doc, style='ent', jupyter=True,options={'distance':130})" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Text Feature Extraction" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## CountVectorizer\n", "Transforming text into a vector" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n", " 'summer', 'the', 'winter'], dtype=object)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sklearn\n", "# Count vectorization\n", "texts = [\"Summer is coming but Summer is short\", \n", " \"I like the Summer and I like the Winter\", \n", " \"I like sandwiches and I like the Winter\"]\n", "\n", "\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "\n", "\n", "# Count occurrences of unique words\n", "# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html \n", "\n", "vectorizer = CountVectorizer()\n", "X = vectorizer.fit_transform(texts)\n", "vectorizer.get_feature_names_out()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 1 1 2 0 0 1 2 0 0]\n", " [1 0 0 0 2 0 0 1 2 1]\n", " [1 0 0 0 2 1 0 0 1 1]]\n" ] } ], "source": [ "print(X.toarray())" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array(['and like', 'but summer', 'coming but', 'is coming', 'is short',\n", " 'like sandwiches', 'like the', 'sandwiches and', 'summer and',\n", " 'summer is', 'the summer', 'the winter'], dtype=object)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2)) # only bigrams; (1,2):unigram, bigram\n", "X2 = vectorizer2.fit_transform(texts)\n", "vectorizer2.get_feature_names_out()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 1 1 1 1 0 0 0 0 2 0 0]\n", " [1 0 0 0 0 0 2 0 1 0 1 1]\n", " [1 0 0 0 0 1 1 1 0 0 0 1]]\n" ] } ], "source": [ "print(X2.toarray())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Remove stop words" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array(['coming', 'like', 'sandwiches', 'short', 'summer', 'winter'],\n", " dtype=object)" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer3 = CountVectorizer(analyzer='word', stop_words='english') # only bigrams\n", "X3 = vectorizer3.fit_transform(texts)\n", "vectorizer3.get_feature_names_out()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1 0 0 1 2 0]\n", " [0 2 0 0 1 1]\n", " [0 2 1 0 0 1]]\n" ] } ], "source": [ "print(X3.toarray())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## TF-IDF" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array(['and', 'but', 'coming', 'is', 'like', 'sandwiches', 'short',\n", " 'summer', 'the', 'winter'], dtype=object)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "vect = TfidfVectorizer()\n", "XTIDF = vect.fit_transform(texts)\n", "vect.get_feature_names_out()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[0 1 1 2 0 0 1 2 0 0]\n", " [1 0 0 0 2 0 0 1 2 1]\n", " [1 0 0 0 2 1 0 0 1 1]]\n", "[[0. 0.32767345 0.32767345 0.65534691 0. 0.\n", " 0.32767345 0.49840822 0. 0. ]\n", " [0.30151134 0. 0. 0. 0.60302269 0.\n", " 0. 0.30151134 0.60302269 0.30151134]\n", " [0.33846987 0. 0. 0. 0.67693975 0.44504721\n", " 0. 0. 0.33846987 0.33846987]]\n" ] } ], "source": [ "# Counter\n", "print(X.toarray())\n", "# TF-IDF\n", "print(XTIDF.toarray())" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['coming' 'like' 'sandwiches' 'short' 'summer' 'winter']\n", "[[1 0 0 1 2 0]\n", " [0 2 0 0 1 1]\n", " [0 2 1 0 0 1]]\n", "[[0.48148213 0. 0. 0.48148213 0.73235914 0. ]\n", " [0. 0.81649658 0. 0. 0.40824829 0.40824829]\n", " [0. 0.77100584 0.50689001 0. 0. 0.38550292]]\n" ] } ], "source": [ "vect3 = TfidfVectorizer(analyzer='word', stop_words='english')\n", "XTIDF3 = vect3.fit_transform(texts)\n", "print(vect3.get_feature_names_out())\n", "print(X3.toarray())\n", "print(XTIDF3.toarray())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Classifying spam\n", "We will use the sms spam collection dataset taken from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelmessage
0hamGo until jurong point, crazy.. Available only ...
1hamOk lar... Joking wif u oni...
2spamFree entry in 2 a wkly comp to win FA Cup fina...
3hamU dun say so early hor... U c already then say...
4hamNah I don't think he goes to usf, he lives aro...
\n", "
" ], "text/plain": [ " label message\n", "0 ham Go until jurong point, crazy.. Available only ...\n", "1 ham Ok lar... Joking wif u oni...\n", "2 spam Free entry in 2 a wkly comp to win FA Cup fina...\n", "3 ham U dun say so early hor... U c already then say...\n", "4 ham Nah I don't think he goes to usf, he lives aro..." ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "column_names = ['label', 'message']\n", "df = pd.read_csv('SMSSpamCollection', sep='\\t', names=column_names, header=None)\n", "df[0:5]" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "label 0\n", "message 0\n", "dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#check missing values\n", "df.isnull().sum()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# df['label'].replace('', np.nan, inplace=True)\n", "# df['message'].replace('', np.nan, inplace=True)\n", "# df.dropna(inplace=True)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "ham 4825\n", "spam 747\n", "Name: label, dtype: int64" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['label'].value_counts()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# spling training and testing\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "X = df['message']\n", "y = df['label']" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "#train\n", "X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.33, random_state=42)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "count_vect = CountVectorizer()" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "# fit vectorizer to data: build dictionary, count words,...\n", "# transform: transform original text message to the vector\n", "X_train_counts = count_vect.fit_transform(X_train)" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(3733,)" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "<3733x7082 sparse matrix of type ''\n", "\twith 49992 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train_counts" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We see the vocabulary are 7082 words, but most values are zeros" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfTransformer" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(3733, 7082)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_transformer = TfidfTransformer()\n", "X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)\n", "X_train_tfidf.shape" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Instead of first count vectorization and then tf-idf transformation, better TF-IDF vectorizer, which makes these two things" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "vectorizer = TfidfVectorizer()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "X_train_tfidf = vectorizer.fit_transform(X_train)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Classifier" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.svm import LinearSVC" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "clf = LinearSVC()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
LinearSVC()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearSVC()" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(X_train_tfidf, y_train)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pipeline\n", "Simple way to define the processing steps for repeating the operation." ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_clf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "predictions = text_clf.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from sklearn.metrics import confusion_matrix, classification_report" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[[1586 7]\n", " [ 12 234]]\n" ] } ], "source": [ "print(confusion_matrix(y_test, predictions))" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " ham 0.99 1.00 0.99 1593\n", " spam 0.97 0.95 0.96 246\n", "\n", " accuracy 0.99 1839\n", " macro avg 0.98 0.97 0.98 1839\n", "weighted avg 0.99 0.99 0.99 1839\n", "\n" ] } ], "source": [ "print(classification_report(y_test, predictions))" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "0.989668297988037" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import metrics\n", "metrics.accuracy_score(y_test, predictions)" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array(['ham'], dtype=object)" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_clf.predict([\"This is a summer school\"])" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "array(['spam'], dtype=object)" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_clf.predict([\"Free tickets and CASH\"])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Vectors and Similarity\n", "You need to install previously spacy if not installed:\n", "* `pip install spacy`\n", "* or `conda install -c conda-forge spacy`\n", "\n", "and install the English models (large or medium):\n", "* `python -m spacy download en_core_web_md`\n", "* `python -m spacy download en_core_web_lg`\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import spacy\n", "nlp = spacy.load('en_core_web_lg')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "array([ 1.8023e+00, 3.9075e+00, -4.2940e+00, -7.6117e+00, -3.7172e+00,\n", " -1.5229e-01, -1.1368e+00, -6.8427e-01, -9.3067e-01, 5.6531e+00,\n", " 4.2536e+00, -4.1175e+00, -8.3049e-01, 2.7701e+00, 6.4474e+00,\n", " -6.6389e-02, -8.3026e-01, -7.4532e+00, 1.7888e-01, 2.5130e+00,\n", " -4.4785e-01, 8.4806e+00, -2.7056e+00, -6.9836e+00, 9.2242e-01,\n", " -3.3579e+00, -3.2071e+00, 1.2901e-01, 3.5933e+00, -4.8096e+00,\n", " 3.2596e-01, -3.0782e-01, -3.8023e+00, -1.2818e-01, 9.7322e-02,\n", " 1.0876e+00, -4.5140e+00, -8.5375e-02, -4.4139e+00, -1.4073e+00,\n", " -2.4729e+00, 1.3307e-01, 3.1949e+00, 2.9971e+00, 5.3643e+00,\n", " -3.2407e+00, -2.7512e+00, 3.6586e-01, 2.7333e-01, 6.6513e+00,\n", " 4.8740e+00, 1.3732e+00, -7.3595e-01, -2.3265e+00, 1.4045e+00,\n", " 1.5080e-01, 3.1985e+00, -5.7459e+00, 3.5059e+00, 8.1671e-01,\n", " -1.1113e+00, -8.9306e-01, -4.2963e+00, 8.4042e-01, -8.3586e-01,\n", " -2.5407e+00, -1.1414e+00, -5.5050e+00, -3.6670e+00, 1.7393e+00,\n", " -1.9284e+00, 2.7994e+00, 4.4476e+00, -1.0855e+00, 2.5439e+00,\n", " -1.8681e+00, 2.1162e+00, 5.5460e+00, 2.8248e+00, -1.1810e+00,\n", " -9.3259e-01, -1.8681e+00, -3.0654e-02, -3.4096e+00, 2.0261e+00,\n", " -5.4005e-01, 8.2070e-01, 4.3283e+00, -3.4484e+00, -2.1291e+00,\n", " 1.2265e+00, -4.4106e-01, 3.8392e+00, -5.8643e+00, -7.3440e-01,\n", " 1.9785e+00, 4.1928e+00, 1.4577e+00, 2.8668e+00, -6.3762e+00,\n", " 2.7575e+00, 1.7991e+00, 1.3388e-02, -5.1316e-01, -6.3303e+00,\n", " -2.5989e+00, -1.0406e+00, 1.8325e+00, 9.6654e-02, 4.4002e+00,\n", " -1.3231e+00, 2.7717e+00, 4.3340e+00, 2.9027e-01, 7.2542e+00,\n", " -1.2149e+00, -1.7366e+00, -5.2755e+00, 7.5762e-01, -6.0150e+00,\n", " 2.1634e+00, -1.6577e+00, -6.4410e+00, 2.5107e+00, -7.6881e+00,\n", " -6.3143e-01, 6.0914e+00, 4.7114e+00, 1.0778e+00, 1.8121e+00,\n", " -3.1133e+00, -5.5923e+00, 5.0992e-01, -2.2783e+00, 1.3641e+00,\n", " 3.4367e+00, -1.0224e+00, -3.1824e+00, 2.0683e+00, 2.0398e+00,\n", " -8.2011e+00, 4.5388e-01, 2.7002e+00, 3.9199e+00, -5.5184e-01,\n", " -3.3309e+00, -3.8620e+00, 1.7020e-01, 4.9659e+00, 6.9592e-01,\n", " 3.4792e+00, -2.7438e+00, -6.0489e-01, 1.9883e-02, 2.3192e-01,\n", " -4.0591e-01, 3.9470e+00, 1.4145e+00, -8.4031e-01, -1.9433e+00,\n", " -2.5783e+00, -6.8732e+00, -3.7792e+00, 6.4090e+00, -2.3963e+00,\n", " -3.1485e+00, 2.2938e+00, -1.3649e+00, -1.3070e+00, -7.4143e-01,\n", " 3.5752e+00, 3.1999e+00, -2.7599e+00, 3.9996e+00, -2.6275e+00,\n", " -3.2632e+00, -2.7695e+00, -2.0046e+00, 3.4848e-01, -3.7322e+00,\n", " 3.9018e+00, 1.1883e-02, 6.7589e+00, -4.2182e+00, -1.7291e+00,\n", " 1.3949e+00, 5.9161e-01, -4.0226e+00, 1.7388e+00, -1.9609e+00,\n", " -5.4280e-02, 1.4707e+00, -4.2497e+00, -7.4698e-01, 5.7317e+00,\n", " -5.9729e+00, 4.3627e-01, 6.9487e+00, -2.9021e+00, 2.8235e+00,\n", " 4.4695e+00, -2.7154e+00, -1.7771e+00, -1.6288e+00, -4.9338e+00,\n", " -2.1144e+00, 1.4976e+00, -4.4156e+00, -3.3974e+00, -9.0295e+00,\n", " 2.1685e+00, -1.7372e+00, -8.9336e-03, 2.2437e+00, -1.3924e+00,\n", " -2.5530e+00, -2.0714e+00, 2.0850e+00, -5.2257e+00, 8.4517e-01,\n", " 1.6804e+00, -7.9530e+00, 3.8700e+00, 9.2134e+00, -4.5150e+00,\n", " 2.8401e+00, 5.1596e-01, -3.7684e+00, 2.3126e+00, 2.2748e+00,\n", " -4.7895e+00, -2.3299e+00, -2.3546e+00, -2.0999e+00, -3.7111e+00,\n", " 1.4847e+00, -1.6953e+00, 4.9883e+00, 2.5845e-01, 4.1598e+00,\n", " -8.4808e-01, 3.1341e+00, 4.1797e+00, -9.9561e-01, 1.1814e+00,\n", " -3.0735e+00, -2.7010e+00, -9.5470e-01, 1.4944e+00, 2.4461e+00,\n", " -1.2699e+00, 2.3195e+00, -7.0078e-01, 2.6868e+00, -2.9822e+00,\n", " 3.8670e+00, 3.1915e+00, 3.2350e+00, -3.2919e+00, 4.2211e-01,\n", " 3.9947e+00, -1.4124e+00, -2.1844e+00, -3.0904e+00, 2.3693e+00,\n", " -2.8532e+00, -7.5463e-01, -3.6133e+00, -7.8667e+00, 4.7647e+00,\n", " 2.6976e+00, -2.6137e-01, -5.2056e+00, -2.2392e+00, 2.7426e+00,\n", " -1.2172e+00, -4.4441e-02, -3.1014e+00, -4.7598e+00, 5.2652e+00,\n", " -4.0911e+00, -4.9625e+00, 2.8234e-01, 1.5329e+00, 5.3542e+00,\n", " -1.5295e+00, -3.5151e+00, -1.5575e+00, -3.6066e+00, -3.2199e+00,\n", " 4.5560e+00, -3.6332e-01, 1.6928e+00, -2.5321e+00, -4.1381e+00,\n", " -3.4422e+00, 2.4066e+00, 6.1191e+00, -1.1493e+00, 3.0401e+00],\n", " dtype=float32)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp(u'girl').vector" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "(300,)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp(u'girl').vector.shape" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "(300,)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Document vector: vector with the average of single words\n", "nlp(u'the girl is blond').vector.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "doc = nlp(u'cat lion dog pet')\n", "#doc = nlp(u'buy sell rent')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "cat cat 1.0\n", "cat lion 0.3854507803916931\n", "cat dog 0.8220816850662231\n", "cat pet 0.732966423034668\n", "lion cat 0.3854507803916931\n", "lion lion 1.0\n", "lion dog 0.2949307858943939\n", "lion pet 0.20031584799289703\n", "dog cat 0.8220816850662231\n", "dog lion 0.2949307858943939\n", "dog dog 1.0\n", "dog pet 0.7856059074401855\n", "pet cat 0.732966423034668\n", "pet lion 0.20031584799289703\n", "pet dog 0.7856059074401855\n", "pet pet 1.0\n" ] } ], "source": [ "for word1 in doc:\n", " for word2 in doc:\n", " print(word1.text, word2.text, word1.similarity(word2))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "(514157, 300)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nlp.vocab.vectors.shape" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n", "True\n" ] } ], "source": [ "doc = nlp(u'catr')\n", "token = doc[0]\n", "print(token.has_vector)\n", "print(token.is_oov)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "from scipy import spatial\n", "\n", "cosine_similarity = lambda v1, v2: 1- spatial.distance.cosine(v1, v2)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "king = nlp.vocab['king'].vector\n", "man = nlp.vocab['man'].vector\n", "woman = nlp.vocab['woman'].vector" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "# king - man + woman\n", "new_vector = king-man+ woman" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "computed_similarity = []\n", "for id in nlp.vocab.vectors:\n", " word = nlp.vocab[id]\n", " if word.has_vector:\n", " if word.is_lower:\n", " if word.is_alpha: \n", " similarity = cosine_similarity(new_vector, word.vector)\n", " computed_similarity.append((word, similarity))" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['king', 'kings', 'princes', 'consort', 'princeling', 'monarch', 'princelings', 'princesses', 'prince', 'kingship', 'princess', 'ruler', 'consorts', 'kingi', 'princedom', 'rulers', 'kingii', 'enthronement', 'monarchical', 'queen', 'monarchs', 'enthroning', 'queening', 'regents', 'principality', 'kingsize', 'throne', 'princesa', 'dynastic', 'princedoms', 'nobility', 'monarchic', 'imperial', 'princesse', 'rulership', 'courtiers', 'dynasties', 'monarchial', 'kingdom', 'predynastic', 'enthrone', 'succession', 'princely', 'royal', 'kingly', 'mcqueen', 'dethronement', 'royally', 'emperor', 'princeps']\n" ] } ], "source": [ "computed_similariy = sorted(computed_similarity,key=lambda item:-item[1])\n", "print([t[0].text for t in computed_similariy[:50]])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## References\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* [Spacy](https://spacy.io/usage/spacy-101/#annotations) \n", "* [NLTK stemmer](https://www.nltk.org/howto/stem.html)\n", "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n", "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)\n", "* Natural Language Processing with Python, José Portilla, 2019." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Licence" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 1 }