{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Table of Contents\n", "* [Objectives](#Objectives)\n", "* [NLP Basics](#NLP-Basics)\n", " * [Spacy installation](#Spacy-installation)\n", " * [Spacy pipeline](#Spacy-pipeline)\n", " * [Tokenization](#Tokenization)\n", " * [Noun chunks](#Noun-chunks)\n", " * [Stemming](#Stemming)\n", " * [Sentence segmentation](#Sentence-segmentation)\n", " * [Lemmatization](#Lemmatization)\n", " * [Stop words](#Stop-words)\n", " * [POS](#POS)\n", " * [NER](#NER)\n", "* [Text Feature extraction](#Text-Feature-extraction)\n", "* [Classifying spam](#Classifying-spam)\n", "* [Vectors and similarity](#Vectors-and-similarity)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Objectives" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In this session we are going to learn the power of transformers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Transformers\n", "As we saw, transformers are an extremely powerful architecture capable of performing many popular NLP tasks.\n", "\n", "A well-known transformer model repository is available at https://huggingface.co/. \n", "\n", "Let's see how to use it. To go deeper, consult the Hugging tutorial (https://huggingface.co/learn/nlp-course/chapter1/1).\n", "\n", "The transformers package requires to have installed Pytorch or TensorFlow. Check the installation details if you want to configure your environment well. For learning purposes, we are going to install Pytorch.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First of all, you should install Hugging Face. Execute:\n", "* pip install torch transformers" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: torch in /home/cif/anaconda3/lib/python3.11/site-packages (2.3.0)\n", "Requirement already satisfied: transformers in /home/cif/anaconda3/lib/python3.11/site-packages (4.40.2)\n", "Requirement already satisfied: filelock in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (3.13.4)\n", "Requirement already satisfied: typing-extensions>=4.8.0 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (4.11.0)\n", "Requirement already satisfied: sympy in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (1.12)\n", "Requirement already satisfied: networkx in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (3.3)\n", "Requirement already satisfied: jinja2 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (3.1.3)\n", "Requirement already satisfied: fsspec in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (2023.10.0)\n", "Requirement already satisfied: nvidia-cuda-nvrtc-cu12==12.1.105 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (12.1.105)\n", "Requirement already satisfied: nvidia-cuda-runtime-cu12==12.1.105 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (12.1.105)\n", "Requirement already satisfied: nvidia-cuda-cupti-cu12==12.1.105 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (12.1.105)\n", "Requirement already satisfied: nvidia-cudnn-cu12==8.9.2.26 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (8.9.2.26)\n", "Requirement already satisfied: nvidia-cublas-cu12==12.1.3.1 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (12.1.3.1)\n", "Requirement already satisfied: nvidia-cufft-cu12==11.0.2.54 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (11.0.2.54)\n", "Requirement already satisfied: nvidia-curand-cu12==10.3.2.106 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (10.3.2.106)\n", "Requirement already satisfied: nvidia-cusolver-cu12==11.4.5.107 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (11.4.5.107)\n", "Requirement already satisfied: nvidia-cusparse-cu12==12.1.0.106 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (12.1.0.106)\n", "Requirement already satisfied: nvidia-nccl-cu12==2.20.5 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (2.20.5)\n", "Requirement already satisfied: nvidia-nvtx-cu12==12.1.105 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (12.1.105)\n", "Requirement already satisfied: triton==2.3.0 in /home/cif/anaconda3/lib/python3.11/site-packages (from torch) (2.3.0)\n", "Requirement already satisfied: nvidia-nvjitlink-cu12 in /home/cif/anaconda3/lib/python3.11/site-packages (from nvidia-cusolver-cu12==11.4.5.107->torch) (12.4.127)\n", "Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (0.23.0)\n", "Requirement already satisfied: numpy>=1.17 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (1.26.4)\n", "Requirement already satisfied: packaging>=20.0 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (24.0)\n", "Requirement already satisfied: pyyaml>=5.1 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (6.0.1)\n", "Requirement already satisfied: regex!=2019.12.17 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (2023.12.25)\n", "Requirement already satisfied: requests in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (2.31.0)\n", "Requirement already satisfied: tokenizers<0.20,>=0.19 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (0.19.1)\n", "Requirement already satisfied: safetensors>=0.4.1 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (0.4.3)\n", "Requirement already satisfied: tqdm>=4.27 in /home/cif/anaconda3/lib/python3.11/site-packages (from transformers) (4.66.2)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /home/cif/anaconda3/lib/python3.11/site-packages (from jinja2->torch) (2.1.5)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /home/cif/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2.0.4)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/cif/anaconda3/lib/python3.11/site-packages (from requests->transformers) (3.7)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/cif/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2.2.1)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/cif/anaconda3/lib/python3.11/site-packages (from requests->transformers) (2024.2.2)\n", "Requirement already satisfied: mpmath>=0.19 in /home/cif/anaconda3/lib/python3.11/site-packages (from sympy->torch) (1.3.0)\n" ] } ], "source": [ "!pip install torch transformers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Usa cases: how to use pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sentiment Analysis\n", "Let's classify sentiments" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'label': 'positive', 'score': 0.974217414855957}]\n", "[{'label': 'negative', 'score': 0.9310991168022156}, {'label': 'neutral', 'score': 0.5152537226676941}]\n" ] } ], "source": [ "from transformers import pipeline\n", "\n", "from transformers import logging\n", "\n", "logging.set_verbosity_error()\n", "#logging.set_verbosity_warning()\n", "\n", "model_sentiment = \"cardiffnlp/twitter-roberta-base-sentiment-latest\"\n", "\n", "sentiment_pipe = pipeline(\"sentiment-analysis\", model=model_sentiment)\n", "\n", "print(sentiment_pipe(\"I love LLMs.\"))\n", "\n", "#We pan \n", "print(sentiment_pipe([\"I hate LLMs.\", \"I don't care about LLMs\"]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Translation\n", "Let's translate a sentence" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'translation_text': 'Il s’agit du cours de traitement des langues naturelles'}]\n" ] } ], "source": [ "from transformers import pipeline\n", "\n", "#if no model is specified, it uses google-t5\n", "translator_en_fr = pipeline(\"translation_en_to_fr\")\n", "\n", "print(translator_en_fr(\"This is the course of Natural Language Processing\", max_length=40))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conversation\n", "Let's create a chatbot" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Conversation id: e0ab78fe-bdbf-4941-a623-e4b8705e9e2e\n", "user: Hi, I'm Peter, how are you?\n", "assistant: I'm doing well. How are you doing this evening? I just got home from work.\n", "\n", "Conversation id: 20180c79-dd50-4541-a432-c92378b8bc31\n", "user: Can i have a lunch with you?\n", "assistant: Sure, what do you want to eat? I'll make you a sandwich. I love sandwiches.\n", "\n", "Conversation id: a7b3d314-6ff0-4756-8a0b-25806c9281a5\n", "user: Do you like Paella?\n", "assistant: I love it! I make it at least once a week. It's one of my favorite dishes.\n", "\n", "Conversation id: c9b6dfd0-10a6-4d5e-abfc-7f090119e63a\n", "user: What do you know about Paella?\n", "assistant: I know that it is a traditional Italian dish consisting of rice and meatballs.\n", "\n" ] } ], "source": [ "from transformers import pipeline, Conversation\n", "\n", "chatbot = pipeline(task = \"conversational\", model = \"facebook/blenderbot-400M-distill\")\n", "\n", "conversation = Conversation(\"Hi, I'm Peter, how are you?\")\n", "conversation = chatbot(conversation)\n", "print(conversation)\n", "\n", "conversation = Conversation(\"Can i have a lunch with you?\")\n", "conversation = chatbot(conversation)\n", "print(conversation)\n", "\n", "conversation = Conversation('Do you like Paella?')\n", "conversation = chatbot(conversation)\n", "print(conversation)\n", "\n", "conversation = Conversation('What do you know about Paella?')\n", "conversation = chatbot(conversation)\n", "print(conversation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Masked word completion\n", "Generate words for a mask" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'score': 0.32434332370758057,\n", " 'token': 2795,\n", " 'token_str': 'table',\n", " 'sequence': 'hello, im am eating at a table.'},\n", " {'score': 0.3150143623352051,\n", " 'token': 4825,\n", " 'token_str': 'restaurant',\n", " 'sequence': 'hello, im am eating at a restaurant.'},\n", " {'score': 0.07178690284490585,\n", " 'token': 3347,\n", " 'token_str': 'bar',\n", " 'sequence': 'hello, im am eating at a bar.'},\n", " {'score': 0.04275984689593315,\n", " 'token': 15736,\n", " 'token_str': 'diner',\n", " 'sequence': 'hello, im am eating at a diner.'},\n", " {'score': 0.032276701182127,\n", " 'token': 28305,\n", " 'token_str': 'buffet',\n", " 'sequence': 'hello, im am eating at a buffet.'}]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "unmasker = pipeline('fill-mask', model='bert-base-uncased')\n", "unmasker(\"Hello, Im am eating at a [MASK].\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Ner\n", "Let's detect NER" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'entity_group': 'PER',\n", " 'score': 0.9992756,\n", " 'word': 'Peter',\n", " 'start': 0,\n", " 'end': 5},\n", " {'entity_group': 'ORG',\n", " 'score': 0.98043567,\n", " 'word': 'Universidad Politécnica de Madrid',\n", " 'start': 21,\n", " 'end': 54},\n", " {'entity_group': 'LOC',\n", " 'score': 0.9985493,\n", " 'word': 'Madrid',\n", " 'start': 58,\n", " 'end': 64},\n", " {'entity_group': 'LOC',\n", " 'score': 0.99971014,\n", " 'word': 'Spain',\n", " 'start': 66,\n", " 'end': 71}]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "\n", "ner = pipeline(\"ner\", aggregation_strategy=\"simple\")\n", "ner(\"Peter has studied at Universidad Politécnica de Madrid in Madrid, Spain\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summarization\n", "Let's generate a summary." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'summary_text': 'Wopke Hoekstra, the EU commissioner for climate action, said Europe had no choice but to press ahead with strong measures to cut greenhouse gases. He said more attention was needed to help businesses thrive in a low-carbon world.'}]\n" ] } ], "source": [ "from transformers import pipeline\n", "\n", "summarizer = pipeline(\"summarization\", model=\"facebook/bart-large-cnn\")\n", "\n", "article = \"\"\"\n", "Europe’s climate chief has warned against politicians trying to use the climate crisis as a wedge issue in the forthcoming EU parliament elections, calling instead for climate policy that will bring wider economic benefits.\n", "Wopke Hoekstra, the EU commissioner for climate action, said Europe had no choice but to press ahead with strong measures to cut greenhouse gases, whoever was in power, but added that more attention was needed to help businesses thrive in a low-carbon world.\n", "He said: “There is no alternative than to continue with climate action. We need to continue in the direction of travel we have set. We need to speed up our pace.”\n", "Rightwing parties are forecast in polls to do well in the election, to be held from 6 to 9 June, largely at the expense of the Greens and socialist parties. Protests by farmers in EU capitals have attacked climate policies, and some rightwing parties have stepped up anti-green rhetoric.\n", "\"\"\"\n", "print(summarizer(article, max_length=130, min_length=30, do_sample=False)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Zero-shot classification\n", "Classification without examples!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sequence': 'one day I will see the world',\n", " 'labels': ['travel', 'cooking', 'dancing'],\n", " 'scores': [0.979964017868042, 0.010604988783597946, 0.00943098682910204]}" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "classifier = pipeline('zero-shot-classification', model='roberta-large-mnli')\n", "\n", "sequence_to_classify = \"one day I will see the world\"\n", "candidate_labels = ['travel', 'cooking', 'dancing']\n", "classifier(sequence_to_classify, candidate_labels)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'sequence': 'The CEO had a strong handshake.',\n", " 'labels': ['male', 'female'],\n", " 'scores': [0.8384838104248047, 0.16151617467403412]}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sequence_to_classify = \"The CEO had a strong handshake.\"\n", "candidate_labels = ['male', 'female']\n", "hypothesis_template = \"This text speaks about a {} profession.\"\n", "classifier(sequence_to_classify, candidate_labels, hypothesis_template=hypothesis_template)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'sequence': 'Nadal has won the last match',\n", " 'labels': ['sport', 'culture', 'economics', 'politics'],\n", " 'scores': [0.8608443140983582,\n", " 0.07932569086551666,\n", " 0.03197338432073593,\n", " 0.027856575325131416]},\n", " {'sequence': 'There is an election in Bulgaria',\n", " 'labels': ['politics', 'culture', 'economics', 'sport'],\n", " 'scores': [0.962326169013977,\n", " 0.01514720730483532,\n", " 0.012851395644247532,\n", " 0.009675216861069202]},\n", " {'sequence': 'The oil price is very high',\n", " 'labels': ['economics', 'culture', 'politics', 'sport'],\n", " 'scores': [0.8462415933609009,\n", " 0.06119668856263161,\n", " 0.04652851074934006,\n", " 0.04603322222828865]},\n", " {'sequence': 'The new film by Almodovar has been just released',\n", " 'labels': ['culture', 'politics', 'sport', 'economics'],\n", " 'scores': [0.711652934551239,\n", " 0.12886476516723633,\n", " 0.10017038881778717,\n", " 0.0593118779361248]}]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentences = [\"Nadal has won the last match\", \"There is an election in Bulgaria\", \"The oil price is very high\", \"The new film by Almodovar has been just released\"]\n", "candidate_labels = ['sport', 'politics', 'culture', 'economics']\n", "classifier(sentences, candidate_labels)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text generation\n", "Let's generate" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'generated_text': \"This articles aims evaluating transformers' capabilities, effectiveness and cost of use.\"}]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "\n", "generation = pipeline(\"text-generation\")\n", "\n", "generation(\"This articles aims evaluating transformers' capabilities\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question-Answering\n", "Let's create a QA!" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'score': 0.8235691785812378,\n", " 'start': 52,\n", " 'end': 77,\n", " 'answer': 'Alcobendas, Madrid, Spain'}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "\n", "qa = pipeline(\"question-answering\")\n", "\n", "qa(\n", " question = \"Where was born Penelope Cruz?\",\n", " context = '''\n", " Cruz was born on April 28, 1974, in Alcobendas, Madrid, Spain. \n", " In July 2010, Cruz married her Vicky Cristina Barcelona co-star, \n", " Spanish actor Javier Bardem. The couple had begun dating early into filming, in 2007.\n", " '''\n", ")" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'score': 0.4068852663040161,\n", " 'start': 126,\n", " 'end': 187,\n", " 'answer': 'Vicky Cristina Barcelona co-star, Spanish actor Javier Bardem'}" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "qa(\n", " question = \"Who is Penelope Cruz' husband?\",\n", " context = '''\n", " Cruz was born on April 28, 1974, in Alcobendas, Madrid, Spain. \n", " In July 2010, Cruz married her Vicky Cristina Barcelona co-star,\n", " Spanish actor Javier Bardem. The couple had begun dating early into filming, in 2007.\n", " '''\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Text-to-Speech" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from transformers import pipeline\n", "\n", "pipe = pipeline(\"text-to-speech\", model=\"suno/bark-small\")\n", "text = \"[clears throat] This is a test ... and I just took a long pause.\"\n", "output = pipe(text)\n", "\n", "from IPython.display import Audio \n", "\n", "Audio(output[\"audio\"], rate=output[\"sampling_rate\"])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## References\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "* [Hugging Face Tutorial](https://huggingface.co/learn/nlp-course/chapter1/1) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Licence" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.7" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 4 }