mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-16 19:42:28 +00:00
334 lines
7.4 KiB
Plaintext
334 lines
7.4 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [First steps](#First-steps)\n",
|
|
"* [Movie review](#Movie-review)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# First steps\n",
|
|
"Given the text taken from https://www.romania-insider.com/baneasa-airport-reopening-date-jul-2022.\n",
|
|
"\n",
|
|
"The Aurel Vlaicu Băneasa Airport will reopen on August 1, with scheduled commercial flights resuming after a nine-year hiatus, George Dorobanțu, the director of the Bucharest National Airports Company (CNAB), announced in an interview with the public radio. Three companies are already ready to start scheduled and charter flights on Băneasa, namely Ryanair, Air Connect, and Fly One, the director said.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"text = \"The Aurel Vlaicu Băneasa Airport will reopen on August 1, with scheduled commercial flights resuming after a nine-year hiatus, George Dorobanțu, the director of the Bucharest National Airports Company (CNAB), announced in an interview with the public radio. Three companies are already ready to start scheduled and charter flights on Băneasa, namely Ryanair, Air Connect, and Fly One, the director said.\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "subslide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 1. List the first 10 tokens of the doc"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 2. Number of tokens of the text."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 3. List the Noun chunks\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 4. Print the sentences of the text"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 5. Print the number of sentences of the text\n",
|
|
"Hint: build a list first"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 6. Print the second sentence. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 7. Visualize the dependency grammar analysis of the second sentence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 8. Listing lemmas and deps\n",
|
|
"For every token in the second sentence, print the text token, the grammatical category, and the lemma in four columns.\n",
|
|
"\n",
|
|
"Example:\n",
|
|
"\n",
|
|
"you  PRON  you  nsubj\n",
|
|
"\n",
|
|
"Hint: format the columns. You can use expandtabs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 9. List frequencies of POS in the document in a table "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 10. Preprocessing\n",
|
|
"\n",
|
|
"Remove from the doc stopwords, digits and punctuation.\n",
|
|
"\n",
|
|
"Hint: check the token api https://spacy.io/api/token\n",
|
|
"\n",
|
|
"Print the number of tokens before and after preprocessing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 11. Entities of the document\n",
|
|
"Print the entities of the document, the type of the entity and what the explanation of the entity in a table with three columns.\n",
|
|
"\n",
|
|
"Example:\n",
|
|
"\n",
|
|
"Ubuntu    ORG    Companies, agencies, institutions, etc."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"### 12. Visualize the entities\n",
|
|
"Show the entities in a graph."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "slide"
|
|
}
|
|
},
|
|
"source": [
|
|
"# Movie review\n",
|
|
"\n",
|
|
"Classify the rmoview reviews from the following dataset https://data.world/rajeevsharma993/movie-reviews"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"## References\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"* [Spacy](https://spacy.io/usage/spacy-101/#annotations) \n",
|
|
"* [NLTK stemmer](https://www.nltk.org/howto/stem.html)\n",
|
|
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
|
|
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)\n",
|
|
"* Natural Language Processing with Python, José Portilla, 2019."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"## Licence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"slideshow": {
|
|
"slide_type": "skip"
|
|
}
|
|
},
|
|
"source": [
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"celltoolbar": "Slideshow",
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.10"
|
|
},
|
|
"latex_envs": {
|
|
"LaTeX_envs_menu_present": true,
|
|
"autocomplete": true,
|
|
"bibliofile": "biblio.bib",
|
|
"cite_by": "apalike",
|
|
"current_citInitial": 1,
|
|
"eqLabelWithNumbers": true,
|
|
"eqNumInitial": 1,
|
|
"hotkeys": {
|
|
"equation": "Ctrl-E",
|
|
"itemize": "Ctrl-I"
|
|
},
|
|
"labels_anchors": false,
|
|
"latex_user_defs": false,
|
|
"report_style_numbering": false,
|
|
"user_envs_cfg": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|