Delete xai/readme

Add files via upload
Create readme
2025-11-21 16:18:17 +00:00 · 2025-06-06 17:24:29 +03:00 · 2025-06-06 17:24:05 +03:00 · 2025-06-06 17:23:37 +03:00 · 2025-06-06 17:23:16 +03:00 · 2025-06-06 17:22:33 +03:00
124 changed files with 139680 additions and 1531 deletions
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 # sitc
 Exercises for Intelligent Systems Course at Universidad Politécnica de Madrid, Telecommunication Engineering School. This material is used in the subjects
- SITC (Sistemas Inteligentes y Tecnologías del Conocimiento) - Master Universitario de Ingeniería de Telecomunicación (MUIT)
+- CDAW (Ciencia de datos y aprendizaje en automático en la web de datos) - Master Universitario de Ingeniería de Telecomunicación (MUIT)
- TIAD (Tecnologías Inteligentes de Análisis de Datos) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos)
+- ABID (Analítica de Big Data) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos)
 For following this course:
 - Follow the instructions to install the environment: https://github.com/gsi-upm/sitc/blob/master/python/1_1_Notebooks.ipynb (Just install 'conda')
@@ -9,11 +9,13 @@ For following this course:
 - Run in a terminal in the folder sitc: jupyter notebook (and enjoy)
 Topics
-* Python: quick introduction to Python
+* Python: a quick introduction to Python
 * ML-1: introduction to machine learning with scikit-learn
 * ML-2: introduction to machine learning with pandas and scikit-learn
 * ML-21: preprocessing and visualizatoin
 * ML-3: introduction to machine learning. Neural Computing
 * ML-4: introduction to Evolutionary Computing
 * ML-5: introduction to Reinforcement Learning
 * NLP: introduction to NLP
 * LOD: Linked Open Data, exercises and example code
 * SNA: Social Network Analysis
--- a/images/.p
+++ b/images/.p
@@ -0,0 +1 @@
--- a/images/EscUpmPolit_p.gif
+++ b/images/EscUpmPolit_p.gif
--- a/images/cart.png
+++ b/images/cart.png
--- a/images/data-chart-type.png
+++ b/images/data-chart-type.png
--- a/images/frozenlake-problem.png
+++ b/images/frozenlake-problem.png
--- a/images/frozenlake-world.png
+++ b/images/frozenlake-world.png
--- a/images/gym-maze.gif
+++ b/images/gym-maze.gif
--- a/images/iris-classes.png
+++ b/images/iris-classes.png
--- a/images/iris-dataset.jpg
+++ b/images/iris-dataset.jpg
--- a/images/iris-features.png
+++ b/images/iris-features.png
--- a/images/machine-learning-process.jpg
+++ b/images/machine-learning-process.jpg
--- a/images/multilayerperceptron_network.png
+++ b/images/multilayerperceptron_network.png
--- a/images/plot_ML_flow_chart_1.png
+++ b/images/plot_ML_flow_chart_1.png
--- a/images/plot_ML_flow_chart_2.png
+++ b/images/plot_ML_flow_chart_2.png
--- a/images/plot_ML_flow_chart_3.png
+++ b/images/plot_ML_flow_chart_3.png
--- a/images/qlearning-algo.png
+++ b/images/qlearning-algo.png
--- a/images/recording.gif
+++ b/images/recording.gif
--- a/images/titanic.jpg
+++ b/images/titanic.jpg
--- a/lod/00_SPARQL_Tutorial.ipynb
+++ b/lod/00_SPARQL_Tutorial.ipynb
@@ -124,7 +124,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql https://live.dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "SELECT ?s ?p ?o\n",
    "WHERE {\n",
@@ -149,7 +149,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql https://live.dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "SELECT *\n",
    "WHERE\n",
@@ -445,7 +445,7 @@
   "window_display": false
  },
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -459,7 +459,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.9.1"
+   "version": "3.8.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/lod/01_SPARQL_Introduction.ipynb
+++ b/lod/01_SPARQL_Introduction.ipynb
@@ -790,11 +790,12 @@
    "\n",
    "SELECT *\n",
    "WHERE { ... }\n",
-    "ORDER BY <variable> <variable> ... DESC(<variable>) ASC(<variable>)\n",
+    "ORDER BY <variable> <variable> ... \n",
    "... other statements like LIMIT ...\n",
    "```\n",
    "\n",
-    "The results can be sorted in ascending or descending order, and using several variables."
+    "The results can be sorted in ascending or descending order, and using several variables.\n",
    "By default the results are ordered in ascending order, but you can indicate the order using an optional modifier (`ASC(<variable>)`, or `DESC(<variable>)`). \n"
   ]
  },
  {
@@ -880,7 +881,7 @@
    "                    rdfs:label \"Ringo Starr\" .\n",
    "```\n",
    "\n",
-    "Using this structure, and the SPARQL statements you already know, to get the **names** of all musicians that collaborated in at least one song.\n"
+    "Using this structure, and the SPARQL statements you already know, get the **names** of all musicians that collaborated in at least one song.\n"
   ]
  },
  {
@@ -954,13 +955,13 @@
    "\n",
    "Results can be aggregated using different functions.\n",
    "One of the simplest functions is `COUNT`.\n",
-    "The syntax for COUNT is:\n",
+    "The syntax for `COUNT` is:\n",
    "    \n",
    "```sparql\n",
    "SELECT (COUNT(?variable) as ?count_name)\n",
    "```\n",
    "\n",
-    "Use `COUNT` to get the number of songs in which Ringo collaborated."
+    "Use `COUNT` to get the number of songs in which Ringo collaborated. Your query should return a column named `number`."
   ]
  },
  {
@@ -1038,7 +1039,7 @@
    "\n",
    "Once results are grouped, they can be aggregated using any aggregation function, such as `COUNT`.\n",
    "\n",
-    "Using `GROUP BY` and `COUNT`, get the count of songs that use each instrument:"
+    "Using `GROUP BY` and `COUNT`, get the count of songs in which Ringo Starr has played each of the instruments:"
   ]
  },
  {
@@ -1143,7 +1144,9 @@
    "Now, use the same principle to get the count of **different** instruments in each song.\n",
    "Some songs have several musicians playing the same instrument, but we only care about *different* instruments in each song.\n",
    "\n",
-    "Use `?number` for the count."
+    "Use `?song` for the song and `?number` for the count.\n",
    "\n",
    "Take into consideration that instruments are entities of type `i:Instrument`."
   ]
  },
  {
@@ -1153,7 +1156,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "2d0633303eedd0655e9b64bb00317dba",
+     "checksum": "3139d9b7e620266946ffe1ae0cf67581",
     "grade": false,
     "grade_id": "cell-ee208c762d00da9c",
     "locked": false,
@@ -1173,6 +1176,8 @@
    "    [] a           s:Song ;\n",
    "       rdfs:label  ?song ;\n",
    "       ?instrument ?musician .\n",
    "    \n",
    "?instrument a s:Instrument .\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
    "ORDER BY DESC(?number)"
@@ -1186,7 +1191,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "301aa479241fa02534ee047cf7577eee",
+     "checksum": "5abf6eb7a67ebc9f7612b876105c1960",
     "grade": true,
     "grade_id": "cell-ddeec32b8ac3d894",
     "locked": true,
@@ -1198,7 +1203,7 @@
   "outputs": [],
   "source": [
    "s = solution()\n",
-    "assert s['columns']['number'][0] == '27'"
+    "assert s['columns']['number'][0] == '25'"
   ]
  },
  {
@@ -1243,10 +1248,10 @@
   "metadata": {},
   "source": [
    "However, there are some songs that do not have a vocalist (at least, in the dataset).\n",
-    "Those songs will not appear in the list above, because we they do not match part of the `WHERE` clause.\n",
+    "Those songs will not appear in the list above, because they do not match part of the `WHERE` clause.\n",
    "\n",
    "In these cases, we can specify optional values in a query using the `OPTIONAL` keyword.\n",
-    "When a set of clauses are inside an OPTIONAL group, the SPARQL endpoint will try to use them in the query.\n",
+    "When a set of clauses are inside an `OPTIONAL` group, the SPARQL endpoint will try to use them in the query.\n",
    "If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).\n",
    "\n",
    "To exemplify this, we can use a property that **does not exist in the dataset**:"
@@ -1504,7 +1509,9 @@
   "source": [
    "Now, count how many instruments each musician have played in a song.\n",
    "\n",
-    "**Do not count lead (`i:vocals`) or backing vocals (`i:backingvocals`) as instruments**."
+    "**Do not count lead (`i:vocals`) or backing vocals (`i:backingvocals`) as instruments**.\n",
    "\n",
    "Use `?musician` for the musician and `?number` for the count."
   ]
  },
  {
@@ -1570,7 +1577,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Which songs had Ringo in dums OR Lennon in lead vocals? (UNION)"
+    "### Which songs had Ringo in drums OR Lennon in lead vocals? (UNION)"
   ]
  },
  {
@@ -1636,7 +1643,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "d583b30a1e00960df3a4411b6854c8c8",
+     "checksum": "11061e79ec06ccb3a9c496319a528366",
     "grade": true,
     "grade_id": "cell-409402df0e801d09",
     "locked": true,
@@ -1647,7 +1654,7 @@
   },
   "outputs": [],
   "source": [
-    "assert len(solution()['tuples']) == 246"
+    "assert len(solution()['tuples']) == 209"
   ]
  },
  {
@@ -1770,7 +1777,9 @@
    "\n",
    "Using `GROUP_CONCAT`, get a list of the instruments that each musician could play.\n",
    "\n",
-    "You can consult how to use GROUP_CONCAT [here](https://www.w3.org/TR/sparql11-query/)."
+    "You can consult how to use GROUP_CONCAT [here](https://www.w3.org/TR/sparql11-query/).\n",
    "\n",
    "Use `?musician` for the musician and `?instruments` for the list of instruments."
   ]
  },
  {
@@ -1815,7 +1824,9 @@
    "\n",
    "You can check if a string or URI matches a regular expression with `regex(?variable, \"<regex>\", \"i\")`.\n",
    "\n",
-    "The documentation for regular expressions in SPARQL is [here](https://www.w3.org/TR/rdf-sparql-query/)."
+    "The documentation for regular expressions in SPARQL is [here](https://www.w3.org/TR/rdf-sparql-query/).\n",
    "\n",
    "Use `?instrument` for the instrument and `?ins` for the url of the type."
   ]
  },
  {
@@ -1873,7 +1884,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -1887,9 +1898,22 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.1"
+   "version": "3.8.10"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
- "nbformat_minor": 2
+ "nbformat_minor": 4
 }
--- a/lod/02_SPARQL_Custom_Endpoint.ipynb
+++ b/lod/02_SPARQL_Custom_Endpoint.ipynb
@@ -441,7 +441,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -455,7 +455,20 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.1"
+   "version": "3.8.10"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/lod/03_SPARQL_Writers.ipynb
+++ b/lod/03_SPARQL_Writers.ipynb
@@ -189,8 +189,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's start with a simple query. We will get a list of cities and towns in Madrid.\n",
+    "Let's start with a simple query. We will get a list of towns and other populated areas within the Community of Madrid.\n",
-    "If we take a look at the DBpedia ontology or the page of any town we already know, we discover that the property that links towns to their community is [`isPartOf`](http://dbpedia.org/ontology/isPartOf), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)\n",
+    "If we take a look at the DBpedia ontology, or the page of any town we already know, we discover that the property that links towns to their community is [`subdivision`](http://dbpedia.org/ontology/subdivision), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)\n",
    "\n",
    "Since there are potentially many cities to get, we will limit our results to the first 10 results:"
   ]
@@ -201,11 +201,11 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "SELECT ?localidad\n",
    "WHERE {\n",
-    "    ?localidad <http://dbpedia.org/ontology/isPartOf> <http://dbpedia.org/resource/Community_of_Madrid>\n",
+    "    ?localidad <http://dbpedia.org/ontology/subdivision> <http://dbpedia.org/resource/Community_of_Madrid>\n",
    "}\n",
    "LIMIT 10"
   ]
@@ -224,14 +224,14 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
    "PREFIX dbr: <http://dbpedia.org/resource/>\n",
    "        \n",
    "SELECT ?localidad\n",
    "WHERE {\n",
-    "    ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
+    "    ?localidad dbo:subdivision dbr:Community_of_Madrid.\n",
    "}\n",
    "LIMIT 10"
   ]
@@ -259,10 +259,11 @@
   "source": [
    "Now that you have some experience under your belt, it is time to design your own query.\n",
    "\n",
-    "Your first task it to get a list of Spanish Novelits, using the skeleton below and the previous query to guide you.\n",
+    "Your first task it to get a list of writers, using the skeleton below and the previous query to guide you.\n",
    "\n",
-    "Pages for Spanish novelists are grouped in the *Spanish novelists* DBpedia category. You can use that fact to get your list.\n",
+    "The DBpedia vocabulary has a special class for writers: `<http://dbpedia.org/ontology/Writer>`.\n",
-    "In other words, the difference from the previous query will be using `dct:subject` instead of `dbo:isPartOf`, and `dbc:Spanish_novelists` instead of `dbr:Community_of_Madrid`."
+    "\n",
    "In other words, the difference from the previous query will be using `a` instead of `dbo:isPartOf`, and `dbo:Writer`  instead of `dbr:Community_of_Madrid`."
   ]
  },
  {
@@ -272,7 +273,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "eef1c62e2797bd3ef01f2061da6f83c4",
+     "checksum": "2a5c55e8bca983aca6cc2293f4560f31",
     "grade": false,
     "grade_id": "cell-7a9509ff3c34127e",
     "locked": false,
@@ -282,10 +283,10 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
-    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
+    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "SELECT ?escritor\n",
    "\n",
@@ -324,7 +325,7 @@
   "source": [
    "### Using more criteria\n",
    "\n",
-    "We can get more than one property in the same query. Let us modify our query to get the population of the cities as well."
+    "We can get more than one property in the same query. Let us modify our query to get the total area of the towns we found before."
   ]
  },
  {
@@ -333,22 +334,21 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
    "PREFIX dbr: <http://dbpedia.org/resource/>\n",
    "PREFIX dbp: <http://dbpedia.org/property/>\n",
    "        \n",
-    "SELECT ?localidad ?pop ?when\n",
+    "SELECT ?localidad ?area\n",
    "\n",
    "WHERE {\n",
-    "    ?localidad dbo:populationTotal ?pop .\n",
+    "    ?localidad dbo:areaTotal ?area .\n",
-    "    ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
+    "    ?localidad dbo:subdivision dbr:Community_of_Madrid .\n",
    "    ?localidad dbp:populationAsOf ?when .\n",
    "}\n",
    "\n",
-    "LIMIT 100"
+    "LIMIT 1000"
   ]
  },
  {
@@ -358,8 +358,7 @@
   "outputs": [],
   "source": [
    "assert 'localidad' in solution()['columns']\n",
-    "assert 'http://dbpedia.org/resource/Parla' in solution()['columns']['localidad']\n",
+    "assert ('http://dbpedia.org/resource/Lozoya', '5.794e+07') in solution()['tuples']"
    "assert ('http://dbpedia.org/resource/San_Sebastián_de_los_Reyes', '75912', '2009') in solution()['tuples']"
   ]
  },
  {
@@ -368,7 +367,7 @@
   "source": [
    "Time to try it yourself.\n",
    "\n",
-    "Get the list of Spanish novelists AND their name (using rdfs:label)."
+    "Get the list of writers AND their name (using rdfs:label)."
   ]
  },
  {
@@ -378,7 +377,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "9d4193612dea95da2d91762b638ad5e6",
+     "checksum": "2ebdc8d3f3420bb961e2c8c77d027c3b",
     "grade": false,
     "grade_id": "cell-83dcaae0d09657b5",
     "locked": false,
@@ -388,7 +387,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -399,7 +398,7 @@
    "WHERE {\n",
    "# YOUR ANSWER HERE\n",
    "}\n",
-    "LIMIT 10"
+    "LIMIT 100"
   ]
  },
  {
@@ -410,7 +409,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "86115c2a8982ad12b7250cf4341ae9c3",
+     "checksum": "d779d690d5d1865973fdcf113b74c221",
     "grade": true,
     "grade_id": "cell-8afd28aada7a896c",
     "locked": true,
@@ -422,8 +421,8 @@
   "outputs": [],
   "source": [
    "assert 'escritor' in solution()['columns']\n",
-    "assert 'http://dbpedia.org/resource/Eduardo_Mendoza_Garriga' in solution()['columns']['escritor']\n",
+    "assert 'http://dbpedia.org/resource/Alison_Stine' in solution()['columns']['escritor']\n",
-    "assert ('http://dbpedia.org/resource/Eduardo_Mendoza_Garriga', 'Eduardo Mendoza') in solution()['tuples']"
+    "assert ('http://dbpedia.org/resource/Alistair_MacLeod', 'Alistair MacLeod') in solution()['tuples']"
   ]
  },
  {
@@ -440,11 +439,12 @@
    "In the previous example, we saw that we got what seemed to be duplicated answers.\n",
    "\n",
    "This happens because entities can have labels in different languages (e.g. English, Spanish).\n",
-    "To restrict the search to only those results we're interested in, we can use filtering.\n",
+    "We can filter results using the `FILTER` keyword.\n",
    "\n",
-    "We can also decide the order in which our results are shown.\n",
+    "We can also decide the order in which our results are shown using the `ORDER BY` sentence.\n",
    "We can order in ascending (`ASC`) or descending (`DESC`) order.\n",
    "\n",
-    "For instance, this is how we could use filtering to get only large cities in our example, ordered by population:"
+    "For instance, this is how we could use filtering to get only large areas in our example, in descending order:"
   ]
  },
  {
@@ -453,21 +453,20 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
    "PREFIX dbr: <http://dbpedia.org/resource/>\n",
    "        \n",
-    "SELECT ?localidad ?pop ?when\n",
+    "SELECT ?localidad ?area\n",
    "\n",
    "WHERE {\n",
-    "    ?localidad dbo:populationTotal ?pop .\n",
+    "    ?localidad dbo:areaTotal ?area .\n",
-    "    ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
+    "    ?localidad dbo:type dbr:Municipalities_of_Spain .\n",
-    "    ?localidad dbp:populationAsOf ?when .\n",
+    "    FILTER(?area > 100000)\n",
    "    FILTER(?pop > 100000)\n",
    "}\n",
-    "ORDER BY ?pop\n",
+    "ORDER BY DESC(?area)\n",
    "LIMIT 100"
   ]
  },
@@ -486,7 +485,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "a38cb1aea7b1f01f6b37c088384e0a3d",
+     "checksum": "1e09f3c1749dd3c9256a1d0bbc14ff2d",
     "grade": true,
     "grade_id": "cell-cb7b8283568cd349",
     "locked": true,
@@ -498,10 +497,9 @@
   "outputs": [],
   "source": [
    "# We still have the biggest city\n",
-    "assert ('http://dbpedia.org/resource/Madrid', '3141991', '2014') in solution()['tuples']\n",
+    "assert 'http://dbpedia.org/resource/Úbeda' in solution()['columns']['localidad']\n",
    "# But the smaller ones are gone\n",
-    "assert 'http://dbpedia.org/resource/Tres_Cantos' not in solution()['columns']['localidad']\n",
+    "assert 'http://dbpedia.org/resource/El_Cañaveral' not in solution()['columns']['localidad']"
    "assert 'http://dbpedia.org/resource/San_Sebastián_de_los_Reyes' not in solution()['columns']['localidad']"
   ]
  },
  {
@@ -518,7 +516,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "b6aaac8ab30d52a042c1efefbbff7550",
+     "checksum": "b200ff7d97fe03bab726040d16b636fe",
     "grade": false,
     "grade_id": "cell-ff3d611cb0304b01",
     "locked": false,
@@ -528,11 +526,11 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
-    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
+    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "SELECT ?escritor ?nombre\n",
    "\n",
@@ -540,7 +538,7 @@
    "# YOUR ANSWER HERE\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
-    "LIMIT 1000"
+    "LIMIT 100"
   ]
  },
  {
@@ -551,7 +549,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "3441fbd2267002acbb0d46d9ce94ba97",
+     "checksum": "637f8a2e0eb286f968f22b0e0fa2215a",
     "grade": true,
     "grade_id": "cell-d70cc6ea394741bc",
     "locked": true,
@@ -563,8 +561,8 @@
   "outputs": [],
   "source": [
    "assert len(solution()['tuples']) >= 50\n",
-    "assert 'Adelaida García Morales' in solution()['columns']['nombre']\n",
+    "assert 'Abraham Abulafia' in solution()['columns']['nombre']\n",
-    "assert sum(1 for k in solution()['columns']['escritor'] if k == 'http://dbpedia.org/resource/Adelaida_García_Morales') == 1"
+    "assert sum(1 for k in solution()['columns']['escritor'] if k == 'http://dbpedia.org/resource/Abraham_Abulafia') == 1"
   ]
  },
  {
@@ -579,8 +577,9 @@
   "metadata": {},
   "source": [
    "From now on, we will focus on our Writers example.\n",
    "More specifically, we will be interested in writers born in the XX century.\n",
    "\n",
-    "First, we will search for writers born in the XX century, using the [20th-century Spanish novelists](http://dbpedia.org/page/Category:20th-century_Spanish_novelists) category."
+    "To do that, we will filter our novelists to only those born (`dbo:birthDate`) in the 20th century (after 1900)."
   ]
  },
  {
@@ -611,7 +610,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "cacdd08a8a267c1173304e319ffff563",
+     "checksum": "e896e64c21f317aeacf82ccd46811059",
     "grade": true,
     "grade_id": "cell-cf3821f2d33fb0f6",
     "locked": true,
@@ -622,9 +621,9 @@
   },
   "outputs": [],
   "source": [
-    "assert 'Camilo José Cela' in solution()['columns']['nombre']\n",
+    "assert 'Kiku Amino' in solution()['columns']['nombre']\n",
-    "assert 'Javier Marías' in solution()['columns']['nombre']\n",
+    "assert 'Albert Hackett' in solution()['columns']['nombre']\n",
-    "assert all(x > '1850-12-31' and x < '2001-01-01' for x in solution()['columns']['nac'])"
+    "assert all(x > '1900-01-01' and x < '2001-01-01' for x in solution()['columns']['nac'])"
   ]
  },
  {
@@ -647,7 +646,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "f4170cbbf042644e394d1eb9acf12ce3",
+     "checksum": "df4364d90fd37ec886bec8f39f6df8ee",
     "grade": false,
     "grade_id": "cell-254a18dd973e82ed",
     "locked": false,
@@ -657,11 +656,10 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "SELECT ?escritor ?nombre ?fechaNac ?fechaDef\n",
@@ -670,7 +668,7 @@
    "# YOUR ANSWER HERE\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
-    "LIMIT 200"
+    "LIMIT 100"
   ]
  },
  {
@@ -681,7 +679,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "29c6362adbdb5606e158f696594e1052",
+     "checksum": "26d08d050ac6963b20595f52b5d14781",
     "grade": true,
     "grade_id": "cell-4d6a64dde67f0e11",
     "locked": true,
@@ -692,8 +690,8 @@
   },
   "outputs": [],
   "source": [
-    "assert 'Wenceslao Fernández Flórez' in solution()['columns']['nombre']\n",
+    "assert 'Alister McGrath' in solution()['columns']['nombre']\n",
-    "assert '1879-2-11' in solution()['columns']['fechaNac']\n",
+    "# assert '1879-2-11' in solution()['columns']['fechaNac']\n",
    "assert '' in solution()['columns']['fechaNac'] # Not all birthdates are defined\n",
    "assert '' in solution()['columns']['fechaDef'] # Some deathdates are not defined"
   ]
@@ -722,7 +720,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Get the list of Spanish novelists that are still alive.\n",
+    "Get the list of writers that are still alive.\n",
    "A person is alive if their death date is not defined and the were born less than 100 years ago"
   ]
  },
@@ -733,7 +731,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "f3c11121eb0d1328d2f5da3580f8d648",
+     "checksum": "7527bd597f9550ec14d454732f6b2183",
     "grade": false,
     "grade_id": "cell-474b1a72dec6827c",
     "locked": false,
@@ -743,7 +741,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -769,7 +767,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "770bbddef5210c28486a1929e4513ada",
+     "checksum": "8f8c783af97cd3024b90a8f5b7fd7027",
     "grade": true,
     "grade_id": "cell-46b62dd2856bc919",
     "locked": true,
@@ -781,7 +779,7 @@
   "outputs": [],
   "source": [
    "assert 'Fernando Arrabal' in solution()['columns']['nombre']\n",
-    "assert 'Albert Espinosa' in solution()['columns']['nombre']\n",
+    "assert 'Javier Sierra' in solution()['columns']['nombre']\n",
    "for year in solution()['columns']['nac']:\n",
    "    assert int(year) >= 1918"
   ]
@@ -790,7 +788,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now, get the list of Spanish novelists that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n",
+    "Now, get the list of writers that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n",
    "\n",
    "Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).\n",
    "\n",
@@ -804,7 +802,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "ed34857649c9a6926eb0a3a0e1d8198d",
+     "checksum": "2e608b808ceceb2c8515f892a6b98d06",
     "grade": false,
     "grade_id": "cell-ceefd3c8fbd39d79",
     "locked": false,
@@ -814,7 +812,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -838,7 +836,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "18bb2d8d586bf4a5231973e69958ab75",
+     "checksum": "ec821397f67619e5bfa02a19bdd597fc",
     "grade": true,
     "grade_id": "cell-461cd6ccc6c2dc79",
     "locked": true,
@@ -849,8 +847,8 @@
   },
   "outputs": [],
   "source": [
-    "assert 'Javier Sierra' in solution()['columns']['nombre']\n",
+    "assert 'Wang Ruowang' in solution()['columns']['nombre']\n",
-    "assert 'http://dbpedia.org/resource/Sanmao_(author)' in solution()['columns']['escritor']"
+    "assert 'http://dbpedia.org/resource/Manuel_de_Pedrolo' in solution()['columns']['escritor']"
   ]
  },
  {
@@ -887,7 +885,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "34163ddb0400cd8ddd2c2e2cdf29c20b",
+     "checksum": "3d647ccd0f3e861b843af0ec4a33098b",
     "grade": false,
     "grade_id": "cell-2a39adc71d26ae73",
     "locked": false,
@@ -897,7 +895,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -921,7 +919,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "84ab7d64a45e03e6dd902216a2aad030",
+     "checksum": "524d152d46d3c1166052b6d5871c6aa5",
     "grade": true,
     "grade_id": "cell-542e0e36347fd5d1",
     "locked": true,
@@ -932,8 +930,8 @@
   },
   "outputs": [],
   "source": [
-    "assert 'Javier Sierra' in solution()['columns']['nombre']\n",
+    "assert 'Anna Langfus' in solution()['columns']['nombre']\n",
-    "assert 'http://dbpedia.org/resource/Albert_Espinosa' in solution()['columns']['escritor']\n",
+    "assert 'http://dbpedia.org/resource/Paul_Celan' in solution()['columns']['escritor']\n",
    "\n",
    "from collections import Counter\n",
    "c = Counter(solution()['columns']['nombre'])\n",
@@ -956,7 +954,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Get the list of living Spanish novelists born in Madrid.\n",
+    "Get the list of living novelists born in Madrid.\n",
    "\n",
    "Hint: use `dbr:Madrid` and `dbo:birthPlace`"
   ]
@@ -968,7 +966,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "25c8edcee216d536aac98fc9aa2b6422",
+     "checksum": "f067a70a247b62d7eb5cc526efdc53c4",
     "grade": false,
     "grade_id": "cell-d175e41da57c889b",
     "locked": false,
@@ -978,7 +976,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1042,7 +1040,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "c1f22b82c4d0bd4102a6c38f7f933dc6",
+     "checksum": "64ea2ef341901ce486bb1dcbed6c3785",
     "grade": false,
     "grade_id": "cell-e4b99af9ef91ff6f",
     "locked": false,
@@ -1052,7 +1050,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1066,7 +1064,7 @@
    "# YOUR ANSWER HERE\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
-    "LIMIT 10000"
+    "LIMIT 1000"
   ]
  },
  {
@@ -1077,7 +1075,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "51acaeb26379c6bd2f8c767001ef79ec",
+     "checksum": "fe47b48969b20b50a16a4ce4ad75e97d",
     "grade": true,
     "grade_id": "cell-68661b73c2140e4f",
     "locked": true,
@@ -1088,8 +1086,8 @@
   },
   "outputs": [],
   "source": [
-    "assert 'http://dbpedia.org/resource/A_Heart_So_White' in solution()['columns']['obra']\n",
+    "assert 'http://dbpedia.org/resource/Cristina_Guzmán_(novel)' in solution()['columns']['obra']\n",
-    "assert 'http://dbpedia.org/resource/Tomorrow_in_the_Battle_Think_on_Me' in solution()['columns']['obra']\n",
+    "assert 'http://dbpedia.org/resource/Life_Is_a_Dream' in solution()['columns']['obra']\n",
    "assert '' in solution()['columns']['obra'] # Some authors don't have works in dbpedia"
   ]
  },
@@ -1097,14 +1095,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Traversing the graph"
+    "### Traversing the graph II"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Get a list of living Spanish novelists born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).\n",
+    "Get a list of writers born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).\n",
    "\n",
    "If the query is right, you should see a list of writers after running the test code.\n",
    "\n",
@@ -1118,7 +1116,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "e3f8e18a006a763f5cdbe49c97b73f5f",
+     "checksum": "d3636d90f8d6a3c824b17ce87ba6c423",
     "grade": false,
     "grade_id": "cell-b1f71c67dd71dad4",
     "locked": false,
@@ -1128,7 +1126,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1142,7 +1140,7 @@
    "# YOUR ANSWER HERE\n",
    "}\n",
    "ORDER BY ?nombre\n",
-    "LIMIT 100"
+    "LIMIT 5"
   ]
  },
  {
@@ -1208,7 +1206,8 @@
   "source": [
    "Using UNION, get a list of distinct spanish novelists AND poets.\n",
    "\n",
-    "Hint: Category: Spanish_poets"
+    "In this query, instead of looking for writers, try to find the right entities by looking at the `dct:subject` property.\n",
    "The entities we are looking after should be in the `Spanish_poets` and `Spanish_novelists` categories."
   ]
  },
  {
@@ -1218,7 +1217,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "9c0da379841474601397f5623abc6a9c",
+     "checksum": "2547e55ac68b37687efddd50c768eb5b",
     "grade": false,
     "grade_id": "cell-21eb6323b6d0011d",
     "locked": false,
@@ -1228,7 +1227,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1242,7 +1241,7 @@
    "# YOUR ANSWER HERE\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
-    "LIMIT 10000"
+    "LIMIT 100"
   ]
  },
  {
@@ -1253,7 +1252,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "f22c7db423410fcf3e8fce4ec0a8e9f9",
+     "checksum": "565dac8ae632765bc3f128f830e70993",
     "grade": true,
     "grade_id": "cell-004e021e877c6ace",
     "locked": true,
@@ -1264,7 +1263,7 @@
   },
   "outputs": [],
   "source": [
-    "assert 'Garcilaso de la Vega' in solution()['columns']['nombre']"
+    "assert 'Antonio Gala' in solution()['columns']['nombre']"
   ]
  },
  {
@@ -1289,7 +1288,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "cd7ce9212f587afe311c7631b3908de2",
+     "checksum": "f8cca6da3b6830a5474eac28c3c8ebde",
     "grade": false,
     "grade_id": "cell-e35414e191c5bf16",
     "locked": false,
@@ -1299,7 +1298,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql http://dbpedia.org/sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1389,13 +1388,13 @@
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
-    "© 2018 Universidad Politécnica de Madrid."
+    "© 2023 Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -1409,7 +1408,20 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.1"
+   "version": "3.8.10"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
--- a/lod/04_SPARQL_Advanced.ipynb
+++ b/lod/04_SPARQL_Advanced.ipynb
@@ -150,7 +150,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "69e23e6e3dc06ca9d2b5d878c2baba94",
+     "checksum": "1a23c8b9a53f7ae28f28b1c23b9706b5",
     "grade": false,
     "grade_id": "cell-ab7755944d46f9ca",
     "locked": false,
@@ -160,19 +160,19 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
-    "PREFIX dct:<http://purl.org/dc/terms/>\n",
+    "PREFIX dct: <http://purl.org/dc/terms/>\n",
-    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
+    "PREFIX dbc: <http://dbpedia.org/resource/Category:>\n",
-    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
+    "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
-    "\n",
+    "PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n",
    "SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
    "\n",
    "SELECT ?escritor ?nombre (year(?fechaNac) as ?nac)\n",
    "WHERE {\n",
-    "    ?escritor dct:subject dbc:Spanish_novelists .\n",
+    "    ?escritor dct:subject dbc:Spanish_novelists ;\n",
-    "    ?escritor rdfs:label ?nombre .\n",
+    "              rdfs:label ?nombre ;\n",
-    "    ?escritor dbo:birthDate ?fechaNac .\n",
+    "              dbo:birthDate ?fechaNac .\n",
    "    FILTER(lang(?nombre) = \"es\") .\n",
    "    # YOUR ANSWER HERE\n",
    "}\n",
@@ -188,7 +188,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "211c632634327a1fd805326fa0520cdd",
+     "checksum": "e261d808f509c1e29227db94d1eab784",
     "grade": true,
     "grade_id": "cell-cf3821f2d33fb0f6",
     "locked": true,
@@ -199,8 +199,8 @@
   },
   "outputs": [],
   "source": [
-    "assert 'Camilo José Cela' in solution()['columns']['nombre']\n",
+    "assert 'Ramiro Ledesma' in solution()['columns']['nombre']\n",
-    "assert 'Javier Marías' in solution()['columns']['nombre']\n",
+    "assert 'Ray Loriga' in solution()['columns']['nombre']\n",
    "assert all(int(x) > 1899 and int(x) < 2001 for x in solution()['columns']['nac'])"
   ]
  },
@@ -304,7 +304,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "2a24f623c23116fd23877facb487dd16",
+     "checksum": "e55173801ab36337ad356a1bc286dbd1",
     "grade": false,
     "grade_id": "cell-ceefd3c8fbd39d79",
     "locked": false,
@@ -314,7 +314,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -341,7 +341,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "18bb2d8d586bf4a5231973e69958ab75",
+     "checksum": "1b77cfaefb8b2ec286ce7b0c70804fe0",
     "grade": true,
     "grade_id": "cell-461cd6ccc6c2dc79",
     "locked": true,
@@ -353,7 +353,7 @@
   "outputs": [],
   "source": [
    "assert 'Javier Sierra' in solution()['columns']['nombre']\n",
-    "assert 'http://dbpedia.org/resource/Sanmao_(author)' in solution()['columns']['escritor']"
+    "assert 'http://dbpedia.org/resource/José_Ángel_Mañas' in solution()['columns']['escritor']"
   ]
  },
  {
@@ -392,7 +392,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "SELECT ?localidad\n",
    "WHERE {\n",
@@ -419,7 +419,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "6e444c20b411033a6c45fd5a566018fa",
+     "checksum": "b70a9a4f102c253e864d2e8aec79ce81",
     "grade": false,
     "grade_id": "cell-a57d3546a812f689",
     "locked": false,
@@ -429,7 +429,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -526,7 +526,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "%%sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
@@ -535,9 +535,9 @@
    "SELECT ?com, GROUP_CONCAT(?name, \",\") as ?places  # notice how we rename the variable\n",
    "\n",
    "WHERE {\n",
-    "    ?localidad dbo:isPartOf ?com .\n",
+    "    ?com dct:subject dbc:Autonomous_communities_of_Spain .\n",
-    "    ?com dbo:type dbr:Autonomous_communities_of_Spain .\n",
+    "    ?localidad dbo:subdivision ?com ;\n",
-    "    ?localidad rdfs:label ?name .\n",
+    "             rdfs:label ?name .\n",
    "    FILTER (lang(?name)=\"es\")\n",
    "}\n",
    "\n",
@@ -552,7 +552,7 @@
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
-     "checksum": "e100e2f89c832cf832add62c107e4008",
+     "checksum": "4779fb61645634308d0ed01e0c88e8a4",
     "grade": false,
     "grade_id": "asdiopjasdoijasdoijasd",
     "locked": true,
@@ -561,7 +561,7 @@
    }
   },
   "source": [
-    "Try it yourself, to get a list of works by each of these authors:"
+    "Try it yourself, to get a list of works by each of the authors in this query:"
   ]
  },
  {
@@ -571,7 +571,7 @@
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
-     "checksum": "9f6e26faab2be98c72fb7a917ac5a421",
+     "checksum": "e5d87d1d8eba51c510241ba75981a597",
     "grade": false,
     "grade_id": "cell-2e3de17c75047652",
     "locked": false,
@@ -581,7 +581,7 @@
   },
   "outputs": [],
   "source": [
-    "%%sparql\n",
+    "%%sparql https://dbpedia.org/sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -592,26 +592,17 @@
    "# YOUR ANSWER HERE\n",
    "\n",
    "WHERE {\n",
-    "    ?escritor dct:subject dbc:Spanish_novelists .\n",
+    "    ?escritor a dbo:Writer .\n",
    "    ?escritor rdfs:label ?nombre .\n",
    "    ?escritor dbo:birthDate ?fechaNac .\n",
    "    ?escritor dbo:birthPlace dbr:Madrid .\n",
-    "    OPTIONAL {\n",
+    "    # YOUR ANSWER HERE\n",
    "        ?obra dbo:author ?escritor .\n",
    "        ?obra rdfs:label ?titulo .\n",
    "    }\n",
    "    OPTIONAL {\n",
    "        ?escritor dbo:deathDate ?fechaDef .\n",
    "    }\n",
    "    FILTER (?fechaNac <= \"2000\"^^xsd:date).\n",
    "    FILTER (?fechaNac >= \"1918\"^^xsd:date).\n",
    "    FILTER (!bound(?fechaDef) || (?fechaNac >= \"1918\"^^xsd:date)) .\n",
    "    FILTER(lang(?nombre) = \"es\") .\n",
    "    FILTER(!bound(?titulo) || lang(?titulo) = \"en\") .\n",
    "\n",
    "}\n",
    "ORDER BY ?nombre\n",
-    "LIMIT 10000"
+    "LIMIT 100"
   ]
  },
  {
@@ -639,7 +630,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -653,7 +644,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.1"
+   "version": "3.8.10"
  }
 },
 "nbformat": 4,
--- a/lod/BeatlesMusicians.ttl
+++ b/lod/BeatlesMusicians.ttl
--- a/lod/helpers.py
+++ b/lod/helpers.py
@@ -12,6 +12,7 @@ from urllib.request import Request, urlopen
 from urllib.parse import quote_plus, urlencode
 from urllib.error import HTTPError
 import ssl
 import json
 import sys
@@ -32,7 +33,11 @@ def send_query(query, endpoint):
                headers={'content-type': 'application/x-www-form-urlencoded',
                         'accept': FORMATS},
                method='POST')
-    res = urlopen(r)
+    context = ssl.create_default_context()
    context.check_hostname = False
    context.verify_mode = ssl.CERT_NONE
    res = urlopen(r, context=context, timeout=2)
    data = res.read().decode('utf-8')
    if res.getcode() == 200:
        try:
--- a/lod/tests.py
+++ b/lod/tests.py
--- a/ml1/2_0_0_Intro_ML.ipynb
+++ b/ml1/2_0_0_Intro_ML.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -71,8 +71,7 @@
   "source": [
    "* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
    "* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
-    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
+    "* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015."
   ]
  },
  {
@@ -80,7 +79,7 @@
   "metadata": {},
   "source": [
    "## Licence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -88,7 +87,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -102,7 +101,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml1/2_0_1_Objectives.ipynb
+++ b/ml1/2_0_1_Objectives.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -40,10 +40,10 @@
    "\n",
    "* Learn to use scikit-learn\n",
    "* Learn the basic steps to apply machine learning techniques: dataset analysis, load, preprocessing, training, validation, optimization and persistence.\n",
-    "* Learn how to do a exploratory data analysis\n",
+    "* Learn how to do an exploratory data analysis\n",
    "* Learn how to visualise a dataset\n",
    "* Learn how to load a bundled dataset\n",
-    "* Learn how to separate the dataset into traning and testing datasets\n",
+    "* Learn how to separate the dataset into training and testing datasets\n",
    "* Learn how to train a classifier\n",
    "* Learn how to predict with a trained classifier\n",
    "* Learn how to evaluate the predictions\n",
@@ -63,9 +63,7 @@
   "metadata": {},
   "source": [
    "* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
-    "* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
+    "* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015."
   ]
  },
  {
@@ -73,7 +71,7 @@
   "metadata": {},
   "source": [
    "## LIcence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -81,7 +79,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -95,7 +93,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.7"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml1/2_1_Intro_ScikitLearn.ipynb
+++ b/ml1/2_1_Intro_ScikitLearn.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -87,10 +87,10 @@
   "metadata": {},
   "source": [
    "Scikit-learn provides algorithms for solving the following problems:\n",
-    "* **Classification**: Identifying to which category an object belongs to. Some of the available [classification algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are decision trees (ID3, C4.5, ...), kNN, SVM, Random forest, Perceptron, etc. \n",
+    "* **Classification**: Identifying to which category an object belongs. Some of the available [classification algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are decision trees (ID3, C4.5, ...), kNN, SVM, Random forest, Perceptron, etc. \n",
    "* **Clustering**: Automatic grouping of similar objects into sets. Some of the available [clustering algorithms](http://scikit-learn.org/stable/modules/clustering.html#clustering) are k-Means, Affinity propagation, etc.\n",
    "* **Regression**: Predicting a continuous-valued attribute associated with an object. Some of the available [regression algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are linear regression, logistic regression, etc.\n",
-    "* ** Dimensionality reduction**: Reducing the number of random variables to consider. Some of the available [dimensionality reduction algorithms](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) are SVD, PCA, etc."
+    "* **Dimensionality reduction**: Reducing the number of random variables to consider. Some of the available [dimensionality reduction algorithms](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) are SVD, PCA, etc."
   ]
  },
  {
@@ -105,7 +105,7 @@
   "metadata": {},
   "source": [
    "In addition, scikit-learn helps in several tasks:\n",
-    "* **Model selection**: Comparing, validating, choosing parameters and models, and persisting models. Some of the [available functionalities](http://scikit-learn.org/stable/model_selection.html#model-selection) are cross-validation or grid search for optimizing the parameters. \n",
+    "* **Model selection**: Comparing, validating, choosing parameters and models, and persisting models. Some [available functionalities](http://scikit-learn.org/stable/model_selection.html#model-selection) are cross-validation or grid search for optimizing the parameters. \n",
    "* **Preprocessing**: Several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. Some of the available [preprocessing functions](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) are scaling and normalizing data, or imputing missing values."
   ]
  },
@@ -128,9 +128,9 @@
    "\n",
    "If it is not installed, install it with conda: `conda install scikit-learn`.\n",
    "\n",
-    "If you have installed scipy and numpy, you can also installed using pip: `pip install -U scikit-learn`.\n",
+    "If you have installed scipy and numpy, you can also install using pip: `pip install -U scikit-learn`.\n",
    "\n",
-    "It is not recommended to use pip for installing scipy and numpy. Instead, use conda or install the linux package *python-sklearn*."
+    "It is not recommended to use pip to install scipy and numpy. Instead, use conda or install the Linux package *python-sklearn*."
   ]
  },
  {
@@ -156,7 +156,7 @@
   "metadata": {},
   "source": [
    "## Licence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
--- a/ml1/2_2_Read_Data.ipynb
+++ b/ml1/2_2_Read_Data.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")\n",
+    "![](./images/EscUpmPolit_p.gif \"UPM\")\n",
    "\n",
    "# Course Notes for Learning Intelligent Systems\n",
    "\n",
@@ -34,11 +34,11 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The goal of this notebook is to learn how to read and load a sample dataset.\n",
+    "This notebook aims to learn how to read and load a sample dataset.\n",
    "\n",
-    "Scikit-learn comes with some bundled [datasets](http://scikit-learn.org/stable/datasets/): iris, digits, boston, etc.\n",
+    "Scikit-learn comes with some bundled [datasets](https://scikit-learn.org/stable/datasets.html): iris, digits, boston, etc.\n",
    "\n",
-    "In this notebook we are going to use the Iris dataset."
+    "In this notebook, we will use the Iris dataset."
   ]
  },
  {
@@ -54,16 +54,25 @@
   "source": [
    "The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",
    "\n",
-    "The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features.\n",
+    "The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",
    "\n",
-    "![Iris](files/images/iris-dataset.jpg)"
+     "![Iris dataset](./images/iris-dataset.jpg \"Iris\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In ordert to read the dataset, we import the datasets bundle and then load the Iris dataset. "
+    "Here you can see the species and the features.\n",
    "![Iris features](./images/iris-features.png \"Iris features\")\n",
    "![Iris classes](./images/iris-classes.png \"Iris classes\")"
   ]
  },
   {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To read the dataset, we import the datasets bundle and then load the Iris dataset. "
   ]
  },
  {
@@ -180,7 +189,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "#Using numpy, I can print the dimensions (here we are working with 2D matriz)\n",
+    "#Using numpy, I can print the dimensions (here we are working with a 2D matrix)\n",
    "print(iris.data.ndim)"
   ]
  },
@@ -218,7 +227,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In following sessions we will learn how to load a dataset from a file (csv, excel, ...) using the pandas library."
+    "In the following sessions, we will learn how to load a dataset from a file (CSV, Excel, ...) using the pandas library."
   ]
  },
  {
@@ -246,7 +255,7 @@
   "source": [
    "## Licence\n",
    "\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
--- a/ml1/2_3_0_Visualisation.ipynb
+++ b/ml1/2_3_0_Visualisation.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -49,7 +49,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The goal of this notebook is to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset  in other sessions."
+    "This notebook aims to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset  in other sessions."
   ]
  },
  {
@@ -65,13 +65,13 @@
   "source": [
    "This section covers different ways to inspect the distribution of samples per feature.\n",
    "\n",
-    "First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
+    "First of all, let's see how many samples we have in each class using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
    "\n",
-    "A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n",
+    "A histogram is a graphical representation of the distribution of numerical data. It estimates the probability distribution of a continuous variable (quantitative variable). \n",
    "\n",
-    "For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
+    "For building a histogram, we need to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
    "\n",
-    "In our case, since the values are not continuous and we have only three values, we do not need to bin them."
+    "Since the values are not continuous and we have only three values, we do not need to bin them."
   ]
  },
  {
@@ -115,7 +115,7 @@
   "metadata": {},
   "source": [
    "As can be seen, we have the same distribution of samples for every class.\n",
-    "The next step is to see the distribution of the features"
+    "The next step is to see the distribution of the features."
   ]
  },
  {
@@ -184,7 +184,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "As we can see, the Setosa class seems to be linearly separable with these two features.\n",
+    "As we can see, the Setosa class seems linearly separable with these two features.\n",
    "\n",
    "Another nice visualisation is given below."
   ]
@@ -228,7 +228,6 @@
   "source": [
    "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
    "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
    "* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
    "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
    "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
    "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
@@ -242,7 +241,7 @@
   "source": [
    "## Licence\n",
    "\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
--- a/ml1/2_3_1_Advanced_Visualisation.ipynb
+++ b/ml1/2_3_1_Advanced_Visualisation.ipynb
--- a/ml1/2_4_Preprocessing.ipynb
+++ b/ml1/2_4_Preprocessing.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -76,7 +76,7 @@
   "source": [
    "A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
    "\n",
-    "We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
+    "We will use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
   ]
  },
  {
@@ -122,9 +122,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
+    "Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might misbehave if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
    "\n",
-    "The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set."
+    "The preprocessing module further provides a utility class `StandardScaler` to compute a training set's mean and standard deviation. Later, the same transformation will be applied on the testing set."
   ]
  },
  {
@@ -163,7 +163,6 @@
   "source": [
    "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
    "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
    "* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
    "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
    "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
    "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"
@@ -174,7 +173,7 @@
   "metadata": {},
   "source": [
    "### Licences\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
--- a/ml1/2_5_0_Machine_Learning.ipynb
+++ b/ml1/2_5_0_Machine_Learning.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -53,9 +53,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "This is an introduction of general ideas about machine learning and the interface of scikit-learn, taken from the [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/general_concepts.html). \n",
+    "This is an introduction to general ideas about machine learning and the interface of scikit-learn, taken from the [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/general_concepts.html). \n",
    "\n",
-    "You can skip it during the lab session and read it later,"
+    "You can skip it during the lab session and read it later."
   ]
  },
  {
@@ -69,20 +69,20 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Machine learning algorithms are programs that learn a model from a dataset with the aim of making predictions or learning structures to organize the data.\n",
+    "Machine learning algorithms are programs that learn a model from a dataset to make predictions or learn structures to organize the data.\n",
    "\n",
-    "In scikit-learn, machine learning algorithms take as an input a *numpy* array (n_samples, n_features), where\n",
+    "In scikit-learn, machine learning algorithms take as input a *numpy* array (n_samples, n_features), where\n",
-    "* **n_samples**: number of samples. Each sample  is an item to process (i.e. classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
+    "* **n_samples**: number of samples. Each sample  is an item to process (i.e., classify). A sample can be a document, a picture, a sound, a video, a row in a database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
-    "* **n_features**: The number of features or distinct traits that can be used to describe each item in a quantitative manner.\n",
+    "* **n_features**: The number of features or distinct traits that can be used to describe each item quantitatively.\n",
    "\n",
-    "The number of features should be defined in advance. There is a specific type of feature sets that are high dimensional (e.g. millions of features), but most of the values are zero for a given sample. Using (numpy) arrays, all those values that are zero would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
+    "The number of features should be defined in advance. A specific type of feature set is high-dimensional (e.g., millions of features), but most values are zero for a given sample. Using (numpy) arrays, all those zero values would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
    "\n",
    "The first step in machine learning is **identifying the relevant features** from the input data, and the second step is **extracting the features** from the input data. \n",
    "\n",
    "[Machine learning algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/) can be classified according to learning style into:\n",
    "* **Supervised learning**: input data (training dataset) has a known label or result. Example problems are classification and regression. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.\n",
-    "* **Unsupervised learning**: input data is not labeled. A model is prepared by deducing structures present in the input data. This may be to extract general rules. Example problems are clustering, dimensionality reduction and association rule learning.\n",
+    "* **Unsupervised learning**: input data is not labeled. A model is prepared by deducing structures present in the input data. This may be to extract general rules. Example problems are clustering, dimensionality reduction, and association rule learning.\n",
-    "* **Semi-supervised learning**:i nput data is a mixture of labeled and unlabeled examples. There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions. Example problems are classification and regression."
+    "* **Semi-supervised learning**: input data is a mixture of labeled and unlabeled examples. There is a desired prediction problem, but the model must learn the structures to organize the data and make predictions. Example problems are classification and regression."
   ] 
  },
  {
@@ -96,8 +96,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In *supervised machine learning models*, the machine learning algorithm takes as an input a training dataset, composed of feature vectors and labels, and produces a predictive model which is used for make prediction on new data.\n",
+    "In *supervised machine learning models*, the machine learning algorithm takes as input a training dataset, composed of feature vectors and labels, and produces a predictive model used to predict new data.\n",
-    "![](files/images/plot_ML_flow_chart_1.png)"
+    "![](./images/plot_ML_flow_chart_1.png)"
   ]
  },
  {
@@ -111,7 +111,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In *unsupervised machine learning models*, the machine learning model algorithm takes as an input the feature vectors and produces a predictive model that is used to fit its parameters so as to best summarize regularities found in the data.\n",
+    "In *unsupervised machine learning models*, the machine learning model algorithm takes as input the feature vectors. It produces a predictive model that is used to fit its parameters to summarize the best regularities found in the data.\n",
    "![](files/images/plot_ML_flow_chart_3.png)"
   ]
  },
@@ -129,15 +129,15 @@
    "scikit-learn has a uniform interface for all the estimators, some methods are only available if the estimator is supervised or unsupervised:\n",
    "\n",
    "* Available in *all estimators*:\n",
-    "    * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",
+    "    * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g., model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",
    "\n",
    "* Available in *supervised estimators*:\n",
-    "    * **model.predict()**: given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.\n",
+    "    * **model.predict()**: given a trained model, predict the label of a new dataset. This method accepts one argument, the new data X_new (e.g., model.predict(X_new)), and returns the learned label for each object in the array.\n",
    "    * **model.predict_proba()**: For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().\n",
    "\n",
    "* Available in *unsupervised estimators*:\n",
    "    * **model.transform()**: given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.\n",
-    "    * **model.fit_transform()**: some estimators implement this method, which performs a fit and a transform on the same input data.\n",
+    "    * **model.fit_transform()**: Some estimators implement this method, which performs a fit and a transform on the same input data.\n",
    "\n",
    "\n",
    "![](files/images/plot_ML_flow_chart_2.png)"
@@ -154,7 +154,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "* [General concepts of machine learning with scikit-learn](http://www.astroml.org/sklearn_tutorial/general_concepts.html)\n",
+    "* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)\n",
    "* [A Tour of Machine Learning Algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)"
   ]
  },
@@ -169,7 +169,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -177,7 +177,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -191,7 +191,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.5.6"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml1/2_5_1_kNN_Model.ipynb
+++ b/ml1/2_5_1_kNN_Model.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -55,7 +55,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.\n",
+    "The goal of this notebook is to learn how to train a model, make predictions with that model, and evaluate these predictions.\n",
    "\n",
    "The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)."
   ]
@@ -212,14 +212,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Precision, recall and f-score"
+    "### Precision, recall, and f-score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
+    "For evaluating classification algorithms, we usually calculate three metrics: precision, recall, and F1-score\n",
    "\n",
    "* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
    "* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
@@ -246,7 +246,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Another useful metric is the confusion matrix"
+    "Another useful metric is the confusion matrix."
   ]
  },
  {
@@ -262,7 +262,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We see we classify well all the 'setosa' and 'versicolor' samples. "
+    "We classify all the 'setosa' and 'versicolor' samples well. "
   ]
  },
  {
@@ -276,7 +276,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**."
+    "To avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**."
   ]
  },
  {
@@ -298,7 +298,7 @@
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
-    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
+    "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "print(scores)"
   ]
@@ -307,7 +307,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
+    "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure."
   ]
  },
  {
@@ -340,7 +340,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We are going to tune the algorithm, and calculate which is the best value for the k parameter."
+    "We will tune the algorithm and calculate the best value for the k hyperparameter."
   ]
  },
  {
@@ -365,7 +365,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k."
+    "The result is very dependent on the input data. Execute the train_test_split again and test how the result changes with k."
   ]
  },
  {
@@ -379,8 +379,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",
+    "* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n"
    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n"
   ]
  },
  {
@@ -388,7 +387,7 @@
   "metadata": {},
   "source": [
    "## Licence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -405,7 +404,7 @@
   "window_display": false
  },
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -419,7 +418,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.9"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml1/2_5_2_Decision_Tree_Model.ipynb
+++ b/ml1/2_5_2_Decision_Tree_Model.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -56,9 +56,9 @@
   "source": [
    "The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
    "\n",
-    "There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
+    "There are several well-known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0, and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
    "\n",
-    "This notebook will follow the same steps that the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",
+    "This notebook will follow the same steps as the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",
    "\n",
    "You need to install pydotplus: `conda install pydotplus` for the visualization."
   ]
@@ -69,12 +69,12 @@
   "source": [
    "## Load data and preprocessing\n",
    "\n",
-    "Here we repeat the same operations for loading data and preprocessing than in the previous notebooks."
+    "Here we repeat the same operations for loading data and preprocessing as in the previous notebooks."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -124,9 +124,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(max_depth=3, random_state=1)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "import numpy as np\n",
@@ -145,9 +156,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction  [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",
      " 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",
      " 0 0 0 0 2 2 0 1 1 2 1 0 0 2 1 1 0 1 1 0 2 1 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",
      " 0]\n",
      "Expected  [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",
      " 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",
      " 0 0 0 0 2 2 0 1 1 2 1 0 0 1 1 1 0 1 1 0 2 2 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",
      " 0]\n"
     ]
    }
   ],
   "source": [
    "print(\"Prediction \", model.predict(x_train))\n",
    "print(\"Expected \", y_train)"
@@ -162,9 +188,26 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicted probabilities [[0.         0.97368421 0.02631579]\n",
      " [1.         0.         0.        ]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [1.         0.         0.        ]\n",
      " [1.         0.         0.        ]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [1.         0.         0.        ]\n",
      " [0.         0.         1.        ]]\n"
     ]
    }
   ],
   "source": [
    "# Print the \n",
    "print(\"Predicted probabilities\", model.predict_proba(x_train[:10]))"
@@ -172,9 +215,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy in training 0.9821428571428571\n"
     ]
    }
   ],
   "source": [
    "# Evaluate Accuracy in training\n",
    "\n",
@@ -185,9 +236,17 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy in testing  0.9210526315789473\n"
     ]
    }
   ],
   "source": [
    "# Now we evaluate error in testing\n",
    "y_test_pred = model.predict(x_test)\n",
@@ -203,15 +262,30 @@
    "The current version of pydot does not work well in Python 3.\n",
    "For obtaining an image, you need to install `pip install pydotplus` and then `conda install graphviz`.\n",
    "\n",
-    "You can skip this example. Since it can require installing additional packages, we include here the result.\n",
+    "You can skip this example. Since it can require installing additional packages, we have included the result here.\n",
-    "![Decision Tree](files/images/cart.png)"
+    "![Decision Tree](./images/cart.png)"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "ename": "InvocationException",
     "evalue": "GraphViz's executables not found",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mInvocationException\u001b[0m                       Traceback (most recent call last)",
      "\u001b[0;32m/tmp/ipykernel_47326/3723147494.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     13\u001b[0m \u001b[0mgraph\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpydot\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgraph_from_dot_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdot_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetvalue\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 14\u001b[0;31m \u001b[0mgraph\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite_png\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'iris-tree.png'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     15\u001b[0m \u001b[0mImage\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgraph\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreate_png\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/pydotplus/graphviz.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(path, f, prog)\u001b[0m\n\u001b[1;32m   1808\u001b[0m                 \u001b[0;32mlambda\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1809\u001b[0m                 \u001b[0mf\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfrmt\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1810\u001b[0;31m                 \u001b[0mprog\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprog\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprog\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mprog\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1811\u001b[0m             )\n\u001b[1;32m   1812\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/pydotplus/graphviz.py\u001b[0m in \u001b[0;36mwrite\u001b[0;34m(self, path, prog, format)\u001b[0m\n\u001b[1;32m   1916\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1917\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1918\u001b[0;31m                 \u001b[0mfobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprog\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1919\u001b[0m         \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1920\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/pydotplus/graphviz.py\u001b[0m in \u001b[0;36mcreate\u001b[0;34m(self, prog, format)\u001b[0m\n\u001b[1;32m   1957\u001b[0m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprogs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfind_graphviz\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1958\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprogs\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1959\u001b[0;31m                 raise InvocationException(\n\u001b[0m\u001b[1;32m   1960\u001b[0m                     'GraphViz\\'s executables not found')\n\u001b[1;32m   1961\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mInvocationException\u001b[0m: GraphViz's executables not found"
     ]
    }
   ],
   "source": [
    "from IPython.display import Image \n",
    "from six import StringIO\n",
@@ -256,7 +330,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Next we are going to export the pseudocode of the the learnt decision tree."
+    "Next, we will export the pseudocode of the learnt decision tree."
   ]
  },
  {
@@ -304,14 +378,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Precision, recall and f-score"
+    "### Precision, recall, and f-score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
+    "For evaluating classification algorithms, we usually calculate three metrics: precision, recall, and F1-score\n",
    "\n",
    "* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
    "* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
@@ -338,7 +412,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Another useful metric is the confusion matrix"
+    "Another useful metric is the confusion matrix."
   ]
  },
  {
@@ -354,7 +428,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We see we classify well all the 'setosa' and 'versicolor' samples. "
+    "We classify all the 'setosa' and 'versicolor' samples well. "
   ]
  },
  {
@@ -368,7 +442,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n",
+    "To avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n",
    "\n",
    "Sklearn comes with other strategies for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), such as stratified K-fold, label k-fold, Leave-One-Out, Leave-P-Out, Leave-One-Label-Out, Leave-P-Label-Out or Shuffle & Split."
   ]
@@ -392,7 +466,7 @@
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
-    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
+    "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "print(scores)"
   ]
@@ -401,7 +475,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
+    "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure."
   ]
  },
  {
@@ -434,10 +508,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n",
+    "* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
-    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
+    "* [Parameter estimation using grid search with cross-validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html)\n",
    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
    "* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
    "* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
   ]
  },
@@ -446,7 +518,7 @@
   "metadata": {},
   "source": [
    "## Licence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -463,7 +535,7 @@
   "window_display": false
  },
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -477,7 +549,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.9"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml1/2_6_Model_Tuning.ipynb
+++ b/ml1/2_6_Model_Tuning.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -39,7 +39,7 @@
    "* [Train classifier](#Train-classifier)\n",
    "* [More about Pipelines](#More-about-Pipelines)\n",
    "* [Tuning the algorithm](#Tuning-the-algorithm)\n",
-    "\t* [Grid Search for Parameter optimization](#Grid-Search-for-Parameter-optimization)\n",
+    "\t* [Grid Search for Hyperparameter optimization](#Grid-Search-for-Hyperparameter-optimization)\n",
    "* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
    "\t* [K-Fold validation](#K-Fold-validation)\n",
    "* [References](#References)\n"
@@ -56,9 +56,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In the previous [notebook](2_5_2_Decision_Tree_Model.ipynb), we got an accuracy of 9.47. Could we get a better accuracy if we tune the parameters of the estimator?\n",
+    "In the previous [notebook](2_5_2_Decision_Tree_Model.ipynb), we got an accuracy of 9.47. Could we get a better accuracy if we tune the hyperparameters of the estimator?\n",
    "\n",
-    "The goal of this notebook is to learn how to tune an algorithm by opimizing its parameters using grid search."
+    "This notebook aims to learn how to tune an algorithm by optimizing its hyperparameters using grid search."
   ]
  },
  {
@@ -137,7 +137,7 @@
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
-    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
+    "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "\n",
    "from scipy.stats import sem\n",
@@ -189,7 +189,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can get the list of parameters of the model. As you will observe, the parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax. We will use this for tuning the parameters."
+    "We can get the list of model parameters. As you will observe, the parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax. We will use this for tuning the parameters."
   ]
  },
  {
@@ -205,7 +205,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Let's see what happens if we change a parameter"
+    "Let's see what happens if we change a parameter."
   ]
  },
  {
@@ -284,7 +284,7 @@
    "\n",
    "Look at the [API](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of *scikit-learn* to understand better the algorithm, as well as which parameters can be tuned. As you see, we can change several ones, such as *criterion*, *splitter*, *max_features*, *max_depth*, *min_samples_split*, *class_weight*, etc.\n",
    "\n",
-    "We can get the full list parameters of an estimator with the method *get_params()*. "
+    "We can get an estimator's full list of parameters with the method *get_params()*. "
   ]
  },
  {
@@ -300,30 +300,30 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "You can try different values for these parameters and observe the results."
+    "You can try different values for these hyperparameters and observe the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "### Grid Search for Parameter optimization"
+    "### Grid Search for Hyperparameter optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Changing manually the parameters to find their optimal values is not practical. Instead, we can consider to find the optimal value of the parameters as an *optimization problem*. \n",
+    "Changing manually the hyperparameters to find their optimal values is not practical. Instead, we can consider finding the optimal value of the hyperparameters as an *optimization problem*. \n",
    "\n",
-    "The sklearn comes with several optimization techniques for this purpose, such as  **grid search** and  **randomized search**. In this notebook we are going to introduce the former one."
+    "Sklearn has several optimization techniques, such as  **grid search** and  **randomized search**. In this notebook, we are going to introduce the former one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The sklearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. "
+    "Sklearn provides an object that, given data, computes the score during the fit of an estimator on a hyperparameter grid and chooses the hyperparameters to maximize the cross-validation score. "
   ]
  },
  {
@@ -351,7 +351,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Now we are going to show the results of grid search"
+    "Now we are going to show the results of the grid search"
   ]
  },
  {
@@ -371,7 +371,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can now evaluate the KFold with this optimized parameter as follows."
+    "We can now evaluate the KFold with this optimized hyperparameter as follows."
   ]
  },
  {
@@ -392,7 +392,7 @@
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
-    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
+    "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "def mean_score(scores):\n",
    "    return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
@@ -405,7 +405,7 @@
   "source": [
    "We have got an *improvement* from 0.947 to 0.953 with k-fold.\n",
    "\n",
-    "We are now to try to fit the best combination of the parameters of the algorithm. It can take some time to compute it."
+    "We are now trying to fit the best combination of the hyperparameters of the algorithm. It can take some time to compute it."
   ]
  },
  {
@@ -414,12 +414,12 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Set the parameters by cross-validation\n",
+    "# Set the hyperparameters by cross-validation\n",
    "\n",
    "from sklearn.metrics import classification_report, recall_score, precision_score, make_scorer\n",
    "\n",
-    "# set of parameters to test\n",
+    "# set of hyperparameters to test\n",
-    "tuned_parameters = [{'max_depth': np.arange(3, 10),\n",
+    "tuned_hyperparameters = [{'max_depth': np.arange(3, 10),\n",
    "#                     'max_weights': [1, 10, 100, 1000]},\n",
    "                     'criterion': ['gini', 'entropy'], \n",
    "                     'splitter': ['best', 'random'],\n",
@@ -431,7 +431,7 @@
    "scores = ['precision', 'recall']\n",
    "\n",
    "for score in scores:\n",
-    "    print(\"# Tuning hyper-parameters for %s\" % score)\n",
+    "    print(\"# Tuning hyperparameters for %s\" % score)\n",
    "    print()\n",
    "\n",
    "    if score == 'precision':\n",
@@ -440,10 +440,10 @@
    "        scorer = make_scorer(recall_score, average='weighted', zero_division=0)\n",
    "    \n",
    "    # cv = the fold of the cross-validation cv, defaulted to 5\n",
-    "    gs = GridSearchCV(DecisionTreeClassifier(), tuned_parameters, cv=10, scoring=scorer)\n",
+    "    gs = GridSearchCV(DecisionTreeClassifier(), tuned_hyperparameters, cv=10, scoring=scorer)\n",
    "    gs.fit(x_train, y_train)\n",
    "\n",
-    "    print(\"Best parameters set found on development set:\")\n",
+    "    print(\"Best hyperparameters set found on development set:\")\n",
    "    print()\n",
    "    print(gs.best_params_)\n",
    "    print()\n",
@@ -492,7 +492,7 @@
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
-    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
+    "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "def mean_score(scores):\n",
    "    return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
@@ -517,10 +517,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n",
+    "* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
-    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
+    "* [Hyperparameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
    "* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
    "* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
   ]
  },
@@ -535,7 +533,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -543,7 +541,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -557,7 +555,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.6"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml1/2_7_Model_Persistence.ipynb
+++ b/ml1/2_7_Model_Persistence.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -48,9 +48,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The goal of this notebook is to learn how to save a model in the the scikit by using Python’s built-in persistence model, namely pickle\n",
+    "The goal of this notebook is to learn how to save a model in the scikit by using Python’s built-in persistence model, namely pickle\n",
    "\n",
-    "First we recap the previous tasks: load data, preprocess and train the model."
+    "First, we recap the previous tasks: load data, preprocess, and train the model."
   ]
  },
  {
@@ -107,7 +107,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "A more efficient alternative to pickle is joblib, especially for big data problems. In this case the model can only be saved to a file and not to a string."
+    "A more efficient alternative to pickle is joblib, especially for big data problems. In this case, the model can only be saved to a file and not to a string."
   ]
  },
  {
@@ -136,7 +136,9 @@
   "metadata": {},
   "source": [
    "* [Tutorial scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n",
-    "* [Model persistence in scikit-learn](http://scikit-learn.org/stable/modules/model_persistence.html#model-persistence)"
+    "* [Model persistence in scikit-learn](http://scikit-learn.org/stable/modules/model_persistence.html#model-persistence)\n",
    "* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
    "* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
   ]
  },
  {
@@ -144,7 +146,7 @@
   "metadata": {},
   "source": [
    "## Licence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
@@ -161,7 +163,7 @@
   "window_display": false
  },
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -175,7 +177,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.9"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml1/2_8_Conclusions.ipynb
+++ b/ml1/2_8_Conclusions.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -52,7 +52,7 @@
    "\n",
    "Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.\n",
    "\n",
-    "The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set.\n",
+    "The plots show training points in solid colors and testing points in semi-transparent colors. The lower right shows the classification accuracy on the test set.\n",
    "\n",
    "The [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier) is a classifier that makes predictions using simple rules. It is useful as a simple baseline to compare with other (real) classifiers. \n",
    "\n",
@@ -94,7 +94,7 @@
   "metadata": {},
   "source": [
    "## Licence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
--- a/ml1/images/iris-classes.png
+++ b/ml1/images/iris-classes.png
--- a/ml1/images/iris-features.png
+++ b/ml1/images/iris-features.png
--- a/ml1/util_ds.py
+++ b/ml1/util_ds.py
@@ -47,7 +47,7 @@ def get_code(tree, feature_names, target_names,
    recurse(left, right, threshold, features, 0, 0)
-# Taken from http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#example-tree-plot-iris-py
+# Taken from https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html
 import numpy as np
 import matplotlib.pyplot as plt
--- a/ml2/3_0_0_Intro_ML_2.ipynb
+++ b/ml2/3_0_0_Intro_ML_2.ipynb
@@ -74,9 +74,7 @@
   "metadata": {},
   "source": [
    "* [IPython Notebook Tutorial for Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
-    "* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
+    "* [Scikit-learn videos and notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015."
   ]
  },
  {
@@ -92,7 +90,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -106,7 +104,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/3_1_Read_Data.ipynb
+++ b/ml2/3_1_Read_Data.ipynb
@@ -50,30 +50,30 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this session we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
+    "In this session, we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
    "\n",
    "![Titanic](images/titanic.jpg)\n",
    "\n",
    "\n",
-    "The main objective is predicting which passengers survived the sinking of the Titanic.\n",
+    "The main objective is to predict which passengers survived the sinking of the Titanic.\n",
    "\n",
    "The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder *data-titanic*.\n",
    "\n",
    "\n",
    "Here follows a description of the variables.\n",
    "\n",
-    "|Variable | Description| Values|\n",
+    "|  Variable  |          Description            |       Values    |\n",
-    "|-------------------------------|\n",
+    "|------------|---------------------------------|-----------------|\n",
-    "| survival| Survival| (0 = No; 1 = Yes)|\n",
+    "|  survival  |           Survival              |(0 = No; 1 = Yes)|\n",
-    "|Pclass |Name | |\n",
+    "|    Pclass  |             Name                |                 |\n",
-    "|Sex  |Sex | male, female|\n",
+    "|     Sex    |              Sex                |   male, female  |\n",
-    "|Age |Age|\n",
+    "|     Age    |              Age                |                 |\n",
-    "|SibSp |Number of Siblings/Spouses Aboard||\n",
+    "|    SibSp   |Number of Siblings/Spouses Aboard|                 |\n",
-    "|Parch |Number of Parents/Children Aboard||\n",
+    "|     Parch  |Number of Parents/Children Aboard|                 |\n",
-    "|Ticket|Ticket Number||\n",
+    "|     Ticket |        Ticket Number            |                 |\n",
-    "|Fare            |Passenger Fare||\n",
+    "|     Fare   |       Passenger Fare            |                 |\n",
-    "|Cabin           |Cabin||\n",
+    "|     Cabin  |            Cabin                |                 |\n",
-    "|Embarked        |Port of Embarkation| (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
+    "|   Embarked |       Port of Embarkation       | (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
    "\n",
    "\n",
    "The definitions used for SibSp and Parch are:\n",
@@ -213,8 +213,7 @@
    "* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)\n",
    "* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n",
    "* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n",
-    "* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",
+    "* [An introduction to NumPy and Scipy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf)\n"
    "* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"
   ]
  },
  {
--- a/ml2/3_2_Pandas.ipynb
+++ b/ml2/3_2_Pandas.ipynb
@@ -433,10 +433,9 @@
   "metadata": {},
   "source": [
    "* [Pandas](http://pandas.pydata.org/)\n",
-    "* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n",
+    "* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
    "* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
    "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
-    "* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)"
+    "* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)"
   ]
  },
  {
@@ -458,7 +457,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -472,7 +471,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/3_3_Data_Munging_with_Pandas.ipynb
+++ b/ml2/3_3_Data_Munging_with_Pandas.ipynb
@@ -373,8 +373,8 @@
   "source": [
    "#Mean age of  passengers per Passenger class\n",
    "\n",
-    "#First we calculate the mean\n",
+    "#First we calculate the mean for the numeric columns\n",
-    "df.groupby('Pclass').mean()"
+    "df.select_dtypes(np.number).groupby('Pclass').mean()"
   ]
  },
  {
@@ -404,7 +404,7 @@
   "outputs": [],
   "source": [
    "#Mean Age and SibSp of passengers grouped by passenger class and sex\n",
-    "df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
+    "df.groupby(['Pclass', 'Sex'])[['Age','SibSp']].mean()"
   ]
  },
  {
@@ -414,7 +414,7 @@
   "outputs": [],
   "source": [
    "#Show mean  Age and  SibSp for passengers  older than 25 grouped by Passenger Class and Sex\n",
-    "df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
+    "df[df.Age > 25].groupby(['Pclass', 'Sex'])[['Age','SibSp']].mean()"
   ]
  },
  {
@@ -424,7 +424,7 @@
   "outputs": [],
   "source": [
    "# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
-    "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()"
+    "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].mean()"
   ]
  },
  {
@@ -436,7 +436,7 @@
    "# We can also decide which function apply in each column\n",
    "\n",
    "#Show mean Age, mean SibSp, and number of passengers older than 25 that survived,  grouped by Passenger Class and Sex\n",
-    "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, \n",
+    "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].agg({'Age': np.mean, \n",
    "                                                                         'SibSp': np.mean, 'Survived': np.sum})"
   ]
  },
@@ -451,7 +451,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Pivot tables are an intuitive way to analyze data, and alternative to group columns."
+    "Pivot tables are an intuitive way to analyze data, and an alternative to group columns.\n",
    "\n",
    "This command makes a table with rows Sex and columns Pclass, and\n",
    "averages the result of the column Survived, thereby giving the percentage of survivors in each grouping."
   ]
  },
  {
@@ -460,7 +463,14 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "pd.pivot_table(df, index='Sex')"
+    "pd.pivot_table(df, index='Sex', columns='Pclass', values=['Survived'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we want to analyze multi-index, the percentage of survivoers, given sex and age, and distributed by Pclass."
   ]
  },
  {
@@ -469,7 +479,14 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "pd.pivot_table(df, index=['Sex', 'Pclass'])"
+    "pd.pivot_table(df, index=['Sex', 'Age'], columns=['Pclass'], values=['Survived'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nevertheless, this is not very useful since we have a row per age. Thus, we define a partition."
   ]
  },
  {
@@ -478,7 +495,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])"
+    "# Partition each of the passengers into 3 categories based on their age\n",
    "age = pd.cut(df['Age'], [0,12,18,80])"
   ]
  },
  {
@@ -487,7 +505,14 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)"
+    "pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can change the function used for aggregating each group."
   ]
  },
  {
@@ -496,8 +521,18 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Try np.sum, np.size, len\n",
+    "# default\n",
-    "pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])"
+    "pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'], aggfunc=np.mean)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Two agg functions\n",
    "pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'], aggfunc=[np.mean, np.sum])"
   ]
  },
  {
@@ -600,8 +635,8 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Fill missing values with the median\n",
+    "# Fill missing values with the median, we avoid empty (None) values with numeric_only\n",
-    "df_filled = df.fillna(df.median())\n",
+    "df_filled = df.fillna(df.median(numeric_only=True))\n",
    "df_filled[-5:]"
   ]
  },
@@ -685,7 +720,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# But we are working on a copy \n",
+    "# But we are working on a copy, so we get a warning\n",
    "df.iloc[889]['Sex'] = np.nan"
   ]
  },
@@ -695,7 +730,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# If we want to change, we should not chain selections\n",
+    "# If we want to change it, we should not chain selections\n",
    "# The selection can be done with the column name\n",
    "df.loc[889, 'Sex']"
   ]
@@ -932,11 +967,11 @@
   "metadata": {},
   "source": [
    "* [Pandas](http://pandas.pydata.org/)\n",
-    "* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n",
+    "* [Learning Pandas, Michael Heydt, Packt Publishing, 2017](https://learning.oreilly.com/library/view/learning-pandas/9781787123137/)\n",
-    "* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)\n",
+    "* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
    "* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
    "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
-    "* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)"
+    "* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)\n",
    "* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)"
   ]
  },
  {
@@ -958,7 +993,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -972,7 +1007,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.11.5"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/3_4_Visualisation_Pandas.ipynb
+++ b/ml2/3_4_Visualisation_Pandas.ipynb
@@ -220,7 +220,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# Analise distributon\n",
+    "# Analise distribution\n",
    "df.hist(figsize=(10,10))\n",
    "plt.show()"
   ]
@@ -233,7 +233,7 @@
   "source": [
    "# We can see the pairwise correlation between variables. A value near 0 means low correlation\n",
    "# while a value  near -1 or 1 indicates strong correlation.\n",
-    "df.corr()"
+    "df.corr(numeric_only = True)"
   ]
  },
  {
@@ -249,11 +249,10 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# General description of relationship betweek variables uwing Seaborn PairGrid\n",
+    "# General description of relationship between variables uwing Seaborn PairGrid\n",
    "# We use df_clean, since the null values of df would gives us an error, you can check it.\n",
    "g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
-    "g.map_diag(plt.hist)\n",
+    "g.map(sns.scatterplot)\n",
    "g.map_offdiag(plt.scatter)\n",
    "g.add_legend()"
   ]
  },
@@ -367,7 +366,7 @@
   "outputs": [],
   "source": [
    "# Now we visualise age and survived to see if there is some relationship\n",
-    "sns.FacetGrid(df, hue=\"Survived\", size=5).map(sns.kdeplot, \"Age\").add_legend()"
+    "sns.FacetGrid(df, hue=\"Survived\", height=5).map(sns.kdeplot, \"Age\").add_legend()"
   ]
  },
  {
@@ -567,7 +566,7 @@
   "outputs": [],
   "source": [
    "# Plot with seaborn\n",
-    "sns.countplot('Sex', data=df)"
+    "sns.countplot(x='Sex', data=df)"
   ]
  },
  {
@@ -683,16 +682,6 @@
    "df.groupby('Pclass').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot('Pclass', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -725,7 +714,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "sns.factorplot('Pclass',data=df,hue='Sex',kind='count')"
+    "sns.catplot(x='Pclass',data=df,hue='Sex',kind='count')"
   ]
  },
  {
@@ -906,7 +895,7 @@
   "outputs": [],
   "source": [
    "# Distribution\n",
-    "sns.countplot('Embarked', data=df)"
+    "sns.countplot(x='Embarked', data=df)"
   ]
  },
  {
@@ -997,7 +986,7 @@
   "outputs": [],
   "source": [
    "# Distribution\n",
-    "sns.countplot('SibSp', data=df)"
+    "sns.countplot(x='SibSp', data=df)"
   ]
  },
  {
@@ -1180,7 +1169,7 @@
   "outputs": [],
   "source": [
    "# Distribution\n",
-    "sns.countplot('Parch', data=df)"
+    "sns.countplot(x='Parch', data=df)"
   ]
  },
  {
@@ -1233,7 +1222,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "df.groupby(['Pclass', 'Sex', 'Parch'])['Parch', 'SibSp', 'Survived'].agg({'Parch': np.size, 'SibSp': np.mean, 'Survived': np.mean})"
+    "df.groupby(['Pclass', 'Sex', 'Parch'])[['Parch', 'SibSp', 'Survived']].agg({'Parch': np.size, 'SibSp': np.mean, 'Survived': np.mean})"
   ]
  },
  {
@@ -1576,7 +1565,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -1590,7 +1579,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/3_5_Exercise_1.ipynb
+++ b/ml2/3_5_Exercise_1.ipynb
@@ -72,7 +72,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
+    "Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv.\n",
    "\n",
    "Print *df*."
   ]
@@ -214,7 +214,7 @@
   "outputs": [],
   "source": [
    "df['FamilySize'] = df['SibSp'] + df['Parch']\n",
-    "df.head()"
+    "df"
   ]
  },
  {
@@ -377,8 +377,8 @@
   "outputs": [],
   "source": [
    "# Group ages to simplify machine learning algorithms.  0: 0-5, 1: 6-10, 2: 11-15, 3: 16-59 and 4: 60-80\n",
-    "df['AgeGroup'] = 0\n",
+    "df['AgeGroup'] = np.nan\n",
-    "df.loc[(.Age<6),'AgeGroup'] = 0\n",
+    "df.loc[(df.Age<6),'AgeGroup'] = 0\n",
    "df.loc[(df.Age>=6) & (df.Age < 11),'AgeGroup'] = 1\n",
    "df.loc[(df.Age>=11) & (df.Age < 16),'AgeGroup'] = 2\n",
    "df.loc[(df.Age>=16) & (df.Age < 60),'AgeGroup'] = 3\n",
@@ -404,8 +404,8 @@
    "        if np.isnan(big_string):\n",
    "            return 'X'\n",
    "    for substring in substrings:\n",
-    "        if big_string.find(substring) != 1:\n",
+    "        if substring in big_string:\n",
-    "            return substring\n",
+    "            return substring[0::]\n",
    "    print(big_string)\n",
    "    return 'X'\n",
    " \n",
@@ -478,8 +478,17 @@
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -493,7 +502,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/3_6_Machine_Learning.ipynb
+++ b/ml2/3_6_Machine_Learning.ipynb
@@ -78,7 +78,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015."
+    "* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka and Vahid Mirjalili, Packt Publishing, 2019."
   ]
  },
  {
@@ -100,7 +100,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -114,7 +114,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/3_7_SVM.ipynb
+++ b/ml2/3_7_SVM.ipynb
@@ -222,7 +222,7 @@
    "kernel = types_of_kernels[0]\n",
    "gamma = 3.0\n",
    "\n",
-    "# Create kNN model\n",
+    "# Create SVM model\n",
    "model = SVC(kernel=kernel, probability=True, gamma=gamma)"
   ]
  },
@@ -276,7 +276,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "We can evaluate the accuracy if the model always predict the most frequent class, following this [refeference](http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/)."
+    "We can evaluate the accuracy if the model always predict the most frequent class, following this [reference](https://medium.com/analytics-vidhya/model-validation-for-classification-5ff4a0373090)."
   ]
  },
  {
@@ -351,10 +351,10 @@
    "We can obtain more information from the confussion matrix and the metric F1-score.\n",
    "In a confussion matrix, we can see:\n",
    "\n",
-    "||**Predicted**: 0| **Predicted: 1**|\n",
+    "|             |**Predicted**: 0| **Predicted: 1**|\n",
-    "|---------------------------|\n",
+    "|-------------|----------------|-----------------|\n",
-    "|**Actual: 0**| TN | FP |\n",
+    "|**Actual: 0**| TN             | FP              |\n",
-    "|**Actual: 1**| FN|TP|\n",
+    "|**Actual: 1**| FN             | TP              |\n",
    "\n",
    "* **True negatives (TN)**: actual negatives that were predicted as negatives\n",
    "* **False positives (FP)**: actual negatives that were predicted as positives\n",
@@ -418,7 +418,7 @@
    "plt.ylim([0.0, 1.0])\n",
    "plt.title('ROC curve for Titanic')\n",
    "plt.xlabel('False Positive Rate (1 - Recall)')\n",
-    "plt.xlabel('True Positive Rate (Sensitivity)')\n",
+    "plt.ylabel('True Positive Rate (Sensitivity)')\n",
    "plt.grid(True)"
   ]
  },
@@ -535,13 +535,13 @@
   "source": [
    "# This step will take some time\n",
    "# Cross-validationt\n",
-    "cv = KFold(n_splits=5, shuffle=False, random_state=33)\n",
+    "cv = KFold(n_splits=5, shuffle=True, random_state=33)\n",
    "# StratifiedKFold has is a variation of k-fold which returns stratified folds:\n",
    "# each set contains approximately the same percentage of samples of each target class as the complete set.\n",
-    "#cv = StratifiedKFold(y, n_folds=3, shuffle=False, random_state=33)\n",
+    "#cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=33)\n",
    "scores = cross_val_score(model, X, y, cv=cv)\n",
    "print(\"Scores in every iteration\", scores)\n",
-    "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))\n"
+    "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
   ]
  },
  {
@@ -644,7 +644,7 @@
   "source": [
    "* [Titanic Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
    "* [API SVC scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)\n",
-    "* [Better evaluation of classification models](http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/)"
+    "* [How to choose the right metric for evaluating an ML model](https://www.kaggle.com/vipulgandhi/how-to-choose-right-metric-for-evaluating-ml-model)"
   ]
  },
  {
@@ -666,7 +666,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -680,7 +680,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/3_8_Exercise_2.ipynb
+++ b/ml2/3_8_Exercise_2.ipynb
@@ -39,7 +39,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this exercise we are going to put in practice what we have learnt in the notebooks of the session. \n",
+    "In this exercise, we are going to put in practice what we have learnt in the notebooks of the session. \n",
    "\n",
    "In the previous notebook we have been applying the SVM machine learning algorithm.\n",
    "\n",
@@ -67,7 +67,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -81,7 +81,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml2/images/iris-classes.png
+++ b/ml2/images/iris-classes.png
--- a/ml2/images/iris-features.png
+++ b/ml2/images/iris-features.png
--- a/ml2/plot_learning_curve.py
+++ b/ml2/plot_learning_curve.py
@@ -1,21 +1,21 @@
 """
 Taken from http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
 ========================
 Plotting Learning Curves
 ========================
 In the first column, first row the learning curve of a naive Bayes classifier
 is shown for the digits dataset. Note that the training score and the
 cross-validation score are both not very good at the end. However, the shape
 of the curve can be found in more complex datasets very often: the training
 score is very high at the beginning and decreases and the cross-validation
 score is very low at the beginning and increases. In the second column, first
 row we see the learning curve of an SVM with RBF kernel. We can see clearly
 that the training score is still around the maximum and the validation score
 could be increased with more training samples. The plots in the second row
 show the times required by the models to train with various sizes of training
 dataset. The plots in the third row show how much time was required to train
 the models for each training sizes.
 On the left side the learning curve of a naive Bayes classifier is shown for
 the digits dataset. Note that the training score and the cross-validation score
 are both not very good at the end. However, the shape of the curve can be found
 in more complex datasets very often: the training score is very high at the
 beginning and decreases and the cross-validation score is very low at the
 beginning and increases. On the right side we see the learning curve of an SVM
 with RBF kernel. We can see clearly that the training score is still around
 the maximum and the validation score could be increased with more training
 samples.
 """
 #print(__doc__)
 import numpy as np
 import matplotlib.pyplot as plt
@@ -23,86 +23,181 @@ from sklearn.naive_bayes import GaussianNB
 from sklearn.svm import SVC
 from sklearn.datasets import load_digits
 from sklearn.model_selection import learning_curve
 from sklearn.model_selection import ShuffleSplit
-def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
+def plot_learning_curve(
-                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
+    estimator,
    title,
    X,
    y,
    axes=None,
    ylim=None,
    cv=None,
    n_jobs=None,
    train_sizes=np.linspace(0.1, 1.0, 5),
 ):
    """
-    Generate a simple plot of the test and traning learning curve.
+    Generate 3 plots: the test and training learning curve, the training
    samples vs fit times curve, the fit times vs score curve.
    Parameters
    ----------
-    estimator : object type that implements the "fit" and "predict" methods
+    estimator : estimator instance
-        An object of that type which is cloned for each validation.
+        An estimator instance implementing `fit` and `predict` methods which
        will be cloned for each validation.
-    title : string
+    title : str
        Title for the chart.
-    X : array-like, shape (n_samples, n_features)
+    X : array-like of shape (n_samples, n_features)
-        Training vector, where n_samples is the number of samples and
+        Training vector, where ``n_samples`` is the number of samples and
-        n_features is the number of features.
+        ``n_features`` is the number of features.
-    y : array-like, shape (n_samples) or (n_samples, n_features), optional
+    y : array-like of shape (n_samples) or (n_samples, n_features)
-        Target relative to X for classification or regression;
+        Target relative to ``X`` for classification or regression;
        None for unsupervised learning.
-    ylim : tuple, shape (ymin, ymax), optional
+    axes : array-like of shape (3,), default=None
-        Defines minimum and maximum yvalues plotted.
+        Axes to use for plotting the curves.
-    cv : integer, cross-validation generator, optional
+    ylim : tuple of shape (2,), default=None
-        If an integer is passed, it is the number of folds (defaults to 3).
+        Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).
        Specific cross-validation objects can be passed, see
        sklearn.model_selection module for the list of possible objects
-    n_jobs : integer, optional
+    cv : int, cross-validation generator or an iterable, default=None
-        Number of jobs to run in parallel (default 1).
+        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 5-fold cross-validation,
          - integer, to specify the number of folds.
          - :term:`CV splitter`,
          - An iterable yielding (train, test) splits as arrays of indices.
        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.
    n_jobs : int or None, default=None
        Number of jobs to run in parallel.
        ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
        ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
        for more details.
    train_sizes : array-like of shape (n_ticks,)
        Relative or absolute numbers of training examples that will be used to
        generate the learning curve. If the ``dtype`` is float, it is regarded
        as a fraction of the maximum size of the training set (that is
        determined by the selected validation method), i.e. it has to be within
        (0, 1]. Otherwise it is interpreted as absolute sizes of the training
        sets. Note that for classification the number of samples usually have
        to be big enough to contain at least one sample from each class.
        (default: np.linspace(0.1, 1.0, 5))
    """
-    plt.figure()
+    if axes is None:
-    plt.title(title)
+        _, axes = plt.subplots(1, 3, figsize=(20, 5))
    axes[0].set_title(title)
    if ylim is not None:
-        plt.ylim(*ylim)
+        axes[0].set_ylim(*ylim)
-    plt.xlabel("Training examples")
+    axes[0].set_xlabel("Training examples")
-    plt.ylabel("Score")
+    axes[0].set_ylabel("Score")
-    train_sizes, train_scores, test_scores = learning_curve(
+
-        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
+    train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
        estimator,
        X,
        y,
        cv=cv,
        n_jobs=n_jobs,
        train_sizes=train_sizes,
        return_times=True,
    )
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
-    plt.grid()
+    fit_times_mean = np.mean(fit_times, axis=1)
    fit_times_std = np.std(fit_times, axis=1)
-    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
+    # Plot learning curve
-                     train_scores_mean + train_scores_std, alpha=0.1,
+    axes[0].grid()
-                     color="r")
+    axes[0].fill_between(
-    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
+        train_sizes,
-                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
+        train_scores_mean - train_scores_std,
-    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
+        train_scores_mean + train_scores_std,
-             label="Training score")
+        alpha=0.1,
-    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
+        color="r",
-             label="Cross-validation score")
+    )
    axes[0].fill_between(
        train_sizes,
        test_scores_mean - test_scores_std,
        test_scores_mean + test_scores_std,
        alpha=0.1,
        color="g",
    )
    axes[0].plot(
        train_sizes, train_scores_mean, "o-", color="r", label="Training score"
    )
    axes[0].plot(
        train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
    )
    axes[0].legend(loc="best")
    # Plot n_samples vs fit_times
    axes[1].grid()
    axes[1].plot(train_sizes, fit_times_mean, "o-")
    axes[1].fill_between(
        train_sizes,
        fit_times_mean - fit_times_std,
        fit_times_mean + fit_times_std,
        alpha=0.1,
    )
    axes[1].set_xlabel("Training examples")
    axes[1].set_ylabel("fit_times")
    axes[1].set_title("Scalability of the model")
    # Plot fit_time vs score
    fit_time_argsort = fit_times_mean.argsort()
    fit_time_sorted = fit_times_mean[fit_time_argsort]
    test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
    test_scores_std_sorted = test_scores_std[fit_time_argsort]
    axes[2].grid()
    axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
    axes[2].fill_between(
        fit_time_sorted,
        test_scores_mean_sorted - test_scores_std_sorted,
        test_scores_mean_sorted + test_scores_std_sorted,
        alpha=0.1,
    )
    axes[2].set_xlabel("fit_times")
    axes[2].set_ylabel("Score")
    axes[2].set_title("Performance of the model")
    plt.legend(loc="best")
    return plt
-#digits = load_digits()
+fig, axes = plt.subplots(3, 2, figsize=(10, 15))
 #X, y = digits.data, digits.target
 X, y = load_digits(return_X_y=True)
-#title = "Learning Curves (Naive Bayes)"
+title = "Learning Curves (Naive Bayes)"
-# Cross validation with 100 iterations to get smoother mean test and train
+# Cross validation with 50 iterations to get smoother mean test and train
 # score curves, each time with 20% data randomly selected as a validation set.
-#cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=100,
+cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
 #                                   test_size=0.2, random_state=0)
-#estimator = GaussianNB()
+estimator = GaussianNB()
-#plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4)
+plot_learning_curve(
    estimator, title, X, y, axes=axes[:, 0], ylim=(0.7, 1.01), cv=cv, n_jobs=4
 )
-#title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
+title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
 # SVC is more expensive so we do a lower number of CV iterations:
-#cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=10,
+cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
-#	                                   test_size=0.2, random_state=0)
+estimator = SVC(gamma=0.001)
-#estimator = SVC(gamma=0.001)
+plot_learning_curve(
-#plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4)
+    estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01), cv=cv, n_jobs=4
 )
-#plt.show()
+plt.show()
--- a/ml2/plot_svm.py
+++ b/ml2/plot_svm.py
@@ -3,7 +3,7 @@ import matplotlib.pyplot as plt
 import numpy as np
 from sklearn import svm
-#Taken from http://nbviewer.jupyter.org/github/agconti/kaggle-titanic/blob/master/Titanic.ipynb
+# Taken from http://nbviewer.jupyter.org/github/agconti/kaggle-titanic/blob/master/Titanic.ipynb
 def plot_svm(df):
 	# set plotting parameters
--- a/ml21/.gitkeep
+++ b/ml21/.gitkeep
@@ -0,0 +1 @@
--- a/ml21/preprocessing/.gitkeep
+++ b/ml21/preprocessing/.gitkeep
@@ -0,0 +1 @@
--- a/ml21/preprocessing/00_Intro_Preprocessing.ipynb
+++ b/ml21/preprocessing/00_Intro_Preprocessing.ipynb
@@ -0,0 +1,157 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Introduction to Preprocessing\n",
    "In this session, we will get more insight regarding how to preprocess data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Objectives\n",
    "The main objectives of this session are:\n",
    "* Understanding the need for preprocessing\n",
    "* Understanding different preprocessing techniques\n",
    "* Experimenting with several environments for preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Table of Contents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "1. [Home](00_Intro_Preprocessing.ipynb)\n",
    "3. [Initial Check](02_Initial_Check.ipynb)\n",
    "4. [Filter Data](03_Filter_Data.ipynb)\n",
    "5. [Unknown values](04_Unknown_Values.ipynb)\n",
    "6. [Duplicated values](05_Duplicated_Values.ipynb)\n",
    "7. [Rescaling Data](06_Rescaling_Data.ipynb)\n",
    "8. [Binarize Data](07_Binarize_Data.ipynb)\n",
    "9. [Categorial features](08_Categorical.ipynb)\n",
    "10. [String Data](09_String_Data.ipynb)\n",
    "12. [Handy libraries for preprocessing](11_0_Handy.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/02_Initial_Check.ipynb
+++ b/ml21/preprocessing/02_Initial_Check.ipynb
@@ -0,0 +1,714 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Initial Check with Pandas\n",
    "\n",
    "We can start with a quick quality check."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Load and check data\n",
    "Check which data you are loading."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Moran, Mr. James</td>\n",
       "      <td>male</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>330877</td>\n",
       "      <td>8.4583</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>McCarthy, Mr. Timothy J</td>\n",
       "      <td>male</td>\n",
       "      <td>54.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>17463</td>\n",
       "      <td>51.8625</td>\n",
       "      <td>E46</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Palsson, Master. Gosta Leonard</td>\n",
       "      <td>male</td>\n",
       "      <td>2.0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>349909</td>\n",
       "      <td>21.0750</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>9</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
       "      <td>female</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>347742</td>\n",
       "      <td>11.1333</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>10</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
       "      <td>female</td>\n",
       "      <td>14.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>237736</td>\n",
       "      <td>30.0708</td>\n",
       "      <td>NaN</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "5            6         0       3   \n",
       "6            7         0       1   \n",
       "7            8         0       3   \n",
       "8            9         1       3   \n",
       "9           10         1       2   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                             Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "5                                   Moran, Mr. James    male   NaN      0   \n",
       "6                            McCarthy, Mr. Timothy J    male  54.0      0   \n",
       "7                     Palsson, Master. Gosta Leonard    male   2.0      3   \n",
       "8  Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   \n",
       "9                Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  \n",
       "0      0         A/5 21171   7.2500   NaN        S  \n",
       "1      0          PC 17599  71.2833   C85        C  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3      0            113803  53.1000  C123        S  \n",
       "4      0            373450   8.0500   NaN        S  \n",
       "5      0            330877   8.4583   NaN        Q  \n",
       "6      0             17463  51.8625   E46        S  \n",
       "7      1            349909  21.0750   NaN        S  \n",
       "8      2            347742  11.1333   NaN        S  \n",
       "9      0            237736  30.0708   NaN        C  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Check number of columns and rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(891, 12)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Check names and types of columns\n",
    "Check the data and type, for example if dates are of strings or what."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
      "       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
      "      dtype='object')\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "PassengerId      int64\n",
       "Survived         int64\n",
       "Pclass           int64\n",
       "Name            object\n",
       "Sex             object\n",
       "Age            float64\n",
       "SibSp            int64\n",
       "Parch            int64\n",
       "Ticket          object\n",
       "Fare           float64\n",
       "Cabin           object\n",
       "Embarked        object\n",
       "dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get column names\n",
    "print(df.columns)\n",
    "# Get column data types\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Check if the column is unique"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PassengerId is unique: True\n",
      "Survived is unique: False\n",
      "Pclass is unique: False\n",
      "Name is unique: True\n",
      "Sex is unique: False\n",
      "Age is unique: False\n",
      "SibSp is unique: False\n",
      "Parch is unique: False\n",
      "Ticket is unique: False\n",
      "Fare is unique: False\n",
      "Cabin is unique: False\n",
      "Embarked is unique: False\n"
     ]
    }
   ],
   "source": [
    "for i in column_names:\n",
    "    print('{} is unique: {}'.format(i, df[i].is_unique))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Check if the dataframe has an index\n",
    "We will need it to do joins or merges."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RangeIndex(start=0, stop=891, step=1)"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# check if there is an index. If not,  you will get 'AtributeError: function object has no atribute index'\n",
    "df.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,\n",
       "        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,\n",
       "        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,\n",
       "        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,\n",
       "        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,\n",
       "        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,\n",
       "        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,\n",
       "        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,\n",
       "       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,\n",
       "       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,\n",
       "       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,\n",
       "       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,\n",
       "       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,\n",
       "       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,\n",
       "       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,\n",
       "       195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,\n",
       "       208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,\n",
       "       221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,\n",
       "       234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,\n",
       "       247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,\n",
       "       260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,\n",
       "       273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,\n",
       "       286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,\n",
       "       299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,\n",
       "       312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,\n",
       "       325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,\n",
       "       338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,\n",
       "       351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,\n",
       "       364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,\n",
       "       377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,\n",
       "       390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,\n",
       "       403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,\n",
       "       416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,\n",
       "       429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,\n",
       "       442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,\n",
       "       455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,\n",
       "       468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,\n",
       "       481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,\n",
       "       494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,\n",
       "       507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,\n",
       "       520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,\n",
       "       533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,\n",
       "       546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,\n",
       "       559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,\n",
       "       572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,\n",
       "       585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,\n",
       "       598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,\n",
       "       611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,\n",
       "       624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,\n",
       "       637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,\n",
       "       650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,\n",
       "       663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,\n",
       "       676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,\n",
       "       689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,\n",
       "       702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,\n",
       "       715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,\n",
       "       728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,\n",
       "       741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,\n",
       "       754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,\n",
       "       767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,\n",
       "       780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,\n",
       "       793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,\n",
       "       806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,\n",
       "       819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,\n",
       "       832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,\n",
       "       845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,\n",
       "       858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,\n",
       "       871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,\n",
       "       884, 885, 886, 887, 888, 889, 890])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# # Check the index values\n",
    "df.index.values"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# If index does not exist\n",
    "df.set_index('column_name_to_use', inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PassengerId      0\n",
       "Survived         0\n",
       "Pclass           0\n",
       "Name             0\n",
       "Sex              0\n",
       "Age            177\n",
       "SibSp            0\n",
       "Parch            0\n",
       "Ticket           0\n",
       "Fare             0\n",
       "Cabin          687\n",
       "Embarked         2\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Count missing vales per column\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/03_Filter_Data.ipynb
+++ b/ml21/preprocessing/03_Filter_Data.ipynb
@@ -0,0 +1,150 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Filter Data\n",
    "\n",
    "Select the columns you want and delete the others."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "# Create list comprehension of the columns you want to lose\n",
    "columns_to_drop = [column_names[i] for i in [1, 3, 5]]\n",
    "# Drop unwanted columns \n",
    "df.drop(columns_to_drop, inplace=True, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/04_Unknown_Values.ipynb
+++ b/ml21/preprocessing/04_Unknown_Values.ipynb
@@ -0,0 +1,591 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Unknown values\n",
    "\n",
    "Two possible approaches are **remove** these rows or **fill** them. It depends on every case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Filling NaN values\n",
    "If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
    "\n",
    "* For **string** fields, we can fill NaN with **' '**.\n",
    "\n",
    "* For **numbers**, we can fill with the **mean** or **median** value. \n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Fill NaN with ' '\n",
    "df['col'] = df['col'].fillna(' ')\n",
    "# Fill NaN with 99\n",
    "df['col'] = df['col'].fillna(99)\n",
    "# Fill NaN with the mean of the column\n",
    "df['col'] = df['col'].fillna(df['col'].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Propagate non-null values forward or backward\n",
    "You can also **propagate** non-null values with these methods:\n",
    "\n",
    "* **ffill**: Fill values by propagating the last valid observation to the next valid.\n",
    "* **bfill**:  Fill values using the following valid observation to fill the gap.\n",
    "* **interpolate**:  Fill NaN values using interpolation.\n",
    "\n",
    "It will fill the next value in the dataframe with the previous non-NaN value. \n",
    "\n",
    "You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   NaN\n",
       "1   NaN\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   NaN\n",
       "6   NaN"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We fill forward the value 4.0 and fill the next one (limit = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   NaN\n",
       "1   NaN\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   4.0\n",
       "6   NaN"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " df.ffill(limit = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.ffill()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   2.0\n",
       "1   2.0\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   NaN\n",
       "6   NaN"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.bfill()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Removing NaN values\n",
    "We can remove them by row or column (use inplace=True if you want to modify the DataFrame)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Drop any rows which have any nans\n",
    "df1 = df.dropna()\n",
    "# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n",
    "df2 = df.dropna(axis=1)\n",
    "# Only drop columns which have at least 90% non-NaNs \n",
    "df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/05_Duplicated_Values.ipynb
+++ b/ml21/preprocessing/05_Duplicated_Values.ipynb
--- a/ml21/preprocessing/06_Rescaling_Data.ipynb
+++ b/ml21/preprocessing/06_Rescaling_Data.ipynb
--- a/ml21/preprocessing/07_Binarize_Data.ipynb
+++ b/ml21/preprocessing/07_Binarize_Data.ipynb
@@ -0,0 +1,198 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Binarize Data\n",
    "* We can transform our data using a binary threshold. All values above the threshold are marked 1, and all values equal to or below are marked 0.\n",
    "* This is called binarizing your data or thresholding your data. \n",
    "\n",
    "* It can be helpful when you have probabilities that you want to make crisp values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Binarize Data with Scikit-Learn\n",
    "We can create new binary attributes in Python using Scikit-learn with the Binarizer class.\n",
    "I"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import Binarizer\n",
    "\n",
    "X = [[ 1., -1.,  2.],\n",
    "     [ 2.,  0.,  0.],\n",
    "     [ 0.,  1.1, -1.]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "transformer = Binarizer(threshold=1.0).fit(X) # threshold 1.0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0., 0., 1.],\n",
       "       [1., 0., 0.],\n",
       "       [0., 1., 0.]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "transformer.transform(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/08_Categorical.ipynb
+++ b/ml21/preprocessing/08_Categorical.ipynb
@@ -0,0 +1,812 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Categorical Data\n",
    "\n",
    "For many ML algorithms, we need to transform categorical data into numbers.\n",
    "\n",
    "For example:\n",
    "* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
    "* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
    "* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
    "\n",
    "There are two main approaches:\n",
    "* Integer encoding\n",
    "* One hot encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Integer Encoding\n",
    "We assign a number to every value:\n",
    "\n",
    "['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
    "\n",
    "['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
    "\n",
    "['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
    "\n",
    "The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
    "\n",
    "In our examples, this representation can be suitable for **temperature**, but not for the other two."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## One Hot Encoding\n",
    "A binary column is created for each value of the categorical variable."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Sex                                               M  F U\n",
    "-----                                            ---------\n",
    "M                                                 1  0 0\n",
    "F                     is transformed into         0  1 0\n",
    "Unknown                                           0  0 1\n",
    "M                                                 1  0 0 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Transforming categorical data  with Scikit-Learn\n",
    "\n",
    "We can use:\n",
    "* **get_dummies()** (one hot encoding)\n",
    "* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
    "\n",
    "We are going to learn the first approach."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One Hot Encoding\n",
    "We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     Name  Age     Sex   Position\n",
      "0  Marius   18    Male   graduate\n",
      "1   Maria   19  Female  professor\n",
      "2    John   20    Male         TA\n",
      "3   Carla   30  Female        phD\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
    "        \"Age\": [18, 19, 20, 30],\n",
    "\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
    "        \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
    "       }\n",
    "df = pd.DataFrame(data)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>sex_encoded</th>\n",
       "      <th>position_encoded</th>\n",
       "      <th>Sex_Female</th>\n",
       "      <th>Sex_Male</th>\n",
       "      <th>Position_TA</th>\n",
       "      <th>Position_graduate</th>\n",
       "      <th>Position_phD</th>\n",
       "      <th>Position_professor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age  sex_encoded  position_encoded  Sex_Female  Sex_Male  \\\n",
       "0  Marius   18            1                 1       False      True   \n",
       "1   Maria   19            0                 3        True     False   \n",
       "2    John   20            1                 0       False      True   \n",
       "3   Carla   30            0                 2        True     False   \n",
       "\n",
       "   Position_TA  Position_graduate  Position_phD  Position_professor  \n",
       "0        False               True         False               False  \n",
       "1        False              False         False                True  \n",
       "2         True              False         False               False  \n",
       "3        False              False          True               False  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
     ]
    }
   ],
   "source": [
    "df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
    "df_onehot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use *OneHotEncoder* from Scikit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sex_Female</th>\n",
       "      <th>Sex_Male</th>\n",
       "      <th>Position_TA</th>\n",
       "      <th>Position_graduate</th>\n",
       "      <th>Position_phD</th>\n",
       "      <th>Position_professor</th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>sex_encoded</th>\n",
       "      <th>position_encoded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Sex_Female Sex_Male Position_TA Position_graduate Position_phD  \\\n",
       "0        0.0      1.0         0.0               1.0          0.0   \n",
       "1        1.0      0.0         0.0               0.0          0.0   \n",
       "2        0.0      1.0         1.0               0.0          0.0   \n",
       "3        1.0      0.0         0.0               0.0          1.0   \n",
       "\n",
       "  Position_professor    Name Age sex_encoded position_encoded  \n",
       "0                0.0  Marius  18           1                1  \n",
       "1                1.0   Maria  19           0                3  \n",
       "2                0.0    John  20           1                0  \n",
       "3                0.0   Carla  30           0                2  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.compose import make_column_transformer\n",
    "\n",
    "df_onehotencoder = df\n",
    "# create OneHotEncoder object\n",
    "encoder = OneHotEncoder()\n",
    "\n",
    "# Transformer for several columns\n",
    "transformer = make_column_transformer(\n",
    "  (OneHotEncoder(), ['Sex', 'Position']),\n",
    "  remainder='passthrough',\n",
    "  verbose_feature_names_out=False)\n",
    "\n",
    "# transform\n",
    "transformed = transformer.fit_transform(df_onehotencoder)\n",
    "\n",
    "df_onehotencoder = pd.DataFrame(\n",
    "  transformed,\n",
    "  columns=transformer.get_feature_names_out())\n",
    "df_onehotencoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Integer encoding\n",
    "We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Position</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>graduate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>professor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>TA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>Female</td>\n",
       "      <td>phD</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age     Sex   Position\n",
       "0  Marius   18    Male   graduate\n",
       "1   Maria   19  Female  professor\n",
       "2    John   20    Male         TA\n",
       "3   Carla   30  Female        phD"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "# creating instance of labelencoder\n",
    "labelencoder = LabelEncoder()\n",
    "df_encoded = df\n",
    "# Assigning numerical values and storing in another column\n",
    "sex_values = ('Male', 'Female')\n",
    "position_values = ('graduate', 'professor', 'TA', 'phD')\n",
    "df_encoded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Position</th>\n",
       "      <th>sex_encoded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>graduate</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>professor</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>TA</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>Female</td>\n",
       "      <td>phD</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age     Sex   Position  sex_encoded\n",
       "0  Marius   18    Male   graduate            1\n",
       "1   Maria   19  Female  professor            0\n",
       "2    John   20    Male         TA            1\n",
       "3   Carla   30  Female        phD            0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
    "df_encoded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Position</th>\n",
       "      <th>sex_encoded</th>\n",
       "      <th>position_encoded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>graduate</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>professor</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>TA</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>Female</td>\n",
       "      <td>phD</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age     Sex   Position  sex_encoded  position_encoded\n",
       "0  Marius   18    Male   graduate            1                 1\n",
       "1   Maria   19  Female  professor            0                 3\n",
       "2    John   20    Male         TA            1                 0\n",
       "3   Carla   30  Female        phD            0                 2"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
    "df_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/09_String_Data.ipynb
+++ b/ml21/preprocessing/09_String_Data.ipynb
@@ -0,0 +1,652 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# String Data\n",
    "It is widespread to clean string columns to follow a predefined format (e.g., emails, URLs, ...).\n",
    "\n",
    "We can do it using regular expressions or specific libraries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Beautifier\n",
    "A simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify URL patterns, domains, and so on. The library helps to clean Unicode, special characters, and unnecessary redirection patterns from the URLs and gives you a clean date.\n",
    "\n",
    "Install with **'pip install beautifier'**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Email cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from beautifier import Email\n",
    "email = Email('me@imsach.in')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'imsach.in'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email.domain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'me'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email.username"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email.is_free_email"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "email2 = Email('This my address')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email2.is_valid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "email3 = Email('pepe@gmail.com')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email3.is_valid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email3.is_free_email"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## URL cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from beautifier import Url\n",
    "url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://in.linkedin.com/in/sachinphilip'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'in.linkedin.com'"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.domain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.param"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'sachinphilip'"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.username"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Unicode\n",
    "Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
    "\n",
    "A **mojibake** is a character displayed in an unintended character encoding. Example:  \"<22>\").\n",
    "\n",
    "We will use the library **ftfy** (fixed text for you) to fix it.\n",
    "\n",
    "First, you should install the library: **conda install ftfy** (or **pip install ftfy**)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "¯\\_(ツ)_/¯\n",
      "Party\n",
      "I'm\n"
     ]
    }
   ],
   "source": [
    "import ftfy\n",
    "foo = '&macr;\\\\_(ã\\x83\\x84)_/&macr;'\n",
    "bar = '\\ufeffParty'\n",
    "baz = '\\001\\033[36;44mI&#x92;m'\n",
    "print(ftfy.fix_text(foo))\n",
    "print(ftfy.fix_text(bar))\n",
    "print(ftfy.fix_text(baz))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can understand which heuristics ftfy is using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "U+0026  &       [Po] AMPERSAND\n",
      "U+006D  m       [Ll] LATIN SMALL LETTER M\n",
      "U+0061  a       [Ll] LATIN SMALL LETTER A\n",
      "U+0063  c       [Ll] LATIN SMALL LETTER C\n",
      "U+0072  r       [Ll] LATIN SMALL LETTER R\n",
      "U+003B  ;       [Po] SEMICOLON\n",
      "U+005C  \\       [Po] REVERSE SOLIDUS\n",
      "U+005F  _       [Pc] LOW LINE\n",
      "U+0028  (       [Ps] LEFT PARENTHESIS\n",
      "U+00E3  ã       [Ll] LATIN SMALL LETTER A WITH TILDE\n",
      "U+0083  \\x83    [Cc] <unknown>\n",
      "U+0084  \\x84    [Cc] <unknown>\n",
      "U+0029  )       [Pe] RIGHT PARENTHESIS\n",
      "U+005F  _       [Pc] LOW LINE\n",
      "U+002F  /       [Po] SOLIDUS\n",
      "U+0026  &       [Po] AMPERSAND\n",
      "U+006D  m       [Ll] LATIN SMALL LETTER M\n",
      "U+0061  a       [Ll] LATIN SMALL LETTER A\n",
      "U+0063  c       [Ll] LATIN SMALL LETTER C\n",
      "U+0072  r       [Ll] LATIN SMALL LETTER R\n",
      "U+003B  ;       [Po] SEMICOLON\n"
     ]
    }
   ],
   "source": [
    "ftfy.explain_unicode(foo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Dates\n",
    "Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as [**python-dateutil**](https://dateutil.readthedocs.io/en/stable/). An alternative is [arrow](https://arrow.readthedocs.io/en/latest/).\n",
    "\n",
    "Install the library: **pip install python-dateutil**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-08-22 10:22:46+00:00\n"
     ]
    }
   ],
   "source": [
    "from dateutil.parser import parse\n",
    "now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
    "print(now)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-08-08 10:20:00\n"
     ]
    }
   ],
   "source": [
    "dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
    "print(dt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), , A. Sharma, 2018.\n",
    "* [Beautifier](https://github.com/labtocat/beautifier) package\n",
    "* [Ftfy](https://ftfy.readthedocs.io/en/latest/) package\n",
    "* [python-dateutil](https://dateutil.readthedocs.io/en/stable/)package"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/11_0_Handy.ipynb
+++ b/ml21/preprocessing/11_0_Handy.ipynb
@@ -0,0 +1,139 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#  Handy libraries\n",
    "Libraries that help in several preprocessing tasks.\n",
    "\n",
    "* [datacleaner](11_1_datacleaner.ipynb)\n",
    "* [autoclean](11_3_autoclean.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
    "* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries),  M. Bierly, 2016\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/11_1_datacleaner.ipynb
+++ b/ml21/preprocessing/11_1_datacleaner.ipynb
@@ -0,0 +1,673 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Datacleaner\n",
    "[Datacleaner](https://github.com/rhiever/datacleaner) supports:\n",
    "\n",
    "* drop rows with missing values\n",
    "* replace missing values with the mode or median on a column-by-column basis\n",
    "* encode non-numeric variables with numerical equivalents\n",
    "\n",
    "\n",
    "Install with\n",
    "\n",
    "**pip install datacleaner**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>887</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Montvila, Rev. Juozas</td>\n",
       "      <td>male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211536</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>888</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Graham, Miss. Margaret Edith</td>\n",
       "      <td>female</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>112053</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>B42</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>889</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>W./C. 6607</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>890</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Behr, Mr. Karl Howell</td>\n",
       "      <td>male</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111369</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>C148</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>891</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Dooley, Mr. Patrick</td>\n",
       "      <td>male</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>370376</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "0              1         0       3   \n",
       "1              2         1       1   \n",
       "2              3         1       3   \n",
       "3              4         1       1   \n",
       "4              5         0       3   \n",
       "..           ...       ...     ...   \n",
       "886          887         0       2   \n",
       "887          888         1       1   \n",
       "888          889         0       3   \n",
       "889          890         1       1   \n",
       "890          891         0       3   \n",
       "\n",
       "                                                  Name     Sex   Age  SibSp  \\\n",
       "0                              Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                               Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                             Allen, Mr. William Henry    male  35.0      0   \n",
       "..                                                 ...     ...   ...    ...   \n",
       "886                              Montvila, Rev. Juozas    male  27.0      0   \n",
       "887                       Graham, Miss. Margaret Edith  female  19.0      0   \n",
       "888           Johnston, Miss. Catherine Helen \"Carrie\"  female   NaN      1   \n",
       "889                              Behr, Mr. Karl Howell    male  26.0      0   \n",
       "890                                Dooley, Mr. Patrick    male  32.0      0   \n",
       "\n",
       "     Parch            Ticket     Fare Cabin Embarked  \n",
       "0        0         A/5 21171   7.2500   NaN        S  \n",
       "1        0          PC 17599  71.2833   C85        C  \n",
       "2        0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3        0            113803  53.1000  C123        S  \n",
       "4        0            373450   8.0500   NaN        S  \n",
       "..     ...               ...      ...   ...      ...  \n",
       "886      0            211536  13.0000   NaN        S  \n",
       "887      0            112053  30.0000   B42        S  \n",
       "888      2        W./C. 6607  23.4500   NaN        S  \n",
       "889      0            111369  30.0000  C148        C  \n",
       "890      0            370376   7.7500   NaN        Q  \n",
       "\n",
       "[891 rows x 12 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from datacleaner import autoclean\n",
    "\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>108</td>\n",
       "      <td>1</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>523</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>47</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>190</td>\n",
       "      <td>0</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>596</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>81</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>353</td>\n",
       "      <td>0</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>669</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>47</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>272</td>\n",
       "      <td>0</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>49</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>55</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>15</td>\n",
       "      <td>1</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>472</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>47</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>887</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>548</td>\n",
       "      <td>1</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>101</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>47</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>888</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>303</td>\n",
       "      <td>0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>14</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>30</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>889</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>413</td>\n",
       "      <td>0</td>\n",
       "      <td>28.0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>675</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>47</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>890</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>81</td>\n",
       "      <td>1</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>8</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>60</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>891</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>220</td>\n",
       "      <td>1</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>466</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>47</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  Name  Sex   Age  SibSp  Parch  Ticket  \\\n",
       "0              1         0       3   108    1  22.0      1      0     523   \n",
       "1              2         1       1   190    0  38.0      1      0     596   \n",
       "2              3         1       3   353    0  26.0      0      0     669   \n",
       "3              4         1       1   272    0  35.0      1      0      49   \n",
       "4              5         0       3    15    1  35.0      0      0     472   \n",
       "..           ...       ...     ...   ...  ...   ...    ...    ...     ...   \n",
       "886          887         0       2   548    1  27.0      0      0     101   \n",
       "887          888         1       1   303    0  19.0      0      0      14   \n",
       "888          889         0       3   413    0  28.0      1      2     675   \n",
       "889          890         1       1    81    1  26.0      0      0       8   \n",
       "890          891         0       3   220    1  32.0      0      0     466   \n",
       "\n",
       "        Fare  Cabin  Embarked  \n",
       "0     7.2500     47         2  \n",
       "1    71.2833     81         0  \n",
       "2     7.9250     47         2  \n",
       "3    53.1000     55         2  \n",
       "4     8.0500     47         2  \n",
       "..       ...    ...       ...  \n",
       "886  13.0000     47         2  \n",
       "887  30.0000     30         2  \n",
       "888  23.4500     47         2  \n",
       "889  30.0000     60         0  \n",
       "890   7.7500     47         1  \n",
       "\n",
       "[891 rows x 12 columns]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean = autoclean(df, copy=True)\n",
    "df_clean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
    "* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries),  M. Bierly, 2016\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": true
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/preprocessing/11_3_autoclean.ipynb
+++ b/ml21/preprocessing/11_3_autoclean.ipynb
@@ -0,0 +1,578 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "849ad57e-6adb-4c2e-afd6-73db37eef572",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "179cc802-9f1d-40b0-bf0c-9d4fb7ea1262",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9858d815-0390-4e77-a5ff-a8d2a1960981",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "238bab60-75f0-4d29-ab05-66afc463b506",
   "metadata": {},
   "source": [
    "# Autoclean\n",
    "A simple library to clean data. [Autoclean](https://github.com/elisemercury/AutoClean) supports:\n",
    "AutoClean supports:\n",
    "\n",
    "* Handling of duplicates\n",
    "* Various imputation methods for missing values\n",
    "* Handling of outliers\n",
    "* Encoding of categorical data (OneHot, Label)\n",
    "* Extraction of data time values\n",
    "\n",
    "Install the package: **pip install py-AutoClean**.\n",
    "\n",
    "Parameters:\n",
    "\n",
    "* **duplicates**\n",
    "    *  default: False,\n",
    "    *  other values: 'auto', True\n",
    "* **missing_num**\n",
    "    * default:False,\n",
    "    * other values:\t'auto', 'linreg', 'knn', 'mean', 'median', 'most_frequent', 'delete', False\n",
    "* **missing_categ**\n",
    "    * default: False,\n",
    "    * other values:\t'auto', 'logreg', 'knn', 'most_frequent', 'delete', False\n",
    "* **encode_categ**\n",
    "    * default: False,\n",
    "    * other values:\t'auto', ['onehot'], ['label'], False ; to encode only specific columns add a list of column names or indexes: ['auto', ['col1', 2]]\n",
    "* **extract_datetime**\n",
    "    * default:\tFalse,\n",
    "    * other values:\t'auto', 'D', 'M', 'Y', 'h', 'm', 's'\n",
    "* **outliers**\n",
    "    * default:\tFalse,\n",
    "    * other values:\t'auto', 'winz', 'delete'\n",
    "* **outlier_param**\tdefault:\t1.5,  other values:\tany int or float, False\n",
    "* **logfile**\n",
    "    * default: True,\n",
    "    * other values:\tFalse\n",
    "* **verbose**\n",
    "    * default: False,\n",
    "    * other values:\tTrue"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "491b034b-994e-4f06-b4bc-df0590a62aab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>886</th>\n",
       "      <td>887</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>Montvila, Rev. Juozas</td>\n",
       "      <td>male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211536</td>\n",
       "      <td>13.0000</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>887</th>\n",
       "      <td>888</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Graham, Miss. Margaret Edith</td>\n",
       "      <td>female</td>\n",
       "      <td>19.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>112053</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>B42</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>888</th>\n",
       "      <td>889</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
       "      <td>female</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>W./C. 6607</td>\n",
       "      <td>23.4500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>889</th>\n",
       "      <td>890</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Behr, Mr. Karl Howell</td>\n",
       "      <td>male</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>111369</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>C148</td>\n",
       "      <td>C</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>890</th>\n",
       "      <td>891</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Dooley, Mr. Patrick</td>\n",
       "      <td>male</td>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>370376</td>\n",
       "      <td>7.7500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Q</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>891 rows × 12 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     PassengerId  Survived  Pclass  \\\n",
       "0              1         0       3   \n",
       "1              2         1       1   \n",
       "2              3         1       3   \n",
       "3              4         1       1   \n",
       "4              5         0       3   \n",
       "..           ...       ...     ...   \n",
       "886          887         0       2   \n",
       "887          888         1       1   \n",
       "888          889         0       3   \n",
       "889          890         1       1   \n",
       "890          891         0       3   \n",
       "\n",
       "                                                  Name     Sex   Age  SibSp  \\\n",
       "0                              Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                               Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                             Allen, Mr. William Henry    male  35.0      0   \n",
       "..                                                 ...     ...   ...    ...   \n",
       "886                              Montvila, Rev. Juozas    male  27.0      0   \n",
       "887                       Graham, Miss. Margaret Edith  female  19.0      0   \n",
       "888           Johnston, Miss. Catherine Helen \"Carrie\"  female   NaN      1   \n",
       "889                              Behr, Mr. Karl Howell    male  26.0      0   \n",
       "890                                Dooley, Mr. Patrick    male  32.0      0   \n",
       "\n",
       "     Parch            Ticket     Fare Cabin Embarked  \n",
       "0        0         A/5 21171   7.2500   NaN        S  \n",
       "1        0          PC 17599  71.2833   C85        C  \n",
       "2        0  STON/O2. 3101282   7.9250   NaN        S  \n",
       "3        0            113803  53.1000  C123        S  \n",
       "4        0            373450   8.0500   NaN        S  \n",
       "..     ...               ...      ...   ...      ...  \n",
       "886      0            211536  13.0000   NaN        S  \n",
       "887      0            112053  30.0000   B42        S  \n",
       "888      2        W./C. 6607  23.4500   NaN        S  \n",
       "889      0            111369  30.0000  C148        C  \n",
       "890      0            370376   7.7500   NaN        Q  \n",
       "\n",
       "[891 rows x 12 columns]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from AutoClean import AutoClean\n",
    "\n",
    "df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "d842eedf-3971-4966-a8b4-543bb56dd60d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AutoClean process completed in 0.289385 seconds\n",
      "Logfile saved to: /home/cif/GoogleDrive/cursos/summer-school-romania/2019/notebooks/preprocessing/autoclean.log\n"
     ]
    }
   ],
   "source": [
    "autoclean = AutoClean(df, mode='auto')\n",
    "\n",
    "# We can control the preprocessing\n",
    "#autoclean = AutoClean(df, mode='auto', duplicates=False, missing_num=False, missing_categ=False, encode_categ=False, extract_datetime=False, outliers=False, outlier_param=1.5, logfile=True, verbose=False)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "4ede7c55-475a-4748-8cc4-788f46c88b26",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>Sex_female</th>\n",
       "      <th>Sex_male</th>\n",
       "      <th>Embarked_C</th>\n",
       "      <th>Embarked_Q</th>\n",
       "      <th>Embarked_S</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>C128</td>\n",
       "      <td>S</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>65.6344</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>C128</td>\n",
       "      <td>S</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>C128</td>\n",
       "      <td>S</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                             Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  Sex_female  Sex_male  \\\n",
       "0      0         A/5 21171   7.2500  C128        S       False      True   \n",
       "1      0          PC 17599  65.6344   C85        C        True     False   \n",
       "2      0  STON/O2. 3101282   7.9250  C128        S        True     False   \n",
       "3      0            113803  53.1000  C123        S        True     False   \n",
       "4      0            373450   8.0500  C128        S       False      True   \n",
       "\n",
       "   Embarked_C  Embarked_Q  Embarked_S  \n",
       "0       False       False        True  \n",
       "1        True       False       False  \n",
       "2       False       False        True  \n",
       "3       False       False        True  \n",
       "4       False       False        True  "
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_clean = autoclean.output\n",
    "df_clean[0:5]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/ml21/preprocessing/5_Duplicated_Values.ipynb
+++ b/ml21/preprocessing/5_Duplicated_Values.ipynb
@@ -0,0 +1,502 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Duplicated values\n",
    "\n",
    "There are two possible approaches: **remove** these rows or **filling** them. It depends on every case.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Filling NaN values\n",
    "If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
    "\n",
    "* For **string** fields, we can fill NaN with **' '**.\n",
    "\n",
    "* For **numbers**, we can fill with the **mean** or **median** value. \n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "# Fill NaN with ' '\n",
    "df['col'] = df['col'].fillna(' ')\n",
    "# Fill NaN with 99\n",
    "df['col'] = df['col'].fillna(99)\n",
    "# Fill NaN with the mean of the column\n",
    "df['col'] = df['col'].fillna(df['col'].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Propagate non-null values forward or backwards\n",
    "You can also propagate non-null values forward or backwards by putting\n",
    "method=’pad’ as the method argument. It will fill the next value in the\n",
    "dataframe with the previous non-NaN value. Maybe you just want to fill one\n",
    "value ( limit=1 )or you want to fill all the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   NaN\n",
       "1   NaN\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   NaN\n",
       "6   NaN"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   NaN\n",
       "1   NaN\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   4.0\n",
       "6   NaN"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We fill forward the value 4.0 and fill the next one (limit = 1)\n",
    "df.fillna(method='pad', limit=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can also backfilling with **bfill**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   2.0\n",
       "1   2.0\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   NaN\n",
       "6   NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Fill the first two NaN values with the first available value\n",
    "df.fillna(method='bfill')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Removing NaN values\n",
    "We can remove them by row or column."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "/# Drop any rows which have any nans\n",
    "df.dropna()\n",
    "/# Drop columns that have any nans\n",
    "df.dropna(axis=1)\n",
    "/# Only drop columns which have at least 90% non-NaNs\n",
    "df.dropna(thresh=int(df.shape[0] * .9), axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/ml21/preprocessing/9_String_Data.ipynb
+++ b/ml21/preprocessing/9_String_Data.ipynb
@@ -0,0 +1,619 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# String Data\n",
    "It is common to clean string columns so that they follow a predefined format (e.g. emails, URLs, ...).\n",
    "\n",
    "We can do it using regular expressions or specific libraries."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Beautifier\n",
    "Simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify url patterns, domains and so on. Library helps to clean unicodes, special characters and unnecessary redirection patterns from the urls and gives you clean date.\n",
    "\n",
    "Install with **'pip install beautifier'**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Email cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from beautifier import Email\n",
    "email = Email('me@imsach.in')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'imsach.in'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email.domain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'me'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email.username"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email.is_free_email"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "email2 = Email('This my address')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "False"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email2.is_valid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "email3 = Email('pepe@gmail.com')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email3.is_valid"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email3.is_free_email"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## URL cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "from beautifier import Url\n",
    "url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'https://in.linkedin.com/in/sachinphilip'"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.cleanup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'in.linkedin.com'"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.domain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.param"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'sachinphilip'"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "url.username"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Unicode\n",
    "Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
    "\n",
    "A **mojibake** is a character displayed in an unintended character enconding. Example:  \"<22>\").\n",
    "\n",
    "We will use the library **ftfy** (fixed text for you) to fix it.\n",
    "\n",
    "First, you should install the library: ***conda install ftfy**. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "¯\\_(ツ)_/¯\n",
      "Party\n",
      "I'm\n"
     ]
    }
   ],
   "source": [
    "import ftfy\n",
    "foo = '&macr;\\\\_(ã\\x83\\x84)_/&macr;'\n",
    "bar = '\\ufeffParty'\n",
    "baz = '\\001\\033[36;44mI&#x92;m'\n",
    "print(ftfy.fix_text(foo))\n",
    "print(ftfy.fix_text(bar))\n",
    "print(ftfy.fix_text(baz))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can understand which heuristics ftfy is using."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "ename": "NameError",
     "evalue": "name 'ftfy' is not defined",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-1-4030b963ff0a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mftfy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexplain_unicode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfoo\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mNameError\u001b[0m: name 'ftfy' is not defined"
     ]
    }
   ],
   "source": [
    "ftfy.explain_unicode(foo)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Dates\n",
    "Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as **python-dateutil**.\n",
    "\n",
    "Install the library: **pip install python-dateutil**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-08-22 10:22:46+00:00\n"
     ]
    }
   ],
   "source": [
    "from dateutil.parser import parse\n",
    "now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
    "print(now)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2019-08-22 10:20:00\n"
     ]
    }
   ],
   "source": [
    "dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
    "print(dt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)\n",
    "* Beautifier https://github.com/labtocat/beautifier\n",
    "* Ftfy https://ftfy.readthedocs.io/en/latest/\n",
    "* python-dateutil https://dateutil.readthedocs.io/en/stable/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/ml21/preprocessing/images/EscUpmPolit_p.gif
+++ b/ml21/preprocessing/images/EscUpmPolit_p.gif
--- a/ml21/preprocessing/images/titanic.jpg
+++ b/ml21/preprocessing/images/titanic.jpg
--- a/ml21/visualization/.gitkeep
+++ b/ml21/visualization/.gitkeep
@@ -0,0 +1 @@
--- a/ml21/visualization/00_Intro_Visualization.ipynb
+++ b/ml21/visualization/00_Intro_Visualization.ipynb
@@ -0,0 +1,185 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Introduction to Visualization\n",
    " \n",
    "In this session, we will get more insight regarding how to visualize data.\n",
    "\n",
    "# Objectives\n",
    "\n",
    "The main objectives of this session are:\n",
    "* Understanding how to visualize data\n",
    "* Understanding the purpose of different charts \n",
    "* Experimenting with several environments for visualizing data\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Seaborn\n",
    "\n",
    "Seaborn is a library that visualizes data in Python. The main characteristics are:\n",
    "\n",
    "* A dataset-oriented API for examining relationships between multiple variables\n",
    "* Specialized support for using categorical variables to show observations or aggregate statistics\n",
    "* Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data\n",
    "* Automatic estimation and plotting of linear regression models for different kinds of dependent variables\n",
    "* Convenient views of the overall structure of complex datasets\n",
    "* High-level abstractions for structuring multi-plot grids that let you quickly build complex visualizations\n",
    "* Concise control over matplotlib figure styling with several built-in themes\n",
    "* Tools for choosing color palettes that faithfully reveal patterns in your data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Install\n",
    "Use:\n",
    "\n",
    "**conda install seaborn**\n",
    "\n",
    "or \n",
    "\n",
    "**pip install seaborn**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Table of Contents"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "1. [Home](00_Intro_Visualization.ipynb)\n",
    "2. [Dataset](01_Dataset.ipynb)\n",
    "3. [Comparison Charts](02_Comparison_Charts.ipynb)\n",
    "     1. [More Comparison Charts](02_01_More_Comparison_Charts.ipynb)\n",
    "4. [Distribution Charts](03_Distribution_Charts.ipynb)\n",
    "5. [Hierarchical charts](04_Hierarchical_Charts.ipynb)\n",
    "6. [Relational charts](05_Relational_Charts.ipynb)\n",
    "7. [Spatial charts](06_Spatial_Charts.ipynb)\n",
    "8. [Temporal charts](07_Temporal_Charts.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/visualization/01_Dataset.ipynb
+++ b/ml21/visualization/01_Dataset.ipynb
@@ -0,0 +1,363 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Visualization](00_Intro_Visualization.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Dataset\n",
    "Seaborn includes several datasets. We can consult the available datasets and load them. \n",
    "\n",
    "The datasets are also available at https://github.com/mwaskom/seaborn-data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['anagrams',\n",
       " 'anscombe',\n",
       " 'attention',\n",
       " 'brain_networks',\n",
       " 'car_crashes',\n",
       " 'diamonds',\n",
       " 'dots',\n",
       " 'dowjones',\n",
       " 'exercise',\n",
       " 'flights',\n",
       " 'fmri',\n",
       " 'geyser',\n",
       " 'glue',\n",
       " 'healthexp',\n",
       " 'iris',\n",
       " 'mpg',\n",
       " 'penguins',\n",
       " 'planets',\n",
       " 'seaice',\n",
       " 'taxis',\n",
       " 'tips',\n",
       " 'titanic']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sns.get_dataset_names()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>total_bill</th>\n",
       "      <th>tip</th>\n",
       "      <th>sex</th>\n",
       "      <th>smoker</th>\n",
       "      <th>day</th>\n",
       "      <th>time</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>16.99</td>\n",
       "      <td>1.01</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10.34</td>\n",
       "      <td>1.66</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>21.01</td>\n",
       "      <td>3.50</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.68</td>\n",
       "      <td>3.31</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>24.59</td>\n",
       "      <td>3.61</td>\n",
       "      <td>Female</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>25.29</td>\n",
       "      <td>4.71</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>8.77</td>\n",
       "      <td>2.00</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>26.88</td>\n",
       "      <td>3.12</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>15.04</td>\n",
       "      <td>1.96</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>14.78</td>\n",
       "      <td>3.23</td>\n",
       "      <td>Male</td>\n",
       "      <td>No</td>\n",
       "      <td>Sun</td>\n",
       "      <td>Dinner</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   total_bill   tip     sex smoker  day    time  size\n",
       "0       16.99  1.01  Female     No  Sun  Dinner     2\n",
       "1       10.34  1.66    Male     No  Sun  Dinner     3\n",
       "2       21.01  3.50    Male     No  Sun  Dinner     3\n",
       "3       23.68  3.31    Male     No  Sun  Dinner     2\n",
       "4       24.59  3.61  Female     No  Sun  Dinner     4\n",
       "5       25.29  4.71    Male     No  Sun  Dinner     4\n",
       "6        8.77  2.00    Male     No  Sun  Dinner     2\n",
       "7       26.88  3.12    Male     No  Sun  Dinner     4\n",
       "8       15.04  1.96    Male     No  Sun  Dinner     2\n",
       "9       14.78  3.23    Male     No  Sun  Dinner     2"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = sns.load_dataset('tips')\n",
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Seaborn](http://seaborn.pydata.org/index.html) documentation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
 }
--- a/ml21/visualization/02_01_More_Comparison_Charts.ipynb
+++ b/ml21/visualization/02_01_More_Comparison_Charts.ipynb
--- a/ml21/visualization/02_Comparison_Charts.ipynb
+++ b/ml21/visualization/02_Comparison_Charts.ipynb
--- a/ml21/visualization/03_Distribution_Charts.ipynb
+++ b/ml21/visualization/03_Distribution_Charts.ipynb
--- a/ml21/visualization/04_Hierarchical_Charts.ipynb
+++ b/ml21/visualization/04_Hierarchical_Charts.ipynb
--- a/ml21/visualization/05_Relational_Charts.ipynb
+++ b/ml21/visualization/05_Relational_Charts.ipynb
--- a/ml21/visualization/06_Spatial_Charts.ipynb
+++ b/ml21/visualization/06_Spatial_Charts.ipynb
--- a/ml21/visualization/07_Temporal_Charts.ipynb
+++ b/ml21/visualization/07_Temporal_Charts.ipynb
--- a/ml21/visualization/images/EscUpmPolit_p.gif
+++ b/ml21/visualization/images/EscUpmPolit_p.gif
--- a/ml3/2_4_0_Intro_NN.ipynb
+++ b/ml3/2_4_0_Intro_NN.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -27,14 +27,14 @@
   "source": [
    "# Introduction to Neural Networks\n",
    " \n",
-    "In this lab session, we are going to learn how to train a neural network.\n",
+    "In this lab session, we will learn how to train a neural network.\n",
    "\n",
    "# Objectives\n",
    "\n",
    "The main objectives of this session are:\n",
    "* Put in practice the notions learn in class about neural computing\n",
    "* Understand what an MLP is\n",
-    "* Learn to use some libraries, such as scikit-learn "
+    "* Learn to use some libraries, such as Scikit-learn."
   ]
  },
  {
@@ -58,7 +58,7 @@
   "metadata": {},
   "source": [
    "## Licence\n",
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
--- a/ml3/2_4_1_Exercise.ipynb
+++ b/ml3/2_4_1_Exercise.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -39,7 +39,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Multilayer perceptrons, also called feedforward neural networks or deep feedforward networks, are the most basic deep learning models."
+    "Multilayer perceptrons, called feedforward neural networks or deep feedforward networks, are the most basic deep learning models."
   ]
  },
  {
@@ -58,7 +58,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this notebook we are going to try the spiral dataset with different algorthms. In particular, we are going to focus our attention on the MLP classifier.\n",
+    "In this notebook, we will try the spiral dataset with different algorithms. In particular, we are going to focus our attention on the MLP classifier.\n",
    "\n",
    "\n",
    "Answer directly in your copy of the exercise and submit it as a moodle task."
--- a/ml3/2_4_2_Exercise_Optional.ipynb
+++ b/ml3/2_4_2_Exercise_Optional.ipynb
@@ -4,7 +4,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "![](images/EscUpmPolit_p.gif \"UPM\")"
+    "![](./images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
@@ -39,10 +39,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "In this notebook we are going to apply a MLP to a simple regression task: learning the Fresnel functions.\n",
+    "In this notebook, we are going to apply an MLP to a simple regression task: learning the Fresnel functions.\n",
    "\n",
    "\n",
-    "Answer directly in your copy of the exercise and submit it as a moodle task."
+    "Answer directly in your copy of the exercise and submit it as a Moodle task."
   ]
  },
  {
@@ -92,7 +92,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Change this variables to change the train and test dataset."
+    "Change these variables to change the train and test dataset."
   ]
  },
  {
--- a/ml3/spiral.py
+++ b/ml3/spiral.py
@@ -15,7 +15,7 @@ def gen_spiral_dataset(n_examples=500, n_classes=2, a=None, b=None, pi_space=3):
    theta = np.linspace(0,pi_space*pi, num=n_examples)
    xy = np.zeros((n_examples,2))
-    # logaritmic spirals
+    # logarithmic spirals
    x_golden_parametric = lambda a, b, theta: a**(theta*b) * cos(theta)
    y_golden_parametric = lambda a, b, theta: a**(theta*b) * sin(theta)
    x_golden_parametric = np.vectorize(x_golden_parametric)
--- a/ml4/2_5_1_Exercise.ipynb
+++ b/ml4/2_5_1_Exercise.ipynb
@@ -48,7 +48,7 @@
    "# Introduction\n",
    "The purpose of this practice is to understand better how GAs work. \n",
    "\n",
-    "There are many libraries that implement GAs, you can find some of then in the [References](#References) section."
+    "There are many libraries that implement GAs; you can find some of them in the [References](#References) section."
   ]
  },
  {
@@ -56,7 +56,7 @@
   "metadata": {},
   "source": [
    "# Genetic Algorithms\n",
-    "In this section we are going to use the library DEAP [References](#References) for implementing a genetic algorithms.\n",
+    "In this section, we are going to use the library [DEAP](https://github.com/DEAP/deap/tree/master) for implementing a genetic algorithms.\n",
    "\n",
    "We are going to implement the OneMax problem as seen in class.\n",
    "\n",
@@ -187,9 +187,9 @@
   "metadata": {},
   "source": [
    "## Comparing\n",
-    "Your task is modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n",
+    "Your task is to modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n",
    "\n",
-    "Submit your notebook and include a the modified code, and a comparison of the effects of these changes. \n",
+    "Submit your notebook and include a modified code and a comparison of the effects of these changes. \n",
    "\n",
    "Discuss your findings."
   ]
@@ -198,31 +198,24 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Optimizing ML hyperparameters\n",
+    "## Optional. Optimizing ML hyperparameters\n",
    "\n",
-    "One of the applications of Genetic Algorithms is the optimization of ML hyperparameters. Previously we have used GridSearch from Scikit. Using (sklearn-deap)[#References], optimize the Titatic hyperparameters using both GridSearch and Genetic Algorithms. \n",
+    "One of the applications of Genetic Algorithms is the optimization of ML hyperparameters. Previously, we have used GridSearch from Scikit. Using [sklearn-deap](https://github.com/rsteca/sklearn-deap), optimize the Titatic hyperparameters using both GridSearch and Genetic Algorithms. \n",
    "\n",
    "The same exercise (using the digits dataset) can be found in this [notebook](https://github.com/rsteca/sklearn-deap/blob/master/test.ipynb).\n",
    "\n",
-    "Submit a notebook where you include well-crafted conclusions about the exercises, discussing the pros and cons of using genetic algorithms for this purpose.\n"
+    "Since there is a problem with Scikit version 0.24, you can just comment on the different approaches.",
    "\n",
    "Alternatively, you can also use  the library [sklearn-genetic-opt](https://sklearn-genetic-opt.readthedocs.io/en/stable/index.html) and discuss the digit classification example included in the library: [digits decision tree](https://sklearn-genetic-opt.readthedocs.io/en/stable/notebooks/Digits_decision_tree.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Optional exercises\n",
+    "## Optional. Optimizing an ML pipeline with a genetic algorithm\n",
    "\n",
-    "Here there is a proposed optional exercise."
+    "The library [TPOT](https://epistasislab.github.io/tpot/latest/) optimizes ML pipelines and comes with a lot of [examples](https://epistasislab.github.io/tpot/latest/Tutorial/9_Genetic_Algorithm_Overview/) and even notebooks, for example for the [iris dataset](https://github.com/EpistasisLab/tpot/blob/master/tutorials/IRIS.ipynb).\n",
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Optimizing a ML pipeline with a genetic algorithm\n",
    "\n",
    "The library [TPOT](#References) optimizes ML pipelines and comes with a lot of (examples)[https://epistasislab.github.io/tpot/examples/] and even notebooks, for example for the [iris dataset](https://github.com/EpistasisLab/tpot/blob/master/tutorials/IRIS.ipynb).\n",
    "\n",
    "Your task is to apply TPOT to the intermediate challenge and write a short essay explaining:\n",
    "* what TPOT does (with your own words).\n",
@@ -240,7 +233,8 @@
    "* [tpot](http://epistasislab.github.io/tpot/)\n",
    "* [gplearn](http://gplearn.readthedocs.io/en/latest/index.html)\n",
    "* [scikit-allel](https://scikit-allel.readthedocs.io/en/latest/)\n",
-    "* [scklearn-genetic](https://github.com/manuel-calzolari/sklearn-genetic)"
+    "* [sklearn-genetic](https://github.com/manuel-calzolari/sklearn-genetic)\n",
    "* [sklearn-genetic-opt](https://sklearn-genetic-opt.readthedocs.io/en/stable/)"
   ]
  },
  {
@@ -254,13 +248,22 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
@@ -276,7 +279,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.7.9"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml5/2_6_0_Intro_RL.ipynb
+++ b/ml5/2_6_0_Intro_RL.ipynb
@@ -48,7 +48,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "1. [Q-Learning](2_6_1_Q-Learning.ipynb)"
+    "1. [Q-Learning](2_6_1_Q-Learning_Basic.ipynb)\n",
    "1. [Visualization](2_6_1_Q-Learning_Visualization.ipynb)\n",
    "1. [Exercises](2_6_1_Q-Learning_Exercises.ipynb)"
   ]
  },
  {
@@ -64,7 +66,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -78,7 +80,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/ml5/2_6_1_Q-Learning.ipynb
+++ b/ml5/2_6_1_Q-Learning.ipynb
@@ -1,443 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2018 Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [Introduction](#Introduction)\n",
    "* [Getting started with OpenAI Gym](#Getting-started-with-OpenAI-Gym)\n",
    "* [The Frozen Lake scenario](#The-Frozen-Lake-scenario)\n",
    "* [Q-Learning with the Frozen Lake scenario](#Q-Learning-with-the-Frozen-Lake-scenario)\n",
    "* [Exercises](#Exercises)\n",
    "* [Optional exercises](#Optional-exercises)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "The purpose of this practice is to understand better Reinforcement Learning (RL) and, in particular, Q-Learning.\n",
    "\n",
    "We are going to use [OpenAI Gym](https://gym.openai.com/). OpenAI is a toolkit for developing and comparing RL algorithms.Take a loot at ther [website](https://gym.openai.com/).\n",
    "\n",
    "It implements [algorithm imitation](http://gym.openai.com/envs/#algorithmic), [classic control problems](http://gym.openai.com/envs/#classic_control), [Atari games](http://gym.openai.com/envs/#atari), [Box2D continuous control](http://gym.openai.com/envs/#box2d), [robotics with MuJoCo, Multi-Joint dynamics with Contact](http://gym.openai.com/envs/#mujoco),  and [simple text based environments](http://gym.openai.com/envs/#toy_text).\n",
    "\n",
    "This notebook is based on * [Diving deeper into Reinforcement Learning with Q-Learning](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
    "\n",
    "First of all, install the OpenAI Gym  library:\n",
    "\n",
    "```console\n",
    "foo@bar:~$ pip install gym\n",
    "```\n",
    "\n",
    "\n",
    "If you get the error message 'NotImplementedError: abstract', [execute](https://github.com/openai/gym/issues/775) \n",
    "```console\n",
    "foo@bar:~$ pip install pyglet==1.2.4\n",
    "```\n",
    "\n",
    "If you want to try the Atari environment, it is better that you opt for the full installation from the source. Follow the instructions at [https://github.com/openai/gym#id15](OpenGym).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Getting started with OpenAI Gym\n",
    "\n",
    "First of all, read the [introduction](http://gym.openai.com/docs/#getting-started-with-gym) of OpenAI Gym."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environments\n",
    "OpenGym provides a number of problems called *environments*. \n",
    "\n",
    "Try the 'CartPole-v0' (or 'MountainCar)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "\n",
    "env = gym.make('CartPole-v0')\n",
    "#env = gym.make('MountainCar-v0')\n",
    "#env = gym.make('Taxi-v2')\n",
    "\n",
    "#env = gym.make('Jamesbond-ram-v0')\n",
    "\n",
    "env.reset()\n",
    "for _ in range(1000):\n",
    "    env.render()\n",
    "    env.step(env.action_space.sample()) # take a random action"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This will launch an external window with the game. If you cannot close that window, just execute in a code cell:\n",
    "\n",
    "```python\n",
    "env.close()\n",
    "```\n",
    "\n",
    "The full list of available environments can be found printing the environment registry as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gym import envs\n",
    "print(envs.registry.all())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The environment’s **step** function returns  four values. These are:\n",
    "\n",
    "* **observation (object):** an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.\n",
    "* **reward (float):** amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.\n",
    "* **done (boolean):** whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.).\n",
    "* **info (dict):** diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.\n",
    "\n",
    "The typical agent loop consists in first calling the method *reset* which provides an initial observation. Then the agent executes an action, and receives the reward, the new observation, and if the episode has finished (done is true). \n",
    "\n",
    "For example, analyze this sample of agent loop for 100 ms. The details of the previous variables for this game as described [here](https://github.com/openai/gym/wiki/CartPole-v0) are:\n",
    "* **observation**: Cart Position, Cart Velocity, Pole Angle, Pole Velocity.\n",
    "* **action**: 0\t(Push cart to the left), 1\t(Push cart to the right).\n",
    "* **reward**: 1  for every step taken, including the termination step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gym\n",
    "env = gym.make('CartPole-v0')\n",
    "for i_episode in range(20):\n",
    "    observation = env.reset()\n",
    "    for t in range(100):\n",
    "        env.render()\n",
    "        print(observation)\n",
    "        action = env.action_space.sample()\n",
    "        print(\"Action \", action)\n",
    "        observation, reward, done, info = env.step(action)\n",
    "        print(\"Observation \", observation, \", reward \", reward, \", done \", done, \", info \" , info)\n",
    "        if done:\n",
    "            print(\"Episode finished after {} timesteps\".format(t+1))\n",
    "            break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Frozen Lake scenario\n",
    "We are going to play to the [Frozen Lake](http://gym.openai.com/envs/FrozenLake-v0/) game.\n",
    "\n",
    "The problem is a grid where you should go from the 'start' (S) position to the 'goal position (G) (the pizza!). You can only walk through the 'frozen tiles' (F). Unfortunately, you can fall in a  'hole' (H).\n",
    "![](images/frozenlake-problem.png \"Frozen lake problem\")\n",
    "\n",
    "The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise. The possible actions are going left, right, up or down. However, the ice is slippery, so you won't always move in the direction you intend.\n",
    "\n",
    "![](images/frozenlake-world.png \"Frozen lake world\")\n",
    "\n",
    "\n",
    "Here you can see several episodes. A full recording is available at  [Frozen World](http://gym.openai.com/envs/FrozenLake-v0/).\n",
    "\n",
    "![](images/recording.gif \"Example running\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Q-Learning with the Frozen Lake scenario\n",
    "We are now going to apply Q-Learning for the Frozen Lake scenario. This part of the notebook is taken from [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb).\n",
    "\n",
    "First we create the environment and a Q-table inizializated with zeros to store the value of each action in a given state. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import gym\n",
    "import random\n",
    "\n",
    "env = gym.make(\"FrozenLake-v0\")\n",
    "\n",
    "\n",
    "action_size = env.action_space.n\n",
    "state_size = env.observation_space.n\n",
    "\n",
    "\n",
    "qtable = np.zeros((state_size, action_size))\n",
    "print(qtable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we define the hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Q-Learning hyperparameters\n",
    "total_episodes = 10000        # Total episodes\n",
    "learning_rate = 0.8           # Learning rate\n",
    "max_steps = 99                # Max steps per episode\n",
    "gamma = 0.95                  # Discounting rate\n",
    "\n",
    "# Exploration hyperparameters\n",
    "epsilon = 1.0                 # Exploration rate\n",
    "max_epsilon = 1.0             # Exploration probability at start\n",
    "min_epsilon = 0.01            # Minimum exploration probability \n",
    "decay_rate = 0.01             # Exponential decay rate for exploration prob"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we implement the Q-Learning algorithm.\n",
    "\n",
    "![](images/qlearning-algo.png \"Q-Learning algorithm\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List of rewards\n",
    "rewards = []\n",
    "\n",
    "# 2 For life or until learning is stopped\n",
    "for episode in range(total_episodes):\n",
    "    # Reset the environment\n",
    "    state = env.reset()\n",
    "    step = 0\n",
    "    done = False\n",
    "    total_rewards = 0\n",
    "    \n",
    "    for step in range(max_steps):\n",
    "        # 3. Choose an action a in the current world state (s)\n",
    "        ## First we randomize a number\n",
    "        exp_exp_tradeoff = random.uniform(0, 1)\n",
    "        \n",
    "        ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
    "        if exp_exp_tradeoff > epsilon:\n",
    "            action = np.argmax(qtable[state,:])\n",
    "\n",
    "        # Else doing a random choice --> exploration\n",
    "        else:\n",
    "            action = env.action_space.sample()\n",
    "\n",
    "        # Take the action (a) and observe the outcome state(s') and reward (r)\n",
    "        new_state, reward, done, info = env.step(action)\n",
    "\n",
    "        # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
    "        # qtable[new_state,:] : all the actions we can take from new state\n",
    "        qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
    "        \n",
    "        total_rewards += reward\n",
    "        \n",
    "        # Our new state is state\n",
    "        state = new_state\n",
    "        \n",
    "        # If done (if we're dead) : finish episode\n",
    "        if done == True: \n",
    "            break\n",
    "        \n",
    "    episode += 1\n",
    "    # Reduce epsilon (because we need less and less exploration)\n",
    "    epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) \n",
    "    rewards.append(total_rewards)\n",
    "\n",
    "print (\"Score over time: \" +  str(sum(rewards)/total_episodes))\n",
    "print(qtable)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we use the learnt Q-table for playing the Frozen World game."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "env.reset()\n",
    "\n",
    "for episode in range(5):\n",
    "    state = env.reset()\n",
    "    step = 0\n",
    "    done = False\n",
    "    print(\"****************************************************\")\n",
    "    print(\"EPISODE \", episode)\n",
    "\n",
    "    for step in range(max_steps):\n",
    "        env.render()\n",
    "        # Take the action (index) that have the maximum expected future reward given that state\n",
    "        action = np.argmax(qtable[state,:])\n",
    "        \n",
    "        new_state, reward, done, info = env.step(action)\n",
    "        \n",
    "        if done:\n",
    "            break\n",
    "        state = new_state\n",
    "env.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "## Taxi\n",
    "Analyze the [Taxi problem](http://gym.openai.com/envs/Taxi-v2/) and solve it applying Q-Learning. You can find a solution as the one previously presented  [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym).\n",
    "\n",
    "Analyze the impact of not changing the learning rate (alfa or epsilon, depending on the book) or changing it in a different way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optional exercises\n",
    "\n",
    "## Doom\n",
    "Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
    "* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
    "* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
    "* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
    "* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
    "* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2018 Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/ml5/2_6_1_Q-Learning_Basic.ipynb
+++ b/ml5/2_6_1_Q-Learning_Basic.ipynb
--- a/ml5/2_6_1_Q-Learning_Exercises.ipynb
+++ b/ml5/2_6_1_Q-Learning_Exercises.ipynb
@@ -0,0 +1,138 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos Á. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "\n",
    "## Taxi\n",
    "Analyze the [Taxi problem](https://gymnasium.farama.org/environments/toy_text/taxi/) and solve it applying Q-Learning. You can find a solution as the one previously presented  [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym), and the notebook is [here](https://github.com/wagonhelm/Reinforcement-Learning-Introduction/blob/master/Reinforcement%20Learning%20Introduction.ipynb). Take into account that Gymnasium has changed, so you will have to adapt the code.\n",
    "\n",
    "Analyze the impact of not changing the learning rate or changing it in a different way. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Optional exercises\n",
    "Select one of the following exercises.\n",
    "\n",
    "## Blackjack\n",
    "Analyze how to appy Q-Learning for solving Blackjack.\n",
    "You can find information in this [article](https://gymnasium.farama.org/tutorials/training_agents/blackjack_tutorial/).\n",
    "\n",
    "## Doom\n",
    "Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "* [Gymnasium documentation](https://gymnasium.farama.org/).\n",
    "* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
    "* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
    "* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
    "* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
    "* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
    "* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos Á. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/ml5/2_6_1_Q-Learning_Visualization.ipynb
+++ b/ml5/2_6_1_Q-Learning_Visualization.ipynb
--- a/ml5/qlearning.py
+++ b/ml5/qlearning.py
@@ -0,0 +1,274 @@
 # Class definition of QLearning
 from pathlib import Path
 from typing import NamedTuple
 import matplotlib.pyplot as plt
 import numpy as np
 import pandas as pd
 import seaborn as sns
 from tqdm import tqdm
 import gymnasium as gym
 from gymnasium.envs.toy_text.frozen_lake import generate_random_map
 # Params
 class Params(NamedTuple):
    total_episodes: int  # Total episodes
    learning_rate: float  # Learning rate
    gamma: float  # Discounting rate
    epsilon: float  # Exploration probability
    map_size: int  # Number of tiles of one side of the squared environment
    seed: int  # Define a seed so that we get reproducible results
    is_slippery: bool  # If true the player will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions
    n_runs: int  # Number of runs
    action_size: int  # Number of possible actions
    state_size: int  # Number of possible states
    proba_frozen: float  # Probability that a tile is frozen
    savefig_folder: Path  # Root folder where plots are saved
 class Qlearning:
    def __init__(self, learning_rate, gamma, state_size, action_size):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate
        self.gamma = gamma
        self.reset_qtable()
    def update(self, state, action, reward, new_state):
        """Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
        delta = (
            reward
            + self.gamma * np.max(self.qtable[new_state][:])
            - self.qtable[state][action]
        )
        q_update = self.qtable[state][action] + self.learning_rate * delta
        return q_update
    def reset_qtable(self):
        """Reset the Q-table."""
        self.qtable = np.zeros((self.state_size, self.action_size))
 class EpsilonGreedy:
    def __init__(self, epsilon, rng):
        self.epsilon = epsilon
        self.rng = rng
    def choose_action(self, action_space, state, qtable):
        """Choose an action `a` in the current world state (s)."""
        # First we randomize a number
        explor_exploit_tradeoff = self.rng.uniform(0, 1)
        # Exploration
        if explor_exploit_tradeoff < self.epsilon:
            action = action_space.sample()
        # Exploitation (taking the biggest Q-value for this state)
        else:
            # Break ties randomly
            # If all actions are the same for this state we choose a random one
            # (otherwise `np.argmax()` would always take the first one)
            if np.all(qtable[state][:]) == qtable[state][0]:
                action = action_space.sample()
            else:
                action = np.argmax(qtable[state][:])
        return action
 def run_frozen_maps(maps, params, rng):
    """Run FrozenLake in maps and plot results"""
    map_sizes = maps
    res_all = pd.DataFrame()
    st_all = pd.DataFrame() 
    for map_size in map_sizes:
            env = gym.make(
            "FrozenLake-v1",
            is_slippery=params.is_slippery,
            render_mode="rgb_array",
            desc=generate_random_map(
                size=map_size, p=params.proba_frozen, seed=params.seed
            ),
    )
    params = params._replace(action_size=env.action_space.n)
    params = params._replace(state_size=env.observation_space.n)
    env.action_space.seed(
            params.seed
        )  # Set the seed to get reproducible results when sampling the action space
    learner = Qlearning(
        learning_rate=params.learning_rate,
        gamma=params.gamma,
        state_size=params.state_size,
        action_size=params.action_size,
    )
    explorer = EpsilonGreedy(
        epsilon=params.epsilon,
        rng=rng
    )
    print(f"Map size: {map_size}x{map_size}")
    rewards, steps, episodes, qtables, all_states, all_actions = run_env(env, params, learner, explorer)
        # Save the results in dataframes
    res, st = postprocess(episodes, params, rewards, steps, map_size)
    res_all = pd.concat([res_all, res])
    st_all = pd.concat([st_all, st])
    qtable = qtables.mean(axis=0)  # Average the Q-table between runs
    plot_states_actions_distribution(
        states=all_states, actions=all_actions, map_size=map_size, params=params
    )  # Sanity check
    plot_q_values_map(qtable, env, map_size, params)
    env.close()
    return res_all, st_all
 def run_env(env, params, learner, explorer):
    rewards = np.zeros((params.total_episodes, params.n_runs))
    steps = np.zeros((params.total_episodes, params.n_runs))
    episodes = np.arange(params.total_episodes)
    qtables = np.zeros((params.n_runs, params.state_size, params.action_size))
    all_states = []
    all_actions = []
    for run in range(params.n_runs):  # Run several times to account for stochasticity
        learner.reset_qtable()  # Reset the Q-table between runs
        for episode in tqdm(
            episodes, desc=f"Run {run}/{params.n_runs} - Episodes", leave=False
        ):
            state = env.reset(seed=params.seed)[0]  # Reset the environment
            step = 0
            done = False
            total_rewards = 0
            while not done:
                action = explorer.choose_action(
                    action_space=env.action_space, state=state, qtable=learner.qtable
                )
                # Log all states and actions
                all_states.append(state)
                all_actions.append(action)
                # Take the action (a) and observe the outcome state(s') and reward (r)
                new_state, reward, terminated, truncated, info = env.step(action)
                done = terminated or truncated
                learner.qtable[state, action] = learner.update(
                    state, action, reward, new_state
                )
                total_rewards += reward
                step += 1
                # Our new state is state
                state = new_state
            # Log all rewards and steps
            rewards[episode, run] = total_rewards
            steps[episode, run] = step
        qtables[run, :, :] = learner.qtable
    return rewards, steps, episodes, qtables, all_states, all_actions
 def postprocess(episodes, params, rewards, steps, map_size):
    """Convert the results of the simulation in dataframes."""
    res = pd.DataFrame(
        data={
            "Episodes": np.tile(episodes, reps=params.n_runs),
            "Rewards": rewards.flatten(),
            "Steps": steps.flatten(),
        }
    )
    res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
    res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])
    st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
    st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])
    return res, st
 def qtable_directions_map(qtable, map_size):
    """Get the best learned action & map it to arrows."""
    qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
    qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
    directions = {0: "←", 1: "↓", 2: "→", 3: "↑"}
    qtable_directions = np.empty(qtable_best_action.flatten().shape, dtype=str)
    eps = np.finfo(float).eps  # Minimum float number on the machine
    for idx, val in enumerate(qtable_best_action.flatten()):
        if qtable_val_max.flatten()[idx] > eps:
            # Assign an arrow only if a minimal Q-value has been learned as best action
            # otherwise since 0 is a direction, it also gets mapped on the tiles where
            # it didn't actually learn anything
            qtable_directions[idx] = directions[val]
    qtable_directions = qtable_directions.reshape(map_size, map_size)
    return qtable_val_max, qtable_directions
 def plot_q_values_map(qtable, env, map_size, params):
    """Plot the last frame of the simulation and the policy learned."""
    qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)
    # Plot the last frame
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    ax[0].imshow(env.render())
    ax[0].axis("off")
    ax[0].set_title("Last frame")
    # Plot the policy
    sns.heatmap(
        qtable_val_max,
        annot=qtable_directions,
        fmt="",
        ax=ax[1],
        cmap=sns.color_palette("Blues", as_cmap=True),
        linewidths=0.7,
        linecolor="black",
        xticklabels=[],
        yticklabels=[],
        annot_kws={"fontsize": "xx-large"},
    ).set(title="Learned Q-values\nArrows represent best action")
    for _, spine in ax[1].spines.items():
        spine.set_visible(True)
        spine.set_linewidth(0.7)
        spine.set_color("black")
    img_title = f"frozenlake_q_values_{map_size}x{map_size}.png"
    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
    plt.show()
 def plot_states_actions_distribution(states, actions, map_size, params):
    """Plot the distributions of states and actions."""
    labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    sns.histplot(data=states, ax=ax[0], kde=True)
    ax[0].set_title("States")
    sns.histplot(data=actions, ax=ax[1])
    ax[1].set_xticks(list(labels.values()), labels=labels.keys())
    ax[1].set_title("Actions")
    fig.tight_layout()
    img_title = f"frozenlake_states_actions_distrib_{map_size}x{map_size}.png"
    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
    plt.show()
 def plot_steps_and_rewards(rewards_df, steps_df,params):
    """Plot the steps and rewards from dataframes."""
    fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
    sns.lineplot(
        data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", ax=ax[0]
    )
    ax[0].set(ylabel="Cumulated rewards")
    sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", ax=ax[1])
    ax[1].set(ylabel="Averaged steps number")
    for axi in ax:
        axi.legend(title="map size")
    fig.tight_layout()
    img_title = "frozenlake_steps_and_rewards.png"
    fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
    plt.show()
--- a/nlp/0_1_LLM.ipynb
+++ b/nlp/0_1_LLM.ipynb
--- a/nlp/0_1_NLP_Slides.ipynb
+++ b/nlp/0_1_NLP_Slides.ipynb
--- a/nlp/0_2_NLP_Assignment.ipynb
+++ b/nlp/0_2_NLP_Assignment.ipynb
@@ -0,0 +1,333 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Table of Contents\n",
    "* [First steps](#First-steps)\n",
    "* [Movie review](#Movie-review)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# First steps\n",
    "Given the text taken from https://www.romania-insider.com/baneasa-airport-reopening-date-jul-2022.\n",
    "\n",
    "The Aurel Vlaicu Băneasa Airport will reopen on August 1, with scheduled commercial flights resuming after a nine-year hiatus, George Dorobanțu, the director of the Bucharest National Airports Company (CNAB), announced in an interview with the public radio. Three companies are already ready to start scheduled and charter flights on Băneasa, namely Ryanair, Air Connect, and Fly One, the director said.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "text = \"The Aurel Vlaicu Băneasa Airport will reopen on August 1, with scheduled commercial flights resuming after a nine-year hiatus, George Dorobanțu, the director of the Bucharest National Airports Company (CNAB), announced in an interview with the public radio. Three companies are already ready to start scheduled and charter flights on Băneasa, namely Ryanair, Air Connect, and Fly One, the director said.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### 1. List the first 10 tokens of the doc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 2. Number of tokens of the text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 3. List the Noun chunks\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 4. Print the sentences of the text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 5. Print the number of sentences of the text\n",
    "Hint: build a list first"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 6. Print the second sentence. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "###  7. Visualize the dependency grammar analysis of the second sentence."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 8. Listing lemmas and deps\n",
    "For every token in the second sentence, print the text token, the grammatical category, and the lemma in four columns.\n",
    "\n",
    "Example:\n",
    "\n",
    "you&emsp;&emsp;PRON&emsp;&emsp;you&emsp;&emsp;nsubj\n",
    "\n",
    "Hint: format the columns. You can use expandtabs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 9. List the frequencies of POS in the document in a table."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 10. Preprocessing\n",
    "\n",
    "Remove from the doc stopwords, digits, and punctuation.\n",
    "\n",
    "Hint: check the token api https://spacy.io/api/token\n",
    "\n",
    "Print the number of tokens before and after preprocessing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 11. Entities of the document\n",
    "Print the entities of the document, the type of the entity, and the explanation of the entity in a table with three columns.\n",
    "\n",
    "Example:\n",
    "\n",
    "Ubuntu&emsp;&emsp;&emsp;&emsp;ORG&emsp;&emsp;&emsp;&emsp;Companies, agencies, institutions, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### 12. Visualize the entities\n",
    "Show the entities highlighted in the text."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Movie review\n",
    "\n",
    "Classify the movie reviews from the following dataset  https://data.world/rajeevsharma993/movie-reviews"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## References\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "* [Spacy](https://spacy.io/usage/spacy-101/#annotations) \n",
    "* [NLTK stemmer](https://www.nltk.org/howto/stem.html)\n",
    "* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
    "* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)\n",
    "* Natural Language Processing with Python, José Portilla, 2019."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
 }
--- a/nlp/4_3_Vector_Representation.ipynb
+++ b/nlp/4_3_Vector_Representation.ipynb
@@ -105,9 +105,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>CountVectorizer(max_features=5000)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">CountVectorizer</label><div class=\"sk-toggleable__content\"><pre>CountVectorizer(max_features=5000)</pre></div></div></div></div></div>"
      ],
      "text/plain": [
       "CountVectorizer(max_features=5000)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
@@ -128,9 +142,21 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 3,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "data": {
      "text/plain": [
       "<3x10 sparse matrix of type '<class 'numpy.int64'>'\n",
       "\twith 15 stored elements in Compressed Sparse Row format>"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectors = vectorizer.fit_transform(documents)\n",
    "vectors"
@@ -146,12 +172,24 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0 1 1 2 0 0 1 2 0 0]\n",
      " [1 0 0 0 2 0 0 1 2 1]\n",
      " [1 0 0 0 2 1 0 0 1 1]]\n",
      "['and' 'but' 'coming' 'is' 'like' 'sandwiches' 'short' 'summer' 'the'\n",
      " 'winter']\n"
     ]
    }
   ],
   "source": [
    "print(vectors.toarray())\n",
-    "print(vectorizer.get_feature_names())"
+    "print(vectorizer.get_feature_names_out())"
   ]
  },
  {
@@ -164,13 +202,25 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 5,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['and', 'but', 'coming', 'i', 'is', 'like', 'sandwiches', 'short',\n",
       "       'summer', 'the', 'winter'], dtype=object)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer = CountVectorizer(analyzer=\"word\", stop_words=None, token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
    "vectors = vectorizer.fit_transform(documents)\n",
-    "vectorizer.get_feature_names()"
+    "vectorizer.get_feature_names_out()"
   ]
  },
  {
@@ -182,20 +232,47 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 6,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/cif/anaconda3/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
      "  warnings.warn(msg, category=FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
    "vectors = vectorizer.fit_transform(documents)\n",
-    "vectorizer.get_feature_names()"
+    "vectorizer.get_feature_names_out()"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "frozenset({'or', 'be', 'least', 'ours', 'very', 'noone', 'more', 'can', 'front', 'last', 'co', 'where', 'beyond', 'you', 'was', 'to', 'nine', 'here', 'describe', 'than', 'rather', 'therefore', 'except', 'at', 'again', 'ourselves', 'most', 'anyway', 'thick', 'whither', 'thereupon', 'someone', 'hereupon', 'besides', 'among', 'hasnt', 'across', 'namely', 'because', 'is', 'out', 'same', 'yourself', 'somehow', 'sincere', 'con', 'hereby', 'towards', 'interest', 'much', 'up', 'why', 'myself', 'all', 'nobody', 'though', 'every', 'show', 'not', 'there', 'whether', 'still', 'name', 'when', 'the', 'each', 'six', 'nor', 'and', 'under', 'thereby', 'less', 'either', 'thence', 'into', 'seemed', 'something', 'four', 'sometimes', 'himself', 'those', 'nowhere', 'almost', 'are', 'empty', 'must', 'while', 'afterwards', 'perhaps', 'from', 'detail', 'through', 'any', 'have', 'may', 'he', 'anywhere', 'alone', 'without', 'beforehand', 'had', 'too', 'yourselves', 'our', 'see', 'how', 'please', 'what', 'am', 'do', 'it', 'serious', 'yet', 'down', 'top', 'amount', 'then', 'both', 'fire', 'been', 'wherein', 'done', 'etc', 'whose', 'whereafter', 'who', 'ltd', 'meanwhile', 'further', 'few', 'first', 'behind', 'made', 'yours', 'until', 'toward', 'amoungst', 'anyhow', 'we', 'with', 'give', 'go', 'no', 'back', 'else', 'becomes', 'your', 'fill', 'together', 'another', 'throughout', 'onto', 'de', 'me', 'ten', 'system', 'became', 'per', 'therein', 'everyone', 'often', 'ie', 'put', 'hers', 'herself', 'nevertheless', 'itself', 'eg', 'herein', 'his', 'this', 'cry', 'due', 'bill', 'one', 'on', 'being', 'themselves', 'of', 'some', 'their', 'neither', 'elsewhere', 'since', 'whole', 'eight', 'i', 'a', 'whoever', 'own', 'call', 'them', 'mostly', 'she', 'my', 'cannot', 'us', 'never', 'as', 'thin', 'upon', 'cant', 'un', 'before', 'her', 'otherwise', 'full', 'these', 'next', 'they', 'side', 'somewhere', 'fifty', 'hence', 'so', 'along', 'already', 'three', 'latter', 'anything', 'whom', 'could', 'indeed', 'nothing', 'whereby', 'which', 'sometime', 'become', 'ever', 'amongst', 'by', 'in', 'five', 'after', 'mine', 'fifteen', 'wherever', 'found', 'thereafter', 'third', 'keep', 'anyone', 'will', 'bottom', 'off', 'seem', 'none', 'an', 'whatever', 'over', 'during', 'also', 'latterly', 'via', 'take', 'former', 'above', 'now', 'becoming', 'hereafter', 'such', 'two', 'only', 'about', 'sixty', 're', 'everything', 'others', 'hundred', 'twelve', 'thus', 'even', 'well', 'always', 'once', 'beside', 'get', 'mill', 'seems', 'if', 'whereupon', 'find', 'forty', 'inc', 'whenever', 'around', 'other', 'should', 'many', 'enough', 'however', 'move', 'against', 'several', 'everywhere', 'has', 'whereas', 'that', 'whence', 'eleven', 'its', 'within', 'twenty', 'part', 'although', 'thru', 'couldnt', 'moreover', 'him', 'formerly', 'might', 'seeming', 'but', 'below', 'would', 'between', 'were', 'for'})\n"
     ]
    }
   ],
   "source": [
    "#stop words in scikit-learn for English\n",
    "print(vectorizer.get_stop_words())"
@@ -442,7 +519,7 @@
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
@@ -456,7 +533,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.1"
+   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Carlos A. Iglesias	9844820e66	Delete xai/readme	2025-06-06 17:24:29 +03:00
Carlos A. Iglesias	d10434362e	Add files via upload	2025-06-06 17:24:05 +03:00
Carlos A. Iglesias	fb2135cea6	Create readme	2025-06-06 17:23:37 +03:00
Carlos A. Iglesias	ba6e533e0b	Add files via upload XAI notebook	2025-06-06 17:23:16 +03:00
Carlos A. Iglesias	4f5e976918	Create readme	2025-06-06 17:22:33 +03:00
Carlos A. Iglesias	b58370a19a	Update .gitignore	2025-06-02 17:23:44 +03:00
Carlos A. Iglesias	5c203b0884	Update spiral.py Fixed typo	2025-06-02 17:22:55 +03:00
Carlos A. Iglesias	5bf815f60f	Update 2_4_2_Exercise_Optional.ipynb Changed image path	2025-06-02 17:22:16 +03:00
Carlos A. Iglesias	90a3ff098b	Update 2_4_1_Exercise.ipynb Changed image path	2025-06-02 17:21:25 +03:00
Carlos A. Iglesias	945a8a7fb6	Update 2_4_0_Intro_NN.ipynb Changed image path	2025-06-02 17:19:19 +03:00
Carlos A. Iglesias	6532ef1b27	Update 2_8_Conclusions.ipynb Changed image path	2025-06-02 17:18:31 +03:00
Carlos A. Iglesias	3a73b2b286	Update 2_7_Model_Persistence.ipynb Changed image path	2025-06-02 17:17:43 +03:00
Carlos A. Iglesias	2e4ec3cfdc	Update 2_6_Model_Tuning.ipynb	2025-06-02 17:16:53 +03:00
Carlos A. Iglesias	21e7ae2f57	Update 2_5_2_Decision_Tree_Model.ipynb Changed image path	2025-06-02 17:13:49 +03:00
Carlos A. Iglesias	7b4d16964d	Update 2_5_1_kNN_Model.ipynb Changed image path	2025-06-02 17:11:45 +03:00
Carlos A. Iglesias	c5967746ea	Update 2_5_0_Machine_Learning.ipynb	2025-06-02 17:09:42 +03:00
Carlos A. Iglesias	ed7f0f3e1c	Update 2_5_0_Machine_Learning.ipynb	2025-06-02 17:09:13 +03:00
Carlos A. Iglesias	9324516c19	Update 2_5_0_Machine_Learning.ipynb Changed image path	2025-06-02 17:08:03 +03:00
Carlos A. Iglesias	6fc5565ea0	Update 2_2_Read_Data.ipynb	2025-06-02 17:05:17 +03:00
Carlos A. Iglesias	1113485833	Add files via upload	2025-06-02 17:03:20 +03:00
Carlos A. Iglesias	0c3f317a85	Add files via upload	2025-06-02 17:02:46 +03:00
Carlos A. Iglesias	0b550c837b	Update 2_2_Read_Data.ipynb Added figures	2025-06-02 17:00:58 +03:00
Carlos A. Iglesias	d7ce6df7fe	Update 2_2_Read_Data.ipynb	2025-06-02 16:57:54 +03:00
Carlos A. Iglesias	e2edae6049	Update 2_2_Read_Data.ipynb	2025-06-02 16:54:37 +03:00
Carlos A. Iglesias	4ea0146def	Update 2_2_Read_Data.ipynb	2025-06-02 16:54:06 +03:00
Carlos A. Iglesias	e7b2cee795	Add files via upload	2025-06-02 16:31:20 +03:00
Carlos A. Iglesias	9e1d0e5534	Add files via upload	2025-06-02 16:30:13 +03:00
Carlos A. Iglesias	f82203f371	Update 2_4_Preprocessing.ipynb Changed image path	2025-06-02 16:29:26 +03:00
Carlos A. Iglesias	b9ecccdeab	Update 2_3_1_Advanced_Visualisation.ipynb	2025-06-02 16:28:06 +03:00
Carlos A. Iglesias	44a555ac2d	Update 2_3_1_Advanced_Visualisation.ipynb Changed image path	2025-06-02 16:09:55 +03:00
Carlos A. Iglesias	ec11ff2d5e	Update 2_3_0_Visualisation.ipynb Changed image path	2025-06-02 16:06:53 +03:00
Carlos A. Iglesias	ec02125396	Update 2_2_Read_Data.ipynb	2025-06-02 16:04:57 +03:00
Carlos A. Iglesias	b5f1a7dd22	Update 2_0_0_Intro_ML.ipynb	2025-06-02 16:03:03 +03:00
Carlos A. Iglesias	1cc1e45673	Update 2_2_Read_Data.ipynb Changed image path	2025-06-02 16:02:45 +03:00
Carlos A. Iglesias	a2ad2c0e92	Update 2_1_Intro_ScikitLearn.ipynb Changed images path	2025-06-02 16:00:59 +03:00
Carlos A. Iglesias	1add6a4c8e	Update 2_0_1_Objectives.ipynb Changed image path	2025-06-02 15:58:32 +03:00
Carlos A. Iglesias	af78e6480d	Update 2_0_0_Intro_ML.ipynb changed path to image	2025-06-02 15:57:25 +03:00
Carlos A. Iglesias	cae7d8cbb2	Updated LLM	2025-05-05 16:39:20 +02:00
Carlos A. Iglesias	f58aa6c0b8	Delete nlp/0_1_LLM.ipynb	2025-05-05 16:38:41 +02:00
Carlos A. Iglesias	6e8448f22f	Update 0_2_NLP_Assignment.ipynb	2025-04-24 18:31:56 +02:00
Carlos A. Iglesias	8f2a5c17d8	Update 0_1_NLP_Slides.ipynb	2025-04-24 18:30:18 +02:00
Carlos A. Iglesias	36d117e417	Delete nlp/spacy/readme.md	2025-04-21 18:59:11 +02:00
Carlos A. Iglesias	2fc057f6f9	Add files via upload	2025-04-21 18:58:47 +02:00
Carlos A. Iglesias	5b0d4f2a5d	Add files via upload	2025-04-21 18:58:15 +02:00
Carlos A. Iglesias	7afa2b3b22	Create readme.md	2025-04-21 18:57:59 +02:00
Carlos A. Iglesias	4e0f9159e8	Update 2_5_1_Exercise.ipynb	2025-04-03 18:54:52 +02:00
Carlos A. Iglesias	82aa552976	Update 2_5_1_Exercise.ipynb	2025-04-03 18:53:35 +02:00
Carlos A. Iglesias	3ebff69cf8	Update 2_5_1_Exercise.ipynb	2025-04-03 18:43:58 +02:00
Carlos A. Iglesias	0f228bbec3	Update 2_5_1_Exercise.ipynb	2025-04-03 18:43:34 +02:00
Carlos A. Iglesias	64c8854741	Update 2_5_1_Exercise.ipynb	2025-04-03 18:41:49 +02:00
Carlos A. Iglesias	3e081e5d83	Update 2_5_1_Exercise.ipynb	2025-04-03 18:38:26 +02:00
Carlos A. Iglesias	065797b886	Update 2_5_1_Exercise.ipynb	2025-04-03 18:37:26 +02:00
Carlos A. Iglesias	8d2f625b7e	Update 2_5_1_Exercise.ipynb	2025-04-03 18:36:31 +02:00
Carlos A. Iglesias	26eda30a71	Update 2_5_1_Exercise.ipynb	2025-04-03 18:35:53 +02:00
Carlos A. Iglesias	55365ae927	Update 2_5_1_Exercise.ipynb	2025-04-03 18:34:50 +02:00
Carlos A. Iglesias	152125b3da	Update 2_5_1_Exercise.ipynb	2025-04-03 18:33:47 +02:00
Carlos A. Iglesias	97362545ea	Update 2_5_1_Exercise.ipynb Added https://sklearn-genetic-opt.readthedocs.io/en/stable/index.html	2025-04-03 18:32:32 +02:00
cif	c49c866a2e	Update notebook with pivot_table examples	2025-03-06 16:05:16 +01:00
Carlos A. Iglesias	3f7694e330	Add files via upload Added ttl	2025-02-20 19:14:13 +01:00
Carlos A. Iglesias	bf684d6e6e	Updated index	2024-06-07 17:54:18 +03:00
Carlos A. Iglesias	d935b85b26	Add files via upload Added images	2024-06-03 14:44:28 +02:00
Carlos A. Iglesias	1d8e777236	Create .p	2024-06-03 15:42:13 +03:00
Carlos A. Iglesias	23ebe2f390	Update 3_1_Read_Data.ipynb Updated table markdown	2024-05-21 14:30:26 +02:00
Carlos A. Iglesias	01eb89ada4	New noteboook about transformers	2024-05-14 09:55:02 +02:00
Carlos A. Iglesias	e4fdcd65a1	Update 2_6_1_Q-Learning_Basic.ipynb Updated installation with new version of gymnasium	2024-04-24 18:46:54 +02:00
Carlos A. Iglesias	9f46c534f7	Update 2_5_1_Exercise.ipynb Added optional exercises.	2024-04-18 18:04:43 +02:00
Carlos A. Iglesias	743c57691f	Delete sna/t.txt	2024-04-17 17:24:12 +02:00
Carlos A. Iglesias	2c53b81299	Uploaded SNA files	2024-04-17 17:23:28 +02:00
Carlos A. Iglesias	dd6c053109	Add files via upload	2024-04-17 17:22:36 +02:00
Carlos A. Iglesias	e35e0a11e9	Create t.txt	2024-04-17 17:22:20 +02:00
Carlos A. Iglesias	7315b681e4	Update README.md	2024-04-17 17:21:21 +02:00
Carlos A. Iglesias	3fac9c6f78	Add files via upload	2024-04-04 18:27:48 +02:00
Carlos A. Iglesias	21819abeae	Added visualization notebooks	2024-04-03 22:53:02 +02:00
Carlos A. Iglesias	0d4c0c706d	Added images	2024-04-03 22:51:58 +02:00
Carlos A. Iglesias	8de629b495	Create .gitkeep	2024-04-03 22:51:19 +02:00
Carlos A. Iglesias	86114b4a56	Added preprocessing notebooks	2024-04-03 22:50:36 +02:00
Carlos A. Iglesias	1a3f618995	Add files via upload	2024-04-03 21:52:25 +02:00
Carlos A. Iglesias	a1121c03a5	Create .gitkeep - Added preprocessing notebooks	2024-04-03 21:51:34 +02:00
Carlos A. Iglesias	715d0cb77f	Create .gitkeep Added new set of exercises	2024-04-03 21:50:50 +02:00
Carlos A. Iglesias	0150ce7cf7	Update 3_7_SVM.ipynb Updated formatted table	2024-02-22 12:23:08 +01:00
Carlos A. Iglesias	08dfe5c147	Update 3_4_Visualisation_Pandas.ipynb Updated code to last version of seaborn	2024-02-22 11:55:35 +01:00
Carlos A. Iglesias	78e62af098	Update 3_3_Data_Munging_with_Pandas.ipynb Updated to last version of scikit	2024-02-21 12:29:04 +01:00
Carlos A. Iglesias	3f5eba3e84	Update 3_2_Pandas.ipynb Updated links	2024-02-21 12:16:12 +01:00
Carlos A. Iglesias	2de1cda8f1	Update 3_1_Read_Data.ipynb Updated links	2024-02-21 12:14:25 +01:00
Carlos A. Iglesias	cc442c35f3	Update 3_0_0_Intro_ML_2.ipynb Updated links	2024-02-21 12:12:14 +01:00
Carlos A. Iglesias	1100c352fa	Update 2_6_Model_Tuning.ipynb updated links	2024-02-21 11:47:34 +01:00
Carlos A. Iglesias	9b573d292d	Update 2_5_2_Decision_Tree_Model.ipynb Updated links	2024-02-21 11:41:42 +01:00
Carlos A. Iglesias	dd8a4f50d8	Update 2_5_2_Decision_Tree_Model.ipynb Updated links	2024-02-21 11:40:59 +01:00
Carlos A. Iglesias	47148f2ccc	Update util_ds.py Updated links	2024-02-21 11:40:06 +01:00
Carlos A. Iglesias	8ffda8123a	Update 2_5_1_kNN_Model.ipynb Updated links	2024-02-21 11:07:38 +01:00
Carlos A. Iglesias	6629837e7d	Update 2_5_0_Machine_Learning.ipynb Updated links	2024-02-21 11:06:21 +01:00
Carlos A. Iglesias	ba08a9a264	Update 2_4_Preprocessing.ipynb Updated links	2024-02-21 11:02:09 +01:00
Carlos A. Iglesias	4b8fd30f42	Update 2_3_1_Advanced_Visualisation.ipynb Updated links	2024-02-21 11:00:53 +01:00
Carlos A. Iglesias	d879369930	Update 2_3_0_Visualisation.ipynb Updated links	2024-02-21 10:57:34 +01:00
Carlos A. Iglesias	4da01f3ae6	Update 2_0_0_Intro_ML.ipynb Updated links	2024-02-21 10:44:43 +01:00
Carlos A. Iglesias	da9a01e26b	Update 2_0_1_Objectives.ipynb Updated links	2024-02-21 10:43:40 +01:00
Carlos A. Iglesias	dc23b178d7	Delete python/plurals.py	2024-02-08 18:32:43 +01:00
Carlos A. Iglesias	5410d6115d	Delete python/catalog.py	2024-02-08 18:32:18 +01:00
Carlos A. Iglesias	6749aa5deb	Added files for modules	2024-02-08 18:26:08 +01:00
Carlos A. Iglesias	c31e6c1676	Update 1_2_Numbers_Strings.ipynb	2024-02-08 17:47:42 +01:00
Carlos A. Iglesias	1c7496c8ac	Update 1_2_Numbers_Strings.ipynb Improved formatting.	2024-02-08 17:46:18 +01:00
Carlos A. Iglesias	35b1ae4ec8	Update 1_8_Classes.ipynb Improved formatting.	2024-02-08 17:43:25 +01:00
Carlos A. Iglesias	58fc6f5e9c	Update 1_4_Sets.ipynb Typo corrected.	2024-02-08 17:42:45 +01:00
Carlos A. Iglesias	91147becee	Update 1_3_Sequences.ipynb Formatting improvement.	2024-02-08 17:41:15 +01:00
Carlos A. Iglesias	1530995243	Update 1_0_Intro_Python.ipynb Updated links.	2024-02-08 17:36:46 +01:00
Carlos A. Iglesias	0c0960cec7	Update 1_7_Variables.ipynb typo in bold markdown Typo in bold markdown	2024-02-08 17:33:48 +01:00
cif	3363c953f4	Borrada versión anterior	2023-04-27 15:43:44 +02:00
cif	542ce2708d	Actualizada práctica a gymnasium y extendida	2023-04-27 15:42:01 +02:00
cif	380340d66d	Updated 4_4 to use get_feature_names_out() instead of get_feature_names	2023-04-23 16:41:53 +02:00
cif	7f49f8990b	Updated 4_4 - using feature_log_prob_ instead of coef_ (deprecated)	2023-04-23 16:37:48 +02:00
Carlos A. Iglesias	419ea57824	Transparencias con Spacy	2023-04-20 18:20:44 +02:00
Carlos A. Iglesias	7d6010114d	Upload data for assignment	2023-04-20 18:17:12 +02:00
Carlos A. Iglesias	f9d8234e14	Added exercise with Spacy	2023-04-20 16:20:28 +02:00
Carlos A. Iglesias	d41fa61c65	Delete 0_2_NLP_Assignment.ipynb	2023-04-20 16:19:57 +02:00
Carlos A. Iglesias	05a4588acf	Exercise with Spacy	2023-04-20 16:18:47 +02:00
Carlos A. Iglesias	50933f6c94	Update 3_7_SVM.ipynb Fixed typo and updated link	2023-03-09 18:04:14 +01:00
J. Fernando Sánchez	68ba528dd7	Fix typos	2023-02-20 19:43:36 +01:00
J. Fernando Sánchez	897bb487b1	Actualizar ejercicios LOD	2023-02-13 18:26:14 +01:00
Oscar Araque	41d3bdea75	minor typos in ml1	2022-09-05 18:20:29 +02:00
Carlos A. Iglesias	0a9cd3bd5e	Update 3_7_SVM.ipynb Fixed typo in a comment	2022-03-17 17:58:09 +01:00
Carlos A. Iglesias	2c7c9e58e0	Update 3_7_SVM.ipynb Fixed bug in ROC curve visualization	2022-03-17 17:50:27 +01:00
cif	f0278aea33	Updated	2022-03-07 14:19:44 +01:00
cif	7bf0fb6479	Updated	2022-03-07 14:17:02 +01:00
cif	4d87b07ed9	Updated visualization	2022-03-07 14:16:14 +01:00
cif	7d71ba5f7a	Updated references	2022-03-07 13:03:48 +01:00
cif	1124c9129c	Fixed URL	2022-03-07 13:01:21 +01:00
cif	df6449b55f	Updated to last version of seaborn	2022-03-07 12:57:17 +01:00
cif	d99eeb733a	Updated median with only numeric values	2022-03-07 12:44:14 +01:00
cif	a43fb4c78c	Updated references	2022-03-07 12:28:10 +01:00
Carlos A. Iglesias	bf21e3ceab	Update 3_1_Read_Data.ipynb Updated references	2022-03-07 11:01:34 +01:00
Carlos A. Iglesias	e41d233828	Update 3_0_0_Intro_ML_2.ipynb Updated bibliography	2022-03-07 10:58:29 +01:00
Carlos A. Iglesias	a7c6be5b96	Update 2_6_Model_Tuning.ipynb Fixed typo.	2022-02-28 12:51:18 +01:00
Carlos A. Iglesias	11a1ea80d3	Update 2_6_Model_Tuning.ipynb Fixed typos.	2022-02-28 12:45:40 +01:00
Carlos A. Iglesias	a209d18a5b	Update 2_5_1_kNN_Model.ipynb Fixed typo.	2022-02-28 12:38:27 +01:00
cif	ffefd8c2e3	Actualizada bibliografía	2022-02-21 13:55:09 +01:00
cif	f43cde73e4	Actualizada bibliografía	2022-02-21 13:51:21 +01:00
cif	8784fdc773	Actualizada bibliografía	2022-02-21 13:39:33 +01:00
cif	a6d5f9ddeb	Actualizada bibliografía	2022-02-21 13:32:07 +01:00
cif	2e72a4d729	Actualizada bibliografía	2022-02-21 13:29:33 +01:00
cif	9426b4c061	Actualizada bibliografía	2022-02-21 13:26:24 +01:00
cif	5e5979d515	Actualizados enlace	2022-02-21 13:22:46 +01:00
cif	270dcec611	Actualizados enlace	2022-02-21 13:09:21 +01:00
Carlos A. Iglesias	e6e52b43ee	Update 2_4_Preprocessing.ipynb Actualizado enlace de Packt.	2022-02-21 12:57:53 +01:00
Carlos A. Iglesias	3b7675fa3f	Update 2_3_0_Visualisation.ipynb Actualizado enlace de bibliografía de packt.	2022-02-21 12:56:22 +01:00
Carlos A. Iglesias	44c63412f9	Update 2_2_Read_Data.ipynb Updated scikit url	2022-02-21 12:26:30 +01:00
Carlos A. Iglesias	5febbc21a4	Update 2_1_Intro_ScikitLearn.ipynb Errata en dimensionality.	2022-02-21 12:22:15 +01:00
J. Fernando Sánchez	66ed4ba258	Minor changes LOD 01 and 03	2022-02-15 20:48:49 +01:00
Carlos A. Iglesias	95cd25aef4	Update 1__10_Modules_Packages.ipynb Fixed link to module tutorial	2022-02-10 17:51:32 +01:00
J. Fernando Sánchez	955e74fc8e	Add requirements Now the dependencies should be automatically installed if you open the repo through Jupyter Binder	2021-11-10 08:48:54 +01:00
cif2cif	6743dad100	Cleaned output	2021-06-07 10:38:53 +02:00
cif2cif	729f7684c2	Cleaned output	2021-06-07 10:36:12 +02:00
cif2cif	ae8d3d3ba2	Updated with the new libraries	2021-05-07 11:10:21 +02:00
cif2cif	2ba0e2f3d9	updated to last version of OpenGym	2021-04-19 19:10:03 +02:00
cif2cif	c9114cc796	Fixed broken link and bug of sklearn-deap with scikit 0.24	2021-04-19 17:47:22 +02:00
cif2cif	b80c097362	Merge branch 'master' of https://github.com/gsi-upm/sitc	2021-04-06 10:21:25 +02:00
cif2cif	161cd8492b	Fixed bug in substrings_in_string and set default df[AgeGroup] to np.nan	2021-04-06 10:20:29 +02:00