1
0
mirror of https://github.com/gsi-upm/sitc synced 2025-11-21 16:18:17 +00:00

Compare commits

156 Commits

Author SHA1 Message Date
Carlos A. Iglesias
9844820e66 Delete xai/readme 2025-06-06 17:24:29 +03:00
Carlos A. Iglesias
d10434362e Add files via upload 2025-06-06 17:24:05 +03:00
Carlos A. Iglesias
fb2135cea6 Create readme 2025-06-06 17:23:37 +03:00
Carlos A. Iglesias
ba6e533e0b Add files via upload
XAI notebook
2025-06-06 17:23:16 +03:00
Carlos A. Iglesias
4f5e976918 Create readme 2025-06-06 17:22:33 +03:00
Carlos A. Iglesias
b58370a19a Update .gitignore 2025-06-02 17:23:44 +03:00
Carlos A. Iglesias
5c203b0884 Update spiral.py
Fixed typo
2025-06-02 17:22:55 +03:00
Carlos A. Iglesias
5bf815f60f Update 2_4_2_Exercise_Optional.ipynb
Changed image path
2025-06-02 17:22:16 +03:00
Carlos A. Iglesias
90a3ff098b Update 2_4_1_Exercise.ipynb
Changed image path
2025-06-02 17:21:25 +03:00
Carlos A. Iglesias
945a8a7fb6 Update 2_4_0_Intro_NN.ipynb
Changed image path
2025-06-02 17:19:19 +03:00
Carlos A. Iglesias
6532ef1b27 Update 2_8_Conclusions.ipynb
Changed image path
2025-06-02 17:18:31 +03:00
Carlos A. Iglesias
3a73b2b286 Update 2_7_Model_Persistence.ipynb
Changed image path
2025-06-02 17:17:43 +03:00
Carlos A. Iglesias
2e4ec3cfdc Update 2_6_Model_Tuning.ipynb 2025-06-02 17:16:53 +03:00
Carlos A. Iglesias
21e7ae2f57 Update 2_5_2_Decision_Tree_Model.ipynb
Changed image path
2025-06-02 17:13:49 +03:00
Carlos A. Iglesias
7b4d16964d Update 2_5_1_kNN_Model.ipynb
Changed image path
2025-06-02 17:11:45 +03:00
Carlos A. Iglesias
c5967746ea Update 2_5_0_Machine_Learning.ipynb 2025-06-02 17:09:42 +03:00
Carlos A. Iglesias
ed7f0f3e1c Update 2_5_0_Machine_Learning.ipynb 2025-06-02 17:09:13 +03:00
Carlos A. Iglesias
9324516c19 Update 2_5_0_Machine_Learning.ipynb
Changed image path
2025-06-02 17:08:03 +03:00
Carlos A. Iglesias
6fc5565ea0 Update 2_2_Read_Data.ipynb 2025-06-02 17:05:17 +03:00
Carlos A. Iglesias
1113485833 Add files via upload 2025-06-02 17:03:20 +03:00
Carlos A. Iglesias
0c3f317a85 Add files via upload 2025-06-02 17:02:46 +03:00
Carlos A. Iglesias
0b550c837b Update 2_2_Read_Data.ipynb
Added figures
2025-06-02 17:00:58 +03:00
Carlos A. Iglesias
d7ce6df7fe Update 2_2_Read_Data.ipynb 2025-06-02 16:57:54 +03:00
Carlos A. Iglesias
e2edae6049 Update 2_2_Read_Data.ipynb 2025-06-02 16:54:37 +03:00
Carlos A. Iglesias
4ea0146def Update 2_2_Read_Data.ipynb 2025-06-02 16:54:06 +03:00
Carlos A. Iglesias
e7b2cee795 Add files via upload 2025-06-02 16:31:20 +03:00
Carlos A. Iglesias
9e1d0e5534 Add files via upload 2025-06-02 16:30:13 +03:00
Carlos A. Iglesias
f82203f371 Update 2_4_Preprocessing.ipynb
Changed image path
2025-06-02 16:29:26 +03:00
Carlos A. Iglesias
b9ecccdeab Update 2_3_1_Advanced_Visualisation.ipynb 2025-06-02 16:28:06 +03:00
Carlos A. Iglesias
44a555ac2d Update 2_3_1_Advanced_Visualisation.ipynb
Changed image path
2025-06-02 16:09:55 +03:00
Carlos A. Iglesias
ec11ff2d5e Update 2_3_0_Visualisation.ipynb
Changed image path
2025-06-02 16:06:53 +03:00
Carlos A. Iglesias
ec02125396 Update 2_2_Read_Data.ipynb 2025-06-02 16:04:57 +03:00
Carlos A. Iglesias
b5f1a7dd22 Update 2_0_0_Intro_ML.ipynb 2025-06-02 16:03:03 +03:00
Carlos A. Iglesias
1cc1e45673 Update 2_2_Read_Data.ipynb
Changed image path
2025-06-02 16:02:45 +03:00
Carlos A. Iglesias
a2ad2c0e92 Update 2_1_Intro_ScikitLearn.ipynb
Changed images path
2025-06-02 16:00:59 +03:00
Carlos A. Iglesias
1add6a4c8e Update 2_0_1_Objectives.ipynb
Changed image path
2025-06-02 15:58:32 +03:00
Carlos A. Iglesias
af78e6480d Update 2_0_0_Intro_ML.ipynb
changed path to image
2025-06-02 15:57:25 +03:00
Carlos A. Iglesias
cae7d8cbb2 Updated LLM 2025-05-05 16:39:20 +02:00
Carlos A. Iglesias
f58aa6c0b8 Delete nlp/0_1_LLM.ipynb 2025-05-05 16:38:41 +02:00
Carlos A. Iglesias
6e8448f22f Update 0_2_NLP_Assignment.ipynb 2025-04-24 18:31:56 +02:00
Carlos A. Iglesias
8f2a5c17d8 Update 0_1_NLP_Slides.ipynb 2025-04-24 18:30:18 +02:00
Carlos A. Iglesias
36d117e417 Delete nlp/spacy/readme.md 2025-04-21 18:59:11 +02:00
Carlos A. Iglesias
2fc057f6f9 Add files via upload 2025-04-21 18:58:47 +02:00
Carlos A. Iglesias
5b0d4f2a5d Add files via upload 2025-04-21 18:58:15 +02:00
Carlos A. Iglesias
7afa2b3b22 Create readme.md 2025-04-21 18:57:59 +02:00
Carlos A. Iglesias
4e0f9159e8 Update 2_5_1_Exercise.ipynb 2025-04-03 18:54:52 +02:00
Carlos A. Iglesias
82aa552976 Update 2_5_1_Exercise.ipynb 2025-04-03 18:53:35 +02:00
Carlos A. Iglesias
3ebff69cf8 Update 2_5_1_Exercise.ipynb 2025-04-03 18:43:58 +02:00
Carlos A. Iglesias
0f228bbec3 Update 2_5_1_Exercise.ipynb 2025-04-03 18:43:34 +02:00
Carlos A. Iglesias
64c8854741 Update 2_5_1_Exercise.ipynb 2025-04-03 18:41:49 +02:00
Carlos A. Iglesias
3e081e5d83 Update 2_5_1_Exercise.ipynb 2025-04-03 18:38:26 +02:00
Carlos A. Iglesias
065797b886 Update 2_5_1_Exercise.ipynb 2025-04-03 18:37:26 +02:00
Carlos A. Iglesias
8d2f625b7e Update 2_5_1_Exercise.ipynb 2025-04-03 18:36:31 +02:00
Carlos A. Iglesias
26eda30a71 Update 2_5_1_Exercise.ipynb 2025-04-03 18:35:53 +02:00
Carlos A. Iglesias
55365ae927 Update 2_5_1_Exercise.ipynb 2025-04-03 18:34:50 +02:00
Carlos A. Iglesias
152125b3da Update 2_5_1_Exercise.ipynb 2025-04-03 18:33:47 +02:00
Carlos A. Iglesias
97362545ea Update 2_5_1_Exercise.ipynb
Added https://sklearn-genetic-opt.readthedocs.io/en/stable/index.html
2025-04-03 18:32:32 +02:00
cif
c49c866a2e Update notebook with pivot_table examples 2025-03-06 16:05:16 +01:00
Carlos A. Iglesias
3f7694e330 Add files via upload
Added ttl
2025-02-20 19:14:13 +01:00
Carlos A. Iglesias
bf684d6e6e Updated index 2024-06-07 17:54:18 +03:00
Carlos A. Iglesias
d935b85b26 Add files via upload
Added images
2024-06-03 14:44:28 +02:00
Carlos A. Iglesias
1d8e777236 Create .p 2024-06-03 15:42:13 +03:00
Carlos A. Iglesias
23ebe2f390 Update 3_1_Read_Data.ipynb
Updated table markdown
2024-05-21 14:30:26 +02:00
Carlos A. Iglesias
01eb89ada4 New noteboook about transformers 2024-05-14 09:55:02 +02:00
Carlos A. Iglesias
e4fdcd65a1 Update 2_6_1_Q-Learning_Basic.ipynb
Updated installation with new version of gymnasium
2024-04-24 18:46:54 +02:00
Carlos A. Iglesias
9f46c534f7 Update 2_5_1_Exercise.ipynb
Added optional exercises.
2024-04-18 18:04:43 +02:00
Carlos A. Iglesias
743c57691f Delete sna/t.txt 2024-04-17 17:24:12 +02:00
Carlos A. Iglesias
2c53b81299 Uploaded SNA files 2024-04-17 17:23:28 +02:00
Carlos A. Iglesias
dd6c053109 Add files via upload 2024-04-17 17:22:36 +02:00
Carlos A. Iglesias
e35e0a11e9 Create t.txt 2024-04-17 17:22:20 +02:00
Carlos A. Iglesias
7315b681e4 Update README.md 2024-04-17 17:21:21 +02:00
Carlos A. Iglesias
3fac9c6f78 Add files via upload 2024-04-04 18:27:48 +02:00
Carlos A. Iglesias
21819abeae Added visualization notebooks 2024-04-03 22:53:02 +02:00
Carlos A. Iglesias
0d4c0c706d Added images 2024-04-03 22:51:58 +02:00
Carlos A. Iglesias
8de629b495 Create .gitkeep 2024-04-03 22:51:19 +02:00
Carlos A. Iglesias
86114b4a56 Added preprocessing notebooks 2024-04-03 22:50:36 +02:00
Carlos A. Iglesias
1a3f618995 Add files via upload 2024-04-03 21:52:25 +02:00
Carlos A. Iglesias
a1121c03a5 Create .gitkeep - Added preprocessing notebooks 2024-04-03 21:51:34 +02:00
Carlos A. Iglesias
715d0cb77f Create .gitkeep
Added new set of exercises
2024-04-03 21:50:50 +02:00
Carlos A. Iglesias
0150ce7cf7 Update 3_7_SVM.ipynb
Updated formatted table
2024-02-22 12:23:08 +01:00
Carlos A. Iglesias
08dfe5c147 Update 3_4_Visualisation_Pandas.ipynb
Updated code to last version of seaborn
2024-02-22 11:55:35 +01:00
Carlos A. Iglesias
78e62af098 Update 3_3_Data_Munging_with_Pandas.ipynb
Updated to last version of scikit
2024-02-21 12:29:04 +01:00
Carlos A. Iglesias
3f5eba3e84 Update 3_2_Pandas.ipynb
Updated links
2024-02-21 12:16:12 +01:00
Carlos A. Iglesias
2de1cda8f1 Update 3_1_Read_Data.ipynb
Updated links
2024-02-21 12:14:25 +01:00
Carlos A. Iglesias
cc442c35f3 Update 3_0_0_Intro_ML_2.ipynb
Updated links
2024-02-21 12:12:14 +01:00
Carlos A. Iglesias
1100c352fa Update 2_6_Model_Tuning.ipynb
updated links
2024-02-21 11:47:34 +01:00
Carlos A. Iglesias
9b573d292d Update 2_5_2_Decision_Tree_Model.ipynb
Updated links
2024-02-21 11:41:42 +01:00
Carlos A. Iglesias
dd8a4f50d8 Update 2_5_2_Decision_Tree_Model.ipynb
Updated links
2024-02-21 11:40:59 +01:00
Carlos A. Iglesias
47148f2ccc Update util_ds.py
Updated links
2024-02-21 11:40:06 +01:00
Carlos A. Iglesias
8ffda8123a Update 2_5_1_kNN_Model.ipynb
Updated links
2024-02-21 11:07:38 +01:00
Carlos A. Iglesias
6629837e7d Update 2_5_0_Machine_Learning.ipynb
Updated links
2024-02-21 11:06:21 +01:00
Carlos A. Iglesias
ba08a9a264 Update 2_4_Preprocessing.ipynb
Updated links
2024-02-21 11:02:09 +01:00
Carlos A. Iglesias
4b8fd30f42 Update 2_3_1_Advanced_Visualisation.ipynb
Updated links
2024-02-21 11:00:53 +01:00
Carlos A. Iglesias
d879369930 Update 2_3_0_Visualisation.ipynb
Updated links
2024-02-21 10:57:34 +01:00
Carlos A. Iglesias
4da01f3ae6 Update 2_0_0_Intro_ML.ipynb
Updated links
2024-02-21 10:44:43 +01:00
Carlos A. Iglesias
da9a01e26b Update 2_0_1_Objectives.ipynb
Updated links
2024-02-21 10:43:40 +01:00
Carlos A. Iglesias
dc23b178d7 Delete python/plurals.py 2024-02-08 18:32:43 +01:00
Carlos A. Iglesias
5410d6115d Delete python/catalog.py 2024-02-08 18:32:18 +01:00
Carlos A. Iglesias
6749aa5deb Added files for modules 2024-02-08 18:26:08 +01:00
Carlos A. Iglesias
c31e6c1676 Update 1_2_Numbers_Strings.ipynb 2024-02-08 17:47:42 +01:00
Carlos A. Iglesias
1c7496c8ac Update 1_2_Numbers_Strings.ipynb
Improved formatting.
2024-02-08 17:46:18 +01:00
Carlos A. Iglesias
35b1ae4ec8 Update 1_8_Classes.ipynb
Improved formatting.
2024-02-08 17:43:25 +01:00
Carlos A. Iglesias
58fc6f5e9c Update 1_4_Sets.ipynb
Typo corrected.
2024-02-08 17:42:45 +01:00
Carlos A. Iglesias
91147becee Update 1_3_Sequences.ipynb
Formatting improvement.
2024-02-08 17:41:15 +01:00
Carlos A. Iglesias
1530995243 Update 1_0_Intro_Python.ipynb
Updated links.
2024-02-08 17:36:46 +01:00
Carlos A. Iglesias
0c0960cec7 Update 1_7_Variables.ipynb typo in bold markdown
Typo in bold markdown
2024-02-08 17:33:48 +01:00
cif
3363c953f4 Borrada versión anterior 2023-04-27 15:43:44 +02:00
cif
542ce2708d Actualizada práctica a gymnasium y extendida 2023-04-27 15:42:01 +02:00
cif
380340d66d Updated 4_4 to use get_feature_names_out() instead of get_feature_names 2023-04-23 16:41:53 +02:00
cif
7f49f8990b Updated 4_4 - using feature_log_prob_ instead of coef_ (deprecated) 2023-04-23 16:37:48 +02:00
Carlos A. Iglesias
419ea57824 Transparencias con Spacy 2023-04-20 18:20:44 +02:00
Carlos A. Iglesias
7d6010114d Upload data for assignment 2023-04-20 18:17:12 +02:00
Carlos A. Iglesias
f9d8234e14 Added exercise with Spacy 2023-04-20 16:20:28 +02:00
Carlos A. Iglesias
d41fa61c65 Delete 0_2_NLP_Assignment.ipynb 2023-04-20 16:19:57 +02:00
Carlos A. Iglesias
05a4588acf Exercise with Spacy 2023-04-20 16:18:47 +02:00
Carlos A. Iglesias
50933f6c94 Update 3_7_SVM.ipynb
Fixed typo and updated link
2023-03-09 18:04:14 +01:00
J. Fernando Sánchez
68ba528dd7 Fix typos 2023-02-20 19:43:36 +01:00
J. Fernando Sánchez
897bb487b1 Actualizar ejercicios LOD 2023-02-13 18:26:14 +01:00
Oscar Araque
41d3bdea75 minor typos in ml1 2022-09-05 18:20:29 +02:00
Carlos A. Iglesias
0a9cd3bd5e Update 3_7_SVM.ipynb
Fixed typo in a comment
2022-03-17 17:58:09 +01:00
Carlos A. Iglesias
2c7c9e58e0 Update 3_7_SVM.ipynb
Fixed bug in ROC curve visualization
2022-03-17 17:50:27 +01:00
cif
f0278aea33 Updated 2022-03-07 14:19:44 +01:00
cif
7bf0fb6479 Updated 2022-03-07 14:17:02 +01:00
cif
4d87b07ed9 Updated visualization 2022-03-07 14:16:14 +01:00
cif
7d71ba5f7a Updated references 2022-03-07 13:03:48 +01:00
cif
1124c9129c Fixed URL 2022-03-07 13:01:21 +01:00
cif
df6449b55f Updated to last version of seaborn 2022-03-07 12:57:17 +01:00
cif
d99eeb733a Updated median with only numeric values 2022-03-07 12:44:14 +01:00
cif
a43fb4c78c Updated references 2022-03-07 12:28:10 +01:00
Carlos A. Iglesias
bf21e3ceab Update 3_1_Read_Data.ipynb
Updated references
2022-03-07 11:01:34 +01:00
Carlos A. Iglesias
e41d233828 Update 3_0_0_Intro_ML_2.ipynb
Updated bibliography
2022-03-07 10:58:29 +01:00
Carlos A. Iglesias
a7c6be5b96 Update 2_6_Model_Tuning.ipynb
Fixed typo.
2022-02-28 12:51:18 +01:00
Carlos A. Iglesias
11a1ea80d3 Update 2_6_Model_Tuning.ipynb
Fixed typos.
2022-02-28 12:45:40 +01:00
Carlos A. Iglesias
a209d18a5b Update 2_5_1_kNN_Model.ipynb
Fixed typo.
2022-02-28 12:38:27 +01:00
cif
ffefd8c2e3 Actualizada bibliografía 2022-02-21 13:55:09 +01:00
cif
f43cde73e4 Actualizada bibliografía 2022-02-21 13:51:21 +01:00
cif
8784fdc773 Actualizada bibliografía 2022-02-21 13:39:33 +01:00
cif
a6d5f9ddeb Actualizada bibliografía 2022-02-21 13:32:07 +01:00
cif
2e72a4d729 Actualizada bibliografía 2022-02-21 13:29:33 +01:00
cif
9426b4c061 Actualizada bibliografía 2022-02-21 13:26:24 +01:00
cif
5e5979d515 Actualizados enlace 2022-02-21 13:22:46 +01:00
cif
270dcec611 Actualizados enlace 2022-02-21 13:09:21 +01:00
Carlos A. Iglesias
e6e52b43ee Update 2_4_Preprocessing.ipynb
Actualizado enlace de Packt.
2022-02-21 12:57:53 +01:00
Carlos A. Iglesias
3b7675fa3f Update 2_3_0_Visualisation.ipynb
Actualizado enlace de bibliografía de packt.
2022-02-21 12:56:22 +01:00
Carlos A. Iglesias
44c63412f9 Update 2_2_Read_Data.ipynb
Updated scikit url
2022-02-21 12:26:30 +01:00
Carlos A. Iglesias
5febbc21a4 Update 2_1_Intro_ScikitLearn.ipynb
Errata en dimensionality.
2022-02-21 12:22:15 +01:00
J. Fernando Sánchez
66ed4ba258 Minor changes LOD 01 and 03 2022-02-15 20:48:49 +01:00
Carlos A. Iglesias
95cd25aef4 Update 1__10_Modules_Packages.ipynb
Fixed link to module tutorial
2022-02-10 17:51:32 +01:00
J. Fernando Sánchez
955e74fc8e Add requirements
Now the dependencies should be automatically installed if you open the repo
through Jupyter Binder
2021-11-10 08:48:54 +01:00
cif2cif
6743dad100 Cleaned output 2021-06-07 10:38:53 +02:00
cif2cif
729f7684c2 Cleaned output 2021-06-07 10:36:12 +02:00
cif2cif
ae8d3d3ba2 Updated with the new libraries 2021-05-07 11:10:21 +02:00
cif2cif
2ba0e2f3d9 updated to last version of OpenGym 2021-04-19 19:10:03 +02:00
cif2cif
c9114cc796 Fixed broken link and bug of sklearn-deap with scikit 0.24 2021-04-19 17:47:22 +02:00
cif2cif
b80c097362 Merge branch 'master' of https://github.com/gsi-upm/sitc 2021-04-06 10:21:25 +02:00
cif2cif
161cd8492b Fixed bug in substrings_in_string and set default df[AgeGroup] to np.nan 2021-04-06 10:20:29 +02:00
124 changed files with 139680 additions and 1531 deletions

View File

@@ -1,7 +1,7 @@
# sitc # sitc
Exercises for Intelligent Systems Course at Universidad Politécnica de Madrid, Telecommunication Engineering School. This material is used in the subjects Exercises for Intelligent Systems Course at Universidad Politécnica de Madrid, Telecommunication Engineering School. This material is used in the subjects
- SITC (Sistemas Inteligentes y Tecnologías del Conocimiento) - Master Universitario de Ingeniería de Telecomunicación (MUIT) - CDAW (Ciencia de datos y aprendizaje en automático en la web de datos) - Master Universitario de Ingeniería de Telecomunicación (MUIT)
- TIAD (Tecnologías Inteligentes de Análisis de Datos) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos) - ABID (Analítica de Big Data) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos)
For following this course: For following this course:
- Follow the instructions to install the environment: https://github.com/gsi-upm/sitc/blob/master/python/1_1_Notebooks.ipynb (Just install 'conda') - Follow the instructions to install the environment: https://github.com/gsi-upm/sitc/blob/master/python/1_1_Notebooks.ipynb (Just install 'conda')
@@ -9,11 +9,13 @@ For following this course:
- Run in a terminal in the folder sitc: jupyter notebook (and enjoy) - Run in a terminal in the folder sitc: jupyter notebook (and enjoy)
Topics Topics
* Python: quick introduction to Python * Python: a quick introduction to Python
* ML-1: introduction to machine learning with scikit-learn * ML-1: introduction to machine learning with scikit-learn
* ML-2: introduction to machine learning with pandas and scikit-learn * ML-2: introduction to machine learning with pandas and scikit-learn
* ML-21: preprocessing and visualizatoin
* ML-3: introduction to machine learning. Neural Computing * ML-3: introduction to machine learning. Neural Computing
* ML-4: introduction to Evolutionary Computing * ML-4: introduction to Evolutionary Computing
* ML-5: introduction to Reinforcement Learning * ML-5: introduction to Reinforcement Learning
* NLP: introduction to NLP * NLP: introduction to NLP
* LOD: Linked Open Data, exercises and example code * LOD: Linked Open Data, exercises and example code
* SNA: Social Network Analysis

1
images/.p Normal file
View File

@@ -0,0 +1 @@

BIN
images/EscUpmPolit_p.gif Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.1 KiB

BIN
images/cart.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 95 KiB

BIN
images/data-chart-type.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 54 KiB

BIN
images/frozenlake-world.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 67 KiB

BIN
images/gym-maze.gif Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 222 KiB

BIN
images/iris-classes.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

BIN
images/iris-dataset.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 44 KiB

BIN
images/iris-features.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 944 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 237 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 58 KiB

BIN
images/qlearning-algo.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 85 KiB

BIN
images/recording.gif Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.8 MiB

BIN
images/titanic.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

View File

@@ -124,7 +124,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql https://live.dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"SELECT ?s ?p ?o\n", "SELECT ?s ?p ?o\n",
"WHERE {\n", "WHERE {\n",
@@ -149,7 +149,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql https://live.dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"SELECT *\n", "SELECT *\n",
"WHERE\n", "WHERE\n",
@@ -445,7 +445,7 @@
"window_display": false "window_display": false
}, },
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -459,7 +459,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.9.1" "version": "3.8.10"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -790,11 +790,12 @@
"\n", "\n",
"SELECT *\n", "SELECT *\n",
"WHERE { ... }\n", "WHERE { ... }\n",
"ORDER BY <variable> <variable> ... DESC(<variable>) ASC(<variable>)\n", "ORDER BY <variable> <variable> ... \n",
"... other statements like LIMIT ...\n", "... other statements like LIMIT ...\n",
"```\n", "```\n",
"\n", "\n",
"The results can be sorted in ascending or descending order, and using several variables." "The results can be sorted in ascending or descending order, and using several variables.\n",
"By default the results are ordered in ascending order, but you can indicate the order using an optional modifier (`ASC(<variable>)`, or `DESC(<variable>)`). \n"
] ]
}, },
{ {
@@ -880,7 +881,7 @@
" rdfs:label \"Ringo Starr\" .\n", " rdfs:label \"Ringo Starr\" .\n",
"```\n", "```\n",
"\n", "\n",
"Using this structure, and the SPARQL statements you already know, to get the **names** of all musicians that collaborated in at least one song.\n" "Using this structure, and the SPARQL statements you already know, get the **names** of all musicians that collaborated in at least one song.\n"
] ]
}, },
{ {
@@ -954,13 +955,13 @@
"\n", "\n",
"Results can be aggregated using different functions.\n", "Results can be aggregated using different functions.\n",
"One of the simplest functions is `COUNT`.\n", "One of the simplest functions is `COUNT`.\n",
"The syntax for COUNT is:\n", "The syntax for `COUNT` is:\n",
" \n", " \n",
"```sparql\n", "```sparql\n",
"SELECT (COUNT(?variable) as ?count_name)\n", "SELECT (COUNT(?variable) as ?count_name)\n",
"```\n", "```\n",
"\n", "\n",
"Use `COUNT` to get the number of songs in which Ringo collaborated." "Use `COUNT` to get the number of songs in which Ringo collaborated. Your query should return a column named `number`."
] ]
}, },
{ {
@@ -1038,7 +1039,7 @@
"\n", "\n",
"Once results are grouped, they can be aggregated using any aggregation function, such as `COUNT`.\n", "Once results are grouped, they can be aggregated using any aggregation function, such as `COUNT`.\n",
"\n", "\n",
"Using `GROUP BY` and `COUNT`, get the count of songs that use each instrument:" "Using `GROUP BY` and `COUNT`, get the count of songs in which Ringo Starr has played each of the instruments:"
] ]
}, },
{ {
@@ -1143,7 +1144,9 @@
"Now, use the same principle to get the count of **different** instruments in each song.\n", "Now, use the same principle to get the count of **different** instruments in each song.\n",
"Some songs have several musicians playing the same instrument, but we only care about *different* instruments in each song.\n", "Some songs have several musicians playing the same instrument, but we only care about *different* instruments in each song.\n",
"\n", "\n",
"Use `?number` for the count." "Use `?song` for the song and `?number` for the count.\n",
"\n",
"Take into consideration that instruments are entities of type `i:Instrument`."
] ]
}, },
{ {
@@ -1153,7 +1156,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "2d0633303eedd0655e9b64bb00317dba", "checksum": "3139d9b7e620266946ffe1ae0cf67581",
"grade": false, "grade": false,
"grade_id": "cell-ee208c762d00da9c", "grade_id": "cell-ee208c762d00da9c",
"locked": false, "locked": false,
@@ -1173,6 +1176,8 @@
" [] a s:Song ;\n", " [] a s:Song ;\n",
" rdfs:label ?song ;\n", " rdfs:label ?song ;\n",
" ?instrument ?musician .\n", " ?instrument ?musician .\n",
" \n",
"?instrument a s:Instrument .\n",
"}\n", "}\n",
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"ORDER BY DESC(?number)" "ORDER BY DESC(?number)"
@@ -1186,7 +1191,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "301aa479241fa02534ee047cf7577eee", "checksum": "5abf6eb7a67ebc9f7612b876105c1960",
"grade": true, "grade": true,
"grade_id": "cell-ddeec32b8ac3d894", "grade_id": "cell-ddeec32b8ac3d894",
"locked": true, "locked": true,
@@ -1198,7 +1203,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"s = solution()\n", "s = solution()\n",
"assert s['columns']['number'][0] == '27'" "assert s['columns']['number'][0] == '25'"
] ]
}, },
{ {
@@ -1243,10 +1248,10 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"However, there are some songs that do not have a vocalist (at least, in the dataset).\n", "However, there are some songs that do not have a vocalist (at least, in the dataset).\n",
"Those songs will not appear in the list above, because we they do not match part of the `WHERE` clause.\n", "Those songs will not appear in the list above, because they do not match part of the `WHERE` clause.\n",
"\n", "\n",
"In these cases, we can specify optional values in a query using the `OPTIONAL` keyword.\n", "In these cases, we can specify optional values in a query using the `OPTIONAL` keyword.\n",
"When a set of clauses are inside an OPTIONAL group, the SPARQL endpoint will try to use them in the query.\n", "When a set of clauses are inside an `OPTIONAL` group, the SPARQL endpoint will try to use them in the query.\n",
"If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).\n", "If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).\n",
"\n", "\n",
"To exemplify this, we can use a property that **does not exist in the dataset**:" "To exemplify this, we can use a property that **does not exist in the dataset**:"
@@ -1504,7 +1509,9 @@
"source": [ "source": [
"Now, count how many instruments each musician have played in a song.\n", "Now, count how many instruments each musician have played in a song.\n",
"\n", "\n",
"**Do not count lead (`i:vocals`) or backing vocals (`i:backingvocals`) as instruments**." "**Do not count lead (`i:vocals`) or backing vocals (`i:backingvocals`) as instruments**.\n",
"\n",
"Use `?musician` for the musician and `?number` for the count."
] ]
}, },
{ {
@@ -1570,7 +1577,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Which songs had Ringo in dums OR Lennon in lead vocals? (UNION)" "### Which songs had Ringo in drums OR Lennon in lead vocals? (UNION)"
] ]
}, },
{ {
@@ -1636,7 +1643,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "d583b30a1e00960df3a4411b6854c8c8", "checksum": "11061e79ec06ccb3a9c496319a528366",
"grade": true, "grade": true,
"grade_id": "cell-409402df0e801d09", "grade_id": "cell-409402df0e801d09",
"locked": true, "locked": true,
@@ -1647,7 +1654,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert len(solution()['tuples']) == 246" "assert len(solution()['tuples']) == 209"
] ]
}, },
{ {
@@ -1770,7 +1777,9 @@
"\n", "\n",
"Using `GROUP_CONCAT`, get a list of the instruments that each musician could play.\n", "Using `GROUP_CONCAT`, get a list of the instruments that each musician could play.\n",
"\n", "\n",
"You can consult how to use GROUP_CONCAT [here](https://www.w3.org/TR/sparql11-query/)." "You can consult how to use GROUP_CONCAT [here](https://www.w3.org/TR/sparql11-query/).\n",
"\n",
"Use `?musician` for the musician and `?instruments` for the list of instruments."
] ]
}, },
{ {
@@ -1815,7 +1824,9 @@
"\n", "\n",
"You can check if a string or URI matches a regular expression with `regex(?variable, \"<regex>\", \"i\")`.\n", "You can check if a string or URI matches a regular expression with `regex(?variable, \"<regex>\", \"i\")`.\n",
"\n", "\n",
"The documentation for regular expressions in SPARQL is [here](https://www.w3.org/TR/rdf-sparql-query/)." "The documentation for regular expressions in SPARQL is [here](https://www.w3.org/TR/rdf-sparql-query/).\n",
"\n",
"Use `?instrument` for the instrument and `?ins` for the url of the type."
] ]
}, },
{ {
@@ -1873,7 +1884,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -1887,9 +1898,22 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.8.1" "version": "3.8.10"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
} }
}, },
"nbformat": 4, "nbformat": 4,
"nbformat_minor": 2 "nbformat_minor": 4
} }

View File

@@ -441,7 +441,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -455,7 +455,20 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.8.1" "version": "3.8.10"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@@ -189,8 +189,8 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Let's start with a simple query. We will get a list of cities and towns in Madrid.\n", "Let's start with a simple query. We will get a list of towns and other populated areas within the Community of Madrid.\n",
"If we take a look at the DBpedia ontology or the page of any town we already know, we discover that the property that links towns to their community is [`isPartOf`](http://dbpedia.org/ontology/isPartOf), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)\n", "If we take a look at the DBpedia ontology, or the page of any town we already know, we discover that the property that links towns to their community is [`subdivision`](http://dbpedia.org/ontology/subdivision), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)\n",
"\n", "\n",
"Since there are potentially many cities to get, we will limit our results to the first 10 results:" "Since there are potentially many cities to get, we will limit our results to the first 10 results:"
] ]
@@ -201,11 +201,11 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"SELECT ?localidad\n", "SELECT ?localidad\n",
"WHERE {\n", "WHERE {\n",
" ?localidad <http://dbpedia.org/ontology/isPartOf> <http://dbpedia.org/resource/Community_of_Madrid>\n", " ?localidad <http://dbpedia.org/ontology/subdivision> <http://dbpedia.org/resource/Community_of_Madrid>\n",
"}\n", "}\n",
"LIMIT 10" "LIMIT 10"
] ]
@@ -224,14 +224,14 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n", "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n", "PREFIX dbr: <http://dbpedia.org/resource/>\n",
" \n", " \n",
"SELECT ?localidad\n", "SELECT ?localidad\n",
"WHERE {\n", "WHERE {\n",
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n", " ?localidad dbo:subdivision dbr:Community_of_Madrid.\n",
"}\n", "}\n",
"LIMIT 10" "LIMIT 10"
] ]
@@ -259,10 +259,11 @@
"source": [ "source": [
"Now that you have some experience under your belt, it is time to design your own query.\n", "Now that you have some experience under your belt, it is time to design your own query.\n",
"\n", "\n",
"Your first task it to get a list of Spanish Novelits, using the skeleton below and the previous query to guide you.\n", "Your first task it to get a list of writers, using the skeleton below and the previous query to guide you.\n",
"\n", "\n",
"Pages for Spanish novelists are grouped in the *Spanish novelists* DBpedia category. You can use that fact to get your list.\n", "The DBpedia vocabulary has a special class for writers: `<http://dbpedia.org/ontology/Writer>`.\n",
"In other words, the difference from the previous query will be using `dct:subject` instead of `dbo:isPartOf`, and `dbc:Spanish_novelists` instead of `dbr:Community_of_Madrid`." "\n",
"In other words, the difference from the previous query will be using `a` instead of `dbo:isPartOf`, and `dbo:Writer` instead of `dbr:Community_of_Madrid`."
] ]
}, },
{ {
@@ -272,7 +273,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "eef1c62e2797bd3ef01f2061da6f83c4", "checksum": "2a5c55e8bca983aca6cc2293f4560f31",
"grade": false, "grade": false,
"grade_id": "cell-7a9509ff3c34127e", "grade_id": "cell-7a9509ff3c34127e",
"locked": false, "locked": false,
@@ -282,10 +283,10 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n", "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n", "\n",
"SELECT ?escritor\n", "SELECT ?escritor\n",
"\n", "\n",
@@ -324,7 +325,7 @@
"source": [ "source": [
"### Using more criteria\n", "### Using more criteria\n",
"\n", "\n",
"We can get more than one property in the same query. Let us modify our query to get the population of the cities as well." "We can get more than one property in the same query. Let us modify our query to get the total area of the towns we found before."
] ]
}, },
{ {
@@ -333,22 +334,21 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n", "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n", "PREFIX dbr: <http://dbpedia.org/resource/>\n",
"PREFIX dbp: <http://dbpedia.org/property/>\n", "PREFIX dbp: <http://dbpedia.org/property/>\n",
" \n", " \n",
"SELECT ?localidad ?pop ?when\n", "SELECT ?localidad ?area\n",
"\n", "\n",
"WHERE {\n", "WHERE {\n",
" ?localidad dbo:populationTotal ?pop .\n", " ?localidad dbo:areaTotal ?area .\n",
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n", " ?localidad dbo:subdivision dbr:Community_of_Madrid .\n",
" ?localidad dbp:populationAsOf ?when .\n",
"}\n", "}\n",
"\n", "\n",
"LIMIT 100" "LIMIT 1000"
] ]
}, },
{ {
@@ -358,8 +358,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'localidad' in solution()['columns']\n", "assert 'localidad' in solution()['columns']\n",
"assert 'http://dbpedia.org/resource/Parla' in solution()['columns']['localidad']\n", "assert ('http://dbpedia.org/resource/Lozoya', '5.794e+07') in solution()['tuples']"
"assert ('http://dbpedia.org/resource/San_Sebastián_de_los_Reyes', '75912', '2009') in solution()['tuples']"
] ]
}, },
{ {
@@ -368,7 +367,7 @@
"source": [ "source": [
"Time to try it yourself.\n", "Time to try it yourself.\n",
"\n", "\n",
"Get the list of Spanish novelists AND their name (using rdfs:label)." "Get the list of writers AND their name (using rdfs:label)."
] ]
}, },
{ {
@@ -378,7 +377,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "9d4193612dea95da2d91762b638ad5e6", "checksum": "2ebdc8d3f3420bb961e2c8c77d027c3b",
"grade": false, "grade": false,
"grade_id": "cell-83dcaae0d09657b5", "grade_id": "cell-83dcaae0d09657b5",
"locked": false, "locked": false,
@@ -388,7 +387,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -399,7 +398,7 @@
"WHERE {\n", "WHERE {\n",
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"}\n", "}\n",
"LIMIT 10" "LIMIT 100"
] ]
}, },
{ {
@@ -410,7 +409,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "86115c2a8982ad12b7250cf4341ae9c3", "checksum": "d779d690d5d1865973fdcf113b74c221",
"grade": true, "grade": true,
"grade_id": "cell-8afd28aada7a896c", "grade_id": "cell-8afd28aada7a896c",
"locked": true, "locked": true,
@@ -422,8 +421,8 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'escritor' in solution()['columns']\n", "assert 'escritor' in solution()['columns']\n",
"assert 'http://dbpedia.org/resource/Eduardo_Mendoza_Garriga' in solution()['columns']['escritor']\n", "assert 'http://dbpedia.org/resource/Alison_Stine' in solution()['columns']['escritor']\n",
"assert ('http://dbpedia.org/resource/Eduardo_Mendoza_Garriga', 'Eduardo Mendoza') in solution()['tuples']" "assert ('http://dbpedia.org/resource/Alistair_MacLeod', 'Alistair MacLeod') in solution()['tuples']"
] ]
}, },
{ {
@@ -440,11 +439,12 @@
"In the previous example, we saw that we got what seemed to be duplicated answers.\n", "In the previous example, we saw that we got what seemed to be duplicated answers.\n",
"\n", "\n",
"This happens because entities can have labels in different languages (e.g. English, Spanish).\n", "This happens because entities can have labels in different languages (e.g. English, Spanish).\n",
"To restrict the search to only those results we're interested in, we can use filtering.\n", "We can filter results using the `FILTER` keyword.\n",
"\n", "\n",
"We can also decide the order in which our results are shown.\n", "We can also decide the order in which our results are shown using the `ORDER BY` sentence.\n",
"We can order in ascending (`ASC`) or descending (`DESC`) order.\n",
"\n", "\n",
"For instance, this is how we could use filtering to get only large cities in our example, ordered by population:" "For instance, this is how we could use filtering to get only large areas in our example, in descending order:"
] ]
}, },
{ {
@@ -453,21 +453,20 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n", "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n", "PREFIX dbr: <http://dbpedia.org/resource/>\n",
" \n", " \n",
"SELECT ?localidad ?pop ?when\n", "SELECT ?localidad ?area\n",
"\n", "\n",
"WHERE {\n", "WHERE {\n",
" ?localidad dbo:populationTotal ?pop .\n", " ?localidad dbo:areaTotal ?area .\n",
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n", " ?localidad dbo:type dbr:Municipalities_of_Spain .\n",
" ?localidad dbp:populationAsOf ?when .\n", " FILTER(?area > 100000)\n",
" FILTER(?pop > 100000)\n",
"}\n", "}\n",
"ORDER BY ?pop\n", "ORDER BY DESC(?area)\n",
"LIMIT 100" "LIMIT 100"
] ]
}, },
@@ -486,7 +485,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "a38cb1aea7b1f01f6b37c088384e0a3d", "checksum": "1e09f3c1749dd3c9256a1d0bbc14ff2d",
"grade": true, "grade": true,
"grade_id": "cell-cb7b8283568cd349", "grade_id": "cell-cb7b8283568cd349",
"locked": true, "locked": true,
@@ -498,10 +497,9 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# We still have the biggest city\n", "# We still have the biggest city\n",
"assert ('http://dbpedia.org/resource/Madrid', '3141991', '2014') in solution()['tuples']\n", "assert 'http://dbpedia.org/resource/Úbeda' in solution()['columns']['localidad']\n",
"# But the smaller ones are gone\n", "# But the smaller ones are gone\n",
"assert 'http://dbpedia.org/resource/Tres_Cantos' not in solution()['columns']['localidad']\n", "assert 'http://dbpedia.org/resource/El_Cañaveral' not in solution()['columns']['localidad']"
"assert 'http://dbpedia.org/resource/San_Sebastián_de_los_Reyes' not in solution()['columns']['localidad']"
] ]
}, },
{ {
@@ -518,7 +516,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "b6aaac8ab30d52a042c1efefbbff7550", "checksum": "b200ff7d97fe03bab726040d16b636fe",
"grade": false, "grade": false,
"grade_id": "cell-ff3d611cb0304b01", "grade_id": "cell-ff3d611cb0304b01",
"locked": false, "locked": false,
@@ -528,11 +526,11 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n", "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n", "\n",
"SELECT ?escritor ?nombre\n", "SELECT ?escritor ?nombre\n",
"\n", "\n",
@@ -540,7 +538,7 @@
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"}\n", "}\n",
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"LIMIT 1000" "LIMIT 100"
] ]
}, },
{ {
@@ -551,7 +549,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "3441fbd2267002acbb0d46d9ce94ba97", "checksum": "637f8a2e0eb286f968f22b0e0fa2215a",
"grade": true, "grade": true,
"grade_id": "cell-d70cc6ea394741bc", "grade_id": "cell-d70cc6ea394741bc",
"locked": true, "locked": true,
@@ -563,8 +561,8 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"assert len(solution()['tuples']) >= 50\n", "assert len(solution()['tuples']) >= 50\n",
"assert 'Adelaida García Morales' in solution()['columns']['nombre']\n", "assert 'Abraham Abulafia' in solution()['columns']['nombre']\n",
"assert sum(1 for k in solution()['columns']['escritor'] if k == 'http://dbpedia.org/resource/Adelaida_García_Morales') == 1" "assert sum(1 for k in solution()['columns']['escritor'] if k == 'http://dbpedia.org/resource/Abraham_Abulafia') == 1"
] ]
}, },
{ {
@@ -579,8 +577,9 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"From now on, we will focus on our Writers example.\n", "From now on, we will focus on our Writers example.\n",
"More specifically, we will be interested in writers born in the XX century.\n",
"\n", "\n",
"First, we will search for writers born in the XX century, using the [20th-century Spanish novelists](http://dbpedia.org/page/Category:20th-century_Spanish_novelists) category." "To do that, we will filter our novelists to only those born (`dbo:birthDate`) in the 20th century (after 1900)."
] ]
}, },
{ {
@@ -611,7 +610,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "cacdd08a8a267c1173304e319ffff563", "checksum": "e896e64c21f317aeacf82ccd46811059",
"grade": true, "grade": true,
"grade_id": "cell-cf3821f2d33fb0f6", "grade_id": "cell-cf3821f2d33fb0f6",
"locked": true, "locked": true,
@@ -622,9 +621,9 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Camilo José Cela' in solution()['columns']['nombre']\n", "assert 'Kiku Amino' in solution()['columns']['nombre']\n",
"assert 'Javier Marías' in solution()['columns']['nombre']\n", "assert 'Albert Hackett' in solution()['columns']['nombre']\n",
"assert all(x > '1850-12-31' and x < '2001-01-01' for x in solution()['columns']['nac'])" "assert all(x > '1900-01-01' and x < '2001-01-01' for x in solution()['columns']['nac'])"
] ]
}, },
{ {
@@ -647,7 +646,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "f4170cbbf042644e394d1eb9acf12ce3", "checksum": "df4364d90fd37ec886bec8f39f6df8ee",
"grade": false, "grade": false,
"grade_id": "cell-254a18dd973e82ed", "grade_id": "cell-254a18dd973e82ed",
"locked": false, "locked": false,
@@ -657,11 +656,10 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n", "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n", "\n",
"SELECT ?escritor ?nombre ?fechaNac ?fechaDef\n", "SELECT ?escritor ?nombre ?fechaNac ?fechaDef\n",
@@ -670,7 +668,7 @@
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"}\n", "}\n",
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"LIMIT 200" "LIMIT 100"
] ]
}, },
{ {
@@ -681,7 +679,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "29c6362adbdb5606e158f696594e1052", "checksum": "26d08d050ac6963b20595f52b5d14781",
"grade": true, "grade": true,
"grade_id": "cell-4d6a64dde67f0e11", "grade_id": "cell-4d6a64dde67f0e11",
"locked": true, "locked": true,
@@ -692,8 +690,8 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Wenceslao Fernández Flórez' in solution()['columns']['nombre']\n", "assert 'Alister McGrath' in solution()['columns']['nombre']\n",
"assert '1879-2-11' in solution()['columns']['fechaNac']\n", "# assert '1879-2-11' in solution()['columns']['fechaNac']\n",
"assert '' in solution()['columns']['fechaNac'] # Not all birthdates are defined\n", "assert '' in solution()['columns']['fechaNac'] # Not all birthdates are defined\n",
"assert '' in solution()['columns']['fechaDef'] # Some deathdates are not defined" "assert '' in solution()['columns']['fechaDef'] # Some deathdates are not defined"
] ]
@@ -722,7 +720,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Get the list of Spanish novelists that are still alive.\n", "Get the list of writers that are still alive.\n",
"A person is alive if their death date is not defined and the were born less than 100 years ago" "A person is alive if their death date is not defined and the were born less than 100 years ago"
] ]
}, },
@@ -733,7 +731,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "f3c11121eb0d1328d2f5da3580f8d648", "checksum": "7527bd597f9550ec14d454732f6b2183",
"grade": false, "grade": false,
"grade_id": "cell-474b1a72dec6827c", "grade_id": "cell-474b1a72dec6827c",
"locked": false, "locked": false,
@@ -743,7 +741,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -769,7 +767,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "770bbddef5210c28486a1929e4513ada", "checksum": "8f8c783af97cd3024b90a8f5b7fd7027",
"grade": true, "grade": true,
"grade_id": "cell-46b62dd2856bc919", "grade_id": "cell-46b62dd2856bc919",
"locked": true, "locked": true,
@@ -781,7 +779,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Fernando Arrabal' in solution()['columns']['nombre']\n", "assert 'Fernando Arrabal' in solution()['columns']['nombre']\n",
"assert 'Albert Espinosa' in solution()['columns']['nombre']\n", "assert 'Javier Sierra' in solution()['columns']['nombre']\n",
"for year in solution()['columns']['nac']:\n", "for year in solution()['columns']['nac']:\n",
" assert int(year) >= 1918" " assert int(year) >= 1918"
] ]
@@ -790,7 +788,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Now, get the list of Spanish novelists that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n", "Now, get the list of writers that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n",
"\n", "\n",
"Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).\n", "Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).\n",
"\n", "\n",
@@ -804,7 +802,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "ed34857649c9a6926eb0a3a0e1d8198d", "checksum": "2e608b808ceceb2c8515f892a6b98d06",
"grade": false, "grade": false,
"grade_id": "cell-ceefd3c8fbd39d79", "grade_id": "cell-ceefd3c8fbd39d79",
"locked": false, "locked": false,
@@ -814,7 +812,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -838,7 +836,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "18bb2d8d586bf4a5231973e69958ab75", "checksum": "ec821397f67619e5bfa02a19bdd597fc",
"grade": true, "grade": true,
"grade_id": "cell-461cd6ccc6c2dc79", "grade_id": "cell-461cd6ccc6c2dc79",
"locked": true, "locked": true,
@@ -849,8 +847,8 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Javier Sierra' in solution()['columns']['nombre']\n", "assert 'Wang Ruowang' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Sanmao_(author)' in solution()['columns']['escritor']" "assert 'http://dbpedia.org/resource/Manuel_de_Pedrolo' in solution()['columns']['escritor']"
] ]
}, },
{ {
@@ -887,7 +885,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "34163ddb0400cd8ddd2c2e2cdf29c20b", "checksum": "3d647ccd0f3e861b843af0ec4a33098b",
"grade": false, "grade": false,
"grade_id": "cell-2a39adc71d26ae73", "grade_id": "cell-2a39adc71d26ae73",
"locked": false, "locked": false,
@@ -897,7 +895,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -921,7 +919,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "84ab7d64a45e03e6dd902216a2aad030", "checksum": "524d152d46d3c1166052b6d5871c6aa5",
"grade": true, "grade": true,
"grade_id": "cell-542e0e36347fd5d1", "grade_id": "cell-542e0e36347fd5d1",
"locked": true, "locked": true,
@@ -932,8 +930,8 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Javier Sierra' in solution()['columns']['nombre']\n", "assert 'Anna Langfus' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Albert_Espinosa' in solution()['columns']['escritor']\n", "assert 'http://dbpedia.org/resource/Paul_Celan' in solution()['columns']['escritor']\n",
"\n", "\n",
"from collections import Counter\n", "from collections import Counter\n",
"c = Counter(solution()['columns']['nombre'])\n", "c = Counter(solution()['columns']['nombre'])\n",
@@ -956,7 +954,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Get the list of living Spanish novelists born in Madrid.\n", "Get the list of living novelists born in Madrid.\n",
"\n", "\n",
"Hint: use `dbr:Madrid` and `dbo:birthPlace`" "Hint: use `dbr:Madrid` and `dbo:birthPlace`"
] ]
@@ -968,7 +966,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "25c8edcee216d536aac98fc9aa2b6422", "checksum": "f067a70a247b62d7eb5cc526efdc53c4",
"grade": false, "grade": false,
"grade_id": "cell-d175e41da57c889b", "grade_id": "cell-d175e41da57c889b",
"locked": false, "locked": false,
@@ -978,7 +976,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1042,7 +1040,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "c1f22b82c4d0bd4102a6c38f7f933dc6", "checksum": "64ea2ef341901ce486bb1dcbed6c3785",
"grade": false, "grade": false,
"grade_id": "cell-e4b99af9ef91ff6f", "grade_id": "cell-e4b99af9ef91ff6f",
"locked": false, "locked": false,
@@ -1052,7 +1050,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1066,7 +1064,7 @@
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"}\n", "}\n",
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"LIMIT 10000" "LIMIT 1000"
] ]
}, },
{ {
@@ -1077,7 +1075,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "51acaeb26379c6bd2f8c767001ef79ec", "checksum": "fe47b48969b20b50a16a4ce4ad75e97d",
"grade": true, "grade": true,
"grade_id": "cell-68661b73c2140e4f", "grade_id": "cell-68661b73c2140e4f",
"locked": true, "locked": true,
@@ -1088,8 +1086,8 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'http://dbpedia.org/resource/A_Heart_So_White' in solution()['columns']['obra']\n", "assert 'http://dbpedia.org/resource/Cristina_Guzmán_(novel)' in solution()['columns']['obra']\n",
"assert 'http://dbpedia.org/resource/Tomorrow_in_the_Battle_Think_on_Me' in solution()['columns']['obra']\n", "assert 'http://dbpedia.org/resource/Life_Is_a_Dream' in solution()['columns']['obra']\n",
"assert '' in solution()['columns']['obra'] # Some authors don't have works in dbpedia" "assert '' in solution()['columns']['obra'] # Some authors don't have works in dbpedia"
] ]
}, },
@@ -1097,14 +1095,14 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Traversing the graph" "### Traversing the graph II"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Get a list of living Spanish novelists born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).\n", "Get a list of writers born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).\n",
"\n", "\n",
"If the query is right, you should see a list of writers after running the test code.\n", "If the query is right, you should see a list of writers after running the test code.\n",
"\n", "\n",
@@ -1118,7 +1116,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "e3f8e18a006a763f5cdbe49c97b73f5f", "checksum": "d3636d90f8d6a3c824b17ce87ba6c423",
"grade": false, "grade": false,
"grade_id": "cell-b1f71c67dd71dad4", "grade_id": "cell-b1f71c67dd71dad4",
"locked": false, "locked": false,
@@ -1128,7 +1126,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1142,7 +1140,7 @@
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"}\n", "}\n",
"ORDER BY ?nombre\n", "ORDER BY ?nombre\n",
"LIMIT 100" "LIMIT 5"
] ]
}, },
{ {
@@ -1208,7 +1206,8 @@
"source": [ "source": [
"Using UNION, get a list of distinct spanish novelists AND poets.\n", "Using UNION, get a list of distinct spanish novelists AND poets.\n",
"\n", "\n",
"Hint: Category: Spanish_poets" "In this query, instead of looking for writers, try to find the right entities by looking at the `dct:subject` property.\n",
"The entities we are looking after should be in the `Spanish_poets` and `Spanish_novelists` categories."
] ]
}, },
{ {
@@ -1218,7 +1217,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "9c0da379841474601397f5623abc6a9c", "checksum": "2547e55ac68b37687efddd50c768eb5b",
"grade": false, "grade": false,
"grade_id": "cell-21eb6323b6d0011d", "grade_id": "cell-21eb6323b6d0011d",
"locked": false, "locked": false,
@@ -1228,7 +1227,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1242,7 +1241,7 @@
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"}\n", "}\n",
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"LIMIT 10000" "LIMIT 100"
] ]
}, },
{ {
@@ -1253,7 +1252,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "f22c7db423410fcf3e8fce4ec0a8e9f9", "checksum": "565dac8ae632765bc3f128f830e70993",
"grade": true, "grade": true,
"grade_id": "cell-004e021e877c6ace", "grade_id": "cell-004e021e877c6ace",
"locked": true, "locked": true,
@@ -1264,7 +1263,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Garcilaso de la Vega' in solution()['columns']['nombre']" "assert 'Antonio Gala' in solution()['columns']['nombre']"
] ]
}, },
{ {
@@ -1289,7 +1288,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "cd7ce9212f587afe311c7631b3908de2", "checksum": "f8cca6da3b6830a5474eac28c3c8ebde",
"grade": false, "grade": false,
"grade_id": "cell-e35414e191c5bf16", "grade_id": "cell-e35414e191c5bf16",
"locked": false, "locked": false,
@@ -1299,7 +1298,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql http://dbpedia.org/sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -1389,13 +1388,13 @@
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© 2018 Universidad Politécnica de Madrid." "© 2023 Universidad Politécnica de Madrid."
] ]
} }
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -1409,7 +1408,20 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.8.1" "version": "3.8.10"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
} }
}, },
"nbformat": 4, "nbformat": 4,

View File

@@ -150,7 +150,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "69e23e6e3dc06ca9d2b5d878c2baba94", "checksum": "1a23c8b9a53f7ae28f28b1c23b9706b5",
"grade": false, "grade": false,
"grade_id": "cell-ab7755944d46f9ca", "grade_id": "cell-ab7755944d46f9ca",
"locked": false, "locked": false,
@@ -160,19 +160,19 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct: <http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n", "PREFIX dbc: <http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n", "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"\n", "PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>\n",
"SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
"\n", "\n",
"SELECT ?escritor ?nombre (year(?fechaNac) as ?nac)\n",
"WHERE {\n", "WHERE {\n",
" ?escritor dct:subject dbc:Spanish_novelists .\n", " ?escritor dct:subject dbc:Spanish_novelists ;\n",
" ?escritor rdfs:label ?nombre .\n", " rdfs:label ?nombre ;\n",
" ?escritor dbo:birthDate ?fechaNac .\n", " dbo:birthDate ?fechaNac .\n",
" FILTER(lang(?nombre) = \"es\") .\n", " FILTER(lang(?nombre) = \"es\") .\n",
" # YOUR ANSWER HERE\n", " # YOUR ANSWER HERE\n",
"}\n", "}\n",
@@ -188,7 +188,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "211c632634327a1fd805326fa0520cdd", "checksum": "e261d808f509c1e29227db94d1eab784",
"grade": true, "grade": true,
"grade_id": "cell-cf3821f2d33fb0f6", "grade_id": "cell-cf3821f2d33fb0f6",
"locked": true, "locked": true,
@@ -199,8 +199,8 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Camilo José Cela' in solution()['columns']['nombre']\n", "assert 'Ramiro Ledesma' in solution()['columns']['nombre']\n",
"assert 'Javier Marías' in solution()['columns']['nombre']\n", "assert 'Ray Loriga' in solution()['columns']['nombre']\n",
"assert all(int(x) > 1899 and int(x) < 2001 for x in solution()['columns']['nac'])" "assert all(int(x) > 1899 and int(x) < 2001 for x in solution()['columns']['nac'])"
] ]
}, },
@@ -304,7 +304,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "2a24f623c23116fd23877facb487dd16", "checksum": "e55173801ab36337ad356a1bc286dbd1",
"grade": false, "grade": false,
"grade_id": "cell-ceefd3c8fbd39d79", "grade_id": "cell-ceefd3c8fbd39d79",
"locked": false, "locked": false,
@@ -314,7 +314,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -341,7 +341,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "18bb2d8d586bf4a5231973e69958ab75", "checksum": "1b77cfaefb8b2ec286ce7b0c70804fe0",
"grade": true, "grade": true,
"grade_id": "cell-461cd6ccc6c2dc79", "grade_id": "cell-461cd6ccc6c2dc79",
"locked": true, "locked": true,
@@ -353,7 +353,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"assert 'Javier Sierra' in solution()['columns']['nombre']\n", "assert 'Javier Sierra' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Sanmao_(author)' in solution()['columns']['escritor']" "assert 'http://dbpedia.org/resource/José_Ángel_Mañas' in solution()['columns']['escritor']"
] ]
}, },
{ {
@@ -392,7 +392,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"SELECT ?localidad\n", "SELECT ?localidad\n",
"WHERE {\n", "WHERE {\n",
@@ -419,7 +419,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "6e444c20b411033a6c45fd5a566018fa", "checksum": "b70a9a4f102c253e864d2e8aec79ce81",
"grade": false, "grade": false,
"grade_id": "cell-a57d3546a812f689", "grade_id": "cell-a57d3546a812f689",
"locked": false, "locked": false,
@@ -429,7 +429,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -526,7 +526,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n", "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
@@ -535,9 +535,9 @@
"SELECT ?com, GROUP_CONCAT(?name, \",\") as ?places # notice how we rename the variable\n", "SELECT ?com, GROUP_CONCAT(?name, \",\") as ?places # notice how we rename the variable\n",
"\n", "\n",
"WHERE {\n", "WHERE {\n",
" ?localidad dbo:isPartOf ?com .\n", " ?com dct:subject dbc:Autonomous_communities_of_Spain .\n",
" ?com dbo:type dbr:Autonomous_communities_of_Spain .\n", " ?localidad dbo:subdivision ?com ;\n",
" ?localidad rdfs:label ?name .\n", " rdfs:label ?name .\n",
" FILTER (lang(?name)=\"es\")\n", " FILTER (lang(?name)=\"es\")\n",
"}\n", "}\n",
"\n", "\n",
@@ -552,7 +552,7 @@
"editable": false, "editable": false,
"nbgrader": { "nbgrader": {
"cell_type": "markdown", "cell_type": "markdown",
"checksum": "e100e2f89c832cf832add62c107e4008", "checksum": "4779fb61645634308d0ed01e0c88e8a4",
"grade": false, "grade": false,
"grade_id": "asdiopjasdoijasdoijasd", "grade_id": "asdiopjasdoijasdoijasd",
"locked": true, "locked": true,
@@ -561,7 +561,7 @@
} }
}, },
"source": [ "source": [
"Try it yourself, to get a list of works by each of these authors:" "Try it yourself, to get a list of works by each of the authors in this query:"
] ]
}, },
{ {
@@ -571,7 +571,7 @@
"deletable": false, "deletable": false,
"nbgrader": { "nbgrader": {
"cell_type": "code", "cell_type": "code",
"checksum": "9f6e26faab2be98c72fb7a917ac5a421", "checksum": "e5d87d1d8eba51c510241ba75981a597",
"grade": false, "grade": false,
"grade_id": "cell-2e3de17c75047652", "grade_id": "cell-2e3de17c75047652",
"locked": false, "locked": false,
@@ -581,7 +581,7 @@
}, },
"outputs": [], "outputs": [],
"source": [ "source": [
"%%sparql\n", "%%sparql https://dbpedia.org/sparql\n",
"\n", "\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n", "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n", "PREFIX dct:<http://purl.org/dc/terms/>\n",
@@ -592,26 +592,17 @@
"# YOUR ANSWER HERE\n", "# YOUR ANSWER HERE\n",
"\n", "\n",
"WHERE {\n", "WHERE {\n",
" ?escritor dct:subject dbc:Spanish_novelists .\n", " ?escritor a dbo:Writer .\n",
" ?escritor rdfs:label ?nombre .\n", " ?escritor rdfs:label ?nombre .\n",
" ?escritor dbo:birthDate ?fechaNac .\n", " ?escritor dbo:birthDate ?fechaNac .\n",
" ?escritor dbo:birthPlace dbr:Madrid .\n", " ?escritor dbo:birthPlace dbr:Madrid .\n",
" OPTIONAL {\n", " # YOUR ANSWER HERE\n",
" ?obra dbo:author ?escritor .\n",
" ?obra rdfs:label ?titulo .\n",
" }\n",
" OPTIONAL {\n",
" ?escritor dbo:deathDate ?fechaDef .\n",
" }\n",
" FILTER (?fechaNac <= \"2000\"^^xsd:date).\n",
" FILTER (?fechaNac >= \"1918\"^^xsd:date).\n",
" FILTER (!bound(?fechaDef) || (?fechaNac >= \"1918\"^^xsd:date)) .\n",
" FILTER(lang(?nombre) = \"es\") .\n", " FILTER(lang(?nombre) = \"es\") .\n",
" FILTER(!bound(?titulo) || lang(?titulo) = \"en\") .\n", " FILTER(!bound(?titulo) || lang(?titulo) = \"en\") .\n",
"\n", "\n",
"}\n", "}\n",
"ORDER BY ?nombre\n", "ORDER BY ?nombre\n",
"LIMIT 10000" "LIMIT 100"
] ]
}, },
{ {
@@ -639,7 +630,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -653,7 +644,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.8.1" "version": "3.8.10"
} }
}, },
"nbformat": 4, "nbformat": 4,

4352
lod/BeatlesMusicians.ttl Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -12,6 +12,7 @@ from urllib.request import Request, urlopen
from urllib.parse import quote_plus, urlencode from urllib.parse import quote_plus, urlencode
from urllib.error import HTTPError from urllib.error import HTTPError
import ssl
import json import json
import sys import sys
@@ -32,7 +33,11 @@ def send_query(query, endpoint):
headers={'content-type': 'application/x-www-form-urlencoded', headers={'content-type': 'application/x-www-form-urlencoded',
'accept': FORMATS}, 'accept': FORMATS},
method='POST') method='POST')
res = urlopen(r) context = ssl.create_default_context()
context.check_hostname = False
context.verify_mode = ssl.CERT_NONE
res = urlopen(r, context=context, timeout=2)
data = res.read().decode('utf-8') data = res.read().decode('utf-8')
if res.getcode() == 200: if res.getcode() == 200:
try: try:

0
lod/tests.py Normal file
View File

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -71,8 +71,7 @@
"source": [ "source": [
"* [Scikit-learn web page](http://scikit-learn.org/stable/)\n", "* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n", "* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n", "* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015."
] ]
}, },
{ {
@@ -80,7 +79,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
@@ -88,7 +87,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -102,7 +101,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.6.7" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -40,10 +40,10 @@
"\n", "\n",
"* Learn to use scikit-learn\n", "* Learn to use scikit-learn\n",
"* Learn the basic steps to apply machine learning techniques: dataset analysis, load, preprocessing, training, validation, optimization and persistence.\n", "* Learn the basic steps to apply machine learning techniques: dataset analysis, load, preprocessing, training, validation, optimization and persistence.\n",
"* Learn how to do a exploratory data analysis\n", "* Learn how to do an exploratory data analysis\n",
"* Learn how to visualise a dataset\n", "* Learn how to visualise a dataset\n",
"* Learn how to load a bundled dataset\n", "* Learn how to load a bundled dataset\n",
"* Learn how to separate the dataset into traning and testing datasets\n", "* Learn how to separate the dataset into training and testing datasets\n",
"* Learn how to train a classifier\n", "* Learn how to train a classifier\n",
"* Learn how to predict with a trained classifier\n", "* Learn how to predict with a trained classifier\n",
"* Learn how to evaluate the predictions\n", "* Learn how to evaluate the predictions\n",
@@ -63,9 +63,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [Scikit-learn web page](http://scikit-learn.org/stable/)\n", "* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n", "* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015."
] ]
}, },
{ {
@@ -73,7 +71,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## LIcence\n", "## LIcence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
@@ -81,7 +79,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -95,7 +93,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.6.7" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -87,10 +87,10 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"Scikit-learn provides algorithms for solving the following problems:\n", "Scikit-learn provides algorithms for solving the following problems:\n",
"* **Classification**: Identifying to which category an object belongs to. Some of the available [classification algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are decision trees (ID3, C4.5, ...), kNN, SVM, Random forest, Perceptron, etc. \n", "* **Classification**: Identifying to which category an object belongs. Some of the available [classification algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are decision trees (ID3, C4.5, ...), kNN, SVM, Random forest, Perceptron, etc. \n",
"* **Clustering**: Automatic grouping of similar objects into sets. Some of the available [clustering algorithms](http://scikit-learn.org/stable/modules/clustering.html#clustering) are k-Means, Affinity propagation, etc.\n", "* **Clustering**: Automatic grouping of similar objects into sets. Some of the available [clustering algorithms](http://scikit-learn.org/stable/modules/clustering.html#clustering) are k-Means, Affinity propagation, etc.\n",
"* **Regression**: Predicting a continuous-valued attribute associated with an object. Some of the available [regression algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are linear regression, logistic regression, etc.\n", "* **Regression**: Predicting a continuous-valued attribute associated with an object. Some of the available [regression algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are linear regression, logistic regression, etc.\n",
"* ** Dimensionality reduction**: Reducing the number of random variables to consider. Some of the available [dimensionality reduction algorithms](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) are SVD, PCA, etc." "* **Dimensionality reduction**: Reducing the number of random variables to consider. Some of the available [dimensionality reduction algorithms](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) are SVD, PCA, etc."
] ]
}, },
{ {
@@ -105,7 +105,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"In addition, scikit-learn helps in several tasks:\n", "In addition, scikit-learn helps in several tasks:\n",
"* **Model selection**: Comparing, validating, choosing parameters and models, and persisting models. Some of the [available functionalities](http://scikit-learn.org/stable/model_selection.html#model-selection) are cross-validation or grid search for optimizing the parameters. \n", "* **Model selection**: Comparing, validating, choosing parameters and models, and persisting models. Some [available functionalities](http://scikit-learn.org/stable/model_selection.html#model-selection) are cross-validation or grid search for optimizing the parameters. \n",
"* **Preprocessing**: Several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. Some of the available [preprocessing functions](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) are scaling and normalizing data, or imputing missing values." "* **Preprocessing**: Several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. Some of the available [preprocessing functions](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) are scaling and normalizing data, or imputing missing values."
] ]
}, },
@@ -128,9 +128,9 @@
"\n", "\n",
"If it is not installed, install it with conda: `conda install scikit-learn`.\n", "If it is not installed, install it with conda: `conda install scikit-learn`.\n",
"\n", "\n",
"If you have installed scipy and numpy, you can also installed using pip: `pip install -U scikit-learn`.\n", "If you have installed scipy and numpy, you can also install using pip: `pip install -U scikit-learn`.\n",
"\n", "\n",
"It is not recommended to use pip for installing scipy and numpy. Instead, use conda or install the linux package *python-sklearn*." "It is not recommended to use pip to install scipy and numpy. Instead, use conda or install the Linux package *python-sklearn*."
] ]
}, },
{ {
@@ -156,7 +156,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")\n", "![](./images/EscUpmPolit_p.gif \"UPM\")\n",
"\n", "\n",
"# Course Notes for Learning Intelligent Systems\n", "# Course Notes for Learning Intelligent Systems\n",
"\n", "\n",
@@ -34,11 +34,11 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The goal of this notebook is to learn how to read and load a sample dataset.\n", "This notebook aims to learn how to read and load a sample dataset.\n",
"\n", "\n",
"Scikit-learn comes with some bundled [datasets](http://scikit-learn.org/stable/datasets/): iris, digits, boston, etc.\n", "Scikit-learn comes with some bundled [datasets](https://scikit-learn.org/stable/datasets.html): iris, digits, boston, etc.\n",
"\n", "\n",
"In this notebook we are going to use the Iris dataset." "In this notebook, we will use the Iris dataset."
] ]
}, },
{ {
@@ -54,16 +54,25 @@
"source": [ "source": [
"The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n", "The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",
"\n", "\n",
"The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features.\n", "The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",
"\n", "\n",
"![Iris](files/images/iris-dataset.jpg)" "![Iris dataset](./images/iris-dataset.jpg \"Iris\")"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In ordert to read the dataset, we import the datasets bundle and then load the Iris dataset. " "Here you can see the species and the features.\n",
"![Iris features](./images/iris-features.png \"Iris features\")\n",
"![Iris classes](./images/iris-classes.png \"Iris classes\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To read the dataset, we import the datasets bundle and then load the Iris dataset. "
] ]
}, },
{ {
@@ -180,7 +189,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"#Using numpy, I can print the dimensions (here we are working with 2D matriz)\n", "#Using numpy, I can print the dimensions (here we are working with a 2D matrix)\n",
"print(iris.data.ndim)" "print(iris.data.ndim)"
] ]
}, },
@@ -218,7 +227,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In following sessions we will learn how to load a dataset from a file (csv, excel, ...) using the pandas library." "In the following sessions, we will learn how to load a dataset from a file (CSV, Excel, ...) using the pandas library."
] ]
}, },
{ {
@@ -246,7 +255,7 @@
"source": [ "source": [
"## Licence\n", "## Licence\n",
"\n", "\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -49,7 +49,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The goal of this notebook is to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset in other sessions." "This notebook aims to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset in other sessions."
] ]
}, },
{ {
@@ -65,13 +65,13 @@
"source": [ "source": [
"This section covers different ways to inspect the distribution of samples per feature.\n", "This section covers different ways to inspect the distribution of samples per feature.\n",
"\n", "\n",
"First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n", "First of all, let's see how many samples we have in each class using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
"\n", "\n",
"A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n", "A histogram is a graphical representation of the distribution of numerical data. It estimates the probability distribution of a continuous variable (quantitative variable). \n",
"\n", "\n",
"For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n", "For building a histogram, we need to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
"\n", "\n",
"In our case, since the values are not continuous and we have only three values, we do not need to bin them." "Since the values are not continuous and we have only three values, we do not need to bin them."
] ]
}, },
{ {
@@ -115,7 +115,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"As can be seen, we have the same distribution of samples for every class.\n", "As can be seen, we have the same distribution of samples for every class.\n",
"The next step is to see the distribution of the features" "The next step is to see the distribution of the features."
] ]
}, },
{ {
@@ -184,7 +184,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"As we can see, the Setosa class seems to be linearly separable with these two features.\n", "As we can see, the Setosa class seems linearly separable with these two features.\n",
"\n", "\n",
"Another nice visualisation is given below." "Another nice visualisation is given below."
] ]
@@ -228,7 +228,6 @@
"source": [ "source": [
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n", "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n", "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
"* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
"* [Matplotlib web page](http://matplotlib.org/index.html)\n", "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n", "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n", "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
@@ -242,7 +241,7 @@
"source": [ "source": [
"## Licence\n", "## Licence\n",
"\n", "\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]

File diff suppressed because one or more lines are too long

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -76,7 +76,7 @@
"source": [ "source": [
"A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n", "A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
"\n", "\n",
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)." "We will use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
] ]
}, },
{ {
@@ -122,9 +122,9 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n", "Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might misbehave if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
"\n", "\n",
"The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set." "The preprocessing module further provides a utility class `StandardScaler` to compute a training set's mean and standard deviation. Later, the same transformation will be applied on the testing set."
] ]
}, },
{ {
@@ -163,7 +163,6 @@
"source": [ "source": [
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n", "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n", "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
"* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
"* [Matplotlib web page](http://matplotlib.org/index.html)\n", "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n", "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)" "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"
@@ -174,7 +173,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Licences\n", "### Licences\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -53,9 +53,9 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"This is an introduction of general ideas about machine learning and the interface of scikit-learn, taken from the [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/general_concepts.html). \n", "This is an introduction to general ideas about machine learning and the interface of scikit-learn, taken from the [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/general_concepts.html). \n",
"\n", "\n",
"You can skip it during the lab session and read it later," "You can skip it during the lab session and read it later."
] ]
}, },
{ {
@@ -69,20 +69,20 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Machine learning algorithms are programs that learn a model from a dataset with the aim of making predictions or learning structures to organize the data.\n", "Machine learning algorithms are programs that learn a model from a dataset to make predictions or learn structures to organize the data.\n",
"\n", "\n",
"In scikit-learn, machine learning algorithms take as an input a *numpy* array (n_samples, n_features), where\n", "In scikit-learn, machine learning algorithms take as input a *numpy* array (n_samples, n_features), where\n",
"* **n_samples**: number of samples. Each sample is an item to process (i.e. classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n", "* **n_samples**: number of samples. Each sample is an item to process (i.e., classify). A sample can be a document, a picture, a sound, a video, a row in a database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
"* **n_features**: The number of features or distinct traits that can be used to describe each item in a quantitative manner.\n", "* **n_features**: The number of features or distinct traits that can be used to describe each item quantitatively.\n",
"\n", "\n",
"The number of features should be defined in advance. There is a specific type of feature sets that are high dimensional (e.g. millions of features), but most of the values are zero for a given sample. Using (numpy) arrays, all those values that are zero would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n", "The number of features should be defined in advance. A specific type of feature set is high-dimensional (e.g., millions of features), but most values are zero for a given sample. Using (numpy) arrays, all those zero values would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
"\n", "\n",
"The first step in machine learning is **identifying the relevant features** from the input data, and the second step is **extracting the features** from the input data. \n", "The first step in machine learning is **identifying the relevant features** from the input data, and the second step is **extracting the features** from the input data. \n",
"\n", "\n",
"[Machine learning algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/) can be classified according to learning style into:\n", "[Machine learning algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/) can be classified according to learning style into:\n",
"* **Supervised learning**: input data (training dataset) has a known label or result. Example problems are classification and regression. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.\n", "* **Supervised learning**: input data (training dataset) has a known label or result. Example problems are classification and regression. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.\n",
"* **Unsupervised learning**: input data is not labeled. A model is prepared by deducing structures present in the input data. This may be to extract general rules. Example problems are clustering, dimensionality reduction and association rule learning.\n", "* **Unsupervised learning**: input data is not labeled. A model is prepared by deducing structures present in the input data. This may be to extract general rules. Example problems are clustering, dimensionality reduction, and association rule learning.\n",
"* **Semi-supervised learning**:i nput data is a mixture of labeled and unlabeled examples. There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions. Example problems are classification and regression." "* **Semi-supervised learning**: input data is a mixture of labeled and unlabeled examples. There is a desired prediction problem, but the model must learn the structures to organize the data and make predictions. Example problems are classification and regression."
] ]
}, },
{ {
@@ -96,8 +96,8 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In *supervised machine learning models*, the machine learning algorithm takes as an input a training dataset, composed of feature vectors and labels, and produces a predictive model which is used for make prediction on new data.\n", "In *supervised machine learning models*, the machine learning algorithm takes as input a training dataset, composed of feature vectors and labels, and produces a predictive model used to predict new data.\n",
"![](files/images/plot_ML_flow_chart_1.png)" "![](./images/plot_ML_flow_chart_1.png)"
] ]
}, },
{ {
@@ -111,7 +111,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In *unsupervised machine learning models*, the machine learning model algorithm takes as an input the feature vectors and produces a predictive model that is used to fit its parameters so as to best summarize regularities found in the data.\n", "In *unsupervised machine learning models*, the machine learning model algorithm takes as input the feature vectors. It produces a predictive model that is used to fit its parameters to summarize the best regularities found in the data.\n",
"![](files/images/plot_ML_flow_chart_3.png)" "![](files/images/plot_ML_flow_chart_3.png)"
] ]
}, },
@@ -129,15 +129,15 @@
"scikit-learn has a uniform interface for all the estimators, some methods are only available if the estimator is supervised or unsupervised:\n", "scikit-learn has a uniform interface for all the estimators, some methods are only available if the estimator is supervised or unsupervised:\n",
"\n", "\n",
"* Available in *all estimators*:\n", "* Available in *all estimators*:\n",
" * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n", " * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g., model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",
"\n", "\n",
"* Available in *supervised estimators*:\n", "* Available in *supervised estimators*:\n",
" * **model.predict()**: given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.\n", " * **model.predict()**: given a trained model, predict the label of a new dataset. This method accepts one argument, the new data X_new (e.g., model.predict(X_new)), and returns the learned label for each object in the array.\n",
" * **model.predict_proba()**: For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().\n", " * **model.predict_proba()**: For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().\n",
"\n", "\n",
"* Available in *unsupervised estimators*:\n", "* Available in *unsupervised estimators*:\n",
" * **model.transform()**: given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.\n", " * **model.transform()**: given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.\n",
" * **model.fit_transform()**: some estimators implement this method, which performs a fit and a transform on the same input data.\n", " * **model.fit_transform()**: Some estimators implement this method, which performs a fit and a transform on the same input data.\n",
"\n", "\n",
"\n", "\n",
"![](files/images/plot_ML_flow_chart_2.png)" "![](files/images/plot_ML_flow_chart_2.png)"
@@ -154,7 +154,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [General concepts of machine learning with scikit-learn](http://www.astroml.org/sklearn_tutorial/general_concepts.html)\n", "* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)\n",
"* [A Tour of Machine Learning Algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)" "* [A Tour of Machine Learning Algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)"
] ]
}, },
@@ -169,7 +169,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
@@ -177,7 +177,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -191,7 +191,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.5.6" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -55,7 +55,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.\n", "The goal of this notebook is to learn how to train a model, make predictions with that model, and evaluate these predictions.\n",
"\n", "\n",
"The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)." "The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)."
] ]
@@ -212,14 +212,14 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Precision, recall and f-score" "### Precision, recall, and f-score"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n", "For evaluating classification algorithms, we usually calculate three metrics: precision, recall, and F1-score\n",
"\n", "\n",
"* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n", "* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
"* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n", "* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
@@ -246,7 +246,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Another useful metric is the confusion matrix" "Another useful metric is the confusion matrix."
] ]
}, },
{ {
@@ -262,7 +262,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We see we classify well all the 'setosa' and 'versicolor' samples. " "We classify all the 'setosa' and 'versicolor' samples well. "
] ]
}, },
{ {
@@ -276,7 +276,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**." "To avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**."
] ]
}, },
{ {
@@ -298,7 +298,7 @@
"# create a k-fold cross validation iterator of k=10 folds\n", "# create a k-fold cross validation iterator of k=10 folds\n",
"cv = KFold(10, shuffle=True, random_state=33)\n", "cv = KFold(10, shuffle=True, random_state=33)\n",
"\n", "\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n", "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n", "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"print(scores)" "print(scores)"
] ]
@@ -307,7 +307,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure" "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure."
] ]
}, },
{ {
@@ -340,7 +340,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We are going to tune the algorithm, and calculate which is the best value for the k parameter." "We will tune the algorithm and calculate the best value for the k hyperparameter."
] ]
}, },
{ {
@@ -365,7 +365,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k." "The result is very dependent on the input data. Execute the train_test_split again and test how the result changes with k."
] ]
}, },
{ {
@@ -379,8 +379,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n", "* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n"
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n"
] ]
}, },
{ {
@@ -388,7 +387,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
@@ -405,7 +404,7 @@
"window_display": false "window_display": false
}, },
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -419,7 +418,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.9" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -56,9 +56,9 @@
"source": [ "source": [
"The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n", "The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
"\n", "\n",
"There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n", "There are several well-known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0, and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
"\n", "\n",
"This notebook will follow the same steps that the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n", "This notebook will follow the same steps as the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",
"\n", "\n",
"You need to install pydotplus: `conda install pydotplus` for the visualization." "You need to install pydotplus: `conda install pydotplus` for the visualization."
] ]
@@ -69,12 +69,12 @@
"source": [ "source": [
"## Load data and preprocessing\n", "## Load data and preprocessing\n",
"\n", "\n",
"Here we repeat the same operations for loading data and preprocessing than in the previous notebooks." "Here we repeat the same operations for loading data and preprocessing as in the previous notebooks."
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 1,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
@@ -124,9 +124,20 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 2,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"data": {
"text/plain": [
"DecisionTreeClassifier(max_depth=3, random_state=1)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [ "source": [
"from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n",
"import numpy as np\n", "import numpy as np\n",
@@ -145,9 +156,24 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Prediction [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",
" 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",
" 0 0 0 0 2 2 0 1 1 2 1 0 0 2 1 1 0 1 1 0 2 1 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",
" 0]\n",
"Expected [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",
" 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",
" 0 0 0 0 2 2 0 1 1 2 1 0 0 1 1 1 0 1 1 0 2 2 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",
" 0]\n"
]
}
],
"source": [ "source": [
"print(\"Prediction \", model.predict(x_train))\n", "print(\"Prediction \", model.predict(x_train))\n",
"print(\"Expected \", y_train)" "print(\"Expected \", y_train)"
@@ -162,9 +188,26 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Predicted probabilities [[0. 0.97368421 0.02631579]\n",
" [1. 0. 0. ]\n",
" [0. 0.97368421 0.02631579]\n",
" [0. 0.97368421 0.02631579]\n",
" [0. 0.97368421 0.02631579]\n",
" [1. 0. 0. ]\n",
" [1. 0. 0. ]\n",
" [0. 0.97368421 0.02631579]\n",
" [1. 0. 0. ]\n",
" [0. 0. 1. ]]\n"
]
}
],
"source": [ "source": [
"# Print the \n", "# Print the \n",
"print(\"Predicted probabilities\", model.predict_proba(x_train[:10]))" "print(\"Predicted probabilities\", model.predict_proba(x_train[:10]))"
@@ -172,9 +215,17 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 5,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy in training 0.9821428571428571\n"
]
}
],
"source": [ "source": [
"# Evaluate Accuracy in training\n", "# Evaluate Accuracy in training\n",
"\n", "\n",
@@ -185,9 +236,17 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 6,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy in testing 0.9210526315789473\n"
]
}
],
"source": [ "source": [
"# Now we evaluate error in testing\n", "# Now we evaluate error in testing\n",
"y_test_pred = model.predict(x_test)\n", "y_test_pred = model.predict(x_test)\n",
@@ -203,15 +262,30 @@
"The current version of pydot does not work well in Python 3.\n", "The current version of pydot does not work well in Python 3.\n",
"For obtaining an image, you need to install `pip install pydotplus` and then `conda install graphviz`.\n", "For obtaining an image, you need to install `pip install pydotplus` and then `conda install graphviz`.\n",
"\n", "\n",
"You can skip this example. Since it can require installing additional packages, we include here the result.\n", "You can skip this example. Since it can require installing additional packages, we have included the result here.\n",
"![Decision Tree](files/images/cart.png)" "![Decision Tree](./images/cart.png)"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 7,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"ename": "InvocationException",
"evalue": "GraphViz's executables not found",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mInvocationException\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/tmp/ipykernel_47326/3723147494.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 12\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0mgraph\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpydot\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgraph_from_dot_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdot_data\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgetvalue\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 14\u001b[0;31m \u001b[0mgraph\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite_png\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'iris-tree.png'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 15\u001b[0m \u001b[0mImage\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgraph\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreate_png\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/pydotplus/graphviz.py\u001b[0m in \u001b[0;36m<lambda>\u001b[0;34m(path, f, prog)\u001b[0m\n\u001b[1;32m 1808\u001b[0m \u001b[0;32mlambda\u001b[0m \u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1809\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mfrmt\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1810\u001b[0;31m \u001b[0mprog\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprog\u001b[0m\u001b[0;34m:\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mprog\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mprog\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1811\u001b[0m )\n\u001b[1;32m 1812\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/pydotplus/graphviz.py\u001b[0m in \u001b[0;36mwrite\u001b[0;34m(self, path, prog, format)\u001b[0m\n\u001b[1;32m 1916\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1917\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1918\u001b[0;31m \u001b[0mfobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mwrite\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcreate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mprog\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mformat\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 1919\u001b[0m \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1920\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mclose\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/anaconda3/lib/python3.8/site-packages/pydotplus/graphviz.py\u001b[0m in \u001b[0;36mcreate\u001b[0;34m(self, prog, format)\u001b[0m\n\u001b[1;32m 1957\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprogs\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfind_graphviz\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 1958\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mprogs\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1959\u001b[0;31m raise InvocationException(\n\u001b[0m\u001b[1;32m 1960\u001b[0m 'GraphViz\\'s executables not found')\n\u001b[1;32m 1961\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mInvocationException\u001b[0m: GraphViz's executables not found"
]
}
],
"source": [ "source": [
"from IPython.display import Image \n", "from IPython.display import Image \n",
"from six import StringIO\n", "from six import StringIO\n",
@@ -256,7 +330,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Next we are going to export the pseudocode of the the learnt decision tree." "Next, we will export the pseudocode of the learnt decision tree."
] ]
}, },
{ {
@@ -304,14 +378,14 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Precision, recall and f-score" "### Precision, recall, and f-score"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n", "For evaluating classification algorithms, we usually calculate three metrics: precision, recall, and F1-score\n",
"\n", "\n",
"* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n", "* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
"* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n", "* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
@@ -338,7 +412,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Another useful metric is the confusion matrix" "Another useful metric is the confusion matrix."
] ]
}, },
{ {
@@ -354,7 +428,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We see we classify well all the 'setosa' and 'versicolor' samples. " "We classify all the 'setosa' and 'versicolor' samples well. "
] ]
}, },
{ {
@@ -368,7 +442,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n", "To avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n",
"\n", "\n",
"Sklearn comes with other strategies for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), such as stratified K-fold, label k-fold, Leave-One-Out, Leave-P-Out, Leave-One-Label-Out, Leave-P-Label-Out or Shuffle & Split." "Sklearn comes with other strategies for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), such as stratified K-fold, label k-fold, Leave-One-Out, Leave-P-Out, Leave-One-Label-Out, Leave-P-Label-Out or Shuffle & Split."
] ]
@@ -392,7 +466,7 @@
"# create a k-fold cross validation iterator of k=10 folds\n", "# create a k-fold cross validation iterator of k=10 folds\n",
"cv = KFold(10, shuffle=True, random_state=33)\n", "cv = KFold(10, shuffle=True, random_state=33)\n",
"\n", "\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n", "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n", "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"print(scores)" "print(scores)"
] ]
@@ -401,7 +475,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure" "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure."
] ]
}, },
{ {
@@ -434,10 +508,8 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n", "* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n", "* [Parameter estimation using grid search with cross-validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html)\n",
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
"* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)" "* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
] ]
}, },
@@ -446,7 +518,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
@@ -463,7 +535,7 @@
"window_display": false "window_display": false
}, },
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -477,7 +549,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.9" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -39,7 +39,7 @@
"* [Train classifier](#Train-classifier)\n", "* [Train classifier](#Train-classifier)\n",
"* [More about Pipelines](#More-about-Pipelines)\n", "* [More about Pipelines](#More-about-Pipelines)\n",
"* [Tuning the algorithm](#Tuning-the-algorithm)\n", "* [Tuning the algorithm](#Tuning-the-algorithm)\n",
"\t* [Grid Search for Parameter optimization](#Grid-Search-for-Parameter-optimization)\n", "\t* [Grid Search for Hyperparameter optimization](#Grid-Search-for-Hyperparameter-optimization)\n",
"* [Evaluating the algorithm](#Evaluating-the-algorithm)\n", "* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
"\t* [K-Fold validation](#K-Fold-validation)\n", "\t* [K-Fold validation](#K-Fold-validation)\n",
"* [References](#References)\n" "* [References](#References)\n"
@@ -56,9 +56,9 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In the previous [notebook](2_5_2_Decision_Tree_Model.ipynb), we got an accuracy of 9.47. Could we get a better accuracy if we tune the parameters of the estimator?\n", "In the previous [notebook](2_5_2_Decision_Tree_Model.ipynb), we got an accuracy of 9.47. Could we get a better accuracy if we tune the hyperparameters of the estimator?\n",
"\n", "\n",
"The goal of this notebook is to learn how to tune an algorithm by opimizing its parameters using grid search." "This notebook aims to learn how to tune an algorithm by optimizing its hyperparameters using grid search."
] ]
}, },
{ {
@@ -137,7 +137,7 @@
"# create a k-fold cross validation iterator of k=10 folds\n", "# create a k-fold cross validation iterator of k=10 folds\n",
"cv = KFold(10, shuffle=True, random_state=33)\n", "cv = KFold(10, shuffle=True, random_state=33)\n",
"\n", "\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n", "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n", "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"\n", "\n",
"from scipy.stats import sem\n", "from scipy.stats import sem\n",
@@ -189,7 +189,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We can get the list of parameters of the model. As you will observe, the parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax. We will use this for tuning the parameters." "We can get the list of model parameters. As you will observe, the parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax. We will use this for tuning the parameters."
] ]
}, },
{ {
@@ -205,7 +205,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Let's see what happens if we change a parameter" "Let's see what happens if we change a parameter."
] ]
}, },
{ {
@@ -284,7 +284,7 @@
"\n", "\n",
"Look at the [API](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of *scikit-learn* to understand better the algorithm, as well as which parameters can be tuned. As you see, we can change several ones, such as *criterion*, *splitter*, *max_features*, *max_depth*, *min_samples_split*, *class_weight*, etc.\n", "Look at the [API](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of *scikit-learn* to understand better the algorithm, as well as which parameters can be tuned. As you see, we can change several ones, such as *criterion*, *splitter*, *max_features*, *max_depth*, *min_samples_split*, *class_weight*, etc.\n",
"\n", "\n",
"We can get the full list parameters of an estimator with the method *get_params()*. " "We can get an estimator's full list of parameters with the method *get_params()*. "
] ]
}, },
{ {
@@ -300,30 +300,30 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"You can try different values for these parameters and observe the results." "You can try different values for these hyperparameters and observe the results."
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"### Grid Search for Parameter optimization" "### Grid Search for Hyperparameter optimization"
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Changing manually the parameters to find their optimal values is not practical. Instead, we can consider to find the optimal value of the parameters as an *optimization problem*. \n", "Changing manually the hyperparameters to find their optimal values is not practical. Instead, we can consider finding the optimal value of the hyperparameters as an *optimization problem*. \n",
"\n", "\n",
"The sklearn comes with several optimization techniques for this purpose, such as **grid search** and **randomized search**. In this notebook we are going to introduce the former one." "Sklearn has several optimization techniques, such as **grid search** and **randomized search**. In this notebook, we are going to introduce the former one."
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The sklearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. " "Sklearn provides an object that, given data, computes the score during the fit of an estimator on a hyperparameter grid and chooses the hyperparameters to maximize the cross-validation score. "
] ]
}, },
{ {
@@ -351,7 +351,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Now we are going to show the results of grid search" "Now we are going to show the results of the grid search"
] ]
}, },
{ {
@@ -371,7 +371,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We can now evaluate the KFold with this optimized parameter as follows." "We can now evaluate the KFold with this optimized hyperparameter as follows."
] ]
}, },
{ {
@@ -392,7 +392,7 @@
"# create a k-fold cross validation iterator of k=10 folds\n", "# create a k-fold cross validation iterator of k=10 folds\n",
"cv = KFold(10, shuffle=True, random_state=33)\n", "cv = KFold(10, shuffle=True, random_state=33)\n",
"\n", "\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n", "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n", "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"def mean_score(scores):\n", "def mean_score(scores):\n",
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n", " return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
@@ -405,7 +405,7 @@
"source": [ "source": [
"We have got an *improvement* from 0.947 to 0.953 with k-fold.\n", "We have got an *improvement* from 0.947 to 0.953 with k-fold.\n",
"\n", "\n",
"We are now to try to fit the best combination of the parameters of the algorithm. It can take some time to compute it." "We are now trying to fit the best combination of the hyperparameters of the algorithm. It can take some time to compute it."
] ]
}, },
{ {
@@ -414,12 +414,12 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# Set the parameters by cross-validation\n", "# Set the hyperparameters by cross-validation\n",
"\n", "\n",
"from sklearn.metrics import classification_report, recall_score, precision_score, make_scorer\n", "from sklearn.metrics import classification_report, recall_score, precision_score, make_scorer\n",
"\n", "\n",
"# set of parameters to test\n", "# set of hyperparameters to test\n",
"tuned_parameters = [{'max_depth': np.arange(3, 10),\n", "tuned_hyperparameters = [{'max_depth': np.arange(3, 10),\n",
"# 'max_weights': [1, 10, 100, 1000]},\n", "# 'max_weights': [1, 10, 100, 1000]},\n",
" 'criterion': ['gini', 'entropy'], \n", " 'criterion': ['gini', 'entropy'], \n",
" 'splitter': ['best', 'random'],\n", " 'splitter': ['best', 'random'],\n",
@@ -431,7 +431,7 @@
"scores = ['precision', 'recall']\n", "scores = ['precision', 'recall']\n",
"\n", "\n",
"for score in scores:\n", "for score in scores:\n",
" print(\"# Tuning hyper-parameters for %s\" % score)\n", " print(\"# Tuning hyperparameters for %s\" % score)\n",
" print()\n", " print()\n",
"\n", "\n",
" if score == 'precision':\n", " if score == 'precision':\n",
@@ -440,10 +440,10 @@
" scorer = make_scorer(recall_score, average='weighted', zero_division=0)\n", " scorer = make_scorer(recall_score, average='weighted', zero_division=0)\n",
" \n", " \n",
" # cv = the fold of the cross-validation cv, defaulted to 5\n", " # cv = the fold of the cross-validation cv, defaulted to 5\n",
" gs = GridSearchCV(DecisionTreeClassifier(), tuned_parameters, cv=10, scoring=scorer)\n", " gs = GridSearchCV(DecisionTreeClassifier(), tuned_hyperparameters, cv=10, scoring=scorer)\n",
" gs.fit(x_train, y_train)\n", " gs.fit(x_train, y_train)\n",
"\n", "\n",
" print(\"Best parameters set found on development set:\")\n", " print(\"Best hyperparameters set found on development set:\")\n",
" print()\n", " print()\n",
" print(gs.best_params_)\n", " print(gs.best_params_)\n",
" print()\n", " print()\n",
@@ -492,7 +492,7 @@
"# create a k-fold cross validation iterator of k=10 folds\n", "# create a k-fold cross validation iterator of k=10 folds\n",
"cv = KFold(10, shuffle=True, random_state=33)\n", "cv = KFold(10, shuffle=True, random_state=33)\n",
"\n", "\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n", "# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n", "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"def mean_score(scores):\n", "def mean_score(scores):\n",
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n", " return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
@@ -517,10 +517,8 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n", "* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n", "* [Hyperparameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
"* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)" "* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
] ]
}, },
@@ -535,7 +533,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
@@ -543,7 +541,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -557,7 +555,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.8.6" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -48,9 +48,9 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The goal of this notebook is to learn how to save a model in the the scikit by using Pythons built-in persistence model, namely pickle\n", "The goal of this notebook is to learn how to save a model in the scikit by using Pythons built-in persistence model, namely pickle\n",
"\n", "\n",
"First we recap the previous tasks: load data, preprocess and train the model." "First, we recap the previous tasks: load data, preprocess, and train the model."
] ]
}, },
{ {
@@ -107,7 +107,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"A more efficient alternative to pickle is joblib, especially for big data problems. In this case the model can only be saved to a file and not to a string." "A more efficient alternative to pickle is joblib, especially for big data problems. In this case, the model can only be saved to a file and not to a string."
] ]
}, },
{ {
@@ -136,7 +136,9 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [Tutorial scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n", "* [Tutorial scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)\n",
"* [Model persistence in scikit-learn](http://scikit-learn.org/stable/modules/model_persistence.html#model-persistence)" "* [Model persistence in scikit-learn](http://scikit-learn.org/stable/modules/model_persistence.html#model-persistence)\n",
"* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
] ]
}, },
{ {
@@ -144,7 +146,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
@@ -161,7 +163,7 @@
"window_display": false "window_display": false
}, },
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -175,7 +177,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.9" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -52,7 +52,7 @@
"\n", "\n",
"Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.\n", "Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.\n",
"\n", "\n",
"The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set.\n", "The plots show training points in solid colors and testing points in semi-transparent colors. The lower right shows the classification accuracy on the test set.\n",
"\n", "\n",
"The [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier) is a classifier that makes predictions using simple rules. It is useful as a simple baseline to compare with other (real) classifiers. \n", "The [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier) is a classifier that makes predictions using simple rules. It is useful as a simple baseline to compare with other (real) classifiers. \n",
"\n", "\n",
@@ -94,7 +94,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]

BIN
ml1/images/iris-classes.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 944 KiB

View File

@@ -47,7 +47,7 @@ def get_code(tree, feature_names, target_names,
recurse(left, right, threshold, features, 0, 0) recurse(left, right, threshold, features, 0, 0)
# Taken from http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#example-tree-plot-iris-py # Taken from https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html
import numpy as np import numpy as np
import matplotlib.pyplot as plt import matplotlib.pyplot as plt

View File

@@ -74,9 +74,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [IPython Notebook Tutorial for Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n", "* [IPython Notebook Tutorial for Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n", "* [Scikit-learn videos and notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015."
] ]
}, },
{ {
@@ -92,7 +90,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -106,7 +104,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -50,30 +50,30 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In this session we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n", "In this session, we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
"\n", "\n",
"![Titanic](images/titanic.jpg)\n", "![Titanic](images/titanic.jpg)\n",
"\n", "\n",
"\n", "\n",
"The main objective is predicting which passengers survived the sinking of the Titanic.\n", "The main objective is to predict which passengers survived the sinking of the Titanic.\n",
"\n", "\n",
"The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder *data-titanic*.\n", "The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder *data-titanic*.\n",
"\n", "\n",
"\n", "\n",
"Here follows a description of the variables.\n", "Here follows a description of the variables.\n",
"\n", "\n",
"|Variable | Description| Values|\n", "| Variable | Description | Values |\n",
"|-------------------------------|\n", "|------------|---------------------------------|-----------------|\n",
"| survival| Survival| (0 = No; 1 = Yes)|\n", "| survival | Survival |(0 = No; 1 = Yes)|\n",
"|Pclass |Name | |\n", "| Pclass | Name | |\n",
"|Sex |Sex | male, female|\n", "| Sex | Sex | male, female |\n",
"|Age |Age|\n", "| Age | Age | |\n",
"|SibSp |Number of Siblings/Spouses Aboard||\n", "| SibSp |Number of Siblings/Spouses Aboard| |\n",
"|Parch |Number of Parents/Children Aboard||\n", "| Parch |Number of Parents/Children Aboard| |\n",
"|Ticket|Ticket Number||\n", "| Ticket | Ticket Number | |\n",
"|Fare |Passenger Fare||\n", "| Fare | Passenger Fare | |\n",
"|Cabin |Cabin||\n", "| Cabin | Cabin | |\n",
"|Embarked |Port of Embarkation| (C = Cherbourg; Q = Queenstown; S = Southampton)|\n", "| Embarked | Port of Embarkation | (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
"\n", "\n",
"\n", "\n",
"The definitions used for SibSp and Parch are:\n", "The definitions used for SibSp and Parch are:\n",
@@ -213,8 +213,7 @@
"* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)\n", "* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)\n",
"* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n", "* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n",
"* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n", "* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n",
"* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n", "* [An introduction to NumPy and Scipy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf)\n"
"* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"
] ]
}, },
{ {

View File

@@ -433,10 +433,9 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [Pandas](http://pandas.pydata.org/)\n", "* [Pandas](http://pandas.pydata.org/)\n",
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n", "* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
"* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n", "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
"* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)" "* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)"
] ]
}, },
{ {
@@ -458,7 +457,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -472,7 +471,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -373,8 +373,8 @@
"source": [ "source": [
"#Mean age of passengers per Passenger class\n", "#Mean age of passengers per Passenger class\n",
"\n", "\n",
"#First we calculate the mean\n", "#First we calculate the mean for the numeric columns\n",
"df.groupby('Pclass').mean()" "df.select_dtypes(np.number).groupby('Pclass').mean()"
] ]
}, },
{ {
@@ -404,7 +404,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"#Mean Age and SibSp of passengers grouped by passenger class and sex\n", "#Mean Age and SibSp of passengers grouped by passenger class and sex\n",
"df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()" "df.groupby(['Pclass', 'Sex'])[['Age','SibSp']].mean()"
] ]
}, },
{ {
@@ -414,7 +414,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"#Show mean Age and SibSp for passengers older than 25 grouped by Passenger Class and Sex\n", "#Show mean Age and SibSp for passengers older than 25 grouped by Passenger Class and Sex\n",
"df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()" "df[df.Age > 25].groupby(['Pclass', 'Sex'])[['Age','SibSp']].mean()"
] ]
}, },
{ {
@@ -424,7 +424,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n", "# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()" "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].mean()"
] ]
}, },
{ {
@@ -436,7 +436,7 @@
"# We can also decide which function apply in each column\n", "# We can also decide which function apply in each column\n",
"\n", "\n",
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n", "#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, \n", "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].agg({'Age': np.mean, \n",
" 'SibSp': np.mean, 'Survived': np.sum})" " 'SibSp': np.mean, 'Survived': np.sum})"
] ]
}, },
@@ -451,7 +451,10 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Pivot tables are an intuitive way to analyze data, and alternative to group columns." "Pivot tables are an intuitive way to analyze data, and an alternative to group columns.\n",
"\n",
"This command makes a table with rows Sex and columns Pclass, and\n",
"averages the result of the column Survived, thereby giving the percentage of survivors in each grouping."
] ]
}, },
{ {
@@ -460,7 +463,14 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"pd.pivot_table(df, index='Sex')" "pd.pivot_table(df, index='Sex', columns='Pclass', values=['Survived'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we want to analyze multi-index, the percentage of survivoers, given sex and age, and distributed by Pclass."
] ]
}, },
{ {
@@ -469,7 +479,14 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'])" "pd.pivot_table(df, index=['Sex', 'Age'], columns=['Pclass'], values=['Survived'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nevertheless, this is not very useful since we have a row per age. Thus, we define a partition."
] ]
}, },
{ {
@@ -478,7 +495,8 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])" "# Partition each of the passengers into 3 categories based on their age\n",
"age = pd.cut(df['Age'], [0,12,18,80])"
] ]
}, },
{ {
@@ -487,7 +505,14 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)" "pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can change the function used for aggregating each group."
] ]
}, },
{ {
@@ -496,8 +521,18 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# Try np.sum, np.size, len\n", "# default\n",
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])" "pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'], aggfunc=np.mean)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Two agg functions\n",
"pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'], aggfunc=[np.mean, np.sum])"
] ]
}, },
{ {
@@ -600,8 +635,8 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# Fill missing values with the median\n", "# Fill missing values with the median, we avoid empty (None) values with numeric_only\n",
"df_filled = df.fillna(df.median())\n", "df_filled = df.fillna(df.median(numeric_only=True))\n",
"df_filled[-5:]" "df_filled[-5:]"
] ]
}, },
@@ -685,7 +720,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# But we are working on a copy \n", "# But we are working on a copy, so we get a warning\n",
"df.iloc[889]['Sex'] = np.nan" "df.iloc[889]['Sex'] = np.nan"
] ]
}, },
@@ -695,7 +730,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# If we want to change, we should not chain selections\n", "# If we want to change it, we should not chain selections\n",
"# The selection can be done with the column name\n", "# The selection can be done with the column name\n",
"df.loc[889, 'Sex']" "df.loc[889, 'Sex']"
] ]
@@ -932,11 +967,11 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [Pandas](http://pandas.pydata.org/)\n", "* [Pandas](http://pandas.pydata.org/)\n",
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n", "* [Learning Pandas, Michael Heydt, Packt Publishing, 2017](https://learning.oreilly.com/library/view/learning-pandas/9781787123137/)\n",
"* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)\n", "* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
"* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n", "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
"* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)" "* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)\n",
"* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)"
] ]
}, },
{ {
@@ -958,7 +993,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -972,7 +1007,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.11.5"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -220,7 +220,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# Analise distributon\n", "# Analise distribution\n",
"df.hist(figsize=(10,10))\n", "df.hist(figsize=(10,10))\n",
"plt.show()" "plt.show()"
] ]
@@ -233,7 +233,7 @@
"source": [ "source": [
"# We can see the pairwise correlation between variables. A value near 0 means low correlation\n", "# We can see the pairwise correlation between variables. A value near 0 means low correlation\n",
"# while a value near -1 or 1 indicates strong correlation.\n", "# while a value near -1 or 1 indicates strong correlation.\n",
"df.corr()" "df.corr(numeric_only = True)"
] ]
}, },
{ {
@@ -249,11 +249,10 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"# General description of relationship betweek variables uwing Seaborn PairGrid\n", "# General description of relationship between variables uwing Seaborn PairGrid\n",
"# We use df_clean, since the null values of df would gives us an error, you can check it.\n", "# We use df_clean, since the null values of df would gives us an error, you can check it.\n",
"g = sns.PairGrid(df_clean, hue=\"Survived\")\n", "g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
"g.map_diag(plt.hist)\n", "g.map(sns.scatterplot)\n",
"g.map_offdiag(plt.scatter)\n",
"g.add_legend()" "g.add_legend()"
] ]
}, },
@@ -367,7 +366,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# Now we visualise age and survived to see if there is some relationship\n", "# Now we visualise age and survived to see if there is some relationship\n",
"sns.FacetGrid(df, hue=\"Survived\", size=5).map(sns.kdeplot, \"Age\").add_legend()" "sns.FacetGrid(df, hue=\"Survived\", height=5).map(sns.kdeplot, \"Age\").add_legend()"
] ]
}, },
{ {
@@ -567,7 +566,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# Plot with seaborn\n", "# Plot with seaborn\n",
"sns.countplot('Sex', data=df)" "sns.countplot(x='Sex', data=df)"
] ]
}, },
{ {
@@ -683,16 +682,6 @@
"df.groupby('Pclass').size()" "df.groupby('Pclass').size()"
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Distribution\n",
"sns.countplot('Pclass', data=df)"
]
},
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
@@ -725,7 +714,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"sns.factorplot('Pclass',data=df,hue='Sex',kind='count')" "sns.catplot(x='Pclass',data=df,hue='Sex',kind='count')"
] ]
}, },
{ {
@@ -906,7 +895,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# Distribution\n", "# Distribution\n",
"sns.countplot('Embarked', data=df)" "sns.countplot(x='Embarked', data=df)"
] ]
}, },
{ {
@@ -997,7 +986,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# Distribution\n", "# Distribution\n",
"sns.countplot('SibSp', data=df)" "sns.countplot(x='SibSp', data=df)"
] ]
}, },
{ {
@@ -1180,7 +1169,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# Distribution\n", "# Distribution\n",
"sns.countplot('Parch', data=df)" "sns.countplot(x='Parch', data=df)"
] ]
}, },
{ {
@@ -1233,7 +1222,7 @@
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [],
"source": [ "source": [
"df.groupby(['Pclass', 'Sex', 'Parch'])['Parch', 'SibSp', 'Survived'].agg({'Parch': np.size, 'SibSp': np.mean, 'Survived': np.mean})" "df.groupby(['Pclass', 'Sex', 'Parch'])[['Parch', 'SibSp', 'Survived']].agg({'Parch': np.size, 'SibSp': np.mean, 'Survived': np.mean})"
] ]
}, },
{ {
@@ -1576,7 +1565,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -1590,7 +1579,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -72,7 +72,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n", "Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv.\n",
"\n", "\n",
"Print *df*." "Print *df*."
] ]
@@ -214,7 +214,7 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"df['FamilySize'] = df['SibSp'] + df['Parch']\n", "df['FamilySize'] = df['SibSp'] + df['Parch']\n",
"df.head()" "df"
] ]
}, },
{ {
@@ -377,8 +377,8 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"# Group ages to simplify machine learning algorithms. 0: 0-5, 1: 6-10, 2: 11-15, 3: 16-59 and 4: 60-80\n", "# Group ages to simplify machine learning algorithms. 0: 0-5, 1: 6-10, 2: 11-15, 3: 16-59 and 4: 60-80\n",
"df['AgeGroup'] = 0\n", "df['AgeGroup'] = np.nan\n",
"df.loc[(.Age<6),'AgeGroup'] = 0\n", "df.loc[(df.Age<6),'AgeGroup'] = 0\n",
"df.loc[(df.Age>=6) & (df.Age < 11),'AgeGroup'] = 1\n", "df.loc[(df.Age>=6) & (df.Age < 11),'AgeGroup'] = 1\n",
"df.loc[(df.Age>=11) & (df.Age < 16),'AgeGroup'] = 2\n", "df.loc[(df.Age>=11) & (df.Age < 16),'AgeGroup'] = 2\n",
"df.loc[(df.Age>=16) & (df.Age < 60),'AgeGroup'] = 3\n", "df.loc[(df.Age>=16) & (df.Age < 60),'AgeGroup'] = 3\n",
@@ -404,8 +404,8 @@
" if np.isnan(big_string):\n", " if np.isnan(big_string):\n",
" return 'X'\n", " return 'X'\n",
" for substring in substrings:\n", " for substring in substrings:\n",
" if big_string.find(substring) != 1:\n", " if substring in big_string:\n",
" return substring\n", " return substring[0::]\n",
" print(big_string)\n", " print(big_string)\n",
" return 'X'\n", " return 'X'\n",
" \n", " \n",
@@ -478,8 +478,17 @@
} }
], ],
"metadata": { "metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -493,7 +502,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -78,7 +78,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015." "* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka and Vahid Mirjalili, Packt Publishing, 2019."
] ]
}, },
{ {
@@ -100,7 +100,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -114,7 +114,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -222,7 +222,7 @@
"kernel = types_of_kernels[0]\n", "kernel = types_of_kernels[0]\n",
"gamma = 3.0\n", "gamma = 3.0\n",
"\n", "\n",
"# Create kNN model\n", "# Create SVM model\n",
"model = SVC(kernel=kernel, probability=True, gamma=gamma)" "model = SVC(kernel=kernel, probability=True, gamma=gamma)"
] ]
}, },
@@ -276,7 +276,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"We can evaluate the accuracy if the model always predict the most frequent class, following this [refeference](http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/)." "We can evaluate the accuracy if the model always predict the most frequent class, following this [reference](https://medium.com/analytics-vidhya/model-validation-for-classification-5ff4a0373090)."
] ]
}, },
{ {
@@ -351,10 +351,10 @@
"We can obtain more information from the confussion matrix and the metric F1-score.\n", "We can obtain more information from the confussion matrix and the metric F1-score.\n",
"In a confussion matrix, we can see:\n", "In a confussion matrix, we can see:\n",
"\n", "\n",
"||**Predicted**: 0| **Predicted: 1**|\n", "| |**Predicted**: 0| **Predicted: 1**|\n",
"|---------------------------|\n", "|-------------|----------------|-----------------|\n",
"|**Actual: 0**| TN | FP |\n", "|**Actual: 0**| TN | FP |\n",
"|**Actual: 1**| FN|TP|\n", "|**Actual: 1**| FN | TP |\n",
"\n", "\n",
"* **True negatives (TN)**: actual negatives that were predicted as negatives\n", "* **True negatives (TN)**: actual negatives that were predicted as negatives\n",
"* **False positives (FP)**: actual negatives that were predicted as positives\n", "* **False positives (FP)**: actual negatives that were predicted as positives\n",
@@ -418,7 +418,7 @@
"plt.ylim([0.0, 1.0])\n", "plt.ylim([0.0, 1.0])\n",
"plt.title('ROC curve for Titanic')\n", "plt.title('ROC curve for Titanic')\n",
"plt.xlabel('False Positive Rate (1 - Recall)')\n", "plt.xlabel('False Positive Rate (1 - Recall)')\n",
"plt.xlabel('True Positive Rate (Sensitivity)')\n", "plt.ylabel('True Positive Rate (Sensitivity)')\n",
"plt.grid(True)" "plt.grid(True)"
] ]
}, },
@@ -535,13 +535,13 @@
"source": [ "source": [
"# This step will take some time\n", "# This step will take some time\n",
"# Cross-validationt\n", "# Cross-validationt\n",
"cv = KFold(n_splits=5, shuffle=False, random_state=33)\n", "cv = KFold(n_splits=5, shuffle=True, random_state=33)\n",
"# StratifiedKFold has is a variation of k-fold which returns stratified folds:\n", "# StratifiedKFold has is a variation of k-fold which returns stratified folds:\n",
"# each set contains approximately the same percentage of samples of each target class as the complete set.\n", "# each set contains approximately the same percentage of samples of each target class as the complete set.\n",
"#cv = StratifiedKFold(y, n_folds=3, shuffle=False, random_state=33)\n", "#cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=33)\n",
"scores = cross_val_score(model, X, y, cv=cv)\n", "scores = cross_val_score(model, X, y, cv=cv)\n",
"print(\"Scores in every iteration\", scores)\n", "print(\"Scores in every iteration\", scores)\n",
"print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))\n" "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
] ]
}, },
{ {
@@ -644,7 +644,7 @@
"source": [ "source": [
"* [Titanic Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n", "* [Titanic Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
"* [API SVC scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)\n", "* [API SVC scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)\n",
"* [Better evaluation of classification models](http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/)" "* [How to choose the right metric for evaluating an ML model](https://www.kaggle.com/vipulgandhi/how-to-choose-right-metric-for-evaluating-ml-model)"
] ]
}, },
{ {
@@ -666,7 +666,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -680,7 +680,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -39,7 +39,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In this exercise we are going to put in practice what we have learnt in the notebooks of the session. \n", "In this exercise, we are going to put in practice what we have learnt in the notebooks of the session. \n",
"\n", "\n",
"In the previous notebook we have been applying the SVM machine learning algorithm.\n", "In the previous notebook we have been applying the SVM machine learning algorithm.\n",
"\n", "\n",
@@ -67,7 +67,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -81,7 +81,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.8.12"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

BIN
ml2/images/iris-classes.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.4 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 944 KiB

View File

@@ -1,21 +1,21 @@
""" """
Taken from http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
======================== ========================
Plotting Learning Curves Plotting Learning Curves
======================== ========================
In the first column, first row the learning curve of a naive Bayes classifier
is shown for the digits dataset. Note that the training score and the
cross-validation score are both not very good at the end. However, the shape
of the curve can be found in more complex datasets very often: the training
score is very high at the beginning and decreases and the cross-validation
score is very low at the beginning and increases. In the second column, first
row we see the learning curve of an SVM with RBF kernel. We can see clearly
that the training score is still around the maximum and the validation score
could be increased with more training samples. The plots in the second row
show the times required by the models to train with various sizes of training
dataset. The plots in the third row show how much time was required to train
the models for each training sizes.
On the left side the learning curve of a naive Bayes classifier is shown for
the digits dataset. Note that the training score and the cross-validation score
are both not very good at the end. However, the shape of the curve can be found
in more complex datasets very often: the training score is very high at the
beginning and decreases and the cross-validation score is very low at the
beginning and increases. On the right side we see the learning curve of an SVM
with RBF kernel. We can see clearly that the training score is still around
the maximum and the validation score could be increased with more training
samples.
""" """
#print(__doc__)
import numpy as np import numpy as np
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
@@ -23,86 +23,181 @@ from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC from sklearn.svm import SVC
from sklearn.datasets import load_digits from sklearn.datasets import load_digits
from sklearn.model_selection import learning_curve from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, def plot_learning_curve(
n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)): estimator,
title,
X,
y,
axes=None,
ylim=None,
cv=None,
n_jobs=None,
train_sizes=np.linspace(0.1, 1.0, 5),
):
""" """
Generate a simple plot of the test and traning learning curve. Generate 3 plots: the test and training learning curve, the training
samples vs fit times curve, the fit times vs score curve.
Parameters Parameters
---------- ----------
estimator : object type that implements the "fit" and "predict" methods estimator : estimator instance
An object of that type which is cloned for each validation. An estimator instance implementing `fit` and `predict` methods which
will be cloned for each validation.
title : string title : str
Title for the chart. Title for the chart.
X : array-like, shape (n_samples, n_features) X : array-like of shape (n_samples, n_features)
Training vector, where n_samples is the number of samples and Training vector, where ``n_samples`` is the number of samples and
n_features is the number of features. ``n_features`` is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional y : array-like of shape (n_samples) or (n_samples, n_features)
Target relative to X for classification or regression; Target relative to ``X`` for classification or regression;
None for unsupervised learning. None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional axes : array-like of shape (3,), default=None
Defines minimum and maximum yvalues plotted. Axes to use for plotting the curves.
cv : integer, cross-validation generator, optional ylim : tuple of shape (2,), default=None
If an integer is passed, it is the number of folds (defaults to 3). Defines minimum and maximum y-values plotted, e.g. (ymin, ymax).
Specific cross-validation objects can be passed, see
sklearn.model_selection module for the list of possible objects
n_jobs : integer, optional cv : int, cross-validation generator or an iterable, default=None
Number of jobs to run in parallel (default 1). Determines the cross-validation splitting strategy.
Possible inputs for cv are:
- None, to use the default 5-fold cross-validation,
- integer, to specify the number of folds.
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if ``y`` is binary or multiclass,
:class:`StratifiedKFold` used. If the estimator is not a classifier
or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validators that can be used here.
n_jobs : int or None, default=None
Number of jobs to run in parallel.
``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
``-1`` means using all processors. See :term:`Glossary <n_jobs>`
for more details.
train_sizes : array-like of shape (n_ticks,)
Relative or absolute numbers of training examples that will be used to
generate the learning curve. If the ``dtype`` is float, it is regarded
as a fraction of the maximum size of the training set (that is
determined by the selected validation method), i.e. it has to be within
(0, 1]. Otherwise it is interpreted as absolute sizes of the training
sets. Note that for classification the number of samples usually have
to be big enough to contain at least one sample from each class.
(default: np.linspace(0.1, 1.0, 5))
""" """
plt.figure() if axes is None:
plt.title(title) _, axes = plt.subplots(1, 3, figsize=(20, 5))
axes[0].set_title(title)
if ylim is not None: if ylim is not None:
plt.ylim(*ylim) axes[0].set_ylim(*ylim)
plt.xlabel("Training examples") axes[0].set_xlabel("Training examples")
plt.ylabel("Score") axes[0].set_ylabel("Score")
train_sizes, train_scores, test_scores = learning_curve(
estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
train_scores_mean = np.mean(train_scores, axis=1) train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1)
plt.grid() fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, # Plot learning curve
train_scores_mean + train_scores_std, alpha=0.1, axes[0].grid()
color="r") axes[0].fill_between(
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, train_sizes,
test_scores_mean + test_scores_std, alpha=0.1, color="g") train_scores_mean - train_scores_std,
plt.plot(train_sizes, train_scores_mean, 'o-', color="r", train_scores_mean + train_scores_std,
label="Training score") alpha=0.1,
plt.plot(train_sizes, test_scores_mean, 'o-', color="g", color="r",
label="Cross-validation score") )
axes[0].fill_between(
train_sizes,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.1,
color="g",
)
axes[0].plot(
train_sizes, train_scores_mean, "o-", color="r", label="Training score"
)
axes[0].plot(
train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
)
axes[0].legend(loc="best")
# Plot n_samples vs fit_times
axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, "o-")
axes[1].fill_between(
train_sizes,
fit_times_mean - fit_times_std,
fit_times_mean + fit_times_std,
alpha=0.1,
)
axes[1].set_xlabel("Training examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability of the model")
# Plot fit_time vs score
fit_time_argsort = fit_times_mean.argsort()
fit_time_sorted = fit_times_mean[fit_time_argsort]
test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
test_scores_std_sorted = test_scores_std[fit_time_argsort]
axes[2].grid()
axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
axes[2].fill_between(
fit_time_sorted,
test_scores_mean_sorted - test_scores_std_sorted,
test_scores_mean_sorted + test_scores_std_sorted,
alpha=0.1,
)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance of the model")
plt.legend(loc="best")
return plt return plt
#digits = load_digits() fig, axes = plt.subplots(3, 2, figsize=(10, 15))
#X, y = digits.data, digits.target
X, y = load_digits(return_X_y=True)
#title = "Learning Curves (Naive Bayes)" title = "Learning Curves (Naive Bayes)"
# Cross validation with 100 iterations to get smoother mean test and train # Cross validation with 50 iterations to get smoother mean test and train
# score curves, each time with 20% data randomly selected as a validation set. # score curves, each time with 20% data randomly selected as a validation set.
#cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=100, cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
# test_size=0.2, random_state=0)
#estimator = GaussianNB() estimator = GaussianNB()
#plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=4) plot_learning_curve(
estimator, title, X, y, axes=axes[:, 0], ylim=(0.7, 1.01), cv=cv, n_jobs=4
)
#title = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)" title = r"Learning Curves (SVM, RBF kernel, $\gamma=0.001$)"
# SVC is more expensive so we do a lower number of CV iterations: # SVC is more expensive so we do a lower number of CV iterations:
#cv = cross_validation.ShuffleSplit(digits.data.shape[0], n_iter=10, cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
# test_size=0.2, random_state=0) estimator = SVC(gamma=0.001)
#estimator = SVC(gamma=0.001) plot_learning_curve(
#plot_learning_curve(estimator, title, X, y, (0.7, 1.01), cv=cv, n_jobs=4) estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01), cv=cv, n_jobs=4
)
#plt.show() plt.show()

View File

@@ -3,7 +3,7 @@ import matplotlib.pyplot as plt
import numpy as np import numpy as np
from sklearn import svm from sklearn import svm
#Taken from http://nbviewer.jupyter.org/github/agconti/kaggle-titanic/blob/master/Titanic.ipynb # Taken from http://nbviewer.jupyter.org/github/agconti/kaggle-titanic/blob/master/Titanic.ipynb
def plot_svm(df): def plot_svm(df):
# set plotting parameters # set plotting parameters

1
ml21/.gitkeep Normal file
View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1,157 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Introduction to Preprocessing\n",
"In this session, we will get more insight regarding how to preprocess data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Objectives\n",
"The main objectives of this session are:\n",
"* Understanding the need for preprocessing\n",
"* Understanding different preprocessing techniques\n",
"* Experimenting with several environments for preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Table of Contents"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"1. [Home](00_Intro_Preprocessing.ipynb)\n",
"3. [Initial Check](02_Initial_Check.ipynb)\n",
"4. [Filter Data](03_Filter_Data.ipynb)\n",
"5. [Unknown values](04_Unknown_Values.ipynb)\n",
"6. [Duplicated values](05_Duplicated_Values.ipynb)\n",
"7. [Rescaling Data](06_Rescaling_Data.ipynb)\n",
"8. [Binarize Data](07_Binarize_Data.ipynb)\n",
"9. [Categorial features](08_Categorical.ipynb)\n",
"10. [String Data](09_String_Data.ipynb)\n",
"12. [Handy libraries for preprocessing](11_0_Handy.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,714 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Initial Check with Pandas\n",
"\n",
"We can start with a quick quality check."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Load and check data\n",
"Check which data you are loading."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Moran, Mr. James</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330877</td>\n",
" <td>8.4583</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy, Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson, Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
" <td>female</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>347742</td>\n",
" <td>11.1333</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"5 6 0 3 \n",
"6 7 0 1 \n",
"7 8 0 3 \n",
"8 9 1 3 \n",
"9 10 1 2 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"5 Moran, Mr. James male NaN 0 \n",
"6 McCarthy, Mr. Timothy J male 54.0 0 \n",
"7 Palsson, Master. Gosta Leonard male 2.0 3 \n",
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n",
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
"5 0 330877 8.4583 NaN Q \n",
"6 0 17463 51.8625 E46 S \n",
"7 1 349909 21.0750 NaN S \n",
"8 2 347742 11.1333 NaN S \n",
"9 0 237736 30.0708 NaN C "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Check number of columns and rows"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(891, 12)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Check names and types of columns\n",
"Check the data and type, for example if dates are of strings or what."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
" 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
" dtype='object')\n"
]
},
{
"data": {
"text/plain": [
"PassengerId int64\n",
"Survived int64\n",
"Pclass int64\n",
"Name object\n",
"Sex object\n",
"Age float64\n",
"SibSp int64\n",
"Parch int64\n",
"Ticket object\n",
"Fare float64\n",
"Cabin object\n",
"Embarked object\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get column names\n",
"print(df.columns)\n",
"# Get column data types\n",
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Check if the column is unique"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PassengerId is unique: True\n",
"Survived is unique: False\n",
"Pclass is unique: False\n",
"Name is unique: True\n",
"Sex is unique: False\n",
"Age is unique: False\n",
"SibSp is unique: False\n",
"Parch is unique: False\n",
"Ticket is unique: False\n",
"Fare is unique: False\n",
"Cabin is unique: False\n",
"Embarked is unique: False\n"
]
}
],
"source": [
"for i in column_names:\n",
" print('{} is unique: {}'.format(i, df[i].is_unique))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Check if the dataframe has an index\n",
"We will need it to do joins or merges."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=891, step=1)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check if there is an index. If not, you will get 'AtributeError: function object has no atribute index'\n",
"df.index"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n",
" 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,\n",
" 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,\n",
" 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,\n",
" 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,\n",
" 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,\n",
" 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,\n",
" 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,\n",
" 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,\n",
" 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,\n",
" 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,\n",
" 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,\n",
" 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,\n",
" 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,\n",
" 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,\n",
" 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,\n",
" 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,\n",
" 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,\n",
" 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,\n",
" 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,\n",
" 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,\n",
" 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,\n",
" 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,\n",
" 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,\n",
" 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,\n",
" 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,\n",
" 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,\n",
" 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,\n",
" 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,\n",
" 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,\n",
" 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,\n",
" 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,\n",
" 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,\n",
" 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,\n",
" 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,\n",
" 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,\n",
" 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,\n",
" 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,\n",
" 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,\n",
" 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,\n",
" 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,\n",
" 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,\n",
" 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,\n",
" 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,\n",
" 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,\n",
" 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,\n",
" 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,\n",
" 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,\n",
" 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,\n",
" 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,\n",
" 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,\n",
" 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,\n",
" 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,\n",
" 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,\n",
" 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,\n",
" 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,\n",
" 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,\n",
" 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,\n",
" 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,\n",
" 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,\n",
" 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,\n",
" 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,\n",
" 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,\n",
" 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,\n",
" 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,\n",
" 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,\n",
" 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,\n",
" 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,\n",
" 884, 885, 886, 887, 888, 889, 890])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# # Check the index values\n",
"df.index.values"
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# If index does not exist\n",
"df.set_index('column_name_to_use', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Name 0\n",
"Sex 0\n",
"Age 177\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count missing vales per column\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,150 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Filter Data\n",
"\n",
"Select the columns you want and delete the others."
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"# Create list comprehension of the columns you want to lose\n",
"columns_to_drop = [column_names[i] for i in [1, 3, 5]]\n",
"# Drop unwanted columns \n",
"df.drop(columns_to_drop, inplace=True, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,591 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Unknown values\n",
"\n",
"Two possible approaches are **remove** these rows or **fill** them. It depends on every case."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Filling NaN values\n",
"If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
"\n",
"* For **string** fields, we can fill NaN with **' '**.\n",
"\n",
"* For **numbers**, we can fill with the **mean** or **median** value. \n"
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Fill NaN with ' '\n",
"df['col'] = df['col'].fillna(' ')\n",
"# Fill NaN with 99\n",
"df['col'] = df['col'].fillna(99)\n",
"# Fill NaN with the mean of the column\n",
"df['col'] = df['col'].fillna(df['col'].mean())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Propagate non-null values forward or backward\n",
"You can also **propagate** non-null values with these methods:\n",
"\n",
"* **ffill**: Fill values by propagating the last valid observation to the next valid.\n",
"* **bfill**: Fill values using the following valid observation to fill the gap.\n",
"* **interpolate**: Fill NaN values using interpolation.\n",
"\n",
"It will fill the next value in the dataframe with the previous non-NaN value. \n",
"\n",
"You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We fill forward the value 4.0 and fill the next one (limit = 1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 4.0\n",
"6 NaN"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" df.ffill(limit = 1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.ffill()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 2.0\n",
"1 2.0\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.bfill()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Removing NaN values\n",
"We can remove them by row or column (use inplace=True if you want to modify the DataFrame)."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Drop any rows which have any nans\n",
"df1 = df.dropna()\n",
"# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n",
"df2 = df.dropna(axis=1)\n",
"# Only drop columns which have at least 90% non-NaNs \n",
"df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,198 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Binarize Data\n",
"* We can transform our data using a binary threshold. All values above the threshold are marked 1, and all values equal to or below are marked 0.\n",
"* This is called binarizing your data or thresholding your data. \n",
"\n",
"* It can be helpful when you have probabilities that you want to make crisp values."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Binarize Data with Scikit-Learn\n",
"We can create new binary attributes in Python using Scikit-learn with the Binarizer class.\n",
"I"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.preprocessing import Binarizer\n",
"\n",
"X = [[ 1., -1., 2.],\n",
" [ 2., 0., 0.],\n",
" [ 0., 1.1, -1.]]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"transformer = Binarizer(threshold=1.0).fit(X) # threshold 1.0"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 1.],\n",
" [1., 0., 0.],\n",
" [0., 1., 0.]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformer.transform(X)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,812 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Categorical Data\n",
"\n",
"For many ML algorithms, we need to transform categorical data into numbers.\n",
"\n",
"For example:\n",
"* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
"* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
"* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
"\n",
"There are two main approaches:\n",
"* Integer encoding\n",
"* One hot encoding"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Integer Encoding\n",
"We assign a number to every value:\n",
"\n",
"['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
"\n",
"['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
"\n",
"['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
"\n",
"The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
"\n",
"In our examples, this representation can be suitable for **temperature**, but not for the other two."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## One Hot Encoding\n",
"A binary column is created for each value of the categorical variable."
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Sex M F U\n",
"----- ---------\n",
"M 1 0 0\n",
"F is transformed into 0 1 0\n",
"Unknown 0 0 1\n",
"M 1 0 0 "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Transforming categorical data with Scikit-Learn\n",
"\n",
"We can use:\n",
"* **get_dummies()** (one hot encoding)\n",
"* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
"\n",
"We are going to learn the first approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### One Hot Encoding\n",
"We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Name Age Sex Position\n",
"0 Marius 18 Male graduate\n",
"1 Maria 19 Female professor\n",
"2 John 20 Male TA\n",
"3 Carla 30 Female phD\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
" \"Age\": [18, 19, 20, 30],\n",
"\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
" \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
" }\n",
"df = pd.DataFrame(data)\n",
"print(df)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" <th>Sex_Female</th>\n",
" <th>Sex_Male</th>\n",
" <th>Position_TA</th>\n",
" <th>Position_graduate</th>\n",
" <th>Position_phD</th>\n",
" <th>Position_professor</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n",
"0 Marius 18 1 1 False True \n",
"1 Maria 19 0 3 True False \n",
"2 John 20 1 0 False True \n",
"3 Carla 30 0 2 True False \n",
"\n",
" Position_TA Position_graduate Position_phD Position_professor \n",
"0 False True False False \n",
"1 False False False True \n",
"2 True False False False \n",
"3 False False True False "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
]
}
],
"source": [
"df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
"df_onehot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use *OneHotEncoder* from Scikit."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sex_Female</th>\n",
" <th>Sex_Male</th>\n",
" <th>Position_TA</th>\n",
" <th>Position_graduate</th>\n",
" <th>Position_phD</th>\n",
" <th>Position_professor</th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n",
"0 0.0 1.0 0.0 1.0 0.0 \n",
"1 1.0 0.0 0.0 0.0 0.0 \n",
"2 0.0 1.0 1.0 0.0 0.0 \n",
"3 1.0 0.0 0.0 0.0 1.0 \n",
"\n",
" Position_professor Name Age sex_encoded position_encoded \n",
"0 0.0 Marius 18 1 1 \n",
"1 1.0 Maria 19 0 3 \n",
"2 0.0 John 20 1 0 \n",
"3 0.0 Carla 30 0 2 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"from sklearn.compose import make_column_transformer\n",
"\n",
"df_onehotencoder = df\n",
"# create OneHotEncoder object\n",
"encoder = OneHotEncoder()\n",
"\n",
"# Transformer for several columns\n",
"transformer = make_column_transformer(\n",
" (OneHotEncoder(), ['Sex', 'Position']),\n",
" remainder='passthrough',\n",
" verbose_feature_names_out=False)\n",
"\n",
"# transform\n",
"transformed = transformer.fit_transform(df_onehotencoder)\n",
"\n",
"df_onehotencoder = pd.DataFrame(\n",
" transformed,\n",
" columns=transformer.get_feature_names_out())\n",
"df_onehotencoder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Integer encoding\n",
"We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position\n",
"0 Marius 18 Male graduate\n",
"1 Maria 19 Female professor\n",
"2 John 20 Male TA\n",
"3 Carla 30 Female phD"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder\n",
"# creating instance of labelencoder\n",
"labelencoder = LabelEncoder()\n",
"df_encoded = df\n",
"# Assigning numerical values and storing in another column\n",
"sex_values = ('Male', 'Female')\n",
"position_values = ('graduate', 'professor', 'TA', 'phD')\n",
"df_encoded"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" <th>sex_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position sex_encoded\n",
"0 Marius 18 Male graduate 1\n",
"1 Maria 19 Female professor 0\n",
"2 John 20 Male TA 1\n",
"3 Carla 30 Female phD 0"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
"df_encoded"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position sex_encoded position_encoded\n",
"0 Marius 18 Male graduate 1 1\n",
"1 Maria 19 Female professor 0 3\n",
"2 John 20 Male TA 1 0\n",
"3 Carla 30 Female phD 0 2"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
"df_encoded"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,652 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# String Data\n",
"It is widespread to clean string columns to follow a predefined format (e.g., emails, URLs, ...).\n",
"\n",
"We can do it using regular expressions or specific libraries."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Beautifier\n",
"A simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify URL patterns, domains, and so on. The library helps to clean Unicode, special characters, and unnecessary redirection patterns from the URLs and gives you a clean date.\n",
"\n",
"Install with **'pip install beautifier'**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Email cleanup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Email\n",
"email = Email('me@imsach.in')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'imsach.in'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.domain"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'me'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.username"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.is_free_email"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email2 = Email('This my address')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email2.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email3 = Email('pepe@gmail.com')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_free_email"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## URL cleanup"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Url\n",
"url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'https://in.linkedin.com/in/sachinphilip'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.cleanup"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'in.linkedin.com'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.domain"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.param"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.parameters"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'sachinphilip'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.username"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Unicode\n",
"Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
"\n",
"A **mojibake** is a character displayed in an unintended character encoding. Example: \"<22>\").\n",
"\n",
"We will use the library **ftfy** (fixed text for you) to fix it.\n",
"\n",
"First, you should install the library: **conda install ftfy** (or **pip install ftfy**)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"¯\\_(ツ)_/¯\n",
"Party\n",
"I'm\n"
]
}
],
"source": [
"import ftfy\n",
"foo = '&macr;\\\\_(ã\\x83\\x84)_/&macr;'\n",
"bar = '\\ufeffParty'\n",
"baz = '\\001\\033[36;44mI&#x92;m'\n",
"print(ftfy.fix_text(foo))\n",
"print(ftfy.fix_text(bar))\n",
"print(ftfy.fix_text(baz))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can understand which heuristics ftfy is using."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"U+0026 & [Po] AMPERSAND\n",
"U+006D m [Ll] LATIN SMALL LETTER M\n",
"U+0061 a [Ll] LATIN SMALL LETTER A\n",
"U+0063 c [Ll] LATIN SMALL LETTER C\n",
"U+0072 r [Ll] LATIN SMALL LETTER R\n",
"U+003B ; [Po] SEMICOLON\n",
"U+005C \\ [Po] REVERSE SOLIDUS\n",
"U+005F _ [Pc] LOW LINE\n",
"U+0028 ( [Ps] LEFT PARENTHESIS\n",
"U+00E3 ã [Ll] LATIN SMALL LETTER A WITH TILDE\n",
"U+0083 \\x83 [Cc] <unknown>\n",
"U+0084 \\x84 [Cc] <unknown>\n",
"U+0029 ) [Pe] RIGHT PARENTHESIS\n",
"U+005F _ [Pc] LOW LINE\n",
"U+002F / [Po] SOLIDUS\n",
"U+0026 & [Po] AMPERSAND\n",
"U+006D m [Ll] LATIN SMALL LETTER M\n",
"U+0061 a [Ll] LATIN SMALL LETTER A\n",
"U+0063 c [Ll] LATIN SMALL LETTER C\n",
"U+0072 r [Ll] LATIN SMALL LETTER R\n",
"U+003B ; [Po] SEMICOLON\n"
]
}
],
"source": [
"ftfy.explain_unicode(foo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Dates\n",
"Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as [**python-dateutil**](https://dateutil.readthedocs.io/en/stable/). An alternative is [arrow](https://arrow.readthedocs.io/en/latest/).\n",
"\n",
"Install the library: **pip install python-dateutil**."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-22 10:22:46+00:00\n"
]
}
],
"source": [
"from dateutil.parser import parse\n",
"now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
"print(now)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-08 10:20:00\n"
]
}
],
"source": [
"dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
"print(dt)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), , A. Sharma, 2018.\n",
"* [Beautifier](https://github.com/labtocat/beautifier) package\n",
"* [Ftfy](https://ftfy.readthedocs.io/en/latest/) package\n",
"* [python-dateutil](https://dateutil.readthedocs.io/en/stable/)package"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,139 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Handy libraries\n",
"Libraries that help in several preprocessing tasks.\n",
"\n",
"* [datacleaner](11_1_datacleaner.ipynb)\n",
"* [autoclean](11_3_autoclean.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
"* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries), M. Bierly, 2016\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,673 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Datacleaner\n",
"[Datacleaner](https://github.com/rhiever/datacleaner) supports:\n",
"\n",
"* drop rows with missing values\n",
"* replace missing values with the mode or median on a column-by-column basis\n",
"* encode non-numeric variables with numerical equivalents\n",
"\n",
"\n",
"Install with\n",
"\n",
"**pip install datacleaner**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.4500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.7500</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas male 27.0 0 \n",
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"890 Dooley, Mr. Patrick male 32.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
".. ... ... ... ... ... \n",
"886 0 211536 13.0000 NaN S \n",
"887 0 112053 30.0000 B42 S \n",
"888 2 W./C. 6607 23.4500 NaN S \n",
"889 0 111369 30.0000 C148 C \n",
"890 0 370376 7.7500 NaN Q \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from datacleaner import autoclean\n",
"\n",
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>108</td>\n",
" <td>1</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>523</td>\n",
" <td>7.2500</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>190</td>\n",
" <td>0</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>596</td>\n",
" <td>71.2833</td>\n",
" <td>81</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>353</td>\n",
" <td>0</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>669</td>\n",
" <td>7.9250</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>272</td>\n",
" <td>0</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>49</td>\n",
" <td>53.1000</td>\n",
" <td>55</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>15</td>\n",
" <td>1</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>472</td>\n",
" <td>8.0500</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>548</td>\n",
" <td>1</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>101</td>\n",
" <td>13.0000</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>303</td>\n",
" <td>0</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>14</td>\n",
" <td>30.0000</td>\n",
" <td>30</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>413</td>\n",
" <td>0</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>675</td>\n",
" <td>23.4500</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>81</td>\n",
" <td>1</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>8</td>\n",
" <td>30.0000</td>\n",
" <td>60</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>220</td>\n",
" <td>1</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>466</td>\n",
" <td>7.7500</td>\n",
" <td>47</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \\\n",
"0 1 0 3 108 1 22.0 1 0 523 \n",
"1 2 1 1 190 0 38.0 1 0 596 \n",
"2 3 1 3 353 0 26.0 0 0 669 \n",
"3 4 1 1 272 0 35.0 1 0 49 \n",
"4 5 0 3 15 1 35.0 0 0 472 \n",
".. ... ... ... ... ... ... ... ... ... \n",
"886 887 0 2 548 1 27.0 0 0 101 \n",
"887 888 1 1 303 0 19.0 0 0 14 \n",
"888 889 0 3 413 0 28.0 1 2 675 \n",
"889 890 1 1 81 1 26.0 0 0 8 \n",
"890 891 0 3 220 1 32.0 0 0 466 \n",
"\n",
" Fare Cabin Embarked \n",
"0 7.2500 47 2 \n",
"1 71.2833 81 0 \n",
"2 7.9250 47 2 \n",
"3 53.1000 55 2 \n",
"4 8.0500 47 2 \n",
".. ... ... ... \n",
"886 13.0000 47 2 \n",
"887 30.0000 30 2 \n",
"888 23.4500 47 2 \n",
"889 30.0000 60 0 \n",
"890 7.7500 47 1 \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clean = autoclean(df, copy=True)\n",
"df_clean"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
"* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries), M. Bierly, 2016\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,578 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "849ad57e-6adb-4c2e-afd6-73db37eef572",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"id": "179cc802-9f1d-40b0-bf0c-9d4fb7ea1262",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"id": "9858d815-0390-4e77-a5ff-a8d2a1960981",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"id": "238bab60-75f0-4d29-ab05-66afc463b506",
"metadata": {},
"source": [
"# Autoclean\n",
"A simple library to clean data. [Autoclean](https://github.com/elisemercury/AutoClean) supports:\n",
"AutoClean supports:\n",
"\n",
"* Handling of duplicates\n",
"* Various imputation methods for missing values\n",
"* Handling of outliers\n",
"* Encoding of categorical data (OneHot, Label)\n",
"* Extraction of data time values\n",
"\n",
"Install the package: **pip install py-AutoClean**.\n",
"\n",
"Parameters:\n",
"\n",
"* **duplicates**\n",
" * default: False,\n",
" * other values: 'auto', True\n",
"* **missing_num**\n",
" * default:False,\n",
" * other values:\t'auto', 'linreg', 'knn', 'mean', 'median', 'most_frequent', 'delete', False\n",
"* **missing_categ**\n",
" * default: False,\n",
" * other values:\t'auto', 'logreg', 'knn', 'most_frequent', 'delete', False\n",
"* **encode_categ**\n",
" * default: False,\n",
" * other values:\t'auto', ['onehot'], ['label'], False ; to encode only specific columns add a list of column names or indexes: ['auto', ['col1', 2]]\n",
"* **extract_datetime**\n",
" * default:\tFalse,\n",
" * other values:\t'auto', 'D', 'M', 'Y', 'h', 'm', 's'\n",
"* **outliers**\n",
" * default:\tFalse,\n",
" * other values:\t'auto', 'winz', 'delete'\n",
"* **outlier_param**\tdefault:\t1.5, other values:\tany int or float, False\n",
"* **logfile**\n",
" * default: True,\n",
" * other values:\tFalse\n",
"* **verbose**\n",
" * default: False,\n",
" * other values:\tTrue"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "491b034b-994e-4f06-b4bc-df0590a62aab",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.4500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.7500</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas male 27.0 0 \n",
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"890 Dooley, Mr. Patrick male 32.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
".. ... ... ... ... ... \n",
"886 0 211536 13.0000 NaN S \n",
"887 0 112053 30.0000 B42 S \n",
"888 2 W./C. 6607 23.4500 NaN S \n",
"889 0 111369 30.0000 C148 C \n",
"890 0 370376 7.7500 NaN Q \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from AutoClean import AutoClean\n",
"\n",
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "d842eedf-3971-4966-a8b4-543bb56dd60d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AutoClean process completed in 0.289385 seconds\n",
"Logfile saved to: /home/cif/GoogleDrive/cursos/summer-school-romania/2019/notebooks/preprocessing/autoclean.log\n"
]
}
],
"source": [
"autoclean = AutoClean(df, mode='auto')\n",
"\n",
"# We can control the preprocessing\n",
"#autoclean = AutoClean(df, mode='auto', duplicates=False, missing_num=False, missing_categ=False, encode_categ=False, extract_datetime=False, outliers=False, outlier_param=1.5, logfile=True, verbose=False)\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "4ede7c55-475a-4748-8cc4-788f46c88b26",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_female</th>\n",
" <th>Sex_male</th>\n",
" <th>Embarked_C</th>\n",
" <th>Embarked_Q</th>\n",
" <th>Embarked_S</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>C128</td>\n",
" <td>S</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>65.6344</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>C128</td>\n",
" <td>S</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>C128</td>\n",
" <td>S</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Sex_female Sex_male \\\n",
"0 0 A/5 21171 7.2500 C128 S False True \n",
"1 0 PC 17599 65.6344 C85 C True False \n",
"2 0 STON/O2. 3101282 7.9250 C128 S True False \n",
"3 0 113803 53.1000 C123 S True False \n",
"4 0 373450 8.0500 C128 S False True \n",
"\n",
" Embarked_C Embarked_Q Embarked_S \n",
"0 False False True \n",
"1 True False False \n",
"2 False False True \n",
"3 False False True \n",
"4 False False True "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clean = autoclean.output\n",
"df_clean[0:5]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@@ -0,0 +1,502 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Duplicated values\n",
"\n",
"There are two possible approaches: **remove** these rows or **filling** them. It depends on every case.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Filling NaN values\n",
"If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
"\n",
"* For **string** fields, we can fill NaN with **' '**.\n",
"\n",
"* For **numbers**, we can fill with the **mean** or **median** value. \n"
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"# Fill NaN with ' '\n",
"df['col'] = df['col'].fillna(' ')\n",
"# Fill NaN with 99\n",
"df['col'] = df['col'].fillna(99)\n",
"# Fill NaN with the mean of the column\n",
"df['col'] = df['col'].fillna(df['col'].mean())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Propagate non-null values forward or backwards\n",
"You can also propagate non-null values forward or backwards by putting\n",
"method=pad as the method argument. It will fill the next value in the\n",
"dataframe with the previous non-NaN value. Maybe you just want to fill one\n",
"value ( limit=1 )or you want to fill all the values."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 4.0\n",
"6 NaN"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We fill forward the value 4.0 and fill the next one (limit = 1)\n",
"df.fillna(method='pad', limit=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can also backfilling with **bfill**."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 2.0\n",
"1 2.0\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fill the first two NaN values with the first available value\n",
"df.fillna(method='bfill')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Removing NaN values\n",
"We can remove them by row or column."
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"/# Drop any rows which have any nans\n",
"df.dropna()\n",
"/# Drop columns that have any nans\n",
"df.dropna(axis=1)\n",
"/# Only drop columns which have at least 90% non-NaNs\n",
"df.dropna(thresh=int(df.shape[0] * .9), axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

View File

@@ -0,0 +1,619 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# String Data\n",
"It is common to clean string columns so that they follow a predefined format (e.g. emails, URLs, ...).\n",
"\n",
"We can do it using regular expressions or specific libraries."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Beautifier\n",
"Simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify url patterns, domains and so on. Library helps to clean unicodes, special characters and unnecessary redirection patterns from the urls and gives you clean date.\n",
"\n",
"Install with **'pip install beautifier'**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Email cleanup"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Email\n",
"email = Email('me@imsach.in')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'imsach.in'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.domain"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'me'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.username"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.is_free_email"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email2 = Email('This my address')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email2.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email3 = Email('pepe@gmail.com')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_free_email"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## URL cleanup"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Url\n",
"url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'https://in.linkedin.com/in/sachinphilip'"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.cleanup"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'in.linkedin.com'"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.domain"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.param"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.parameters"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'sachinphilip'"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.username"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Unicode\n",
"Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
"\n",
"A **mojibake** is a character displayed in an unintended character enconding. Example: \"<22>\").\n",
"\n",
"We will use the library **ftfy** (fixed text for you) to fix it.\n",
"\n",
"First, you should install the library: ***conda install ftfy**. "
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"¯\\_(ツ)_/¯\n",
"Party\n",
"I'm\n"
]
}
],
"source": [
"import ftfy\n",
"foo = '&macr;\\\\_(ã\\x83\\x84)_/&macr;'\n",
"bar = '\\ufeffParty'\n",
"baz = '\\001\\033[36;44mI&#x92;m'\n",
"print(ftfy.fix_text(foo))\n",
"print(ftfy.fix_text(bar))\n",
"print(ftfy.fix_text(baz))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can understand which heuristics ftfy is using."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'ftfy' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-1-4030b963ff0a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mftfy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexplain_unicode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfoo\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'ftfy' is not defined"
]
}
],
"source": [
"ftfy.explain_unicode(foo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Dates\n",
"Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as **python-dateutil**.\n",
"\n",
"Install the library: **pip install python-dateutil**."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-22 10:22:46+00:00\n"
]
}
],
"source": [
"from dateutil.parser import parse\n",
"now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
"print(now)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-22 10:20:00\n"
]
}
],
"source": [
"dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
"print(dt)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)\n",
"* Beautifier https://github.com/labtocat/beautifier\n",
"* Ftfy https://ftfy.readthedocs.io/en/latest/\n",
"* python-dateutil https://dateutil.readthedocs.io/en/stable/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.1 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

View File

@@ -0,0 +1 @@

View File

@@ -0,0 +1,185 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Introduction to Visualization\n",
" \n",
"In this session, we will get more insight regarding how to visualize data.\n",
"\n",
"# Objectives\n",
"\n",
"The main objectives of this session are:\n",
"* Understanding how to visualize data\n",
"* Understanding the purpose of different charts \n",
"* Experimenting with several environments for visualizing data\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Seaborn\n",
"\n",
"Seaborn is a library that visualizes data in Python. The main characteristics are:\n",
"\n",
"* A dataset-oriented API for examining relationships between multiple variables\n",
"* Specialized support for using categorical variables to show observations or aggregate statistics\n",
"* Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data\n",
"* Automatic estimation and plotting of linear regression models for different kinds of dependent variables\n",
"* Convenient views of the overall structure of complex datasets\n",
"* High-level abstractions for structuring multi-plot grids that let you quickly build complex visualizations\n",
"* Concise control over matplotlib figure styling with several built-in themes\n",
"* Tools for choosing color palettes that faithfully reveal patterns in your data\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Install\n",
"Use:\n",
"\n",
"**conda install seaborn**\n",
"\n",
"or \n",
"\n",
"**pip install seaborn**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Table of Contents"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"1. [Home](00_Intro_Visualization.ipynb)\n",
"2. [Dataset](01_Dataset.ipynb)\n",
"3. [Comparison Charts](02_Comparison_Charts.ipynb)\n",
" 1. [More Comparison Charts](02_01_More_Comparison_Charts.ipynb)\n",
"4. [Distribution Charts](03_Distribution_Charts.ipynb)\n",
"5. [Hierarchical charts](04_Hierarchical_Charts.ipynb)\n",
"6. [Relational charts](05_Relational_Charts.ipynb)\n",
"7. [Spatial charts](06_Spatial_Charts.ipynb)\n",
"8. [Temporal charts](07_Temporal_Charts.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@@ -0,0 +1,363 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Visualization](00_Intro_Visualization.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Dataset\n",
"Seaborn includes several datasets. We can consult the available datasets and load them. \n",
"\n",
"The datasets are also available at https://github.com/mwaskom/seaborn-data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from matplotlib import pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['anagrams',\n",
" 'anscombe',\n",
" 'attention',\n",
" 'brain_networks',\n",
" 'car_crashes',\n",
" 'diamonds',\n",
" 'dots',\n",
" 'dowjones',\n",
" 'exercise',\n",
" 'flights',\n",
" 'fmri',\n",
" 'geyser',\n",
" 'glue',\n",
" 'healthexp',\n",
" 'iris',\n",
" 'mpg',\n",
" 'penguins',\n",
" 'planets',\n",
" 'seaice',\n",
" 'taxis',\n",
" 'tips',\n",
" 'titanic']"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sns.get_dataset_names()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>total_bill</th>\n",
" <th>tip</th>\n",
" <th>sex</th>\n",
" <th>smoker</th>\n",
" <th>day</th>\n",
" <th>time</th>\n",
" <th>size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>16.99</td>\n",
" <td>1.01</td>\n",
" <td>Female</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>10.34</td>\n",
" <td>1.66</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>21.01</td>\n",
" <td>3.50</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>23.68</td>\n",
" <td>3.31</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>24.59</td>\n",
" <td>3.61</td>\n",
" <td>Female</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>25.29</td>\n",
" <td>4.71</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>8.77</td>\n",
" <td>2.00</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>26.88</td>\n",
" <td>3.12</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>15.04</td>\n",
" <td>1.96</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>14.78</td>\n",
" <td>3.23</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" total_bill tip sex smoker day time size\n",
"0 16.99 1.01 Female No Sun Dinner 2\n",
"1 10.34 1.66 Male No Sun Dinner 3\n",
"2 21.01 3.50 Male No Sun Dinner 3\n",
"3 23.68 3.31 Male No Sun Dinner 2\n",
"4 24.59 3.61 Female No Sun Dinner 4\n",
"5 25.29 4.71 Male No Sun Dinner 4\n",
"6 8.77 2.00 Male No Sun Dinner 2\n",
"7 26.88 3.12 Male No Sun Dinner 4\n",
"8 15.04 1.96 Male No Sun Dinner 2\n",
"9 14.78 3.23 Male No Sun Dinner 2"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = sns.load_dataset('tips')\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Seaborn](http://seaborn.pydata.org/index.html) documentation"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.1 KiB

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -27,14 +27,14 @@
"source": [ "source": [
"# Introduction to Neural Networks\n", "# Introduction to Neural Networks\n",
" \n", " \n",
"In this lab session, we are going to learn how to train a neural network.\n", "In this lab session, we will learn how to train a neural network.\n",
"\n", "\n",
"# Objectives\n", "# Objectives\n",
"\n", "\n",
"The main objectives of this session are:\n", "The main objectives of this session are:\n",
"* Put in practice the notions learn in class about neural computing\n", "* Put in practice the notions learn in class about neural computing\n",
"* Understand what an MLP is\n", "* Understand what an MLP is\n",
"* Learn to use some libraries, such as scikit-learn " "* Learn to use some libraries, such as Scikit-learn."
] ]
}, },
{ {
@@ -58,7 +58,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Licence\n", "## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -39,7 +39,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Multilayer perceptrons, also called feedforward neural networks or deep feedforward networks, are the most basic deep learning models." "Multilayer perceptrons, called feedforward neural networks or deep feedforward networks, are the most basic deep learning models."
] ]
}, },
{ {
@@ -58,7 +58,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In this notebook we are going to try the spiral dataset with different algorthms. In particular, we are going to focus our attention on the MLP classifier.\n", "In this notebook, we will try the spiral dataset with different algorithms. In particular, we are going to focus our attention on the MLP classifier.\n",
"\n", "\n",
"\n", "\n",
"Answer directly in your copy of the exercise and submit it as a moodle task." "Answer directly in your copy of the exercise and submit it as a moodle task."

View File

@@ -4,7 +4,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"![](images/EscUpmPolit_p.gif \"UPM\")" "![](./images/EscUpmPolit_p.gif \"UPM\")"
] ]
}, },
{ {
@@ -39,10 +39,10 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"In this notebook we are going to apply a MLP to a simple regression task: learning the Fresnel functions.\n", "In this notebook, we are going to apply an MLP to a simple regression task: learning the Fresnel functions.\n",
"\n", "\n",
"\n", "\n",
"Answer directly in your copy of the exercise and submit it as a moodle task." "Answer directly in your copy of the exercise and submit it as a Moodle task."
] ]
}, },
{ {
@@ -92,7 +92,7 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"Change this variables to change the train and test dataset." "Change these variables to change the train and test dataset."
] ]
}, },
{ {

View File

@@ -15,7 +15,7 @@ def gen_spiral_dataset(n_examples=500, n_classes=2, a=None, b=None, pi_space=3):
theta = np.linspace(0,pi_space*pi, num=n_examples) theta = np.linspace(0,pi_space*pi, num=n_examples)
xy = np.zeros((n_examples,2)) xy = np.zeros((n_examples,2))
# logaritmic spirals # logarithmic spirals
x_golden_parametric = lambda a, b, theta: a**(theta*b) * cos(theta) x_golden_parametric = lambda a, b, theta: a**(theta*b) * cos(theta)
y_golden_parametric = lambda a, b, theta: a**(theta*b) * sin(theta) y_golden_parametric = lambda a, b, theta: a**(theta*b) * sin(theta)
x_golden_parametric = np.vectorize(x_golden_parametric) x_golden_parametric = np.vectorize(x_golden_parametric)

View File

@@ -48,7 +48,7 @@
"# Introduction\n", "# Introduction\n",
"The purpose of this practice is to understand better how GAs work. \n", "The purpose of this practice is to understand better how GAs work. \n",
"\n", "\n",
"There are many libraries that implement GAs, you can find some of then in the [References](#References) section." "There are many libraries that implement GAs; you can find some of them in the [References](#References) section."
] ]
}, },
{ {
@@ -56,7 +56,7 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"# Genetic Algorithms\n", "# Genetic Algorithms\n",
"In this section we are going to use the library DEAP [References](#References) for implementing a genetic algorithms.\n", "In this section, we are going to use the library [DEAP](https://github.com/DEAP/deap/tree/master) for implementing a genetic algorithms.\n",
"\n", "\n",
"We are going to implement the OneMax problem as seen in class.\n", "We are going to implement the OneMax problem as seen in class.\n",
"\n", "\n",
@@ -187,9 +187,9 @@
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Comparing\n", "## Comparing\n",
"Your task is modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n", "Your task is to modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n",
"\n", "\n",
"Submit your notebook and include a the modified code, and a comparison of the effects of these changes. \n", "Submit your notebook and include a modified code and a comparison of the effects of these changes. \n",
"\n", "\n",
"Discuss your findings." "Discuss your findings."
] ]
@@ -198,31 +198,24 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"## Optimizing ML hyperparameters\n", "## Optional. Optimizing ML hyperparameters\n",
"\n", "\n",
"One of the applications of Genetic Algorithms is the optimization of ML hyperparameters. Previously we have used GridSearch from Scikit. Using (sklearn-deap)[#References], optimize the Titatic hyperparameters using both GridSearch and Genetic Algorithms. \n", "One of the applications of Genetic Algorithms is the optimization of ML hyperparameters. Previously, we have used GridSearch from Scikit. Using [sklearn-deap](https://github.com/rsteca/sklearn-deap), optimize the Titatic hyperparameters using both GridSearch and Genetic Algorithms. \n",
"\n", "\n",
"The same exercise (using the digits dataset) can be found in this [notebook](https://github.com/rsteca/sklearn-deap/blob/master/test.ipynb).\n", "The same exercise (using the digits dataset) can be found in this [notebook](https://github.com/rsteca/sklearn-deap/blob/master/test.ipynb).\n",
"\n", "\n",
"Submit a notebook where you include well-crafted conclusions about the exercises, discussing the pros and cons of using genetic algorithms for this purpose.\n" "Since there is a problem with Scikit version 0.24, you can just comment on the different approaches.",
"\n",
"Alternatively, you can also use the library [sklearn-genetic-opt](https://sklearn-genetic-opt.readthedocs.io/en/stable/index.html) and discuss the digit classification example included in the library: [digits decision tree](https://sklearn-genetic-opt.readthedocs.io/en/stable/notebooks/Digits_decision_tree.html)."
] ]
}, },
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# Optional exercises\n", "## Optional. Optimizing an ML pipeline with a genetic algorithm\n",
"\n", "\n",
"Here there is a proposed optional exercise." "The library [TPOT](https://epistasislab.github.io/tpot/latest/) optimizes ML pipelines and comes with a lot of [examples](https://epistasislab.github.io/tpot/latest/Tutorial/9_Genetic_Algorithm_Overview/) and even notebooks, for example for the [iris dataset](https://github.com/EpistasisLab/tpot/blob/master/tutorials/IRIS.ipynb).\n",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optimizing a ML pipeline with a genetic algorithm\n",
"\n",
"The library [TPOT](#References) optimizes ML pipelines and comes with a lot of (examples)[https://epistasislab.github.io/tpot/examples/] and even notebooks, for example for the [iris dataset](https://github.com/EpistasisLab/tpot/blob/master/tutorials/IRIS.ipynb).\n",
"\n", "\n",
"Your task is to apply TPOT to the intermediate challenge and write a short essay explaining:\n", "Your task is to apply TPOT to the intermediate challenge and write a short essay explaining:\n",
"* what TPOT does (with your own words).\n", "* what TPOT does (with your own words).\n",
@@ -240,7 +233,8 @@
"* [tpot](http://epistasislab.github.io/tpot/)\n", "* [tpot](http://epistasislab.github.io/tpot/)\n",
"* [gplearn](http://gplearn.readthedocs.io/en/latest/index.html)\n", "* [gplearn](http://gplearn.readthedocs.io/en/latest/index.html)\n",
"* [scikit-allel](https://scikit-allel.readthedocs.io/en/latest/)\n", "* [scikit-allel](https://scikit-allel.readthedocs.io/en/latest/)\n",
"* [scklearn-genetic](https://github.com/manuel-calzolari/sklearn-genetic)" "* [sklearn-genetic](https://github.com/manuel-calzolari/sklearn-genetic)\n",
"* [sklearn-genetic-opt](https://sklearn-genetic-opt.readthedocs.io/en/stable/)"
] ]
}, },
{ {
@@ -254,13 +248,22 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n", "\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid." "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
] ]
} }
], ],
"metadata": { "metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3",
"language": "python", "language": "python",
@@ -276,7 +279,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.7.9"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -48,7 +48,9 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"1. [Q-Learning](2_6_1_Q-Learning.ipynb)" "1. [Q-Learning](2_6_1_Q-Learning_Basic.ipynb)\n",
"1. [Visualization](2_6_1_Q-Learning_Visualization.ipynb)\n",
"1. [Exercises](2_6_1_Q-Learning_Exercises.ipynb)"
] ]
}, },
{ {
@@ -64,7 +66,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -78,7 +80,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.10.10"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

View File

@@ -1,443 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2018 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"\n",
"* [Introduction](#Introduction)\n",
"* [Getting started with OpenAI Gym](#Getting-started-with-OpenAI-Gym)\n",
"* [The Frozen Lake scenario](#The-Frozen-Lake-scenario)\n",
"* [Q-Learning with the Frozen Lake scenario](#Q-Learning-with-the-Frozen-Lake-scenario)\n",
"* [Exercises](#Exercises)\n",
"* [Optional exercises](#Optional-exercises)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"The purpose of this practice is to understand better Reinforcement Learning (RL) and, in particular, Q-Learning.\n",
"\n",
"We are going to use [OpenAI Gym](https://gym.openai.com/). OpenAI is a toolkit for developing and comparing RL algorithms.Take a loot at ther [website](https://gym.openai.com/).\n",
"\n",
"It implements [algorithm imitation](http://gym.openai.com/envs/#algorithmic), [classic control problems](http://gym.openai.com/envs/#classic_control), [Atari games](http://gym.openai.com/envs/#atari), [Box2D continuous control](http://gym.openai.com/envs/#box2d), [robotics with MuJoCo, Multi-Joint dynamics with Contact](http://gym.openai.com/envs/#mujoco), and [simple text based environments](http://gym.openai.com/envs/#toy_text).\n",
"\n",
"This notebook is based on * [Diving deeper into Reinforcement Learning with Q-Learning](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
"\n",
"First of all, install the OpenAI Gym library:\n",
"\n",
"```console\n",
"foo@bar:~$ pip install gym\n",
"```\n",
"\n",
"\n",
"If you get the error message 'NotImplementedError: abstract', [execute](https://github.com/openai/gym/issues/775) \n",
"```console\n",
"foo@bar:~$ pip install pyglet==1.2.4\n",
"```\n",
"\n",
"If you want to try the Atari environment, it is better that you opt for the full installation from the source. Follow the instructions at [https://github.com/openai/gym#id15](OpenGym).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started with OpenAI Gym\n",
"\n",
"First of all, read the [introduction](http://gym.openai.com/docs/#getting-started-with-gym) of OpenAI Gym."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Environments\n",
"OpenGym provides a number of problems called *environments*. \n",
"\n",
"Try the 'CartPole-v0' (or 'MountainCar)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gym\n",
"\n",
"env = gym.make('CartPole-v0')\n",
"#env = gym.make('MountainCar-v0')\n",
"#env = gym.make('Taxi-v2')\n",
"\n",
"#env = gym.make('Jamesbond-ram-v0')\n",
"\n",
"env.reset()\n",
"for _ in range(1000):\n",
" env.render()\n",
" env.step(env.action_space.sample()) # take a random action"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This will launch an external window with the game. If you cannot close that window, just execute in a code cell:\n",
"\n",
"```python\n",
"env.close()\n",
"```\n",
"\n",
"The full list of available environments can be found printing the environment registry as follows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from gym import envs\n",
"print(envs.registry.all())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The environments **step** function returns four values. These are:\n",
"\n",
"* **observation (object):** an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.\n",
"* **reward (float):** amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.\n",
"* **done (boolean):** whether its time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.).\n",
"* **info (dict):** diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environments last state change). However, official evaluations of your agent are not allowed to use this for learning.\n",
"\n",
"The typical agent loop consists in first calling the method *reset* which provides an initial observation. Then the agent executes an action, and receives the reward, the new observation, and if the episode has finished (done is true). \n",
"\n",
"For example, analyze this sample of agent loop for 100 ms. The details of the previous variables for this game as described [here](https://github.com/openai/gym/wiki/CartPole-v0) are:\n",
"* **observation**: Cart Position, Cart Velocity, Pole Angle, Pole Velocity.\n",
"* **action**: 0\t(Push cart to the left), 1\t(Push cart to the right).\n",
"* **reward**: 1 for every step taken, including the termination step."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import gym\n",
"env = gym.make('CartPole-v0')\n",
"for i_episode in range(20):\n",
" observation = env.reset()\n",
" for t in range(100):\n",
" env.render()\n",
" print(observation)\n",
" action = env.action_space.sample()\n",
" print(\"Action \", action)\n",
" observation, reward, done, info = env.step(action)\n",
" print(\"Observation \", observation, \", reward \", reward, \", done \", done, \", info \" , info)\n",
" if done:\n",
" print(\"Episode finished after {} timesteps\".format(t+1))\n",
" break"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# The Frozen Lake scenario\n",
"We are going to play to the [Frozen Lake](http://gym.openai.com/envs/FrozenLake-v0/) game.\n",
"\n",
"The problem is a grid where you should go from the 'start' (S) position to the 'goal position (G) (the pizza!). You can only walk through the 'frozen tiles' (F). Unfortunately, you can fall in a 'hole' (H).\n",
"![](images/frozenlake-problem.png \"Frozen lake problem\")\n",
"\n",
"The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise. The possible actions are going left, right, up or down. However, the ice is slippery, so you won't always move in the direction you intend.\n",
"\n",
"![](images/frozenlake-world.png \"Frozen lake world\")\n",
"\n",
"\n",
"Here you can see several episodes. A full recording is available at [Frozen World](http://gym.openai.com/envs/FrozenLake-v0/).\n",
"\n",
"![](images/recording.gif \"Example running\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Q-Learning with the Frozen Lake scenario\n",
"We are now going to apply Q-Learning for the Frozen Lake scenario. This part of the notebook is taken from [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb).\n",
"\n",
"First we create the environment and a Q-table inizializated with zeros to store the value of each action in a given state. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import gym\n",
"import random\n",
"\n",
"env = gym.make(\"FrozenLake-v0\")\n",
"\n",
"\n",
"action_size = env.action_space.n\n",
"state_size = env.observation_space.n\n",
"\n",
"\n",
"qtable = np.zeros((state_size, action_size))\n",
"print(qtable)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we define the hyperparameters."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Q-Learning hyperparameters\n",
"total_episodes = 10000 # Total episodes\n",
"learning_rate = 0.8 # Learning rate\n",
"max_steps = 99 # Max steps per episode\n",
"gamma = 0.95 # Discounting rate\n",
"\n",
"# Exploration hyperparameters\n",
"epsilon = 1.0 # Exploration rate\n",
"max_epsilon = 1.0 # Exploration probability at start\n",
"min_epsilon = 0.01 # Minimum exploration probability \n",
"decay_rate = 0.01 # Exponential decay rate for exploration prob"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now we implement the Q-Learning algorithm.\n",
"\n",
"![](images/qlearning-algo.png \"Q-Learning algorithm\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# List of rewards\n",
"rewards = []\n",
"\n",
"# 2 For life or until learning is stopped\n",
"for episode in range(total_episodes):\n",
" # Reset the environment\n",
" state = env.reset()\n",
" step = 0\n",
" done = False\n",
" total_rewards = 0\n",
" \n",
" for step in range(max_steps):\n",
" # 3. Choose an action a in the current world state (s)\n",
" ## First we randomize a number\n",
" exp_exp_tradeoff = random.uniform(0, 1)\n",
" \n",
" ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
" if exp_exp_tradeoff > epsilon:\n",
" action = np.argmax(qtable[state,:])\n",
"\n",
" # Else doing a random choice --> exploration\n",
" else:\n",
" action = env.action_space.sample()\n",
"\n",
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
" new_state, reward, done, info = env.step(action)\n",
"\n",
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
" # qtable[new_state,:] : all the actions we can take from new state\n",
" qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
" \n",
" total_rewards += reward\n",
" \n",
" # Our new state is state\n",
" state = new_state\n",
" \n",
" # If done (if we're dead) : finish episode\n",
" if done == True: \n",
" break\n",
" \n",
" episode += 1\n",
" # Reduce epsilon (because we need less and less exploration)\n",
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) \n",
" rewards.append(total_rewards)\n",
"\n",
"print (\"Score over time: \" + str(sum(rewards)/total_episodes))\n",
"print(qtable)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, we use the learnt Q-table for playing the Frozen World game."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"env.reset()\n",
"\n",
"for episode in range(5):\n",
" state = env.reset()\n",
" step = 0\n",
" done = False\n",
" print(\"****************************************************\")\n",
" print(\"EPISODE \", episode)\n",
"\n",
" for step in range(max_steps):\n",
" env.render()\n",
" # Take the action (index) that have the maximum expected future reward given that state\n",
" action = np.argmax(qtable[state,:])\n",
" \n",
" new_state, reward, done, info = env.step(action)\n",
" \n",
" if done:\n",
" break\n",
" state = new_state\n",
"env.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises\n",
"\n",
"## Taxi\n",
"Analyze the [Taxi problem](http://gym.openai.com/envs/Taxi-v2/) and solve it applying Q-Learning. You can find a solution as the one previously presented [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym).\n",
"\n",
"Analyze the impact of not changing the learning rate (alfa or epsilon, depending on the book) or changing it in a different way."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Optional exercises\n",
"\n",
"## Doom\n",
"Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
"* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
"* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
"* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
"* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
"* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2018 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.5"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,138 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos Á. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises\n",
"\n",
"\n",
"## Taxi\n",
"Analyze the [Taxi problem](https://gymnasium.farama.org/environments/toy_text/taxi/) and solve it applying Q-Learning. You can find a solution as the one previously presented [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym), and the notebook is [here](https://github.com/wagonhelm/Reinforcement-Learning-Introduction/blob/master/Reinforcement%20Learning%20Introduction.ipynb). Take into account that Gymnasium has changed, so you will have to adapt the code.\n",
"\n",
"Analyze the impact of not changing the learning rate or changing it in a different way. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Optional exercises\n",
"Select one of the following exercises.\n",
"\n",
"## Blackjack\n",
"Analyze how to appy Q-Learning for solving Blackjack.\n",
"You can find information in this [article](https://gymnasium.farama.org/tutorials/training_agents/blackjack_tutorial/).\n",
"\n",
"## Doom\n",
"Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"* [Gymnasium documentation](https://gymnasium.farama.org/).\n",
"* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
"* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
"* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
"* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
"* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
"* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos Á. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

File diff suppressed because one or more lines are too long

274
ml5/qlearning.py Normal file
View File

@@ -0,0 +1,274 @@
# Class definition of QLearning
from pathlib import Path
from typing import NamedTuple
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import gymnasium as gym
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
# Params
class Params(NamedTuple):
total_episodes: int # Total episodes
learning_rate: float # Learning rate
gamma: float # Discounting rate
epsilon: float # Exploration probability
map_size: int # Number of tiles of one side of the squared environment
seed: int # Define a seed so that we get reproducible results
is_slippery: bool # If true the player will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions
n_runs: int # Number of runs
action_size: int # Number of possible actions
state_size: int # Number of possible states
proba_frozen: float # Probability that a tile is frozen
savefig_folder: Path # Root folder where plots are saved
class Qlearning:
def __init__(self, learning_rate, gamma, state_size, action_size):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate
self.gamma = gamma
self.reset_qtable()
def update(self, state, action, reward, new_state):
"""Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
delta = (
reward
+ self.gamma * np.max(self.qtable[new_state][:])
- self.qtable[state][action]
)
q_update = self.qtable[state][action] + self.learning_rate * delta
return q_update
def reset_qtable(self):
"""Reset the Q-table."""
self.qtable = np.zeros((self.state_size, self.action_size))
class EpsilonGreedy:
def __init__(self, epsilon, rng):
self.epsilon = epsilon
self.rng = rng
def choose_action(self, action_space, state, qtable):
"""Choose an action `a` in the current world state (s)."""
# First we randomize a number
explor_exploit_tradeoff = self.rng.uniform(0, 1)
# Exploration
if explor_exploit_tradeoff < self.epsilon:
action = action_space.sample()
# Exploitation (taking the biggest Q-value for this state)
else:
# Break ties randomly
# If all actions are the same for this state we choose a random one
# (otherwise `np.argmax()` would always take the first one)
if np.all(qtable[state][:]) == qtable[state][0]:
action = action_space.sample()
else:
action = np.argmax(qtable[state][:])
return action
def run_frozen_maps(maps, params, rng):
"""Run FrozenLake in maps and plot results"""
map_sizes = maps
res_all = pd.DataFrame()
st_all = pd.DataFrame()
for map_size in map_sizes:
env = gym.make(
"FrozenLake-v1",
is_slippery=params.is_slippery,
render_mode="rgb_array",
desc=generate_random_map(
size=map_size, p=params.proba_frozen, seed=params.seed
),
)
params = params._replace(action_size=env.action_space.n)
params = params._replace(state_size=env.observation_space.n)
env.action_space.seed(
params.seed
) # Set the seed to get reproducible results when sampling the action space
learner = Qlearning(
learning_rate=params.learning_rate,
gamma=params.gamma,
state_size=params.state_size,
action_size=params.action_size,
)
explorer = EpsilonGreedy(
epsilon=params.epsilon,
rng=rng
)
print(f"Map size: {map_size}x{map_size}")
rewards, steps, episodes, qtables, all_states, all_actions = run_env(env, params, learner, explorer)
# Save the results in dataframes
res, st = postprocess(episodes, params, rewards, steps, map_size)
res_all = pd.concat([res_all, res])
st_all = pd.concat([st_all, st])
qtable = qtables.mean(axis=0) # Average the Q-table between runs
plot_states_actions_distribution(
states=all_states, actions=all_actions, map_size=map_size, params=params
) # Sanity check
plot_q_values_map(qtable, env, map_size, params)
env.close()
return res_all, st_all
def run_env(env, params, learner, explorer):
rewards = np.zeros((params.total_episodes, params.n_runs))
steps = np.zeros((params.total_episodes, params.n_runs))
episodes = np.arange(params.total_episodes)
qtables = np.zeros((params.n_runs, params.state_size, params.action_size))
all_states = []
all_actions = []
for run in range(params.n_runs): # Run several times to account for stochasticity
learner.reset_qtable() # Reset the Q-table between runs
for episode in tqdm(
episodes, desc=f"Run {run}/{params.n_runs} - Episodes", leave=False
):
state = env.reset(seed=params.seed)[0] # Reset the environment
step = 0
done = False
total_rewards = 0
while not done:
action = explorer.choose_action(
action_space=env.action_space, state=state, qtable=learner.qtable
)
# Log all states and actions
all_states.append(state)
all_actions.append(action)
# Take the action (a) and observe the outcome state(s') and reward (r)
new_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
learner.qtable[state, action] = learner.update(
state, action, reward, new_state
)
total_rewards += reward
step += 1
# Our new state is state
state = new_state
# Log all rewards and steps
rewards[episode, run] = total_rewards
steps[episode, run] = step
qtables[run, :, :] = learner.qtable
return rewards, steps, episodes, qtables, all_states, all_actions
def postprocess(episodes, params, rewards, steps, map_size):
"""Convert the results of the simulation in dataframes."""
res = pd.DataFrame(
data={
"Episodes": np.tile(episodes, reps=params.n_runs),
"Rewards": rewards.flatten(),
"Steps": steps.flatten(),
}
)
res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])
st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])
return res, st
def qtable_directions_map(qtable, map_size):
"""Get the best learned action & map it to arrows."""
qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
directions = {0: "", 1: "", 2: "", 3: ""}
qtable_directions = np.empty(qtable_best_action.flatten().shape, dtype=str)
eps = np.finfo(float).eps # Minimum float number on the machine
for idx, val in enumerate(qtable_best_action.flatten()):
if qtable_val_max.flatten()[idx] > eps:
# Assign an arrow only if a minimal Q-value has been learned as best action
# otherwise since 0 is a direction, it also gets mapped on the tiles where
# it didn't actually learn anything
qtable_directions[idx] = directions[val]
qtable_directions = qtable_directions.reshape(map_size, map_size)
return qtable_val_max, qtable_directions
def plot_q_values_map(qtable, env, map_size, params):
"""Plot the last frame of the simulation and the policy learned."""
qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)
# Plot the last frame
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
ax[0].imshow(env.render())
ax[0].axis("off")
ax[0].set_title("Last frame")
# Plot the policy
sns.heatmap(
qtable_val_max,
annot=qtable_directions,
fmt="",
ax=ax[1],
cmap=sns.color_palette("Blues", as_cmap=True),
linewidths=0.7,
linecolor="black",
xticklabels=[],
yticklabels=[],
annot_kws={"fontsize": "xx-large"},
).set(title="Learned Q-values\nArrows represent best action")
for _, spine in ax[1].spines.items():
spine.set_visible(True)
spine.set_linewidth(0.7)
spine.set_color("black")
img_title = f"frozenlake_q_values_{map_size}x{map_size}.png"
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
plt.show()
def plot_states_actions_distribution(states, actions, map_size, params):
"""Plot the distributions of states and actions."""
labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
sns.histplot(data=states, ax=ax[0], kde=True)
ax[0].set_title("States")
sns.histplot(data=actions, ax=ax[1])
ax[1].set_xticks(list(labels.values()), labels=labels.keys())
ax[1].set_title("Actions")
fig.tight_layout()
img_title = f"frozenlake_states_actions_distrib_{map_size}x{map_size}.png"
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
plt.show()
def plot_steps_and_rewards(rewards_df, steps_df,params):
"""Plot the steps and rewards from dataframes."""
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
sns.lineplot(
data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", ax=ax[0]
)
ax[0].set(ylabel="Cumulated rewards")
sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", ax=ax[1])
ax[1].set(ylabel="Averaged steps number")
for axi in ax:
axi.legend(title="map size")
fig.tight_layout()
img_title = "frozenlake_steps_and_rewards.png"
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
plt.show()

742
nlp/0_1_LLM.ipynb Normal file

File diff suppressed because one or more lines are too long

2538
nlp/0_1_NLP_Slides.ipynb Normal file

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,333 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Table of Contents\n",
"* [First steps](#First-steps)\n",
"* [Movie review](#Movie-review)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# First steps\n",
"Given the text taken from https://www.romania-insider.com/baneasa-airport-reopening-date-jul-2022.\n",
"\n",
"The Aurel Vlaicu Băneasa Airport will reopen on August 1, with scheduled commercial flights resuming after a nine-year hiatus, George Dorobanțu, the director of the Bucharest National Airports Company (CNAB), announced in an interview with the public radio. Three companies are already ready to start scheduled and charter flights on Băneasa, namely Ryanair, Air Connect, and Fly One, the director said.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"text = \"The Aurel Vlaicu Băneasa Airport will reopen on August 1, with scheduled commercial flights resuming after a nine-year hiatus, George Dorobanțu, the director of the Bucharest National Airports Company (CNAB), announced in an interview with the public radio. Three companies are already ready to start scheduled and charter flights on Băneasa, namely Ryanair, Air Connect, and Fly One, the director said.\""
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### 1. List the first 10 tokens of the doc."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 2. Number of tokens of the text."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 3. List the Noun chunks\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 4. Print the sentences of the text"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 5. Print the number of sentences of the text\n",
"Hint: build a list first"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 6. Print the second sentence. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 7. Visualize the dependency grammar analysis of the second sentence."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 8. Listing lemmas and deps\n",
"For every token in the second sentence, print the text token, the grammatical category, and the lemma in four columns.\n",
"\n",
"Example:\n",
"\n",
"you&emsp;&emsp;PRON&emsp;&emsp;you&emsp;&emsp;nsubj\n",
"\n",
"Hint: format the columns. You can use expandtabs."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 9. List the frequencies of POS in the document in a table."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 10. Preprocessing\n",
"\n",
"Remove from the doc stopwords, digits, and punctuation.\n",
"\n",
"Hint: check the token api https://spacy.io/api/token\n",
"\n",
"Print the number of tokens before and after preprocessing."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 11. Entities of the document\n",
"Print the entities of the document, the type of the entity, and the explanation of the entity in a table with three columns.\n",
"\n",
"Example:\n",
"\n",
"Ubuntu&emsp;&emsp;&emsp;&emsp;ORG&emsp;&emsp;&emsp;&emsp;Companies, agencies, institutions, etc."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### 12. Visualize the entities\n",
"Show the entities highlighted in the text."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Movie review\n",
"\n",
"Classify the movie reviews from the following dataset https://data.world/rajeevsharma993/movie-reviews"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## References\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"* [Spacy](https://spacy.io/usage/spacy-101/#annotations) \n",
"* [NLTK stemmer](https://www.nltk.org/howto/stem.html)\n",
"* [NLTK Book. Natural Language Processing with Python. Steven Bird, Ewan Klein, and Edward Loper. O'Reilly Media, 2009 ](http://www.nltk.org/book_1ed/)\n",
"* [NLTK Essentials, Nitin Hardeniya, Packt Publishing, 2015](http://proquest.safaribooksonline.com/search?q=NLTK%20Essentials)\n",
"* Natural Language Processing with Python, José Portilla, 2019."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

View File

@@ -105,9 +105,23 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 2,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"data": {
"text/html": [
"<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>CountVectorizer(max_features=5000)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">CountVectorizer</label><div class=\"sk-toggleable__content\"><pre>CountVectorizer(max_features=5000)</pre></div></div></div></div></div>"
],
"text/plain": [
"CountVectorizer(max_features=5000)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [ "source": [
"from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.feature_extraction.text import CountVectorizer\n",
"\n", "\n",
@@ -128,9 +142,21 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 3,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"data": {
"text/plain": [
"<3x10 sparse matrix of type '<class 'numpy.int64'>'\n",
"\twith 15 stored elements in Compressed Sparse Row format>"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [ "source": [
"vectors = vectorizer.fit_transform(documents)\n", "vectors = vectorizer.fit_transform(documents)\n",
"vectors" "vectors"
@@ -146,12 +172,24 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 4,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 1 1 2 0 0 1 2 0 0]\n",
" [1 0 0 0 2 0 0 1 2 1]\n",
" [1 0 0 0 2 1 0 0 1 1]]\n",
"['and' 'but' 'coming' 'is' 'like' 'sandwiches' 'short' 'summer' 'the'\n",
" 'winter']\n"
]
}
],
"source": [ "source": [
"print(vectors.toarray())\n", "print(vectors.toarray())\n",
"print(vectorizer.get_feature_names())" "print(vectorizer.get_feature_names_out())"
] ]
}, },
{ {
@@ -164,13 +202,25 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 5,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"data": {
"text/plain": [
"array(['and', 'but', 'coming', 'i', 'is', 'like', 'sandwiches', 'short',\n",
" 'summer', 'the', 'winter'], dtype=object)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [ "source": [
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words=None, token_pattern='(?u)\\\\b\\\\w+\\\\b') \n", "vectorizer = CountVectorizer(analyzer=\"word\", stop_words=None, token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
"vectors = vectorizer.fit_transform(documents)\n", "vectors = vectorizer.fit_transform(documents)\n",
"vectorizer.get_feature_names()" "vectorizer.get_feature_names_out()"
] ]
}, },
{ {
@@ -182,20 +232,47 @@
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 6,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/cif/anaconda3/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
" warnings.warn(msg, category=FutureWarning)\n"
]
},
{
"data": {
"text/plain": [
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [ "source": [
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', token_pattern='(?u)\\\\b\\\\w+\\\\b') \n", "vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
"vectors = vectorizer.fit_transform(documents)\n", "vectors = vectorizer.fit_transform(documents)\n",
"vectorizer.get_feature_names()" "vectorizer.get_feature_names_out()"
] ]
}, },
{ {
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": 7,
"metadata": {}, "metadata": {},
"outputs": [], "outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"frozenset({'or', 'be', 'least', 'ours', 'very', 'noone', 'more', 'can', 'front', 'last', 'co', 'where', 'beyond', 'you', 'was', 'to', 'nine', 'here', 'describe', 'than', 'rather', 'therefore', 'except', 'at', 'again', 'ourselves', 'most', 'anyway', 'thick', 'whither', 'thereupon', 'someone', 'hereupon', 'besides', 'among', 'hasnt', 'across', 'namely', 'because', 'is', 'out', 'same', 'yourself', 'somehow', 'sincere', 'con', 'hereby', 'towards', 'interest', 'much', 'up', 'why', 'myself', 'all', 'nobody', 'though', 'every', 'show', 'not', 'there', 'whether', 'still', 'name', 'when', 'the', 'each', 'six', 'nor', 'and', 'under', 'thereby', 'less', 'either', 'thence', 'into', 'seemed', 'something', 'four', 'sometimes', 'himself', 'those', 'nowhere', 'almost', 'are', 'empty', 'must', 'while', 'afterwards', 'perhaps', 'from', 'detail', 'through', 'any', 'have', 'may', 'he', 'anywhere', 'alone', 'without', 'beforehand', 'had', 'too', 'yourselves', 'our', 'see', 'how', 'please', 'what', 'am', 'do', 'it', 'serious', 'yet', 'down', 'top', 'amount', 'then', 'both', 'fire', 'been', 'wherein', 'done', 'etc', 'whose', 'whereafter', 'who', 'ltd', 'meanwhile', 'further', 'few', 'first', 'behind', 'made', 'yours', 'until', 'toward', 'amoungst', 'anyhow', 'we', 'with', 'give', 'go', 'no', 'back', 'else', 'becomes', 'your', 'fill', 'together', 'another', 'throughout', 'onto', 'de', 'me', 'ten', 'system', 'became', 'per', 'therein', 'everyone', 'often', 'ie', 'put', 'hers', 'herself', 'nevertheless', 'itself', 'eg', 'herein', 'his', 'this', 'cry', 'due', 'bill', 'one', 'on', 'being', 'themselves', 'of', 'some', 'their', 'neither', 'elsewhere', 'since', 'whole', 'eight', 'i', 'a', 'whoever', 'own', 'call', 'them', 'mostly', 'she', 'my', 'cannot', 'us', 'never', 'as', 'thin', 'upon', 'cant', 'un', 'before', 'her', 'otherwise', 'full', 'these', 'next', 'they', 'side', 'somewhere', 'fifty', 'hence', 'so', 'along', 'already', 'three', 'latter', 'anything', 'whom', 'could', 'indeed', 'nothing', 'whereby', 'which', 'sometime', 'become', 'ever', 'amongst', 'by', 'in', 'five', 'after', 'mine', 'fifteen', 'wherever', 'found', 'thereafter', 'third', 'keep', 'anyone', 'will', 'bottom', 'off', 'seem', 'none', 'an', 'whatever', 'over', 'during', 'also', 'latterly', 'via', 'take', 'former', 'above', 'now', 'becoming', 'hereafter', 'such', 'two', 'only', 'about', 'sixty', 're', 'everything', 'others', 'hundred', 'twelve', 'thus', 'even', 'well', 'always', 'once', 'beside', 'get', 'mill', 'seems', 'if', 'whereupon', 'find', 'forty', 'inc', 'whenever', 'around', 'other', 'should', 'many', 'enough', 'however', 'move', 'against', 'several', 'everywhere', 'has', 'whereas', 'that', 'whence', 'eleven', 'its', 'within', 'twenty', 'part', 'although', 'thru', 'couldnt', 'moreover', 'him', 'formerly', 'might', 'seeming', 'but', 'below', 'would', 'between', 'were', 'for'})\n"
]
}
],
"source": [ "source": [
"#stop words in scikit-learn for English\n", "#stop words in scikit-learn for English\n",
"print(vectorizer.get_stop_words())" "print(vectorizer.get_stop_words())"
@@ -442,7 +519,7 @@
], ],
"metadata": { "metadata": {
"kernelspec": { "kernelspec": {
"display_name": "Python 3", "display_name": "Python 3 (ipykernel)",
"language": "python", "language": "python",
"name": "python3" "name": "python3"
}, },
@@ -456,7 +533,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.1" "version": "3.10.10"
}, },
"latex_envs": { "latex_envs": {
"LaTeX_envs_menu_present": true, "LaTeX_envs_menu_present": true,

Some files were not shown because too many files have changed in this diff Show More