1
0
mirror of https://github.com/gsi-upm/sitc synced 2025-01-07 03:31:28 +00:00

Compare commits

..

1 Commits

Author SHA1 Message Date
emma-guillen-upm
81fafc2c35
Merge 23dfc8663f into 3363c953f4 2023-05-19 23:21:11 +02:00
79 changed files with 62 additions and 26805 deletions

View File

@ -1,7 +1,7 @@
# sitc
Exercises for Intelligent Systems Course at Universidad Politécnica de Madrid, Telecommunication Engineering School. This material is used in the subjects
- CDAW (Ciencia de datos y aprendizaje en automático en la web de datos) - Master Universitario de Ingeniería de Telecomunicación (MUIT)
- ABID (Analítica de Big Data) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos)
- SITC (Sistemas Inteligentes y Tecnologías del Conocimiento) - Master Universitario de Ingeniería de Telecomunicación (MUIT)
- TIAD (Tecnologías Inteligentes de Análisis de Datos) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos)
For following this course:
- Follow the instructions to install the environment: https://github.com/gsi-upm/sitc/blob/master/python/1_1_Notebooks.ipynb (Just install 'conda')
@ -9,13 +9,11 @@ For following this course:
- Run in a terminal in the folder sitc: jupyter notebook (and enjoy)
Topics
* Python: a quick introduction to Python
* Python: quick introduction to Python
* ML-1: introduction to machine learning with scikit-learn
* ML-2: introduction to machine learning with pandas and scikit-learn
* ML-21: preprocessing and visualizatoin
* ML-3: introduction to machine learning. Neural Computing
* ML-4: introduction to Evolutionary Computing
* ML-5: introduction to Reinforcement Learning
* NLP: introduction to NLP
* LOD: Linked Open Data, exercises and example code
* SNA: Social Network Analysis

View File

@ -1 +0,0 @@

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.1 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 95 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 222 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 237 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 87 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 85 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 1.8 MiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 152 KiB

View File

@ -71,6 +71,7 @@
"source": [
"* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
"* [scikit-learn : Machine Learning Simplified](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), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
]
},

View File

@ -63,7 +63,9 @@
"metadata": {},
"source": [
"* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
"* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
]
},
{

View File

@ -228,6 +228,7 @@
"source": [
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
"* [Mastering Pandas](https://learning.oreilly.com/library/view/mastering-pandas/9781789343236/), Femi Anthony, Packt Publishing, 2015.\n",
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",

View File

@ -408,6 +408,7 @@
"source": [
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
"* [Mastering Pandas](https://learning.oreilly.com/library/view/mastering-pandas/9781789343236/), Femi Anthony, Packt Publishing, 2015.\n",
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",

View File

@ -163,6 +163,7 @@
"source": [
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
"* [Mastering Pandas](https://learning.oreilly.com/library/view/mastering-pandas/9781789343236/), Femi Anthony, Packt Publishing, 2015.\n",
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"

View File

@ -154,7 +154,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)\n",
"* [General concepts of machine learning with scikit-learn](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/auto_examples/tutorial/plot_ML_flow_chart.html)\n",
"* [A Tour of Machine Learning Algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)"
]
},

View File

@ -379,7 +379,8 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n"
"* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",
"* [Learning scikit-learn: Machine Learning in Python](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n"
]
},
{

View File

@ -509,6 +509,8 @@
"metadata": {},
"source": [
"* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
"* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019.\n",
"* [Parameter estimation using grid search with cross-validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html)\n",
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
]

View File

@ -518,6 +518,8 @@
"metadata": {},
"source": [
"* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
"* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019.\n",
"* [Hyperparameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
]

View File

@ -47,7 +47,7 @@ def get_code(tree, feature_names, target_names,
recurse(left, right, threshold, features, 0, 0)
# Taken from https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html
# Taken from http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#example-tree-plot-iris-py
import numpy as np
import matplotlib.pyplot as plt
@ -114,4 +114,4 @@ def plot_tree_iris():
plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend()
plt.show()
plt.show()

View File

@ -74,7 +74,9 @@
"metadata": {},
"source": [
"* [IPython Notebook Tutorial for Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
"* [Scikit-learn videos and notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
"* [Scikit-learn videos and notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
"* [Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781838826048/), Tarek Amr, Packt Publishing, 2020.\n",
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka and Vahid Mirjalili, Packt Publishing, 2019."
]
},
{

View File

@ -50,30 +50,30 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this session, we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
"In this session we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
"\n",
"![Titanic](images/titanic.jpg)\n",
"\n",
"\n",
"The main objective is to predict which passengers survived the sinking of the Titanic.\n",
"The main objective is predicting which passengers survived the sinking of the Titanic.\n",
"\n",
"The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder *data-titanic*.\n",
"\n",
"\n",
"Here follows a description of the variables.\n",
"\n",
"| Variable | Description | Values |\n",
"|------------|---------------------------------|-----------------|\n",
"| survival | Survival |(0 = No; 1 = Yes)|\n",
"| Pclass | Name | |\n",
"| Sex | Sex | male, female |\n",
"| Age | Age | |\n",
"| SibSp |Number of Siblings/Spouses Aboard| |\n",
"| Parch |Number of Parents/Children Aboard| |\n",
"| Ticket | Ticket Number | |\n",
"| Fare | Passenger Fare | |\n",
"| Cabin | Cabin | |\n",
"| Embarked | Port of Embarkation | (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
"|Variable | Description| Values|\n",
"|-------------------------------|\n",
"| survival| Survival| (0 = No; 1 = Yes)|\n",
"|Pclass |Name | |\n",
"|Sex |Sex | male, female|\n",
"|Age |Age|\n",
"|SibSp |Number of Siblings/Spouses Aboard||\n",
"|Parch |Number of Parents/Children Aboard||\n",
"|Ticket|Ticket Number||\n",
"|Fare |Passenger Fare||\n",
"|Cabin |Cabin||\n",
"|Embarked |Port of Embarkation| (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
"\n",
"\n",
"The definitions used for SibSp and Parch are:\n",
@ -213,7 +213,8 @@
"* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)\n",
"* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n",
"* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n",
"* [An introduction to NumPy and Scipy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf)\n"
"* [An introduction to NumPy and Scipy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf)\n",
"* [NumPy tutorial](https://numpy.org/doc/stable/)"
]
},
{

View File

@ -433,6 +433,7 @@
"metadata": {},
"source": [
"* [Pandas](http://pandas.pydata.org/)\n",
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2017](https://learning.oreilly.com/library/view/learning-pandas/9781787123137/)\n",
"* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
"* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)"

View File

@ -373,8 +373,8 @@
"source": [
"#Mean age of passengers per Passenger class\n",
"\n",
"#First we calculate the mean for the numeric columns\n",
"df.select_dtypes(np.number).groupby('Pclass').mean()"
"#First we calculate the mean\n",
"df.groupby('Pclass').mean()"
]
},
{

View File

@ -220,7 +220,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Analise distribution\n",
"# Analise distributon\n",
"df.hist(figsize=(10,10))\n",
"plt.show()"
]
@ -233,7 +233,7 @@
"source": [
"# We can see the pairwise correlation between variables. A value near 0 means low correlation\n",
"# while a value near -1 or 1 indicates strong correlation.\n",
"df.corr(numeric_only = True)"
"df.corr()"
]
},
{
@ -249,10 +249,11 @@
"metadata": {},
"outputs": [],
"source": [
"# General description of relationship between variables uwing Seaborn PairGrid\n",
"# General description of relationship betweek variables uwing Seaborn PairGrid\n",
"# We use df_clean, since the null values of df would gives us an error, you can check it.\n",
"g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
"g.map(sns.scatterplot)\n",
"g.map_diag(plt.hist)\n",
"g.map_offdiag(plt.scatter)\n",
"g.add_legend()"
]
},

View File

@ -351,10 +351,10 @@
"We can obtain more information from the confussion matrix and the metric F1-score.\n",
"In a confussion matrix, we can see:\n",
"\n",
"| |**Predicted**: 0| **Predicted: 1**|\n",
"|-------------|----------------|-----------------|\n",
"|**Actual: 0**| TN | FP |\n",
"|**Actual: 1**| FN | TP |\n",
"||**Predicted**: 0| **Predicted: 1**|\n",
"|---------------------------|\n",
"|**Actual: 0**| TN | FP |\n",
"|**Actual: 1**| FN|TP|\n",
"\n",
"* **True negatives (TN)**: actual negatives that were predicted as negatives\n",
"* **False positives (FP)**: actual negatives that were predicted as positives\n",

View File

@ -1 +0,0 @@

View File

@ -1 +0,0 @@

View File

@ -1,157 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Introduction to Preprocessing\n",
"In this session, we will get more insight regarding how to preprocess data.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Objectives\n",
"The main objectives of this session are:\n",
"* Understanding the need for preprocessing\n",
"* Understanding different preprocessing techniques\n",
"* Experimenting with several environments for preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Table of Contents"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"1. [Home](00_Intro_Preprocessing.ipynb)\n",
"3. [Initial Check](02_Initial_Check.ipynb)\n",
"4. [Filter Data](03_Filter_Data.ipynb)\n",
"5. [Unknown values](04_Unknown_Values.ipynb)\n",
"6. [Duplicated values](05_Duplicated_Values.ipynb)\n",
"7. [Rescaling Data](06_Rescaling_Data.ipynb)\n",
"8. [Binarize Data](07_Binarize_Data.ipynb)\n",
"9. [Categorial features](08_Categorical.ipynb)\n",
"10. [String Data](09_String_Data.ipynb)\n",
"12. [Handy libraries for preprocessing](11_0_Handy.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,714 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Initial Check with Pandas\n",
"\n",
"We can start with a quick quality check."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Load and check data\n",
"Check which data you are loading."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Moran, Mr. James</td>\n",
" <td>male</td>\n",
" <td>NaN</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>330877</td>\n",
" <td>8.4583</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>McCarthy, Mr. Timothy J</td>\n",
" <td>male</td>\n",
" <td>54.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>17463</td>\n",
" <td>51.8625</td>\n",
" <td>E46</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Palsson, Master. Gosta Leonard</td>\n",
" <td>male</td>\n",
" <td>2.0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>349909</td>\n",
" <td>21.0750</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
" <td>female</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>347742</td>\n",
" <td>11.1333</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>10</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
" <td>female</td>\n",
" <td>14.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>237736</td>\n",
" <td>30.0708</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"5 6 0 3 \n",
"6 7 0 1 \n",
"7 8 0 3 \n",
"8 9 1 3 \n",
"9 10 1 2 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"5 Moran, Mr. James male NaN 0 \n",
"6 McCarthy, Mr. Timothy J male 54.0 0 \n",
"7 Palsson, Master. Gosta Leonard male 2.0 3 \n",
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n",
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
"5 0 330877 8.4583 NaN Q \n",
"6 0 17463 51.8625 E46 S \n",
"7 1 349909 21.0750 NaN S \n",
"8 2 347742 11.1333 NaN S \n",
"9 0 237736 30.0708 NaN C "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Check number of columns and rows"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"(891, 12)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Check names and types of columns\n",
"Check the data and type, for example if dates are of strings or what."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
" 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
" dtype='object')\n"
]
},
{
"data": {
"text/plain": [
"PassengerId int64\n",
"Survived int64\n",
"Pclass int64\n",
"Name object\n",
"Sex object\n",
"Age float64\n",
"SibSp int64\n",
"Parch int64\n",
"Ticket object\n",
"Fare float64\n",
"Cabin object\n",
"Embarked object\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get column names\n",
"print(df.columns)\n",
"# Get column data types\n",
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Check if the column is unique"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"PassengerId is unique: True\n",
"Survived is unique: False\n",
"Pclass is unique: False\n",
"Name is unique: True\n",
"Sex is unique: False\n",
"Age is unique: False\n",
"SibSp is unique: False\n",
"Parch is unique: False\n",
"Ticket is unique: False\n",
"Fare is unique: False\n",
"Cabin is unique: False\n",
"Embarked is unique: False\n"
]
}
],
"source": [
"for i in column_names:\n",
" print('{} is unique: {}'.format(i, df[i].is_unique))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Check if the dataframe has an index\n",
"We will need it to do joins or merges."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=891, step=1)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check if there is an index. If not, you will get 'AtributeError: function object has no atribute index'\n",
"df.index"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n",
" 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,\n",
" 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,\n",
" 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,\n",
" 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,\n",
" 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,\n",
" 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,\n",
" 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,\n",
" 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,\n",
" 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,\n",
" 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,\n",
" 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,\n",
" 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,\n",
" 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,\n",
" 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,\n",
" 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,\n",
" 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,\n",
" 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,\n",
" 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,\n",
" 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,\n",
" 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,\n",
" 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,\n",
" 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,\n",
" 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,\n",
" 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,\n",
" 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,\n",
" 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,\n",
" 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,\n",
" 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,\n",
" 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,\n",
" 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,\n",
" 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,\n",
" 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,\n",
" 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,\n",
" 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,\n",
" 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,\n",
" 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,\n",
" 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,\n",
" 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,\n",
" 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,\n",
" 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,\n",
" 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,\n",
" 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,\n",
" 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,\n",
" 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,\n",
" 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,\n",
" 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,\n",
" 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,\n",
" 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,\n",
" 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,\n",
" 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,\n",
" 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,\n",
" 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,\n",
" 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,\n",
" 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,\n",
" 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,\n",
" 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,\n",
" 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,\n",
" 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,\n",
" 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,\n",
" 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,\n",
" 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,\n",
" 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,\n",
" 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,\n",
" 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,\n",
" 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,\n",
" 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,\n",
" 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,\n",
" 884, 885, 886, 887, 888, 889, 890])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# # Check the index values\n",
"df.index.values"
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# If index does not exist\n",
"df.set_index('column_name_to_use', inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Name 0\n",
"Sex 0\n",
"Age 177\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Count missing vales per column\n",
"df.isnull().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,150 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Filter Data\n",
"\n",
"Select the columns you want and delete the others."
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"# Create list comprehension of the columns you want to lose\n",
"columns_to_drop = [column_names[i] for i in [1, 3, 5]]\n",
"# Drop unwanted columns \n",
"df.drop(columns_to_drop, inplace=True, axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,591 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Unknown values\n",
"\n",
"Two possible approaches are **remove** these rows or **fill** them. It depends on every case."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Filling NaN values\n",
"If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
"\n",
"* For **string** fields, we can fill NaN with **' '**.\n",
"\n",
"* For **numbers**, we can fill with the **mean** or **median** value. \n"
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Fill NaN with ' '\n",
"df['col'] = df['col'].fillna(' ')\n",
"# Fill NaN with 99\n",
"df['col'] = df['col'].fillna(99)\n",
"# Fill NaN with the mean of the column\n",
"df['col'] = df['col'].fillna(df['col'].mean())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Propagate non-null values forward or backward\n",
"You can also **propagate** non-null values with these methods:\n",
"\n",
"* **ffill**: Fill values by propagating the last valid observation to the next valid.\n",
"* **bfill**: Fill values using the following valid observation to fill the gap.\n",
"* **interpolate**: Fill NaN values using interpolation.\n",
"\n",
"It will fill the next value in the dataframe with the previous non-NaN value. \n",
"\n",
"You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We fill forward the value 4.0 and fill the next one (limit = 1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 4.0\n",
"6 NaN"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
" df.ffill(limit = 1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.ffill()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 2.0\n",
"1 2.0\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.bfill()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Removing NaN values\n",
"We can remove them by row or column (use inplace=True if you want to modify the DataFrame)."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Drop any rows which have any nans\n",
"df1 = df.dropna()\n",
"# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n",
"df2 = df.dropna(axis=1)\n",
"# Only drop columns which have at least 90% non-NaNs \n",
"df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@ -1,198 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Binarize Data\n",
"* We can transform our data using a binary threshold. All values above the threshold are marked 1, and all values equal to or below are marked 0.\n",
"* This is called binarizing your data or thresholding your data. \n",
"\n",
"* It can be helpful when you have probabilities that you want to make crisp values."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Binarize Data with Scikit-Learn\n",
"We can create new binary attributes in Python using Scikit-learn with the Binarizer class.\n",
"I"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from sklearn.preprocessing import Binarizer\n",
"\n",
"X = [[ 1., -1., 2.],\n",
" [ 2., 0., 0.],\n",
" [ 0., 1.1, -1.]]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"transformer = Binarizer(threshold=1.0).fit(X) # threshold 1.0"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 1.],\n",
" [1., 0., 0.],\n",
" [0., 1., 0.]])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"transformer.transform(X)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,812 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Categorical Data\n",
"\n",
"For many ML algorithms, we need to transform categorical data into numbers.\n",
"\n",
"For example:\n",
"* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
"* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
"* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
"\n",
"There are two main approaches:\n",
"* Integer encoding\n",
"* One hot encoding"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Integer Encoding\n",
"We assign a number to every value:\n",
"\n",
"['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
"\n",
"['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
"\n",
"['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
"\n",
"The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
"\n",
"In our examples, this representation can be suitable for **temperature**, but not for the other two."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## One Hot Encoding\n",
"A binary column is created for each value of the categorical variable."
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Sex M F U\n",
"----- ---------\n",
"M 1 0 0\n",
"F is transformed into 0 1 0\n",
"Unknown 0 0 1\n",
"M 1 0 0 "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Transforming categorical data with Scikit-Learn\n",
"\n",
"We can use:\n",
"* **get_dummies()** (one hot encoding)\n",
"* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
"\n",
"We are going to learn the first approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### One Hot Encoding\n",
"We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Name Age Sex Position\n",
"0 Marius 18 Male graduate\n",
"1 Maria 19 Female professor\n",
"2 John 20 Male TA\n",
"3 Carla 30 Female phD\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
" \"Age\": [18, 19, 20, 30],\n",
"\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
" \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
" }\n",
"df = pd.DataFrame(data)\n",
"print(df)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" <th>Sex_Female</th>\n",
" <th>Sex_Male</th>\n",
" <th>Position_TA</th>\n",
" <th>Position_graduate</th>\n",
" <th>Position_phD</th>\n",
" <th>Position_professor</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n",
"0 Marius 18 1 1 False True \n",
"1 Maria 19 0 3 True False \n",
"2 John 20 1 0 False True \n",
"3 Carla 30 0 2 True False \n",
"\n",
" Position_TA Position_graduate Position_phD Position_professor \n",
"0 False True False False \n",
"1 False False False True \n",
"2 True False False False \n",
"3 False False True False "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
]
}
],
"source": [
"df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
"df_onehot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use *OneHotEncoder* from Scikit."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sex_Female</th>\n",
" <th>Sex_Male</th>\n",
" <th>Position_TA</th>\n",
" <th>Position_graduate</th>\n",
" <th>Position_phD</th>\n",
" <th>Position_professor</th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n",
"0 0.0 1.0 0.0 1.0 0.0 \n",
"1 1.0 0.0 0.0 0.0 0.0 \n",
"2 0.0 1.0 1.0 0.0 0.0 \n",
"3 1.0 0.0 0.0 0.0 1.0 \n",
"\n",
" Position_professor Name Age sex_encoded position_encoded \n",
"0 0.0 Marius 18 1 1 \n",
"1 1.0 Maria 19 0 3 \n",
"2 0.0 John 20 1 0 \n",
"3 0.0 Carla 30 0 2 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"from sklearn.compose import make_column_transformer\n",
"\n",
"df_onehotencoder = df\n",
"# create OneHotEncoder object\n",
"encoder = OneHotEncoder()\n",
"\n",
"# Transformer for several columns\n",
"transformer = make_column_transformer(\n",
" (OneHotEncoder(), ['Sex', 'Position']),\n",
" remainder='passthrough',\n",
" verbose_feature_names_out=False)\n",
"\n",
"# transform\n",
"transformed = transformer.fit_transform(df_onehotencoder)\n",
"\n",
"df_onehotencoder = pd.DataFrame(\n",
" transformed,\n",
" columns=transformer.get_feature_names_out())\n",
"df_onehotencoder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Integer encoding\n",
"We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position\n",
"0 Marius 18 Male graduate\n",
"1 Maria 19 Female professor\n",
"2 John 20 Male TA\n",
"3 Carla 30 Female phD"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder\n",
"# creating instance of labelencoder\n",
"labelencoder = LabelEncoder()\n",
"df_encoded = df\n",
"# Assigning numerical values and storing in another column\n",
"sex_values = ('Male', 'Female')\n",
"position_values = ('graduate', 'professor', 'TA', 'phD')\n",
"df_encoded"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" <th>sex_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position sex_encoded\n",
"0 Marius 18 Male graduate 1\n",
"1 Maria 19 Female professor 0\n",
"2 John 20 Male TA 1\n",
"3 Carla 30 Female phD 0"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
"df_encoded"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position sex_encoded position_encoded\n",
"0 Marius 18 Male graduate 1 1\n",
"1 Maria 19 Female professor 0 3\n",
"2 John 20 Male TA 1 0\n",
"3 Carla 30 Female phD 0 2"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
"df_encoded"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,652 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# String Data\n",
"It is widespread to clean string columns to follow a predefined format (e.g., emails, URLs, ...).\n",
"\n",
"We can do it using regular expressions or specific libraries."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Beautifier\n",
"A simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify URL patterns, domains, and so on. The library helps to clean Unicode, special characters, and unnecessary redirection patterns from the URLs and gives you a clean date.\n",
"\n",
"Install with **'pip install beautifier'**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Email cleanup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Email\n",
"email = Email('me@imsach.in')"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'imsach.in'"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.domain"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'me'"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.username"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.is_free_email"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email2 = Email('This my address')"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email2.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email3 = Email('pepe@gmail.com')"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_free_email"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## URL cleanup"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Url\n",
"url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'https://in.linkedin.com/in/sachinphilip'"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.cleanup"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'in.linkedin.com'"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.domain"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.param"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.parameters"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'sachinphilip'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.username"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Unicode\n",
"Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
"\n",
"A **mojibake** is a character displayed in an unintended character encoding. Example: \"<22>\").\n",
"\n",
"We will use the library **ftfy** (fixed text for you) to fix it.\n",
"\n",
"First, you should install the library: **conda install ftfy** (or **pip install ftfy**)."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"¯\\_(ツ)_/¯\n",
"Party\n",
"I'm\n"
]
}
],
"source": [
"import ftfy\n",
"foo = '&macr;\\\\_(ã\\x83\\x84)_/&macr;'\n",
"bar = '\\ufeffParty'\n",
"baz = '\\001\\033[36;44mI&#x92;m'\n",
"print(ftfy.fix_text(foo))\n",
"print(ftfy.fix_text(bar))\n",
"print(ftfy.fix_text(baz))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can understand which heuristics ftfy is using."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"U+0026 & [Po] AMPERSAND\n",
"U+006D m [Ll] LATIN SMALL LETTER M\n",
"U+0061 a [Ll] LATIN SMALL LETTER A\n",
"U+0063 c [Ll] LATIN SMALL LETTER C\n",
"U+0072 r [Ll] LATIN SMALL LETTER R\n",
"U+003B ; [Po] SEMICOLON\n",
"U+005C \\ [Po] REVERSE SOLIDUS\n",
"U+005F _ [Pc] LOW LINE\n",
"U+0028 ( [Ps] LEFT PARENTHESIS\n",
"U+00E3 ã [Ll] LATIN SMALL LETTER A WITH TILDE\n",
"U+0083 \\x83 [Cc] <unknown>\n",
"U+0084 \\x84 [Cc] <unknown>\n",
"U+0029 ) [Pe] RIGHT PARENTHESIS\n",
"U+005F _ [Pc] LOW LINE\n",
"U+002F / [Po] SOLIDUS\n",
"U+0026 & [Po] AMPERSAND\n",
"U+006D m [Ll] LATIN SMALL LETTER M\n",
"U+0061 a [Ll] LATIN SMALL LETTER A\n",
"U+0063 c [Ll] LATIN SMALL LETTER C\n",
"U+0072 r [Ll] LATIN SMALL LETTER R\n",
"U+003B ; [Po] SEMICOLON\n"
]
}
],
"source": [
"ftfy.explain_unicode(foo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Dates\n",
"Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as [**python-dateutil**](https://dateutil.readthedocs.io/en/stable/). An alternative is [arrow](https://arrow.readthedocs.io/en/latest/).\n",
"\n",
"Install the library: **pip install python-dateutil**."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-22 10:22:46+00:00\n"
]
}
],
"source": [
"from dateutil.parser import parse\n",
"now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
"print(now)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-08 10:20:00\n"
]
}
],
"source": [
"dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
"print(dt)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), , A. Sharma, 2018.\n",
"* [Beautifier](https://github.com/labtocat/beautifier) package\n",
"* [Ftfy](https://ftfy.readthedocs.io/en/latest/) package\n",
"* [python-dateutil](https://dateutil.readthedocs.io/en/stable/)package"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,139 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Handy libraries\n",
"Libraries that help in several preprocessing tasks.\n",
"\n",
"* [datacleaner](11_1_datacleaner.ipynb)\n",
"* [autoclean](11_3_autoclean.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
"* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries), M. Bierly, 2016\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,673 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Datacleaner\n",
"[Datacleaner](https://github.com/rhiever/datacleaner) supports:\n",
"\n",
"* drop rows with missing values\n",
"* replace missing values with the mode or median on a column-by-column basis\n",
"* encode non-numeric variables with numerical equivalents\n",
"\n",
"\n",
"Install with\n",
"\n",
"**pip install datacleaner**"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.4500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.7500</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas male 27.0 0 \n",
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"890 Dooley, Mr. Patrick male 32.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
".. ... ... ... ... ... \n",
"886 0 211536 13.0000 NaN S \n",
"887 0 112053 30.0000 B42 S \n",
"888 2 W./C. 6607 23.4500 NaN S \n",
"889 0 111369 30.0000 C148 C \n",
"890 0 370376 7.7500 NaN Q \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from datacleaner import autoclean\n",
"\n",
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>108</td>\n",
" <td>1</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>523</td>\n",
" <td>7.2500</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>190</td>\n",
" <td>0</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>596</td>\n",
" <td>71.2833</td>\n",
" <td>81</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>353</td>\n",
" <td>0</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>669</td>\n",
" <td>7.9250</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>272</td>\n",
" <td>0</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>49</td>\n",
" <td>53.1000</td>\n",
" <td>55</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>15</td>\n",
" <td>1</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>472</td>\n",
" <td>8.0500</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>548</td>\n",
" <td>1</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>101</td>\n",
" <td>13.0000</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>303</td>\n",
" <td>0</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>14</td>\n",
" <td>30.0000</td>\n",
" <td>30</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>413</td>\n",
" <td>0</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>675</td>\n",
" <td>23.4500</td>\n",
" <td>47</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>81</td>\n",
" <td>1</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>8</td>\n",
" <td>30.0000</td>\n",
" <td>60</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>220</td>\n",
" <td>1</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>466</td>\n",
" <td>7.7500</td>\n",
" <td>47</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \\\n",
"0 1 0 3 108 1 22.0 1 0 523 \n",
"1 2 1 1 190 0 38.0 1 0 596 \n",
"2 3 1 3 353 0 26.0 0 0 669 \n",
"3 4 1 1 272 0 35.0 1 0 49 \n",
"4 5 0 3 15 1 35.0 0 0 472 \n",
".. ... ... ... ... ... ... ... ... ... \n",
"886 887 0 2 548 1 27.0 0 0 101 \n",
"887 888 1 1 303 0 19.0 0 0 14 \n",
"888 889 0 3 413 0 28.0 1 2 675 \n",
"889 890 1 1 81 1 26.0 0 0 8 \n",
"890 891 0 3 220 1 32.0 0 0 466 \n",
"\n",
" Fare Cabin Embarked \n",
"0 7.2500 47 2 \n",
"1 71.2833 81 0 \n",
"2 7.9250 47 2 \n",
"3 53.1000 55 2 \n",
"4 8.0500 47 2 \n",
".. ... ... ... \n",
"886 13.0000 47 2 \n",
"887 30.0000 30 2 \n",
"888 23.4500 47 2 \n",
"889 30.0000 60 0 \n",
"890 7.7500 47 1 \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clean = autoclean(df, copy=True)\n",
"df_clean"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
"* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries), M. Bierly, 2016\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,578 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "849ad57e-6adb-4c2e-afd6-73db37eef572",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"id": "179cc802-9f1d-40b0-bf0c-9d4fb7ea1262",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"id": "9858d815-0390-4e77-a5ff-a8d2a1960981",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"id": "238bab60-75f0-4d29-ab05-66afc463b506",
"metadata": {},
"source": [
"# Autoclean\n",
"A simple library to clean data. [Autoclean](https://github.com/elisemercury/AutoClean) supports:\n",
"AutoClean supports:\n",
"\n",
"* Handling of duplicates\n",
"* Various imputation methods for missing values\n",
"* Handling of outliers\n",
"* Encoding of categorical data (OneHot, Label)\n",
"* Extraction of data time values\n",
"\n",
"Install the package: **pip install py-AutoClean**.\n",
"\n",
"Parameters:\n",
"\n",
"* **duplicates**\n",
" * default: False,\n",
" * other values: 'auto', True\n",
"* **missing_num**\n",
" * default:False,\n",
" * other values:\t'auto', 'linreg', 'knn', 'mean', 'median', 'most_frequent', 'delete', False\n",
"* **missing_categ**\n",
" * default: False,\n",
" * other values:\t'auto', 'logreg', 'knn', 'most_frequent', 'delete', False\n",
"* **encode_categ**\n",
" * default: False,\n",
" * other values:\t'auto', ['onehot'], ['label'], False ; to encode only specific columns add a list of column names or indexes: ['auto', ['col1', 2]]\n",
"* **extract_datetime**\n",
" * default:\tFalse,\n",
" * other values:\t'auto', 'D', 'M', 'Y', 'h', 'm', 's'\n",
"* **outliers**\n",
" * default:\tFalse,\n",
" * other values:\t'auto', 'winz', 'delete'\n",
"* **outlier_param**\tdefault:\t1.5, other values:\tany int or float, False\n",
"* **logfile**\n",
" * default: True,\n",
" * other values:\tFalse\n",
"* **verbose**\n",
" * default: False,\n",
" * other values:\tTrue"
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "491b034b-994e-4f06-b4bc-df0590a62aab",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.4500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.7500</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 12 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas male 27.0 0 \n",
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"890 Dooley, Mr. Patrick male 32.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S \n",
".. ... ... ... ... ... \n",
"886 0 211536 13.0000 NaN S \n",
"887 0 112053 30.0000 B42 S \n",
"888 2 W./C. 6607 23.4500 NaN S \n",
"889 0 111369 30.0000 C148 C \n",
"890 0 370376 7.7500 NaN Q \n",
"\n",
"[891 rows x 12 columns]"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from AutoClean import AutoClean\n",
"\n",
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "d842eedf-3971-4966-a8b4-543bb56dd60d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"AutoClean process completed in 0.289385 seconds\n",
"Logfile saved to: /home/cif/GoogleDrive/cursos/summer-school-romania/2019/notebooks/preprocessing/autoclean.log\n"
]
}
],
"source": [
"autoclean = AutoClean(df, mode='auto')\n",
"\n",
"# We can control the preprocessing\n",
"#autoclean = AutoClean(df, mode='auto', duplicates=False, missing_num=False, missing_categ=False, encode_categ=False, extract_datetime=False, outliers=False, outlier_param=1.5, logfile=True, verbose=False)\n"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "4ede7c55-475a-4748-8cc4-788f46c88b26",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Sex_female</th>\n",
" <th>Sex_male</th>\n",
" <th>Embarked_C</th>\n",
" <th>Embarked_Q</th>\n",
" <th>Embarked_S</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>C128</td>\n",
" <td>S</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>65.6344</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>C128</td>\n",
" <td>S</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>C128</td>\n",
" <td>S</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Sex_female Sex_male \\\n",
"0 0 A/5 21171 7.2500 C128 S False True \n",
"1 0 PC 17599 65.6344 C85 C True False \n",
"2 0 STON/O2. 3101282 7.9250 C128 S True False \n",
"3 0 113803 53.1000 C123 S True False \n",
"4 0 373450 8.0500 C128 S False True \n",
"\n",
" Embarked_C Embarked_Q Embarked_S \n",
"0 False False True \n",
"1 True False False \n",
"2 False False True \n",
"3 False False True \n",
"4 False False True "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_clean = autoclean.output\n",
"df_clean[0:5]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

View File

@ -1,502 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Duplicated values\n",
"\n",
"There are two possible approaches: **remove** these rows or **filling** them. It depends on every case.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Filling NaN values\n",
"If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
"\n",
"* For **string** fields, we can fill NaN with **' '**.\n",
"\n",
"* For **numbers**, we can fill with the **mean** or **median** value. \n"
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"# Fill NaN with ' '\n",
"df['col'] = df['col'].fillna(' ')\n",
"# Fill NaN with 99\n",
"df['col'] = df['col'].fillna(99)\n",
"# Fill NaN with the mean of the column\n",
"df['col'] = df['col'].fillna(df['col'].mean())"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Propagate non-null values forward or backwards\n",
"You can also propagate non-null values forward or backwards by putting\n",
"method=pad as the method argument. It will fill the next value in the\n",
"dataframe with the previous non-NaN value. Maybe you just want to fill one\n",
"value ( limit=1 )or you want to fill all the values."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 NaN\n",
"1 NaN\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 4.0\n",
"6 NaN"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We fill forward the value 4.0 and fill the next one (limit = 1)\n",
"df.fillna(method='pad', limit=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can also backfilling with **bfill**."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>col1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" col1\n",
"0 2.0\n",
"1 2.0\n",
"2 2.0\n",
"3 3.0\n",
"4 4.0\n",
"5 NaN\n",
"6 NaN"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fill the first two NaN values with the first available value\n",
"df.fillna(method='bfill')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Removing NaN values\n",
"We can remove them by row or column."
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"/# Drop any rows which have any nans\n",
"df.dropna()\n",
"/# Drop columns that have any nans\n",
"df.dropna(axis=1)\n",
"/# Only drop columns which have at least 90% non-NaNs\n",
"df.dropna(thresh=int(df.shape[0] * .9), axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

View File

@ -1,619 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# String Data\n",
"It is common to clean string columns so that they follow a predefined format (e.g. emails, URLs, ...).\n",
"\n",
"We can do it using regular expressions or specific libraries."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Beautifier\n",
"Simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify url patterns, domains and so on. Library helps to clean unicodes, special characters and unnecessary redirection patterns from the urls and gives you clean date.\n",
"\n",
"Install with **'pip install beautifier'**."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Email cleanup"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Email\n",
"email = Email('me@imsach.in')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'imsach.in'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.domain"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'me'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.username"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email.is_free_email"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email2 = Email('This my address')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email2.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"email3 = Email('pepe@gmail.com')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_valid"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"email3.is_free_email"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## URL cleanup"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"from beautifier import Url\n",
"url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'https://in.linkedin.com/in/sachinphilip'"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.cleanup"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'in.linkedin.com'"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.domain"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.param"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.parameters"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"'sachinphilip'"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"url.username"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Unicode\n",
"Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
"\n",
"A **mojibake** is a character displayed in an unintended character enconding. Example: \"<22>\").\n",
"\n",
"We will use the library **ftfy** (fixed text for you) to fix it.\n",
"\n",
"First, you should install the library: ***conda install ftfy**. "
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"¯\\_(ツ)_/¯\n",
"Party\n",
"I'm\n"
]
}
],
"source": [
"import ftfy\n",
"foo = '&macr;\\\\_(ã\\x83\\x84)_/&macr;'\n",
"bar = '\\ufeffParty'\n",
"baz = '\\001\\033[36;44mI&#x92;m'\n",
"print(ftfy.fix_text(foo))\n",
"print(ftfy.fix_text(bar))\n",
"print(ftfy.fix_text(baz))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We can understand which heuristics ftfy is using."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'ftfy' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-1-4030b963ff0a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mftfy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexplain_unicode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfoo\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;31mNameError\u001b[0m: name 'ftfy' is not defined"
]
}
],
"source": [
"ftfy.explain_unicode(foo)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Dates\n",
"Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as **python-dateutil**.\n",
"\n",
"Install the library: **pip install python-dateutil**."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-22 10:22:46+00:00\n"
]
}
],
"source": [
"from dateutil.parser import parse\n",
"now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
"print(now)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"2019-08-22 10:20:00\n"
]
}
],
"source": [
"dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
"print(dt)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)\n",
"* Beautifier https://github.com/labtocat/beautifier\n",
"* Ftfy https://ftfy.readthedocs.io/en/latest/\n",
"* python-dateutil https://dateutil.readthedocs.io/en/stable/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.1 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 152 KiB

View File

@ -1 +0,0 @@

View File

@ -1,185 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Introduction to Visualization\n",
" \n",
"In this session, we will get more insight regarding how to visualize data.\n",
"\n",
"# Objectives\n",
"\n",
"The main objectives of this session are:\n",
"* Understanding how to visualize data\n",
"* Understanding the purpose of different charts \n",
"* Experimenting with several environments for visualizing data\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Seaborn\n",
"\n",
"Seaborn is a library that visualizes data in Python. The main characteristics are:\n",
"\n",
"* A dataset-oriented API for examining relationships between multiple variables\n",
"* Specialized support for using categorical variables to show observations or aggregate statistics\n",
"* Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data\n",
"* Automatic estimation and plotting of linear regression models for different kinds of dependent variables\n",
"* Convenient views of the overall structure of complex datasets\n",
"* High-level abstractions for structuring multi-plot grids that let you quickly build complex visualizations\n",
"* Concise control over matplotlib figure styling with several built-in themes\n",
"* Tools for choosing color palettes that faithfully reveal patterns in your data\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Install\n",
"Use:\n",
"\n",
"**conda install seaborn**\n",
"\n",
"or \n",
"\n",
"**pip install seaborn**"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Table of Contents"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"1. [Home](00_Intro_Visualization.ipynb)\n",
"2. [Dataset](01_Dataset.ipynb)\n",
"3. [Comparison Charts](02_Comparison_Charts.ipynb)\n",
" 1. [More Comparison Charts](02_01_More_Comparison_Charts.ipynb)\n",
"4. [Distribution Charts](03_Distribution_Charts.ipynb)\n",
"5. [Hierarchical charts](04_Hierarchical_Charts.ipynb)\n",
"6. [Relational charts](05_Relational_Charts.ipynb)\n",
"7. [Spatial charts](06_Spatial_Charts.ipynb)\n",
"8. [Temporal charts](07_Temporal_Charts.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

View File

@ -1,363 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Visualization](00_Intro_Visualization.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"# Dataset\n",
"Seaborn includes several datasets. We can consult the available datasets and load them. \n",
"\n",
"The datasets are also available at https://github.com/mwaskom/seaborn-data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import pandas as pd\n",
"from matplotlib import pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['anagrams',\n",
" 'anscombe',\n",
" 'attention',\n",
" 'brain_networks',\n",
" 'car_crashes',\n",
" 'diamonds',\n",
" 'dots',\n",
" 'dowjones',\n",
" 'exercise',\n",
" 'flights',\n",
" 'fmri',\n",
" 'geyser',\n",
" 'glue',\n",
" 'healthexp',\n",
" 'iris',\n",
" 'mpg',\n",
" 'penguins',\n",
" 'planets',\n",
" 'seaice',\n",
" 'taxis',\n",
" 'tips',\n",
" 'titanic']"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sns.get_dataset_names()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>total_bill</th>\n",
" <th>tip</th>\n",
" <th>sex</th>\n",
" <th>smoker</th>\n",
" <th>day</th>\n",
" <th>time</th>\n",
" <th>size</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>16.99</td>\n",
" <td>1.01</td>\n",
" <td>Female</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>10.34</td>\n",
" <td>1.66</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>21.01</td>\n",
" <td>3.50</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>23.68</td>\n",
" <td>3.31</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>24.59</td>\n",
" <td>3.61</td>\n",
" <td>Female</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>25.29</td>\n",
" <td>4.71</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>8.77</td>\n",
" <td>2.00</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>26.88</td>\n",
" <td>3.12</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>15.04</td>\n",
" <td>1.96</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>14.78</td>\n",
" <td>3.23</td>\n",
" <td>Male</td>\n",
" <td>No</td>\n",
" <td>Sun</td>\n",
" <td>Dinner</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" total_bill tip sex smoker day time size\n",
"0 16.99 1.01 Female No Sun Dinner 2\n",
"1 10.34 1.66 Male No Sun Dinner 3\n",
"2 21.01 3.50 Male No Sun Dinner 3\n",
"3 23.68 3.31 Male No Sun Dinner 2\n",
"4 24.59 3.61 Female No Sun Dinner 4\n",
"5 25.29 4.71 Male No Sun Dinner 4\n",
"6 8.77 2.00 Male No Sun Dinner 2\n",
"7 26.88 3.12 Male No Sun Dinner 4\n",
"8 15.04 1.96 Male No Sun Dinner 2\n",
"9 14.78 3.23 Male No Sun Dinner 2"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = sns.load_dataset('tips')\n",
"df.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Seaborn](http://seaborn.pydata.org/index.html) documentation"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.1 KiB

View File

@ -187,9 +187,9 @@
"metadata": {},
"source": [
"## Comparing\n",
"Your task is to modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n",
"Your task is modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n",
"\n",
"Submit your notebook and include a modified code and a comparison of the effects of these changes. \n",
"Submit your notebook and include a the modified code, and a comparison of the effects of these changes. \n",
"\n",
"Discuss your findings."
]
@ -198,7 +198,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optional. Optimizing ML hyperparameters\n",
"## Optimizing ML hyperparameters\n",
"\n",
"One of the applications of Genetic Algorithms is the optimization of ML hyperparameters. Previously we have used GridSearch from Scikit. Using (sklearn-deap)[[References](#References)], optimize the Titatic hyperparameters using both GridSearch and Genetic Algorithms. \n",
"\n",
@ -206,7 +206,7 @@
"\n",
"Submit a notebook where you include well-crafted conclusions about the exercises, discussing the pros and cons of using genetic algorithms for this purpose.\n",
"\n",
"Note: There is a problem with Scikit version 0.24. Comment on the different approaches."
"Note: There is a problem with the version 0.24 of scikit. Just comment the different approaches."
]
},
{
@ -222,7 +222,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Optional. Optimizing an ML pipeline with a genetic algorithm\n",
"## Optimizing a ML pipeline with a genetic algorithm\n",
"\n",
"The library [TPOT](#References) optimizes ML pipelines and comes with a lot of (examples)[https://epistasislab.github.io/tpot/examples/] and even notebooks, for example for the [iris dataset](https://github.com/EpistasisLab/tpot/blob/master/tutorials/IRIS.ipynb).\n",
"\n",

View File

@ -44,7 +44,7 @@
"First of all, install the Gymnasium library, which is a fork of the OpenAI Gym library:\n",
"\n",
"```console\n",
"foo@bar:~$ pip install gymnasium[classic-control]\n",
"foo@bar:~$ conda install gymnasium\n",
"```\n",
"\n",
"If you get an error 'No module named 'Box2D', install 'pybox2d'.\n"

File diff suppressed because one or more lines are too long

View File

@ -74,6 +74,7 @@
"* [The Python tutorial](https://docs.python.org/3/tutorial/)\n",
"* [Object-Oriented Programming in Python](http://python-textbok.readthedocs.org/en/latest/index.html)\n",
"* [Python3 tutorial](http://www.python-course.eu/python3_course.php)\n",
"* [Python for the Busy Java Developer, Deepak Sarda, 2014](http://antrix.net/static/pages/python-for-java/online/)\n",
"* [Style Guide for Python Code (PEP-0008)](https://www.python.org/dev/peps/pep-0008/)\n",
"* [Python Slides](http://tdc-www.harvard.edu/Python.pdf)\n",
"* [Python for Programmers - 1 day course](http://www.ucs.cam.ac.uk/docs/course-notes/unix-courses/archived/archived-python-courses/PythonProgIntro/files/notes.pdf)\n",

View File

@ -85,7 +85,7 @@
"In Python3, there are the following [numeric types](https://docs.python.org/3/library/stdtypes.html#typesnumeric):\n",
"* integers (int): 1, -1, ...\n",
"* floating point numbers (float): 0.1, 1E2\n",
"* complex numbers (complex): 2 + 3j\n.",
"* complex numbers (complex): 2 + 3j\n",
"Let's play a bit"
]
},

View File

@ -377,7 +377,7 @@
"\n",
"Tuples are faster than lists. Its main usage is when the collection is constant, or you do not want it can be changed (write protected). \n",
"\n",
"Tuples can be converted into lists and vice-versa, with the methods *list()* and *tuple()*."
"Tuples can be converted into lists and vice-versa, with the methods list() and tuple()."
]
},
{

View File

@ -37,7 +37,7 @@
"\n",
"A set object is an unordered collection of distinct objects. There are two built-in set types: **set** (mutable) and **frozenset** (inmutable).\n",
"\n",
"A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects. There is only one builtin mapping type: **dictionary**."
"A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects. There is only one bultin mapping type: **dictionary**."
]
},
{

View File

@ -65,7 +65,7 @@
"Python is a **strongly typed** language and **dynamically typed** language.\n",
"\n",
"This means:\n",
"* **dynamically typed**: variables do not declare a static type (as in Java int a = 2;). Variables have no type themselves, they are just names that hold a reference to some object. The type of the variable is changed dynamically when you change the type of the assigned data object. \n",
"* ** dynamically typed**: variables do not declare a static type (as in Java int a = 2;). Variables have no type themselves, they are just names that hold a reference to some object. The type of the variable is changed dynamically when you change the type of the assigned data object. \n",
"* **strongly typed**: the interpreter tracks variable types. There is no implicit type conversion. This means that all the type variables should be converted manually, preventing from unexpected behaviour. "
]
},

View File

@ -41,7 +41,7 @@
"The first argument of instance class method is self, that refers to the current instance of the class.\n",
"There is a special method, __init__ that initializes the object. It is like a constructor, but the object is already created when __init__ is called.\n",
"\n",
"Instance attributes are define as *self.variables*. (self is the same than this in Java)."
"Instance attributes are define as self.variables. (self is the same than this in Java)."
]
},
{

View File

@ -1,154 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Introduction to Network Analysis\n",
" \n",
"In this session, we are going to get more insight regarding how to analyze and visualize social networks.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Objectives\n",
"\n",
"The main objectives of this session are:\n",
"* Understanding why networks are important in data science\n",
"* Experimenting with network analysis with networkx."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Table of Contents"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"1. [Home](0_Intro_Network_Analysis.ipynb)\n",
"2. [First Steps](1_First_Steps.ipynb)\n",
"3. [Working_with_Graphs](2_Working_with_Graphs.ipynb)\n",
"4. [Network Analysis](3_Network_Analysis.ipynb)\n",
"5. [Social Networks](4_Social_Networks.ipynb)\n",
"6. [Pandas integration](5_Pandas.ipynb)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,374 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Network Analysis](0_Intro_Network_Analysis.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise: Florentine families"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import networkx as nx\n",
"import warnings\n",
"warnings.simplefilter(action='ignore', category=FutureWarning)"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"G_florentine = nx.florentine_families_graph()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Exercise: Star Wars"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import networkx as nx\n",
"\n",
"# Taken from https://gist.github.com/codingthat/be03565bd97e789a3835b50235ad562f\n",
"# The original dataset is from:\n",
"# Gabasova, E. (2016). Star Wars social network. DOI: https://doi.org/10.5281/zenodo.1411479\n",
"# \n",
"# Simplified by Federico Albanese.\n",
"\n",
"characters = [\"R2-D2\",\n",
" \"CHEWBACCA\",\n",
" \"C-3PO\",\n",
" \"LUKE\",\n",
" \"DARTH VADER\",\n",
" \"CAMIE\",\n",
" \"BIGGS\",\n",
" \"LEIA\",\n",
" \"BERU\",\n",
" \"OWEN\",\n",
" \"OBI-WAN\",\n",
" \"MOTTI\",\n",
" \"TARKIN\",\n",
" \"HAN\",\n",
" \"DODONNA\",\n",
" \"GOLD LEADER\",\n",
" \"WEDGE\",\n",
" \"RED LEADER\",\n",
" \"RED TEN\"]\n",
"\n",
"\n",
"edges = [(\"CHEWBACCA\", \"R2-D2\"),\n",
" (\"C-3PO\",\"R2-D2\"),\n",
" (\"BERU\", \"R2-D2\"),\n",
" (\"LUKE\", \"R2-D2\"),\n",
" (\"OWEN\", \"R2-D2\"),\n",
" (\"OBI-WAN\", \"R2-D2\"),\n",
" (\"LEIA\", \"R2-D2\"),\n",
" (\"BIGGS\", \"R2-D2\"),\n",
" (\"HAN\", \"R2-D2\"),\n",
" (\"CHEWBACCA\", \"OBI-WAN\"),\n",
" (\"C-3PO\", \"CHEWBACCA\"),\n",
" (\"CHEWBACCA\", \"LUKE\"),\n",
" (\"CHEWBACCA\", \"HAN\"),\n",
" (\"CHEWBACCA\", \"LEIA\"),\n",
" (\"CAMIE\", \"LUKE\"),\n",
" (\"BIGGS\", \"CAMIE\"),\n",
" (\"BIGGS\", \"LUKE\"),\n",
" (\"DARTH VADER\", \"LEIA\"),\n",
" (\"BERU\", \"LUKE\"),\n",
" (\"BERU\", \"OWEN\"),\n",
" (\"BERU\", \"C-3PO\"),\n",
" (\"LUKE\", \"OWEN\"),\n",
" (\"C-3PO\", \"LUKE\"),\n",
" (\"C-3PO\", \"OWEN\"),\n",
" (\"C-3PO\", \"LEIA\"),\n",
" (\"LEIA\", \"LUKE\"),\n",
" (\"BERU\", \"LEIA\"),\n",
" (\"LUKE\", \"OBI-WAN\"),\n",
" (\"C-3PO\", \"OBI-WAN\"),\n",
" (\"LEIA\", \"OBI-WAN\"),\n",
" (\"MOTTI\", \"TARKIN\"),\n",
" (\"DARTH VADER\", \"MOTTI\"),\n",
" (\"DARTH VADER\", \"TARKIN\"),\n",
" (\"HAN\", \"OBI-WAN\"),\n",
" (\"HAN\", \"LUKE\"),\n",
" (\"C-3PO\", \"HAN\"),\n",
" (\"LEIA\", \"MOTT\"),\n",
" (\"LEIA\", \"TARKIN\"),\n",
" (\"HAN\", \"LEIA\"),\n",
" (\"DARTH VADER\", \"OBI-WAN\"),\n",
" (\"DODONNA\", \"GOLD LEADER\"),\n",
" (\"DODONNA\", \"WEDGE\"),\n",
" (\"DODONNA\", \"LUKE\"),\n",
" (\"GOLD LEADER\", \"WEDGE\"),\n",
" (\"GOLD LEADER\", \"LUKE\"),\n",
" (\"LUKE\", \"WEDGE\"),\n",
" (\"BIGGS\", \"LEIA\"),\n",
" (\"LEIA\", \"RED LEADER\"),\n",
" (\"LUKE\", \"RED LEADER\"),\n",
" (\"BIGGS\", \"RED LEADER\"),\n",
" (\"BIGGS\", \"C-3PO\"),\n",
" (\"C-3PO\", \"RED LEADER\"),\n",
" (\"RED LEADER\", \"WEDGE\"),\n",
" (\"GOLD LEADER\", \"RED LEADER\"),\n",
" (\"BIGGS\", \"WEDGE\"),\n",
" (\"RED LEADER\", \"RED TEN\"),\n",
" (\"BIGGS\", \"GOLD LEADER\"),\n",
" (\"LUKE\", \"RED TEN\")]\n",
"\n",
"G_starWars = nx.Graph()\n",
"\n",
"\n",
"G_starWars.add_nodes_from(characters)\n",
"G_starWars.add_edges_from(edges)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise\n",
"In this exercise we are going to practice some of the concepts of the session.\n",
"\n",
"Answer the following questions using the object *G_starWars* and *G_florentine*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Number of nodes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Number of edges"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of nodes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of edges"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Draw the graph\n",
"\n",
"Hint. Use different layouts (circular, spring, ...)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Think of interesting micro, meso and macro metrics"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analyze ego networks of interesting nodes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analyze communities"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

Binary file not shown.

Before

Width:  |  Height:  |  Size: 3.1 KiB