Compare commits
110 Commits
419ea57824
...
master
Author | SHA1 | Date | |
---|---|---|---|
|
9844820e66 | ||
|
d10434362e | ||
|
fb2135cea6 | ||
|
ba6e533e0b | ||
|
4f5e976918 | ||
|
b58370a19a | ||
|
5c203b0884 | ||
|
5bf815f60f | ||
|
90a3ff098b | ||
|
945a8a7fb6 | ||
|
6532ef1b27 | ||
|
3a73b2b286 | ||
|
2e4ec3cfdc | ||
|
21e7ae2f57 | ||
|
7b4d16964d | ||
|
c5967746ea | ||
|
ed7f0f3e1c | ||
|
9324516c19 | ||
|
6fc5565ea0 | ||
|
1113485833 | ||
|
0c3f317a85 | ||
|
0b550c837b | ||
|
d7ce6df7fe | ||
|
e2edae6049 | ||
|
4ea0146def | ||
|
e7b2cee795 | ||
|
9e1d0e5534 | ||
|
f82203f371 | ||
|
b9ecccdeab | ||
|
44a555ac2d | ||
|
ec11ff2d5e | ||
|
ec02125396 | ||
|
b5f1a7dd22 | ||
|
1cc1e45673 | ||
|
a2ad2c0e92 | ||
|
1add6a4c8e | ||
|
af78e6480d | ||
|
cae7d8cbb2 | ||
|
f58aa6c0b8 | ||
|
6e8448f22f | ||
|
8f2a5c17d8 | ||
|
36d117e417 | ||
|
2fc057f6f9 | ||
|
5b0d4f2a5d | ||
|
7afa2b3b22 | ||
|
4e0f9159e8 | ||
|
82aa552976 | ||
|
3ebff69cf8 | ||
|
0f228bbec3 | ||
|
64c8854741 | ||
|
3e081e5d83 | ||
|
065797b886 | ||
|
8d2f625b7e | ||
|
26eda30a71 | ||
|
55365ae927 | ||
|
152125b3da | ||
|
97362545ea | ||
|
c49c866a2e | ||
|
3f7694e330 | ||
|
bf684d6e6e | ||
|
d935b85b26 | ||
|
1d8e777236 | ||
|
23ebe2f390 | ||
|
01eb89ada4 | ||
|
e4fdcd65a1 | ||
|
9f46c534f7 | ||
|
743c57691f | ||
|
2c53b81299 | ||
|
dd6c053109 | ||
|
e35e0a11e9 | ||
|
7315b681e4 | ||
|
3fac9c6f78 | ||
|
21819abeae | ||
|
0d4c0c706d | ||
|
8de629b495 | ||
|
86114b4a56 | ||
|
1a3f618995 | ||
|
a1121c03a5 | ||
|
715d0cb77f | ||
|
0150ce7cf7 | ||
|
08dfe5c147 | ||
|
78e62af098 | ||
|
3f5eba3e84 | ||
|
2de1cda8f1 | ||
|
cc442c35f3 | ||
|
1100c352fa | ||
|
9b573d292d | ||
|
dd8a4f50d8 | ||
|
47148f2ccc | ||
|
8ffda8123a | ||
|
6629837e7d | ||
|
ba08a9a264 | ||
|
4b8fd30f42 | ||
|
d879369930 | ||
|
4da01f3ae6 | ||
|
da9a01e26b | ||
|
dc23b178d7 | ||
|
5410d6115d | ||
|
6749aa5deb | ||
|
c31e6c1676 | ||
|
1c7496c8ac | ||
|
35b1ae4ec8 | ||
|
58fc6f5e9c | ||
|
91147becee | ||
|
1530995243 | ||
|
0c0960cec7 | ||
|
3363c953f4 | ||
|
542ce2708d | ||
|
380340d66d | ||
|
7f49f8990b |
@@ -1,7 +1,7 @@
|
||||
# sitc
|
||||
Exercises for Intelligent Systems Course at Universidad Politécnica de Madrid, Telecommunication Engineering School. This material is used in the subjects
|
||||
- SITC (Sistemas Inteligentes y Tecnologías del Conocimiento) - Master Universitario de Ingeniería de Telecomunicación (MUIT)
|
||||
- TIAD (Tecnologías Inteligentes de Análisis de Datos) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos)
|
||||
- CDAW (Ciencia de datos y aprendizaje en automático en la web de datos) - Master Universitario de Ingeniería de Telecomunicación (MUIT)
|
||||
- ABID (Analítica de Big Data) - Master Universitario en Ingeniera de Redes y Servicios Telemáticos)
|
||||
|
||||
For following this course:
|
||||
- Follow the instructions to install the environment: https://github.com/gsi-upm/sitc/blob/master/python/1_1_Notebooks.ipynb (Just install 'conda')
|
||||
@@ -9,11 +9,13 @@ For following this course:
|
||||
- Run in a terminal in the folder sitc: jupyter notebook (and enjoy)
|
||||
|
||||
Topics
|
||||
* Python: quick introduction to Python
|
||||
* Python: a quick introduction to Python
|
||||
* ML-1: introduction to machine learning with scikit-learn
|
||||
* ML-2: introduction to machine learning with pandas and scikit-learn
|
||||
* ML-21: preprocessing and visualizatoin
|
||||
* ML-3: introduction to machine learning. Neural Computing
|
||||
* ML-4: introduction to Evolutionary Computing
|
||||
* ML-5: introduction to Reinforcement Learning
|
||||
* NLP: introduction to NLP
|
||||
* LOD: Linked Open Data, exercises and example code
|
||||
* SNA: Social Network Analysis
|
||||
|
BIN
images/EscUpmPolit_p.gif
Normal file
After Width: | Height: | Size: 3.1 KiB |
BIN
images/cart.png
Normal file
After Width: | Height: | Size: 95 KiB |
BIN
images/data-chart-type.png
Normal file
After Width: | Height: | Size: 34 KiB |
BIN
images/frozenlake-problem.png
Normal file
After Width: | Height: | Size: 54 KiB |
BIN
images/frozenlake-world.png
Normal file
After Width: | Height: | Size: 67 KiB |
BIN
images/gym-maze.gif
Normal file
After Width: | Height: | Size: 222 KiB |
BIN
images/iris-classes.png
Normal file
After Width: | Height: | Size: 1.4 MiB |
BIN
images/iris-dataset.jpg
Normal file
After Width: | Height: | Size: 44 KiB |
BIN
images/iris-features.png
Normal file
After Width: | Height: | Size: 944 KiB |
BIN
images/machine-learning-process.jpg
Normal file
After Width: | Height: | Size: 237 KiB |
BIN
images/multilayerperceptron_network.png
Normal file
After Width: | Height: | Size: 87 KiB |
BIN
images/plot_ML_flow_chart_1.png
Normal file
After Width: | Height: | Size: 56 KiB |
BIN
images/plot_ML_flow_chart_2.png
Normal file
After Width: | Height: | Size: 87 KiB |
BIN
images/plot_ML_flow_chart_3.png
Normal file
After Width: | Height: | Size: 58 KiB |
BIN
images/qlearning-algo.png
Normal file
After Width: | Height: | Size: 85 KiB |
BIN
images/recording.gif
Normal file
After Width: | Height: | Size: 1.8 MiB |
BIN
images/titanic.jpg
Normal file
After Width: | Height: | Size: 152 KiB |
4352
lod/BeatlesMusicians.ttl
Normal file
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -71,7 +71,6 @@
|
||||
"source": [
|
||||
"* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
|
||||
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
|
||||
"* [scikit-learn : Machine Learning Simplified](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), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
|
||||
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
|
||||
]
|
||||
},
|
||||
@@ -80,7 +79,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -40,10 +40,10 @@
|
||||
"\n",
|
||||
"* Learn to use scikit-learn\n",
|
||||
"* Learn the basic steps to apply machine learning techniques: dataset analysis, load, preprocessing, training, validation, optimization and persistence.\n",
|
||||
"* Learn how to do a exploratory data analysis\n",
|
||||
"* Learn how to do an exploratory data analysis\n",
|
||||
"* Learn how to visualise a dataset\n",
|
||||
"* Learn how to load a bundled dataset\n",
|
||||
"* Learn how to separate the dataset into traning and testing datasets\n",
|
||||
"* Learn how to separate the dataset into training and testing datasets\n",
|
||||
"* Learn how to train a classifier\n",
|
||||
"* Learn how to predict with a trained classifier\n",
|
||||
"* Learn how to evaluate the predictions\n",
|
||||
@@ -63,9 +63,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [Scikit-learn web page](http://scikit-learn.org/stable/)\n",
|
||||
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
|
||||
"* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
|
||||
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019."
|
||||
"* [Scikit-learn videos](http://blog.kaggle.com/author/kevin-markham/) and [notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -73,7 +71,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## LIcence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -87,7 +87,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Scikit-learn provides algorithms for solving the following problems:\n",
|
||||
"* **Classification**: Identifying to which category an object belongs to. Some of the available [classification algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are decision trees (ID3, C4.5, ...), kNN, SVM, Random forest, Perceptron, etc. \n",
|
||||
"* **Classification**: Identifying to which category an object belongs. Some of the available [classification algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are decision trees (ID3, C4.5, ...), kNN, SVM, Random forest, Perceptron, etc. \n",
|
||||
"* **Clustering**: Automatic grouping of similar objects into sets. Some of the available [clustering algorithms](http://scikit-learn.org/stable/modules/clustering.html#clustering) are k-Means, Affinity propagation, etc.\n",
|
||||
"* **Regression**: Predicting a continuous-valued attribute associated with an object. Some of the available [regression algorithms](http://scikit-learn.org/stable/supervised_learning.html#supervised-learning) are linear regression, logistic regression, etc.\n",
|
||||
"* **Dimensionality reduction**: Reducing the number of random variables to consider. Some of the available [dimensionality reduction algorithms](http://scikit-learn.org/stable/modules/decomposition.html#decompositions) are SVD, PCA, etc."
|
||||
@@ -105,7 +105,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In addition, scikit-learn helps in several tasks:\n",
|
||||
"* **Model selection**: Comparing, validating, choosing parameters and models, and persisting models. Some of the [available functionalities](http://scikit-learn.org/stable/model_selection.html#model-selection) are cross-validation or grid search for optimizing the parameters. \n",
|
||||
"* **Model selection**: Comparing, validating, choosing parameters and models, and persisting models. Some [available functionalities](http://scikit-learn.org/stable/model_selection.html#model-selection) are cross-validation or grid search for optimizing the parameters. \n",
|
||||
"* **Preprocessing**: Several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. Some of the available [preprocessing functions](http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) are scaling and normalizing data, or imputing missing values."
|
||||
]
|
||||
},
|
||||
@@ -128,9 +128,9 @@
|
||||
"\n",
|
||||
"If it is not installed, install it with conda: `conda install scikit-learn`.\n",
|
||||
"\n",
|
||||
"If you have installed scipy and numpy, you can also installed using pip: `pip install -U scikit-learn`.\n",
|
||||
"If you have installed scipy and numpy, you can also install using pip: `pip install -U scikit-learn`.\n",
|
||||
"\n",
|
||||
"It is not recommended to use pip for installing scipy and numpy. Instead, use conda or install the linux package *python-sklearn*."
|
||||
"It is not recommended to use pip to install scipy and numpy. Instead, use conda or install the Linux package *python-sklearn*."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -156,7 +156,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Course Notes for Learning Intelligent Systems\n",
|
||||
"\n",
|
||||
@@ -34,11 +34,11 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The goal of this notebook is to learn how to read and load a sample dataset.\n",
|
||||
"This notebook aims to learn how to read and load a sample dataset.\n",
|
||||
"\n",
|
||||
"Scikit-learn comes with some bundled [datasets](https://scikit-learn.org/stable/datasets.html): iris, digits, boston, etc.\n",
|
||||
"\n",
|
||||
"In this notebook we are going to use the Iris dataset."
|
||||
"In this notebook, we will use the Iris dataset."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -54,16 +54,25 @@
|
||||
"source": [
|
||||
"The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",
|
||||
"\n",
|
||||
"The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",
|
||||
"The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",
|
||||
"\n",
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In order to read the dataset, we import the datasets bundle and then load the Iris dataset. "
|
||||
"Here you can see the species and the features.\n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"To read the dataset, we import the datasets bundle and then load the Iris dataset. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -180,7 +189,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#Using numpy, I can print the dimensions (here we are working with 2D matriz)\n",
|
||||
"#Using numpy, I can print the dimensions (here we are working with a 2D matrix)\n",
|
||||
"print(iris.data.ndim)"
|
||||
]
|
||||
},
|
||||
@@ -218,7 +227,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In following sessions we will learn how to load a dataset from a file (csv, excel, ...) using the pandas library."
|
||||
"In the following sessions, we will learn how to load a dataset from a file (CSV, Excel, ...) using the pandas library."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -246,7 +255,7 @@
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -49,7 +49,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The goal of this notebook is to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset in other sessions."
|
||||
"This notebook aims to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset in other sessions."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -65,13 +65,13 @@
|
||||
"source": [
|
||||
"This section covers different ways to inspect the distribution of samples per feature.\n",
|
||||
"\n",
|
||||
"First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
|
||||
"First of all, let's see how many samples we have in each class using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
|
||||
"\n",
|
||||
"A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n",
|
||||
"A histogram is a graphical representation of the distribution of numerical data. It estimates the probability distribution of a continuous variable (quantitative variable). \n",
|
||||
"\n",
|
||||
"For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
|
||||
"For building a histogram, we need to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
|
||||
"\n",
|
||||
"In our case, since the values are not continuous and we have only three values, we do not need to bin them."
|
||||
"Since the values are not continuous and we have only three values, we do not need to bin them."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -115,7 +115,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As can be seen, we have the same distribution of samples for every class.\n",
|
||||
"The next step is to see the distribution of the features"
|
||||
"The next step is to see the distribution of the features."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -184,7 +184,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"As we can see, the Setosa class seems to be linearly separable with these two features.\n",
|
||||
"As we can see, the Setosa class seems linearly separable with these two features.\n",
|
||||
"\n",
|
||||
"Another nice visualisation is given below."
|
||||
]
|
||||
@@ -228,7 +228,6 @@
|
||||
"source": [
|
||||
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
|
||||
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
|
||||
"* [Mastering Pandas](https://learning.oreilly.com/library/view/mastering-pandas/9781789343236/), Femi Anthony, Packt Publishing, 2015.\n",
|
||||
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
|
||||
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
|
||||
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
|
||||
@@ -242,7 +241,7 @@
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -52,11 +52,11 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In the previous notebook we developed plots with the [matplotlib](http://matplotlib.org/) plotting library.\n",
|
||||
"In the previous notebook, we developed plots with the [matplotlib](http://matplotlib.org/) plotting library.\n",
|
||||
"\n",
|
||||
"This notebook introduces another plotting library, [**seaborn**](https://stanford.edu/~mwaskom/software/seaborn/), which provides advanced facilities for data visualization.\n",
|
||||
"\n",
|
||||
"*Seaborn* is a library for making attractive and informative statistical graphics in Python. It is built on top of *matplotlib* and tightly integrated with the *PyData* stack, including support for *numpy* and *pandas* data structures and statistical routines from *scipy* and *statsmodels*.\n",
|
||||
"*Seaborn* is a library that makes attractive and informative statistical graphics in Python. It is built on top of *matplotlib* and tightly integrated with the *PyData* stack, including support for *numpy* and *pandas* data structures and statistical routines from *scipy* and *statsmodels*.\n",
|
||||
"\n",
|
||||
"*Seaborn* requires its input to be *DataFrames* (a structure created with the library *pandas*)."
|
||||
]
|
||||
@@ -197,9 +197,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A very common way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of the three different species of iris flowers.\n",
|
||||
"A widespread way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of the three different species of iris flowers.\n",
|
||||
"\n",
|
||||
"We are going to color each class, so that we can easily identify **clustering** and **linear relationships**."
|
||||
"We are going to color each class, so we can easily identify **clustering** and **linear relationships**."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -220,7 +220,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"By default every numeric column in the dataset is used, but you can focus on particular relationships if you want."
|
||||
"By default, every numeric column in the dataset is used, but you can focus on particular relationships if you want."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -321,7 +321,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# One way we can extend this plot is adding a layer of individual points on top of\n",
|
||||
"# One way we can extend this plot is by adding a layer of individual points on top of\n",
|
||||
"# it through Seaborn's striplot\n",
|
||||
"# \n",
|
||||
"# We'll use jitter=True so that all the points don't fall in single vertical lines\n",
|
||||
@@ -347,7 +347,7 @@
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# A violin plot combines the benefits of the previous two plots and simplifies them\n",
|
||||
"# Denser regions of the data are fatter, and sparser thiner in a violin plot\n",
|
||||
"# Denser regions of the data are fatter, and sparser thinner in a violin plot\n",
|
||||
"sns.violinplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, size=6)"
|
||||
]
|
||||
},
|
||||
@@ -389,10 +389,10 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Depending on the data, we can choose which visualisation suits better. the following [diagram](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/) guides this selection.\n",
|
||||
"Depending on the data, we can choose which visualisation suits us better. the following [diagram](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/) guides this selection.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -408,7 +408,6 @@
|
||||
"source": [
|
||||
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
|
||||
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
|
||||
"* [Mastering Pandas](https://learning.oreilly.com/library/view/mastering-pandas/9781789343236/), Femi Anthony, Packt Publishing, 2015.\n",
|
||||
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
|
||||
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
|
||||
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
|
||||
@@ -422,7 +421,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -76,7 +76,7 @@
|
||||
"source": [
|
||||
"A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
|
||||
"\n",
|
||||
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
|
||||
"We will use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -122,9 +122,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
|
||||
"Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might misbehave if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
|
||||
"\n",
|
||||
"The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set."
|
||||
"The preprocessing module further provides a utility class `StandardScaler` to compute a training set's mean and standard deviation. Later, the same transformation will be applied on the testing set."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -163,7 +163,6 @@
|
||||
"source": [
|
||||
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
|
||||
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
|
||||
"* [Mastering Pandas](https://learning.oreilly.com/library/view/mastering-pandas/9781789343236/), Femi Anthony, Packt Publishing, 2015.\n",
|
||||
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
|
||||
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
|
||||
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"
|
||||
@@ -174,7 +173,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Licences\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -53,9 +53,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This is an introduction of general ideas about machine learning and the interface of scikit-learn, taken from the [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/general_concepts.html). \n",
|
||||
"This is an introduction to general ideas about machine learning and the interface of scikit-learn, taken from the [scikit-learn tutorial](http://www.astroml.org/sklearn_tutorial/general_concepts.html). \n",
|
||||
"\n",
|
||||
"You can skip it during the lab session and read it later,"
|
||||
"You can skip it during the lab session and read it later."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -69,21 +69,21 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Machine learning algorithms are programs that learn a model from a dataset with the aim of making predictions or learning structures to organize the data.\n",
|
||||
"Machine learning algorithms are programs that learn a model from a dataset to make predictions or learn structures to organize the data.\n",
|
||||
"\n",
|
||||
"In scikit-learn, machine learning algorithms take as an input a *numpy* array (n_samples, n_features), where\n",
|
||||
"* **n_samples**: number of samples. Each sample is an item to process (i.e. classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
|
||||
"* **n_features**: The number of features or distinct traits that can be used to describe each item in a quantitative manner.\n",
|
||||
"In scikit-learn, machine learning algorithms take as input a *numpy* array (n_samples, n_features), where\n",
|
||||
"* **n_samples**: number of samples. Each sample is an item to process (i.e., classify). A sample can be a document, a picture, a sound, a video, a row in a database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
|
||||
"* **n_features**: The number of features or distinct traits that can be used to describe each item quantitatively.\n",
|
||||
"\n",
|
||||
"The number of features should be defined in advance. There is a specific type of feature sets that are high dimensional (e.g. millions of features), but most of the values are zero for a given sample. Using (numpy) arrays, all those values that are zero would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
|
||||
"The number of features should be defined in advance. A specific type of feature set is high-dimensional (e.g., millions of features), but most values are zero for a given sample. Using (numpy) arrays, all those zero values would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
|
||||
"\n",
|
||||
"The first step in machine learning is **identifying the relevant features** from the input data, and the second step is **extracting the features** from the input data. \n",
|
||||
"\n",
|
||||
"[Machine learning algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/) can be classified according to learning style into:\n",
|
||||
"* **Supervised learning**: input data (training dataset) has a known label or result. Example problems are classification and regression. A model is prepared through a training process where it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.\n",
|
||||
"* **Unsupervised learning**: input data is not labeled. A model is prepared by deducing structures present in the input data. This may be to extract general rules. Example problems are clustering, dimensionality reduction and association rule learning.\n",
|
||||
"* **Semi-supervised learning**:i nput data is a mixture of labeled and unlabeled examples. There is a desired prediction problem but the model must learn the structures to organize the data as well as make predictions. Example problems are classification and regression."
|
||||
]
|
||||
"* **Unsupervised learning**: input data is not labeled. A model is prepared by deducing structures present in the input data. This may be to extract general rules. Example problems are clustering, dimensionality reduction, and association rule learning.\n",
|
||||
"* **Semi-supervised learning**: input data is a mixture of labeled and unlabeled examples. There is a desired prediction problem, but the model must learn the structures to organize the data and make predictions. Example problems are classification and regression."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
@@ -96,8 +96,8 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In *supervised machine learning models*, the machine learning algorithm takes as an input a training dataset, composed of feature vectors and labels, and produces a predictive model which is used for make prediction on new data.\n",
|
||||
""
|
||||
"In *supervised machine learning models*, the machine learning algorithm takes as input a training dataset, composed of feature vectors and labels, and produces a predictive model used to predict new data.\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -111,7 +111,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In *unsupervised machine learning models*, the machine learning model algorithm takes as an input the feature vectors and produces a predictive model that is used to fit its parameters so as to best summarize regularities found in the data.\n",
|
||||
"In *unsupervised machine learning models*, the machine learning model algorithm takes as input the feature vectors. It produces a predictive model that is used to fit its parameters to summarize the best regularities found in the data.\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
@@ -129,15 +129,15 @@
|
||||
"scikit-learn has a uniform interface for all the estimators, some methods are only available if the estimator is supervised or unsupervised:\n",
|
||||
"\n",
|
||||
"* Available in *all estimators*:\n",
|
||||
" * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",
|
||||
" * **model.fit()**: fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g., model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).\n",
|
||||
"\n",
|
||||
"* Available in *supervised estimators*:\n",
|
||||
" * **model.predict()**: given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.\n",
|
||||
" * **model.predict()**: given a trained model, predict the label of a new dataset. This method accepts one argument, the new data X_new (e.g., model.predict(X_new)), and returns the learned label for each object in the array.\n",
|
||||
" * **model.predict_proba()**: For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().\n",
|
||||
"\n",
|
||||
"* Available in *unsupervised estimators*:\n",
|
||||
" * **model.transform()**: given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.\n",
|
||||
" * **model.fit_transform()**: some estimators implement this method, which performs a fit and a transform on the same input data.\n",
|
||||
" * **model.fit_transform()**: Some estimators implement this method, which performs a fit and a transform on the same input data.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
""
|
||||
@@ -154,7 +154,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [General concepts of machine learning with scikit-learn](https://ogrisel.github.io/scikit-learn.org/sklearn-tutorial/auto_examples/tutorial/plot_ML_flow_chart.html)\n",
|
||||
"* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)\n",
|
||||
"* [A Tour of Machine Learning Algorithms](http://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/)"
|
||||
]
|
||||
},
|
||||
@@ -169,7 +169,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -55,7 +55,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.\n",
|
||||
"The goal of this notebook is to learn how to train a model, make predictions with that model, and evaluate these predictions.\n",
|
||||
"\n",
|
||||
"The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)."
|
||||
]
|
||||
@@ -212,14 +212,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Precision, recall and f-score"
|
||||
"### Precision, recall, and f-score"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
|
||||
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall, and F1-score\n",
|
||||
"\n",
|
||||
"* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
|
||||
"* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
|
||||
@@ -246,7 +246,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Another useful metric is the confusion matrix"
|
||||
"Another useful metric is the confusion matrix."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -262,7 +262,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see we classify well all the 'setosa' and 'versicolor' samples. "
|
||||
"We classify all the 'setosa' and 'versicolor' samples well. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -276,7 +276,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**."
|
||||
"To avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -298,7 +298,7 @@
|
||||
"# create a k-fold cross validation iterator of k=10 folds\n",
|
||||
"cv = KFold(10, shuffle=True, random_state=33)\n",
|
||||
"\n",
|
||||
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
|
||||
"# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
|
||||
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
|
||||
"print(scores)"
|
||||
]
|
||||
@@ -307,7 +307,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
|
||||
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -340,7 +340,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We are going to tune the algorithm, and calculate which is the best value for the k hyperparameter."
|
||||
"We will tune the algorithm and calculate the best value for the k hyperparameter."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -365,7 +365,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k."
|
||||
"The result is very dependent on the input data. Execute the train_test_split again and test how the result changes with k."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -379,8 +379,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",
|
||||
"* [Learning scikit-learn: Machine Learning in Python](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n"
|
||||
"* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -388,7 +387,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -56,9 +56,9 @@
|
||||
"source": [
|
||||
"The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
|
||||
"\n",
|
||||
"There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
|
||||
"There are several well-known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0, and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
|
||||
"\n",
|
||||
"This notebook will follow the same steps that the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",
|
||||
"This notebook will follow the same steps as the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",
|
||||
"\n",
|
||||
"You need to install pydotplus: `conda install pydotplus` for the visualization."
|
||||
]
|
||||
@@ -69,7 +69,7 @@
|
||||
"source": [
|
||||
"## Load data and preprocessing\n",
|
||||
"\n",
|
||||
"Here we repeat the same operations for loading data and preprocessing than in the previous notebooks."
|
||||
"Here we repeat the same operations for loading data and preprocessing as in the previous notebooks."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -262,8 +262,8 @@
|
||||
"The current version of pydot does not work well in Python 3.\n",
|
||||
"For obtaining an image, you need to install `pip install pydotplus` and then `conda install graphviz`.\n",
|
||||
"\n",
|
||||
"You can skip this example. Since it can require installing additional packages, we include here the result.\n",
|
||||
""
|
||||
"You can skip this example. Since it can require installing additional packages, we have included the result here.\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -330,7 +330,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Next we are going to export the pseudocode of the the learnt decision tree."
|
||||
"Next, we will export the pseudocode of the learnt decision tree."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -378,14 +378,14 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Precision, recall and f-score"
|
||||
"### Precision, recall, and f-score"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
|
||||
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall, and F1-score\n",
|
||||
"\n",
|
||||
"* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
|
||||
"* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
|
||||
@@ -412,7 +412,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Another useful metric is the confusion matrix"
|
||||
"Another useful metric is the confusion matrix."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -428,7 +428,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We see we classify well all the 'setosa' and 'versicolor' samples. "
|
||||
"We classify all the 'setosa' and 'versicolor' samples well. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -442,7 +442,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n",
|
||||
"To avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n",
|
||||
"\n",
|
||||
"Sklearn comes with other strategies for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), such as stratified K-fold, label k-fold, Leave-One-Out, Leave-P-Out, Leave-One-Label-Out, Leave-P-Label-Out or Shuffle & Split."
|
||||
]
|
||||
@@ -466,7 +466,7 @@
|
||||
"# create a k-fold cross validation iterator of k=10 folds\n",
|
||||
"cv = KFold(10, shuffle=True, random_state=33)\n",
|
||||
"\n",
|
||||
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
|
||||
"# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
|
||||
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
|
||||
"print(scores)"
|
||||
]
|
||||
@@ -475,7 +475,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
|
||||
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -509,8 +509,6 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
|
||||
"* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
|
||||
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019.\n",
|
||||
"* [Parameter estimation using grid search with cross-validation](https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html)\n",
|
||||
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
|
||||
]
|
||||
@@ -520,7 +518,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -58,7 +58,7 @@
|
||||
"source": [
|
||||
"In the previous [notebook](2_5_2_Decision_Tree_Model.ipynb), we got an accuracy of 9.47. Could we get a better accuracy if we tune the hyperparameters of the estimator?\n",
|
||||
"\n",
|
||||
"The goal of this notebook is to learn how to tune an algorithm by opimizing its hyperparameters using grid search."
|
||||
"This notebook aims to learn how to tune an algorithm by optimizing its hyperparameters using grid search."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -137,7 +137,7 @@
|
||||
"# create a k-fold cross validation iterator of k=10 folds\n",
|
||||
"cv = KFold(10, shuffle=True, random_state=33)\n",
|
||||
"\n",
|
||||
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
|
||||
"# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
|
||||
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
|
||||
"\n",
|
||||
"from scipy.stats import sem\n",
|
||||
@@ -189,7 +189,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can get the list of parameters of the model. As you will observe, the parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax. We will use this for tuning the parameters."
|
||||
"We can get the list of model parameters. As you will observe, the parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax. We will use this for tuning the parameters."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -205,7 +205,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Let's see what happens if we change a parameter"
|
||||
"Let's see what happens if we change a parameter."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -284,7 +284,7 @@
|
||||
"\n",
|
||||
"Look at the [API](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of *scikit-learn* to understand better the algorithm, as well as which parameters can be tuned. As you see, we can change several ones, such as *criterion*, *splitter*, *max_features*, *max_depth*, *min_samples_split*, *class_weight*, etc.\n",
|
||||
"\n",
|
||||
"We can get the full list parameters of an estimator with the method *get_params()*. "
|
||||
"We can get an estimator's full list of parameters with the method *get_params()*. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -314,16 +314,16 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Changing manually the hyperparameters to find their optimal values is not practical. Instead, we can consider to find the optimal value of the hyperparameters as an *optimization problem*. \n",
|
||||
"Changing manually the hyperparameters to find their optimal values is not practical. Instead, we can consider finding the optimal value of the hyperparameters as an *optimization problem*. \n",
|
||||
"\n",
|
||||
"The sklearn comes with several optimization techniques for this purpose, such as **grid search** and **randomized search**. In this notebook we are going to introduce the former one."
|
||||
"Sklearn has several optimization techniques, such as **grid search** and **randomized search**. In this notebook, we are going to introduce the former one."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The sklearn provides an object that, given data, computes the score during the fit of an estimator on a hyperparameter grid and chooses the hyperparameters to maximize the cross-validation score. "
|
||||
"Sklearn provides an object that, given data, computes the score during the fit of an estimator on a hyperparameter grid and chooses the hyperparameters to maximize the cross-validation score. "
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -351,7 +351,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we are going to show the results of grid search"
|
||||
"Now we are going to show the results of the grid search"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -392,7 +392,7 @@
|
||||
"# create a k-fold cross validation iterator of k=10 folds\n",
|
||||
"cv = KFold(10, shuffle=True, random_state=33)\n",
|
||||
"\n",
|
||||
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
|
||||
"# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
|
||||
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
|
||||
"def mean_score(scores):\n",
|
||||
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
|
||||
@@ -405,7 +405,7 @@
|
||||
"source": [
|
||||
"We have got an *improvement* from 0.947 to 0.953 with k-fold.\n",
|
||||
"\n",
|
||||
"We are now to try to fit the best combination of the hyperparameters of the algorithm. It can take some time to compute it."
|
||||
"We are now trying to fit the best combination of the hyperparameters of the algorithm. It can take some time to compute it."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -492,7 +492,7 @@
|
||||
"# create a k-fold cross validation iterator of k=10 folds\n",
|
||||
"cv = KFold(10, shuffle=True, random_state=33)\n",
|
||||
"\n",
|
||||
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
|
||||
"# by default the score used is the one returned by the score method of the estimator (accuracy)\n",
|
||||
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
|
||||
"def mean_score(scores):\n",
|
||||
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
|
||||
@@ -518,8 +518,6 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [Plot the decision surface of a decision tree on the iris dataset](https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html)\n",
|
||||
"* [scikit-learn : Machine Learning Simplified](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2017.\n",
|
||||
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka, Packt Publishing, 2019.\n",
|
||||
"* [Hyperparameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
|
||||
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
|
||||
]
|
||||
@@ -535,7 +533,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -48,9 +48,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The goal of this notebook is to learn how to save a model in the the scikit by using Python’s built-in persistence model, namely pickle\n",
|
||||
"The goal of this notebook is to learn how to save a model in the scikit by using Python’s built-in persistence model, namely pickle\n",
|
||||
"\n",
|
||||
"First we recap the previous tasks: load data, preprocess and train the model."
|
||||
"First, we recap the previous tasks: load data, preprocess, and train the model."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -107,7 +107,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"A more efficient alternative to pickle is joblib, especially for big data problems. In this case the model can only be saved to a file and not to a string."
|
||||
"A more efficient alternative to pickle is joblib, especially for big data problems. In this case, the model can only be saved to a file and not to a string."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -146,7 +146,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -52,7 +52,7 @@
|
||||
"\n",
|
||||
"Particularly in high-dimensional spaces, data can more easily be separated linearly and the simplicity of classifiers such as naive Bayes and linear SVMs might lead to better generalization than is achieved by other classifiers.\n",
|
||||
"\n",
|
||||
"The plots show training points in solid colors and testing points semi-transparent. The lower right shows the classification accuracy on the test set.\n",
|
||||
"The plots show training points in solid colors and testing points in semi-transparent colors. The lower right shows the classification accuracy on the test set.\n",
|
||||
"\n",
|
||||
"The [DummyClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html#sklearn.dummy.DummyClassifier) is a classifier that makes predictions using simple rules. It is useful as a simple baseline to compare with other (real) classifiers. \n",
|
||||
"\n",
|
||||
@@ -94,7 +94,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
BIN
ml1/images/iris-classes.png
Normal file
After Width: | Height: | Size: 1.4 MiB |
BIN
ml1/images/iris-features.png
Normal file
After Width: | Height: | Size: 944 KiB |
@@ -47,7 +47,7 @@ def get_code(tree, feature_names, target_names,
|
||||
|
||||
recurse(left, right, threshold, features, 0, 0)
|
||||
|
||||
# Taken from http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#example-tree-plot-iris-py
|
||||
# Taken from https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
@@ -114,4 +114,4 @@ def plot_tree_iris():
|
||||
|
||||
plt.suptitle("Decision surface of a decision tree using paired features")
|
||||
plt.legend()
|
||||
plt.show()
|
||||
plt.show()
|
||||
|
@@ -74,9 +74,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [IPython Notebook Tutorial for Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
|
||||
"* [Scikit-learn videos and notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n",
|
||||
"* [Hands-On Machine Learning with scikit-learn and Scientific Python Toolkits](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781838826048/), Tarek Amr, Packt Publishing, 2020.\n",
|
||||
"* [Python Machine Learning](https://learning.oreilly.com/library/view/python-machine-learning/9781789955750/), Sebastian Raschka and Vahid Mirjalili, Packt Publishing, 2019."
|
||||
"* [Scikit-learn videos and notebooks](https://github.com/justmarkham/scikit-learn-videos) by Kevin Marham\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@@ -50,30 +50,30 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this session we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
|
||||
"In this session, we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The main objective is predicting which passengers survived the sinking of the Titanic.\n",
|
||||
"The main objective is to predict which passengers survived the sinking of the Titanic.\n",
|
||||
"\n",
|
||||
"The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder *data-titanic*.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Here follows a description of the variables.\n",
|
||||
"\n",
|
||||
"|Variable | Description| Values|\n",
|
||||
"|-------------------------------|\n",
|
||||
"| survival| Survival| (0 = No; 1 = Yes)|\n",
|
||||
"|Pclass |Name | |\n",
|
||||
"|Sex |Sex | male, female|\n",
|
||||
"|Age |Age|\n",
|
||||
"|SibSp |Number of Siblings/Spouses Aboard||\n",
|
||||
"|Parch |Number of Parents/Children Aboard||\n",
|
||||
"|Ticket|Ticket Number||\n",
|
||||
"|Fare |Passenger Fare||\n",
|
||||
"|Cabin |Cabin||\n",
|
||||
"|Embarked |Port of Embarkation| (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
|
||||
"| Variable | Description | Values |\n",
|
||||
"|------------|---------------------------------|-----------------|\n",
|
||||
"| survival | Survival |(0 = No; 1 = Yes)|\n",
|
||||
"| Pclass | Name | |\n",
|
||||
"| Sex | Sex | male, female |\n",
|
||||
"| Age | Age | |\n",
|
||||
"| SibSp |Number of Siblings/Spouses Aboard| |\n",
|
||||
"| Parch |Number of Parents/Children Aboard| |\n",
|
||||
"| Ticket | Ticket Number | |\n",
|
||||
"| Fare | Passenger Fare | |\n",
|
||||
"| Cabin | Cabin | |\n",
|
||||
"| Embarked | Port of Embarkation | (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The definitions used for SibSp and Parch are:\n",
|
||||
@@ -213,8 +213,7 @@
|
||||
"* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)\n",
|
||||
"* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n",
|
||||
"* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n",
|
||||
"* [An introduction to NumPy and Scipy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf)\n",
|
||||
"* [NumPy tutorial](https://numpy.org/doc/stable/)"
|
||||
"* [An introduction to NumPy and Scipy](https://sites.engineering.ucsb.edu/~shell/che210d/numpy.pdf)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@@ -433,7 +433,6 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"* [Pandas](http://pandas.pydata.org/)\n",
|
||||
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2017](https://learning.oreilly.com/library/view/learning-pandas/9781787123137/)\n",
|
||||
"* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
|
||||
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
|
||||
"* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)"
|
||||
|
@@ -373,8 +373,8 @@
|
||||
"source": [
|
||||
"#Mean age of passengers per Passenger class\n",
|
||||
"\n",
|
||||
"#First we calculate the mean\n",
|
||||
"df.groupby('Pclass').mean()"
|
||||
"#First we calculate the mean for the numeric columns\n",
|
||||
"df.select_dtypes(np.number).groupby('Pclass').mean()"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -451,7 +451,10 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Pivot tables are an intuitive way to analyze data, and alternative to group columns."
|
||||
"Pivot tables are an intuitive way to analyze data, and an alternative to group columns.\n",
|
||||
"\n",
|
||||
"This command makes a table with rows Sex and columns Pclass, and\n",
|
||||
"averages the result of the column Survived, thereby giving the percentage of survivors in each grouping."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -460,7 +463,14 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pd.pivot_table(df, index='Sex')"
|
||||
"pd.pivot_table(df, index='Sex', columns='Pclass', values=['Survived'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we want to analyze multi-index, the percentage of survivoers, given sex and age, and distributed by Pclass."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -469,7 +479,14 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pd.pivot_table(df, index=['Sex', 'Pclass'])"
|
||||
"pd.pivot_table(df, index=['Sex', 'Age'], columns=['Pclass'], values=['Survived'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Nevertheless, this is not very useful since we have a row per age. Thus, we define a partition."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -478,7 +495,8 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])"
|
||||
"# Partition each of the passengers into 3 categories based on their age\n",
|
||||
"age = pd.cut(df['Age'], [0,12,18,80])"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -487,7 +505,14 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)"
|
||||
"pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can change the function used for aggregating each group."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -496,8 +521,18 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Try np.sum, np.size, len\n",
|
||||
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])"
|
||||
"# default\n",
|
||||
"pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'], aggfunc=np.mean)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Two agg functions\n",
|
||||
"pd.pivot_table(df, index=['Sex', age], columns=['Pclass'], values=['Survived'], aggfunc=[np.mean, np.sum])"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -972,7 +1007,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.12"
|
||||
"version": "3.11.5"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
|
@@ -220,7 +220,7 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Analise distributon\n",
|
||||
"# Analise distribution\n",
|
||||
"df.hist(figsize=(10,10))\n",
|
||||
"plt.show()"
|
||||
]
|
||||
@@ -233,7 +233,7 @@
|
||||
"source": [
|
||||
"# We can see the pairwise correlation between variables. A value near 0 means low correlation\n",
|
||||
"# while a value near -1 or 1 indicates strong correlation.\n",
|
||||
"df.corr()"
|
||||
"df.corr(numeric_only = True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -249,11 +249,10 @@
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# General description of relationship betweek variables uwing Seaborn PairGrid\n",
|
||||
"# General description of relationship between variables uwing Seaborn PairGrid\n",
|
||||
"# We use df_clean, since the null values of df would gives us an error, you can check it.\n",
|
||||
"g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
|
||||
"g.map_diag(plt.hist)\n",
|
||||
"g.map_offdiag(plt.scatter)\n",
|
||||
"g.map(sns.scatterplot)\n",
|
||||
"g.add_legend()"
|
||||
]
|
||||
},
|
||||
|
@@ -351,10 +351,10 @@
|
||||
"We can obtain more information from the confussion matrix and the metric F1-score.\n",
|
||||
"In a confussion matrix, we can see:\n",
|
||||
"\n",
|
||||
"||**Predicted**: 0| **Predicted: 1**|\n",
|
||||
"|---------------------------|\n",
|
||||
"|**Actual: 0**| TN | FP |\n",
|
||||
"|**Actual: 1**| FN|TP|\n",
|
||||
"| |**Predicted**: 0| **Predicted: 1**|\n",
|
||||
"|-------------|----------------|-----------------|\n",
|
||||
"|**Actual: 0**| TN | FP |\n",
|
||||
"|**Actual: 1**| FN | TP |\n",
|
||||
"\n",
|
||||
"* **True negatives (TN)**: actual negatives that were predicted as negatives\n",
|
||||
"* **False positives (FP)**: actual negatives that were predicted as positives\n",
|
||||
|
BIN
ml2/images/iris-classes.png
Normal file
After Width: | Height: | Size: 1.4 MiB |
BIN
ml2/images/iris-features.png
Normal file
After Width: | Height: | Size: 944 KiB |
1
ml21/.gitkeep
Normal file
@@ -0,0 +1 @@
|
||||
|
1
ml21/preprocessing/.gitkeep
Normal file
@@ -0,0 +1 @@
|
||||
|
157
ml21/preprocessing/00_Intro_Preprocessing.ipynb
Normal file
@@ -0,0 +1,157 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Introduction to Preprocessing\n",
|
||||
"In this session, we will get more insight regarding how to preprocess data.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Objectives\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Understanding the need for preprocessing\n",
|
||||
"* Understanding different preprocessing techniques\n",
|
||||
"* Experimenting with several environments for preprocessing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Table of Contents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"1. [Home](00_Intro_Preprocessing.ipynb)\n",
|
||||
"3. [Initial Check](02_Initial_Check.ipynb)\n",
|
||||
"4. [Filter Data](03_Filter_Data.ipynb)\n",
|
||||
"5. [Unknown values](04_Unknown_Values.ipynb)\n",
|
||||
"6. [Duplicated values](05_Duplicated_Values.ipynb)\n",
|
||||
"7. [Rescaling Data](06_Rescaling_Data.ipynb)\n",
|
||||
"8. [Binarize Data](07_Binarize_Data.ipynb)\n",
|
||||
"9. [Categorial features](08_Categorical.ipynb)\n",
|
||||
"10. [String Data](09_String_Data.ipynb)\n",
|
||||
"12. [Handy libraries for preprocessing](11_0_Handy.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
714
ml21/preprocessing/02_Initial_Check.ipynb
Normal file
@@ -0,0 +1,714 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Initial Check with Pandas\n",
|
||||
"\n",
|
||||
"We can start with a quick quality check."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Load and check data\n",
|
||||
"Check which data you are loading."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>PassengerId</th>\n",
|
||||
" <th>Survived</th>\n",
|
||||
" <th>Pclass</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>SibSp</th>\n",
|
||||
" <th>Parch</th>\n",
|
||||
" <th>Ticket</th>\n",
|
||||
" <th>Fare</th>\n",
|
||||
" <th>Cabin</th>\n",
|
||||
" <th>Embarked</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Braund, Mr. Owen Harris</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>22.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>A/5 21171</td>\n",
|
||||
" <td>7.2500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>38.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>PC 17599</td>\n",
|
||||
" <td>71.2833</td>\n",
|
||||
" <td>C85</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Heikkinen, Miss. Laina</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>STON/O2. 3101282</td>\n",
|
||||
" <td>7.9250</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>113803</td>\n",
|
||||
" <td>53.1000</td>\n",
|
||||
" <td>C123</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>5</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Allen, Mr. William Henry</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>373450</td>\n",
|
||||
" <td>8.0500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>6</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Moran, Mr. James</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>330877</td>\n",
|
||||
" <td>8.4583</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>Q</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>7</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>McCarthy, Mr. Timothy J</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>54.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>17463</td>\n",
|
||||
" <td>51.8625</td>\n",
|
||||
" <td>E46</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>7</th>\n",
|
||||
" <td>8</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Palsson, Master. Gosta Leonard</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>349909</td>\n",
|
||||
" <td>21.0750</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>8</th>\n",
|
||||
" <td>9</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>27.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>347742</td>\n",
|
||||
" <td>11.1333</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>9</th>\n",
|
||||
" <td>10</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>14.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>237736</td>\n",
|
||||
" <td>30.0708</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" PassengerId Survived Pclass \\\n",
|
||||
"0 1 0 3 \n",
|
||||
"1 2 1 1 \n",
|
||||
"2 3 1 3 \n",
|
||||
"3 4 1 1 \n",
|
||||
"4 5 0 3 \n",
|
||||
"5 6 0 3 \n",
|
||||
"6 7 0 1 \n",
|
||||
"7 8 0 3 \n",
|
||||
"8 9 1 3 \n",
|
||||
"9 10 1 2 \n",
|
||||
"\n",
|
||||
" Name Sex Age SibSp \\\n",
|
||||
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
||||
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
||||
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
||||
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
||||
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
||||
"5 Moran, Mr. James male NaN 0 \n",
|
||||
"6 McCarthy, Mr. Timothy J male 54.0 0 \n",
|
||||
"7 Palsson, Master. Gosta Leonard male 2.0 3 \n",
|
||||
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n",
|
||||
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n",
|
||||
"\n",
|
||||
" Parch Ticket Fare Cabin Embarked \n",
|
||||
"0 0 A/5 21171 7.2500 NaN S \n",
|
||||
"1 0 PC 17599 71.2833 C85 C \n",
|
||||
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
|
||||
"3 0 113803 53.1000 C123 S \n",
|
||||
"4 0 373450 8.0500 NaN S \n",
|
||||
"5 0 330877 8.4583 NaN Q \n",
|
||||
"6 0 17463 51.8625 E46 S \n",
|
||||
"7 1 349909 21.0750 NaN S \n",
|
||||
"8 2 347742 11.1333 NaN S \n",
|
||||
"9 0 237736 30.0708 NaN C "
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
|
||||
"df.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Check number of columns and rows"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(891, 12)"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Check names and types of columns\n",
|
||||
"Check the data and type, for example if dates are of strings or what."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
|
||||
" 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
|
||||
" dtype='object')\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"PassengerId int64\n",
|
||||
"Survived int64\n",
|
||||
"Pclass int64\n",
|
||||
"Name object\n",
|
||||
"Sex object\n",
|
||||
"Age float64\n",
|
||||
"SibSp int64\n",
|
||||
"Parch int64\n",
|
||||
"Ticket object\n",
|
||||
"Fare float64\n",
|
||||
"Cabin object\n",
|
||||
"Embarked object\n",
|
||||
"dtype: object"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Get column names\n",
|
||||
"print(df.columns)\n",
|
||||
"# Get column data types\n",
|
||||
"df.dtypes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Check if the column is unique"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"PassengerId is unique: True\n",
|
||||
"Survived is unique: False\n",
|
||||
"Pclass is unique: False\n",
|
||||
"Name is unique: True\n",
|
||||
"Sex is unique: False\n",
|
||||
"Age is unique: False\n",
|
||||
"SibSp is unique: False\n",
|
||||
"Parch is unique: False\n",
|
||||
"Ticket is unique: False\n",
|
||||
"Fare is unique: False\n",
|
||||
"Cabin is unique: False\n",
|
||||
"Embarked is unique: False\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i in column_names:\n",
|
||||
" print('{} is unique: {}'.format(i, df[i].is_unique))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Check if the dataframe has an index\n",
|
||||
"We will need it to do joins or merges."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"RangeIndex(start=0, stop=891, step=1)"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# check if there is an index. If not, you will get 'AtributeError: function object has no atribute index'\n",
|
||||
"df.index"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n",
|
||||
" 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,\n",
|
||||
" 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,\n",
|
||||
" 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,\n",
|
||||
" 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,\n",
|
||||
" 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,\n",
|
||||
" 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,\n",
|
||||
" 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,\n",
|
||||
" 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,\n",
|
||||
" 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,\n",
|
||||
" 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,\n",
|
||||
" 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,\n",
|
||||
" 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,\n",
|
||||
" 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,\n",
|
||||
" 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,\n",
|
||||
" 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,\n",
|
||||
" 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,\n",
|
||||
" 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,\n",
|
||||
" 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,\n",
|
||||
" 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,\n",
|
||||
" 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,\n",
|
||||
" 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,\n",
|
||||
" 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,\n",
|
||||
" 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,\n",
|
||||
" 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,\n",
|
||||
" 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,\n",
|
||||
" 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,\n",
|
||||
" 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,\n",
|
||||
" 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,\n",
|
||||
" 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,\n",
|
||||
" 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,\n",
|
||||
" 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,\n",
|
||||
" 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,\n",
|
||||
" 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,\n",
|
||||
" 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,\n",
|
||||
" 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,\n",
|
||||
" 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,\n",
|
||||
" 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,\n",
|
||||
" 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,\n",
|
||||
" 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,\n",
|
||||
" 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,\n",
|
||||
" 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,\n",
|
||||
" 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,\n",
|
||||
" 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,\n",
|
||||
" 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,\n",
|
||||
" 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,\n",
|
||||
" 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,\n",
|
||||
" 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,\n",
|
||||
" 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,\n",
|
||||
" 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,\n",
|
||||
" 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,\n",
|
||||
" 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,\n",
|
||||
" 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,\n",
|
||||
" 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,\n",
|
||||
" 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,\n",
|
||||
" 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,\n",
|
||||
" 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,\n",
|
||||
" 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,\n",
|
||||
" 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,\n",
|
||||
" 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,\n",
|
||||
" 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,\n",
|
||||
" 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,\n",
|
||||
" 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,\n",
|
||||
" 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,\n",
|
||||
" 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,\n",
|
||||
" 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,\n",
|
||||
" 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,\n",
|
||||
" 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,\n",
|
||||
" 884, 885, 886, 887, 888, 889, 890])"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# # Check the index values\n",
|
||||
"df.index.values"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# If index does not exist\n",
|
||||
"df.set_index('column_name_to_use', inplace=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"PassengerId 0\n",
|
||||
"Survived 0\n",
|
||||
"Pclass 0\n",
|
||||
"Name 0\n",
|
||||
"Sex 0\n",
|
||||
"Age 177\n",
|
||||
"SibSp 0\n",
|
||||
"Parch 0\n",
|
||||
"Ticket 0\n",
|
||||
"Fare 0\n",
|
||||
"Cabin 687\n",
|
||||
"Embarked 2\n",
|
||||
"dtype: int64"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Count missing vales per column\n",
|
||||
"df.isnull().sum()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
150
ml21/preprocessing/03_Filter_Data.ipynb
Normal file
@@ -0,0 +1,150 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Filter Data\n",
|
||||
"\n",
|
||||
"Select the columns you want and delete the others."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Create list comprehension of the columns you want to lose\n",
|
||||
"columns_to_drop = [column_names[i] for i in [1, 3, 5]]\n",
|
||||
"# Drop unwanted columns \n",
|
||||
"df.drop(columns_to_drop, inplace=True, axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
591
ml21/preprocessing/04_Unknown_Values.ipynb
Normal file
@@ -0,0 +1,591 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Unknown values\n",
|
||||
"\n",
|
||||
"Two possible approaches are **remove** these rows or **fill** them. It depends on every case."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Filling NaN values\n",
|
||||
"If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
|
||||
"\n",
|
||||
"* For **string** fields, we can fill NaN with **' '**.\n",
|
||||
"\n",
|
||||
"* For **numbers**, we can fill with the **mean** or **median** value. \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Fill NaN with ' '\n",
|
||||
"df['col'] = df['col'].fillna(' ')\n",
|
||||
"# Fill NaN with 99\n",
|
||||
"df['col'] = df['col'].fillna(99)\n",
|
||||
"# Fill NaN with the mean of the column\n",
|
||||
"df['col'] = df['col'].fillna(df['col'].mean())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Propagate non-null values forward or backward\n",
|
||||
"You can also **propagate** non-null values with these methods:\n",
|
||||
"\n",
|
||||
"* **ffill**: Fill values by propagating the last valid observation to the next valid.\n",
|
||||
"* **bfill**: Fill values using the following valid observation to fill the gap.\n",
|
||||
"* **interpolate**: Fill NaN values using interpolation.\n",
|
||||
"\n",
|
||||
"It will fill the next value in the dataframe with the previous non-NaN value. \n",
|
||||
"\n",
|
||||
"You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 NaN\n",
|
||||
"1 NaN\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 NaN\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We fill forward the value 4.0 and fill the next one (limit = 1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 NaN\n",
|
||||
"1 NaN\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 4.0\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
" df.ffill(limit = 1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.ffill()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 2.0\n",
|
||||
"1 2.0\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 NaN\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df.bfill()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Removing NaN values\n",
|
||||
"We can remove them by row or column (use inplace=True if you want to modify the DataFrame)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0"
|
||||
]
|
||||
},
|
||||
"execution_count": 26,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Drop any rows which have any nans\n",
|
||||
"df1 = df.dropna()\n",
|
||||
"# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n",
|
||||
"df2 = df.dropna(axis=1)\n",
|
||||
"# Only drop columns which have at least 90% non-NaNs \n",
|
||||
"df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n",
|
||||
"df1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
3535
ml21/preprocessing/05_Duplicated_Values.ipynb
Normal file
954
ml21/preprocessing/06_Rescaling_Data.ipynb
Normal file
198
ml21/preprocessing/07_Binarize_Data.ipynb
Normal file
@@ -0,0 +1,198 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Binarize Data\n",
|
||||
"* We can transform our data using a binary threshold. All values above the threshold are marked 1, and all values equal to or below are marked 0.\n",
|
||||
"* This is called binarizing your data or thresholding your data. \n",
|
||||
"\n",
|
||||
"* It can be helpful when you have probabilities that you want to make crisp values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Binarize Data with Scikit-Learn\n",
|
||||
"We can create new binary attributes in Python using Scikit-learn with the Binarizer class.\n",
|
||||
"I"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import Binarizer\n",
|
||||
"\n",
|
||||
"X = [[ 1., -1., 2.],\n",
|
||||
" [ 2., 0., 0.],\n",
|
||||
" [ 0., 1.1, -1.]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"transformer = Binarizer(threshold=1.0).fit(X) # threshold 1.0"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[0., 0., 1.],\n",
|
||||
" [1., 0., 0.],\n",
|
||||
" [0., 1., 0.]])"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"transformer.transform(X)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
812
ml21/preprocessing/08_Categorical.ipynb
Normal file
@@ -0,0 +1,812 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Categorical Data\n",
|
||||
"\n",
|
||||
"For many ML algorithms, we need to transform categorical data into numbers.\n",
|
||||
"\n",
|
||||
"For example:\n",
|
||||
"* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
|
||||
"* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
|
||||
"* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
|
||||
"\n",
|
||||
"There are two main approaches:\n",
|
||||
"* Integer encoding\n",
|
||||
"* One hot encoding"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Integer Encoding\n",
|
||||
"We assign a number to every value:\n",
|
||||
"\n",
|
||||
"['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
|
||||
"\n",
|
||||
"['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
|
||||
"\n",
|
||||
"['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
|
||||
"\n",
|
||||
"The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
|
||||
"\n",
|
||||
"In our examples, this representation can be suitable for **temperature**, but not for the other two."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## One Hot Encoding\n",
|
||||
"A binary column is created for each value of the categorical variable."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Sex M F U\n",
|
||||
"----- ---------\n",
|
||||
"M 1 0 0\n",
|
||||
"F is transformed into 0 1 0\n",
|
||||
"Unknown 0 0 1\n",
|
||||
"M 1 0 0 "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Transforming categorical data with Scikit-Learn\n",
|
||||
"\n",
|
||||
"We can use:\n",
|
||||
"* **get_dummies()** (one hot encoding)\n",
|
||||
"* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
|
||||
"\n",
|
||||
"We are going to learn the first approach."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### One Hot Encoding\n",
|
||||
"We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" Name Age Sex Position\n",
|
||||
"0 Marius 18 Male graduate\n",
|
||||
"1 Maria 19 Female professor\n",
|
||||
"2 John 20 Male TA\n",
|
||||
"3 Carla 30 Female phD\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
|
||||
" \"Age\": [18, 19, 20, 30],\n",
|
||||
"\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
|
||||
" \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
|
||||
" }\n",
|
||||
"df = pd.DataFrame(data)\n",
|
||||
"print(df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" <th>position_encoded</th>\n",
|
||||
" <th>Sex_Female</th>\n",
|
||||
" <th>Sex_Male</th>\n",
|
||||
" <th>Position_TA</th>\n",
|
||||
" <th>Position_graduate</th>\n",
|
||||
" <th>Position_phD</th>\n",
|
||||
" <th>Position_professor</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n",
|
||||
"0 Marius 18 1 1 False True \n",
|
||||
"1 Maria 19 0 3 True False \n",
|
||||
"2 John 20 1 0 False True \n",
|
||||
"3 Carla 30 0 2 True False \n",
|
||||
"\n",
|
||||
" Position_TA Position_graduate Position_phD Position_professor \n",
|
||||
"0 False True False False \n",
|
||||
"1 False False False True \n",
|
||||
"2 True False False False \n",
|
||||
"3 False False True False "
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
|
||||
"df_onehot"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also use *OneHotEncoder* from Scikit."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Sex_Female</th>\n",
|
||||
" <th>Sex_Male</th>\n",
|
||||
" <th>Position_TA</th>\n",
|
||||
" <th>Position_graduate</th>\n",
|
||||
" <th>Position_phD</th>\n",
|
||||
" <th>Position_professor</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" <th>position_encoded</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n",
|
||||
"0 0.0 1.0 0.0 1.0 0.0 \n",
|
||||
"1 1.0 0.0 0.0 0.0 0.0 \n",
|
||||
"2 0.0 1.0 1.0 0.0 0.0 \n",
|
||||
"3 1.0 0.0 0.0 0.0 1.0 \n",
|
||||
"\n",
|
||||
" Position_professor Name Age sex_encoded position_encoded \n",
|
||||
"0 0.0 Marius 18 1 1 \n",
|
||||
"1 1.0 Maria 19 0 3 \n",
|
||||
"2 0.0 John 20 1 0 \n",
|
||||
"3 0.0 Carla 30 0 2 "
|
||||
]
|
||||
},
|
||||
"execution_count": 27,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||||
"from sklearn.compose import make_column_transformer\n",
|
||||
"\n",
|
||||
"df_onehotencoder = df\n",
|
||||
"# create OneHotEncoder object\n",
|
||||
"encoder = OneHotEncoder()\n",
|
||||
"\n",
|
||||
"# Transformer for several columns\n",
|
||||
"transformer = make_column_transformer(\n",
|
||||
" (OneHotEncoder(), ['Sex', 'Position']),\n",
|
||||
" remainder='passthrough',\n",
|
||||
" verbose_feature_names_out=False)\n",
|
||||
"\n",
|
||||
"# transform\n",
|
||||
"transformed = transformer.fit_transform(df_onehotencoder)\n",
|
||||
"\n",
|
||||
"df_onehotencoder = pd.DataFrame(\n",
|
||||
" transformed,\n",
|
||||
" columns=transformer.get_feature_names_out())\n",
|
||||
"df_onehotencoder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Integer encoding\n",
|
||||
"We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Position</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>graduate</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>professor</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>TA</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>phD</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age Sex Position\n",
|
||||
"0 Marius 18 Male graduate\n",
|
||||
"1 Maria 19 Female professor\n",
|
||||
"2 John 20 Male TA\n",
|
||||
"3 Carla 30 Female phD"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import LabelEncoder\n",
|
||||
"# creating instance of labelencoder\n",
|
||||
"labelencoder = LabelEncoder()\n",
|
||||
"df_encoded = df\n",
|
||||
"# Assigning numerical values and storing in another column\n",
|
||||
"sex_values = ('Male', 'Female')\n",
|
||||
"position_values = ('graduate', 'professor', 'TA', 'phD')\n",
|
||||
"df_encoded"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Position</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>graduate</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>professor</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>TA</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>phD</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age Sex Position sex_encoded\n",
|
||||
"0 Marius 18 Male graduate 1\n",
|
||||
"1 Maria 19 Female professor 0\n",
|
||||
"2 John 20 Male TA 1\n",
|
||||
"3 Carla 30 Female phD 0"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
|
||||
"df_encoded"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Position</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" <th>position_encoded</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>graduate</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>professor</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>TA</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>phD</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age Sex Position sex_encoded position_encoded\n",
|
||||
"0 Marius 18 Male graduate 1 1\n",
|
||||
"1 Maria 19 Female professor 0 3\n",
|
||||
"2 John 20 Male TA 1 0\n",
|
||||
"3 Carla 30 Female phD 0 2"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
|
||||
"df_encoded"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
652
ml21/preprocessing/09_String_Data.ipynb
Normal file
@@ -0,0 +1,652 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# String Data\n",
|
||||
"It is widespread to clean string columns to follow a predefined format (e.g., emails, URLs, ...).\n",
|
||||
"\n",
|
||||
"We can do it using regular expressions or specific libraries."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Beautifier\n",
|
||||
"A simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify URL patterns, domains, and so on. The library helps to clean Unicode, special characters, and unnecessary redirection patterns from the URLs and gives you a clean date.\n",
|
||||
"\n",
|
||||
"Install with **'pip install beautifier'**."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Email cleanup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from beautifier import Email\n",
|
||||
"email = Email('me@imsach.in')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'imsach.in'"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email.domain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'me'"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email.username"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"False"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email.is_free_email"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"email2 = Email('This my address')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"False"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email2.is_valid"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"email3 = Email('pepe@gmail.com')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email3.is_valid"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email3.is_free_email"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## URL cleanup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from beautifier import Url\n",
|
||||
"url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'https://in.linkedin.com/in/sachinphilip'"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.cleanup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'in.linkedin.com'"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.domain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.param"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.parameters"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'sachinphilip'"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.username"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Unicode\n",
|
||||
"Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
|
||||
"\n",
|
||||
"A **mojibake** is a character displayed in an unintended character encoding. Example: \"<22>\").\n",
|
||||
"\n",
|
||||
"We will use the library **ftfy** (fixed text for you) to fix it.\n",
|
||||
"\n",
|
||||
"First, you should install the library: **conda install ftfy** (or **pip install ftfy**)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"¯\\_(ツ)_/¯\n",
|
||||
"Party\n",
|
||||
"I'm\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import ftfy\n",
|
||||
"foo = '¯\\\\_(ã\\x83\\x84)_/¯'\n",
|
||||
"bar = '\\ufeffParty'\n",
|
||||
"baz = '\\001\\033[36;44mI’m'\n",
|
||||
"print(ftfy.fix_text(foo))\n",
|
||||
"print(ftfy.fix_text(bar))\n",
|
||||
"print(ftfy.fix_text(baz))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"We can understand which heuristics ftfy is using."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"U+0026 & [Po] AMPERSAND\n",
|
||||
"U+006D m [Ll] LATIN SMALL LETTER M\n",
|
||||
"U+0061 a [Ll] LATIN SMALL LETTER A\n",
|
||||
"U+0063 c [Ll] LATIN SMALL LETTER C\n",
|
||||
"U+0072 r [Ll] LATIN SMALL LETTER R\n",
|
||||
"U+003B ; [Po] SEMICOLON\n",
|
||||
"U+005C \\ [Po] REVERSE SOLIDUS\n",
|
||||
"U+005F _ [Pc] LOW LINE\n",
|
||||
"U+0028 ( [Ps] LEFT PARENTHESIS\n",
|
||||
"U+00E3 ã [Ll] LATIN SMALL LETTER A WITH TILDE\n",
|
||||
"U+0083 \\x83 [Cc] <unknown>\n",
|
||||
"U+0084 \\x84 [Cc] <unknown>\n",
|
||||
"U+0029 ) [Pe] RIGHT PARENTHESIS\n",
|
||||
"U+005F _ [Pc] LOW LINE\n",
|
||||
"U+002F / [Po] SOLIDUS\n",
|
||||
"U+0026 & [Po] AMPERSAND\n",
|
||||
"U+006D m [Ll] LATIN SMALL LETTER M\n",
|
||||
"U+0061 a [Ll] LATIN SMALL LETTER A\n",
|
||||
"U+0063 c [Ll] LATIN SMALL LETTER C\n",
|
||||
"U+0072 r [Ll] LATIN SMALL LETTER R\n",
|
||||
"U+003B ; [Po] SEMICOLON\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"ftfy.explain_unicode(foo)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Dates\n",
|
||||
"Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as [**python-dateutil**](https://dateutil.readthedocs.io/en/stable/). An alternative is [arrow](https://arrow.readthedocs.io/en/latest/).\n",
|
||||
"\n",
|
||||
"Install the library: **pip install python-dateutil**."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"2019-08-22 10:22:46+00:00\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from dateutil.parser import parse\n",
|
||||
"now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
|
||||
"print(now)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 19,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"2019-08-08 10:20:00\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
|
||||
"print(dt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), , A. Sharma, 2018.\n",
|
||||
"* [Beautifier](https://github.com/labtocat/beautifier) package\n",
|
||||
"* [Ftfy](https://ftfy.readthedocs.io/en/latest/) package\n",
|
||||
"* [python-dateutil](https://dateutil.readthedocs.io/en/stable/)package"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
139
ml21/preprocessing/11_0_Handy.ipynb
Normal file
@@ -0,0 +1,139 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Handy libraries\n",
|
||||
"Libraries that help in several preprocessing tasks.\n",
|
||||
"\n",
|
||||
"* [datacleaner](11_1_datacleaner.ipynb)\n",
|
||||
"* [autoclean](11_3_autoclean.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
|
||||
"* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries), M. Bierly, 2016\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
673
ml21/preprocessing/11_1_datacleaner.ipynb
Normal file
@@ -0,0 +1,673 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Datacleaner\n",
|
||||
"[Datacleaner](https://github.com/rhiever/datacleaner) supports:\n",
|
||||
"\n",
|
||||
"* drop rows with missing values\n",
|
||||
"* replace missing values with the mode or median on a column-by-column basis\n",
|
||||
"* encode non-numeric variables with numerical equivalents\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Install with\n",
|
||||
"\n",
|
||||
"**pip install datacleaner**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>PassengerId</th>\n",
|
||||
" <th>Survived</th>\n",
|
||||
" <th>Pclass</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>SibSp</th>\n",
|
||||
" <th>Parch</th>\n",
|
||||
" <th>Ticket</th>\n",
|
||||
" <th>Fare</th>\n",
|
||||
" <th>Cabin</th>\n",
|
||||
" <th>Embarked</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Braund, Mr. Owen Harris</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>22.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>A/5 21171</td>\n",
|
||||
" <td>7.2500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>38.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>PC 17599</td>\n",
|
||||
" <td>71.2833</td>\n",
|
||||
" <td>C85</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Heikkinen, Miss. Laina</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>STON/O2. 3101282</td>\n",
|
||||
" <td>7.9250</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>113803</td>\n",
|
||||
" <td>53.1000</td>\n",
|
||||
" <td>C123</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>5</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Allen, Mr. William Henry</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>373450</td>\n",
|
||||
" <td>8.0500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>...</th>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>886</th>\n",
|
||||
" <td>887</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>Montvila, Rev. Juozas</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>27.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>211536</td>\n",
|
||||
" <td>13.0000</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>887</th>\n",
|
||||
" <td>888</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Graham, Miss. Margaret Edith</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>19.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>112053</td>\n",
|
||||
" <td>30.0000</td>\n",
|
||||
" <td>B42</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>888</th>\n",
|
||||
" <td>889</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>W./C. 6607</td>\n",
|
||||
" <td>23.4500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>889</th>\n",
|
||||
" <td>890</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Behr, Mr. Karl Howell</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>111369</td>\n",
|
||||
" <td>30.0000</td>\n",
|
||||
" <td>C148</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>890</th>\n",
|
||||
" <td>891</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Dooley, Mr. Patrick</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>32.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>370376</td>\n",
|
||||
" <td>7.7500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>Q</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"<p>891 rows × 12 columns</p>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" PassengerId Survived Pclass \\\n",
|
||||
"0 1 0 3 \n",
|
||||
"1 2 1 1 \n",
|
||||
"2 3 1 3 \n",
|
||||
"3 4 1 1 \n",
|
||||
"4 5 0 3 \n",
|
||||
".. ... ... ... \n",
|
||||
"886 887 0 2 \n",
|
||||
"887 888 1 1 \n",
|
||||
"888 889 0 3 \n",
|
||||
"889 890 1 1 \n",
|
||||
"890 891 0 3 \n",
|
||||
"\n",
|
||||
" Name Sex Age SibSp \\\n",
|
||||
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
||||
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
||||
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
||||
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
||||
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
||||
".. ... ... ... ... \n",
|
||||
"886 Montvila, Rev. Juozas male 27.0 0 \n",
|
||||
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
|
||||
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
|
||||
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
|
||||
"890 Dooley, Mr. Patrick male 32.0 0 \n",
|
||||
"\n",
|
||||
" Parch Ticket Fare Cabin Embarked \n",
|
||||
"0 0 A/5 21171 7.2500 NaN S \n",
|
||||
"1 0 PC 17599 71.2833 C85 C \n",
|
||||
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
|
||||
"3 0 113803 53.1000 C123 S \n",
|
||||
"4 0 373450 8.0500 NaN S \n",
|
||||
".. ... ... ... ... ... \n",
|
||||
"886 0 211536 13.0000 NaN S \n",
|
||||
"887 0 112053 30.0000 B42 S \n",
|
||||
"888 2 W./C. 6607 23.4500 NaN S \n",
|
||||
"889 0 111369 30.0000 C148 C \n",
|
||||
"890 0 370376 7.7500 NaN Q \n",
|
||||
"\n",
|
||||
"[891 rows x 12 columns]"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"from datacleaner import autoclean\n",
|
||||
"\n",
|
||||
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>PassengerId</th>\n",
|
||||
" <th>Survived</th>\n",
|
||||
" <th>Pclass</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>SibSp</th>\n",
|
||||
" <th>Parch</th>\n",
|
||||
" <th>Ticket</th>\n",
|
||||
" <th>Fare</th>\n",
|
||||
" <th>Cabin</th>\n",
|
||||
" <th>Embarked</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>108</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>22.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>523</td>\n",
|
||||
" <td>7.2500</td>\n",
|
||||
" <td>47</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>190</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>38.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>596</td>\n",
|
||||
" <td>71.2833</td>\n",
|
||||
" <td>81</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>353</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>669</td>\n",
|
||||
" <td>7.9250</td>\n",
|
||||
" <td>47</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>272</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>49</td>\n",
|
||||
" <td>53.1000</td>\n",
|
||||
" <td>55</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>5</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>15</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>472</td>\n",
|
||||
" <td>8.0500</td>\n",
|
||||
" <td>47</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>...</th>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>886</th>\n",
|
||||
" <td>887</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>548</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>27.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>101</td>\n",
|
||||
" <td>13.0000</td>\n",
|
||||
" <td>47</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>887</th>\n",
|
||||
" <td>888</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>303</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>19.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>14</td>\n",
|
||||
" <td>30.0000</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>888</th>\n",
|
||||
" <td>889</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>413</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>28.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>675</td>\n",
|
||||
" <td>23.4500</td>\n",
|
||||
" <td>47</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>889</th>\n",
|
||||
" <td>890</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>81</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>8</td>\n",
|
||||
" <td>30.0000</td>\n",
|
||||
" <td>60</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>890</th>\n",
|
||||
" <td>891</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>220</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>32.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>466</td>\n",
|
||||
" <td>7.7500</td>\n",
|
||||
" <td>47</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"<p>891 rows × 12 columns</p>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \\\n",
|
||||
"0 1 0 3 108 1 22.0 1 0 523 \n",
|
||||
"1 2 1 1 190 0 38.0 1 0 596 \n",
|
||||
"2 3 1 3 353 0 26.0 0 0 669 \n",
|
||||
"3 4 1 1 272 0 35.0 1 0 49 \n",
|
||||
"4 5 0 3 15 1 35.0 0 0 472 \n",
|
||||
".. ... ... ... ... ... ... ... ... ... \n",
|
||||
"886 887 0 2 548 1 27.0 0 0 101 \n",
|
||||
"887 888 1 1 303 0 19.0 0 0 14 \n",
|
||||
"888 889 0 3 413 0 28.0 1 2 675 \n",
|
||||
"889 890 1 1 81 1 26.0 0 0 8 \n",
|
||||
"890 891 0 3 220 1 32.0 0 0 466 \n",
|
||||
"\n",
|
||||
" Fare Cabin Embarked \n",
|
||||
"0 7.2500 47 2 \n",
|
||||
"1 71.2833 81 0 \n",
|
||||
"2 7.9250 47 2 \n",
|
||||
"3 53.1000 55 2 \n",
|
||||
"4 8.0500 47 2 \n",
|
||||
".. ... ... ... \n",
|
||||
"886 13.0000 47 2 \n",
|
||||
"887 30.0000 30 2 \n",
|
||||
"888 23.4500 47 2 \n",
|
||||
"889 30.0000 60 0 \n",
|
||||
"890 7.7500 47 1 \n",
|
||||
"\n",
|
||||
"[891 rows x 12 columns]"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_clean = autoclean(df, copy=True)\n",
|
||||
"df_clean"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
|
||||
"* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries), M. Bierly, 2016\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": true
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
578
ml21/preprocessing/11_3_autoclean.ipynb
Normal file
@@ -0,0 +1,578 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "849ad57e-6adb-4c2e-afd6-73db37eef572",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "179cc802-9f1d-40b0-bf0c-9d4fb7ea1262",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9858d815-0390-4e77-a5ff-a8d2a1960981",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "238bab60-75f0-4d29-ab05-66afc463b506",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Autoclean\n",
|
||||
"A simple library to clean data. [Autoclean](https://github.com/elisemercury/AutoClean) supports:\n",
|
||||
"AutoClean supports:\n",
|
||||
"\n",
|
||||
"* Handling of duplicates\n",
|
||||
"* Various imputation methods for missing values\n",
|
||||
"* Handling of outliers\n",
|
||||
"* Encoding of categorical data (OneHot, Label)\n",
|
||||
"* Extraction of data time values\n",
|
||||
"\n",
|
||||
"Install the package: **pip install py-AutoClean**.\n",
|
||||
"\n",
|
||||
"Parameters:\n",
|
||||
"\n",
|
||||
"* **duplicates**\n",
|
||||
" * default: False,\n",
|
||||
" * other values: 'auto', True\n",
|
||||
"* **missing_num**\n",
|
||||
" * default:False,\n",
|
||||
" * other values:\t'auto', 'linreg', 'knn', 'mean', 'median', 'most_frequent', 'delete', False\n",
|
||||
"* **missing_categ**\n",
|
||||
" * default: False,\n",
|
||||
" * other values:\t'auto', 'logreg', 'knn', 'most_frequent', 'delete', False\n",
|
||||
"* **encode_categ**\n",
|
||||
" * default: False,\n",
|
||||
" * other values:\t'auto', ['onehot'], ['label'], False ; to encode only specific columns add a list of column names or indexes: ['auto', ['col1', 2]]\n",
|
||||
"* **extract_datetime**\n",
|
||||
" * default:\tFalse,\n",
|
||||
" * other values:\t'auto', 'D', 'M', 'Y', 'h', 'm', 's'\n",
|
||||
"* **outliers**\n",
|
||||
" * default:\tFalse,\n",
|
||||
" * other values:\t'auto', 'winz', 'delete'\n",
|
||||
"* **outlier_param**\tdefault:\t1.5, other values:\tany int or float, False\n",
|
||||
"* **logfile**\n",
|
||||
" * default: True,\n",
|
||||
" * other values:\tFalse\n",
|
||||
"* **verbose**\n",
|
||||
" * default: False,\n",
|
||||
" * other values:\tTrue"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"id": "491b034b-994e-4f06-b4bc-df0590a62aab",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>PassengerId</th>\n",
|
||||
" <th>Survived</th>\n",
|
||||
" <th>Pclass</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>SibSp</th>\n",
|
||||
" <th>Parch</th>\n",
|
||||
" <th>Ticket</th>\n",
|
||||
" <th>Fare</th>\n",
|
||||
" <th>Cabin</th>\n",
|
||||
" <th>Embarked</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Braund, Mr. Owen Harris</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>22.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>A/5 21171</td>\n",
|
||||
" <td>7.2500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>38.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>PC 17599</td>\n",
|
||||
" <td>71.2833</td>\n",
|
||||
" <td>C85</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Heikkinen, Miss. Laina</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>STON/O2. 3101282</td>\n",
|
||||
" <td>7.9250</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>113803</td>\n",
|
||||
" <td>53.1000</td>\n",
|
||||
" <td>C123</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>5</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Allen, Mr. William Henry</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>373450</td>\n",
|
||||
" <td>8.0500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>...</th>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" <td>...</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>886</th>\n",
|
||||
" <td>887</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>Montvila, Rev. Juozas</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>27.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>211536</td>\n",
|
||||
" <td>13.0000</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>887</th>\n",
|
||||
" <td>888</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Graham, Miss. Margaret Edith</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>19.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>112053</td>\n",
|
||||
" <td>30.0000</td>\n",
|
||||
" <td>B42</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>888</th>\n",
|
||||
" <td>889</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>W./C. 6607</td>\n",
|
||||
" <td>23.4500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>889</th>\n",
|
||||
" <td>890</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Behr, Mr. Karl Howell</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>111369</td>\n",
|
||||
" <td>30.0000</td>\n",
|
||||
" <td>C148</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>890</th>\n",
|
||||
" <td>891</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Dooley, Mr. Patrick</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>32.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>370376</td>\n",
|
||||
" <td>7.7500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>Q</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"<p>891 rows × 12 columns</p>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" PassengerId Survived Pclass \\\n",
|
||||
"0 1 0 3 \n",
|
||||
"1 2 1 1 \n",
|
||||
"2 3 1 3 \n",
|
||||
"3 4 1 1 \n",
|
||||
"4 5 0 3 \n",
|
||||
".. ... ... ... \n",
|
||||
"886 887 0 2 \n",
|
||||
"887 888 1 1 \n",
|
||||
"888 889 0 3 \n",
|
||||
"889 890 1 1 \n",
|
||||
"890 891 0 3 \n",
|
||||
"\n",
|
||||
" Name Sex Age SibSp \\\n",
|
||||
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
||||
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
||||
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
||||
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
||||
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
||||
".. ... ... ... ... \n",
|
||||
"886 Montvila, Rev. Juozas male 27.0 0 \n",
|
||||
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
|
||||
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
|
||||
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
|
||||
"890 Dooley, Mr. Patrick male 32.0 0 \n",
|
||||
"\n",
|
||||
" Parch Ticket Fare Cabin Embarked \n",
|
||||
"0 0 A/5 21171 7.2500 NaN S \n",
|
||||
"1 0 PC 17599 71.2833 C85 C \n",
|
||||
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
|
||||
"3 0 113803 53.1000 C123 S \n",
|
||||
"4 0 373450 8.0500 NaN S \n",
|
||||
".. ... ... ... ... ... \n",
|
||||
"886 0 211536 13.0000 NaN S \n",
|
||||
"887 0 112053 30.0000 B42 S \n",
|
||||
"888 2 W./C. 6607 23.4500 NaN S \n",
|
||||
"889 0 111369 30.0000 C148 C \n",
|
||||
"890 0 370376 7.7500 NaN Q \n",
|
||||
"\n",
|
||||
"[891 rows x 12 columns]"
|
||||
]
|
||||
},
|
||||
"execution_count": 29,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"from AutoClean import AutoClean\n",
|
||||
"\n",
|
||||
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 36,
|
||||
"id": "d842eedf-3971-4966-a8b4-543bb56dd60d",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"AutoClean process completed in 0.289385 seconds\n",
|
||||
"Logfile saved to: /home/cif/GoogleDrive/cursos/summer-school-romania/2019/notebooks/preprocessing/autoclean.log\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"autoclean = AutoClean(df, mode='auto')\n",
|
||||
"\n",
|
||||
"# We can control the preprocessing\n",
|
||||
"#autoclean = AutoClean(df, mode='auto', duplicates=False, missing_num=False, missing_categ=False, encode_categ=False, extract_datetime=False, outliers=False, outlier_param=1.5, logfile=True, verbose=False)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 38,
|
||||
"id": "4ede7c55-475a-4748-8cc4-788f46c88b26",
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>PassengerId</th>\n",
|
||||
" <th>Survived</th>\n",
|
||||
" <th>Pclass</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>SibSp</th>\n",
|
||||
" <th>Parch</th>\n",
|
||||
" <th>Ticket</th>\n",
|
||||
" <th>Fare</th>\n",
|
||||
" <th>Cabin</th>\n",
|
||||
" <th>Embarked</th>\n",
|
||||
" <th>Sex_female</th>\n",
|
||||
" <th>Sex_male</th>\n",
|
||||
" <th>Embarked_C</th>\n",
|
||||
" <th>Embarked_Q</th>\n",
|
||||
" <th>Embarked_S</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Braund, Mr. Owen Harris</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>22.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>A/5 21171</td>\n",
|
||||
" <td>7.2500</td>\n",
|
||||
" <td>C128</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>38.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>PC 17599</td>\n",
|
||||
" <td>65.6344</td>\n",
|
||||
" <td>C85</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Heikkinen, Miss. Laina</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>STON/O2. 3101282</td>\n",
|
||||
" <td>7.9250</td>\n",
|
||||
" <td>C128</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>113803</td>\n",
|
||||
" <td>53.1000</td>\n",
|
||||
" <td>C123</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>5</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Allen, Mr. William Henry</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>373450</td>\n",
|
||||
" <td>8.0500</td>\n",
|
||||
" <td>C128</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" PassengerId Survived Pclass \\\n",
|
||||
"0 1 0 3 \n",
|
||||
"1 2 1 1 \n",
|
||||
"2 3 1 3 \n",
|
||||
"3 4 1 1 \n",
|
||||
"4 5 0 3 \n",
|
||||
"\n",
|
||||
" Name Sex Age SibSp \\\n",
|
||||
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
||||
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
||||
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
||||
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
||||
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
||||
"\n",
|
||||
" Parch Ticket Fare Cabin Embarked Sex_female Sex_male \\\n",
|
||||
"0 0 A/5 21171 7.2500 C128 S False True \n",
|
||||
"1 0 PC 17599 65.6344 C85 C True False \n",
|
||||
"2 0 STON/O2. 3101282 7.9250 C128 S True False \n",
|
||||
"3 0 113803 53.1000 C123 S True False \n",
|
||||
"4 0 373450 8.0500 C128 S False True \n",
|
||||
"\n",
|
||||
" Embarked_C Embarked_Q Embarked_S \n",
|
||||
"0 False False True \n",
|
||||
"1 True False False \n",
|
||||
"2 False False True \n",
|
||||
"3 False False True \n",
|
||||
"4 False False True "
|
||||
]
|
||||
},
|
||||
"execution_count": 38,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_clean = autoclean.output\n",
|
||||
"df_clean[0:5]"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
502
ml21/preprocessing/5_Duplicated_Values.ipynb
Normal file
@@ -0,0 +1,502 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Duplicated values\n",
|
||||
"\n",
|
||||
"There are two possible approaches: **remove** these rows or **filling** them. It depends on every case.\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Filling NaN values\n",
|
||||
"If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
|
||||
"\n",
|
||||
"* For **string** fields, we can fill NaN with **' '**.\n",
|
||||
"\n",
|
||||
"* For **numbers**, we can fill with the **mean** or **median** value. \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Fill NaN with ' '\n",
|
||||
"df['col'] = df['col'].fillna(' ')\n",
|
||||
"# Fill NaN with 99\n",
|
||||
"df['col'] = df['col'].fillna(99)\n",
|
||||
"# Fill NaN with the mean of the column\n",
|
||||
"df['col'] = df['col'].fillna(df['col'].mean())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Propagate non-null values forward or backwards\n",
|
||||
"You can also propagate non-null values forward or backwards by putting\n",
|
||||
"method=’pad’ as the method argument. It will fill the next value in the\n",
|
||||
"dataframe with the previous non-NaN value. Maybe you just want to fill one\n",
|
||||
"value ( limit=1 )or you want to fill all the values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 NaN\n",
|
||||
"1 NaN\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 NaN\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 NaN\n",
|
||||
"1 NaN\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 4.0\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# We fill forward the value 4.0 and fill the next one (limit = 1)\n",
|
||||
"df.fillna(method='pad', limit=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"We can also backfilling with **bfill**."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 2.0\n",
|
||||
"1 2.0\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 NaN\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Fill the first two NaN values with the first available value\n",
|
||||
"df.fillna(method='bfill')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Removing NaN values\n",
|
||||
"We can remove them by row or column."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"/# Drop any rows which have any nans\n",
|
||||
"df.dropna()\n",
|
||||
"/# Drop columns that have any nans\n",
|
||||
"df.dropna(axis=1)\n",
|
||||
"/# Only drop columns which have at least 90% non-NaNs\n",
|
||||
"df.dropna(thresh=int(df.shape[0] * .9), axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
619
ml21/preprocessing/9_String_Data.ipynb
Normal file
@@ -0,0 +1,619 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# String Data\n",
|
||||
"It is common to clean string columns so that they follow a predefined format (e.g. emails, URLs, ...).\n",
|
||||
"\n",
|
||||
"We can do it using regular expressions or specific libraries."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Beautifier\n",
|
||||
"Simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify url patterns, domains and so on. Library helps to clean unicodes, special characters and unnecessary redirection patterns from the urls and gives you clean date.\n",
|
||||
"\n",
|
||||
"Install with **'pip install beautifier'**."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Email cleanup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from beautifier import Email\n",
|
||||
"email = Email('me@imsach.in')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'imsach.in'"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email.domain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'me'"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email.username"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"False"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email.is_free_email"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"email2 = Email('This my address')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"False"
|
||||
]
|
||||
},
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email2.is_valid"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"email3 = Email('pepe@gmail.com')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email3.is_valid"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"True"
|
||||
]
|
||||
},
|
||||
"execution_count": 27,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"email3.is_free_email"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## URL cleanup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 29,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from beautifier import Url\n",
|
||||
"url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 31,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'https://in.linkedin.com/in/sachinphilip'"
|
||||
]
|
||||
},
|
||||
"execution_count": 31,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.cleanup"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'in.linkedin.com'"
|
||||
]
|
||||
},
|
||||
"execution_count": 33,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.domain"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 35,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']"
|
||||
]
|
||||
},
|
||||
"execution_count": 35,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.param"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 37,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'"
|
||||
]
|
||||
},
|
||||
"execution_count": 37,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.parameters"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 39,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"'sachinphilip'"
|
||||
]
|
||||
},
|
||||
"execution_count": 39,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"url.username"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Unicode\n",
|
||||
"Problem: Some unicode code has been broken. We see the character in a different character dataset.\n",
|
||||
"\n",
|
||||
"A **mojibake** is a character displayed in an unintended character enconding. Example: \"<22>\").\n",
|
||||
"\n",
|
||||
"We will use the library **ftfy** (fixed text for you) to fix it.\n",
|
||||
"\n",
|
||||
"First, you should install the library: ***conda install ftfy**. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 41,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"¯\\_(ツ)_/¯\n",
|
||||
"Party\n",
|
||||
"I'm\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import ftfy\n",
|
||||
"foo = '¯\\\\_(ã\\x83\\x84)_/¯'\n",
|
||||
"bar = '\\ufeffParty'\n",
|
||||
"baz = '\\001\\033[36;44mI’m'\n",
|
||||
"print(ftfy.fix_text(foo))\n",
|
||||
"print(ftfy.fix_text(bar))\n",
|
||||
"print(ftfy.fix_text(baz))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"We can understand which heuristics ftfy is using."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"ename": "NameError",
|
||||
"evalue": "name 'ftfy' is not defined",
|
||||
"output_type": "error",
|
||||
"traceback": [
|
||||
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
||||
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
|
||||
"\u001b[0;32m<ipython-input-1-4030b963ff0a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mftfy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexplain_unicode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfoo\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
|
||||
"\u001b[0;31mNameError\u001b[0m: name 'ftfy' is not defined"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"ftfy.explain_unicode(foo)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Dates\n",
|
||||
"Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as **python-dateutil**.\n",
|
||||
"\n",
|
||||
"Install the library: **pip install python-dateutil**."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"2019-08-22 10:22:46+00:00\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from dateutil.parser import parse\n",
|
||||
"now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n",
|
||||
"print(now)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"2019-08-22 10:20:00\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n",
|
||||
"print(dt)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)\n",
|
||||
"* Beautifier https://github.com/labtocat/beautifier\n",
|
||||
"* Ftfy https://ftfy.readthedocs.io/en/latest/\n",
|
||||
"* python-dateutil https://dateutil.readthedocs.io/en/stable/"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.4"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
BIN
ml21/preprocessing/images/EscUpmPolit_p.gif
Normal file
After Width: | Height: | Size: 3.1 KiB |
BIN
ml21/preprocessing/images/titanic.jpg
Normal file
After Width: | Height: | Size: 152 KiB |
1
ml21/visualization/.gitkeep
Normal file
@@ -0,0 +1 @@
|
||||
|
185
ml21/visualization/00_Intro_Visualization.ipynb
Normal file
@@ -0,0 +1,185 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Introduction to Visualization\n",
|
||||
" \n",
|
||||
"In this session, we will get more insight regarding how to visualize data.\n",
|
||||
"\n",
|
||||
"# Objectives\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Understanding how to visualize data\n",
|
||||
"* Understanding the purpose of different charts \n",
|
||||
"* Experimenting with several environments for visualizing data\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Seaborn\n",
|
||||
"\n",
|
||||
"Seaborn is a library that visualizes data in Python. The main characteristics are:\n",
|
||||
"\n",
|
||||
"* A dataset-oriented API for examining relationships between multiple variables\n",
|
||||
"* Specialized support for using categorical variables to show observations or aggregate statistics\n",
|
||||
"* Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data\n",
|
||||
"* Automatic estimation and plotting of linear regression models for different kinds of dependent variables\n",
|
||||
"* Convenient views of the overall structure of complex datasets\n",
|
||||
"* High-level abstractions for structuring multi-plot grids that let you quickly build complex visualizations\n",
|
||||
"* Concise control over matplotlib figure styling with several built-in themes\n",
|
||||
"* Tools for choosing color palettes that faithfully reveal patterns in your data\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Install\n",
|
||||
"Use:\n",
|
||||
"\n",
|
||||
"**conda install seaborn**\n",
|
||||
"\n",
|
||||
"or \n",
|
||||
"\n",
|
||||
"**pip install seaborn**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Table of Contents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"1. [Home](00_Intro_Visualization.ipynb)\n",
|
||||
"2. [Dataset](01_Dataset.ipynb)\n",
|
||||
"3. [Comparison Charts](02_Comparison_Charts.ipynb)\n",
|
||||
" 1. [More Comparison Charts](02_01_More_Comparison_Charts.ipynb)\n",
|
||||
"4. [Distribution Charts](03_Distribution_Charts.ipynb)\n",
|
||||
"5. [Hierarchical charts](04_Hierarchical_Charts.ipynb)\n",
|
||||
"6. [Relational charts](05_Relational_Charts.ipynb)\n",
|
||||
"7. [Spatial charts](06_Spatial_Charts.ipynb)\n",
|
||||
"8. [Temporal charts](07_Temporal_Charts.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
363
ml21/visualization/01_Dataset.ipynb
Normal file
@@ -0,0 +1,363 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Visualization](00_Intro_Visualization.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Dataset\n",
|
||||
"Seaborn includes several datasets. We can consult the available datasets and load them. \n",
|
||||
"\n",
|
||||
"The datasets are also available at https://github.com/mwaskom/seaborn-data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"from matplotlib import pyplot as plt\n",
|
||||
"import seaborn as sns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['anagrams',\n",
|
||||
" 'anscombe',\n",
|
||||
" 'attention',\n",
|
||||
" 'brain_networks',\n",
|
||||
" 'car_crashes',\n",
|
||||
" 'diamonds',\n",
|
||||
" 'dots',\n",
|
||||
" 'dowjones',\n",
|
||||
" 'exercise',\n",
|
||||
" 'flights',\n",
|
||||
" 'fmri',\n",
|
||||
" 'geyser',\n",
|
||||
" 'glue',\n",
|
||||
" 'healthexp',\n",
|
||||
" 'iris',\n",
|
||||
" 'mpg',\n",
|
||||
" 'penguins',\n",
|
||||
" 'planets',\n",
|
||||
" 'seaice',\n",
|
||||
" 'taxis',\n",
|
||||
" 'tips',\n",
|
||||
" 'titanic']"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sns.get_dataset_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>total_bill</th>\n",
|
||||
" <th>tip</th>\n",
|
||||
" <th>sex</th>\n",
|
||||
" <th>smoker</th>\n",
|
||||
" <th>day</th>\n",
|
||||
" <th>time</th>\n",
|
||||
" <th>size</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>16.99</td>\n",
|
||||
" <td>1.01</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>10.34</td>\n",
|
||||
" <td>1.66</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>21.01</td>\n",
|
||||
" <td>3.50</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>23.68</td>\n",
|
||||
" <td>3.31</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>24.59</td>\n",
|
||||
" <td>3.61</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>25.29</td>\n",
|
||||
" <td>4.71</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>8.77</td>\n",
|
||||
" <td>2.00</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>7</th>\n",
|
||||
" <td>26.88</td>\n",
|
||||
" <td>3.12</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>8</th>\n",
|
||||
" <td>15.04</td>\n",
|
||||
" <td>1.96</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>9</th>\n",
|
||||
" <td>14.78</td>\n",
|
||||
" <td>3.23</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" total_bill tip sex smoker day time size\n",
|
||||
"0 16.99 1.01 Female No Sun Dinner 2\n",
|
||||
"1 10.34 1.66 Male No Sun Dinner 3\n",
|
||||
"2 21.01 3.50 Male No Sun Dinner 3\n",
|
||||
"3 23.68 3.31 Male No Sun Dinner 2\n",
|
||||
"4 24.59 3.61 Female No Sun Dinner 4\n",
|
||||
"5 25.29 4.71 Male No Sun Dinner 4\n",
|
||||
"6 8.77 2.00 Male No Sun Dinner 2\n",
|
||||
"7 26.88 3.12 Male No Sun Dinner 4\n",
|
||||
"8 15.04 1.96 Male No Sun Dinner 2\n",
|
||||
"9 14.78 3.23 Male No Sun Dinner 2"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df = sns.load_dataset('tips')\n",
|
||||
"df.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Seaborn](http://seaborn.pydata.org/index.html) documentation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
3192
ml21/visualization/02_01_More_Comparison_Charts.ipynb
Normal file
561
ml21/visualization/02_Comparison_Charts.ipynb
Normal file
1235
ml21/visualization/03_Distribution_Charts.ipynb
Normal file
2126
ml21/visualization/04_Hierarchical_Charts.ipynb
Normal file
500
ml21/visualization/05_Relational_Charts.ipynb
Normal file
689
ml21/visualization/06_Spatial_Charts.ipynb
Normal file
451
ml21/visualization/07_Temporal_Charts.ipynb
Normal file
BIN
ml21/visualization/images/EscUpmPolit_p.gif
Normal file
After Width: | Height: | Size: 3.1 KiB |
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -27,14 +27,14 @@
|
||||
"source": [
|
||||
"# Introduction to Neural Networks\n",
|
||||
" \n",
|
||||
"In this lab session, we are going to learn how to train a neural network.\n",
|
||||
"In this lab session, we will learn how to train a neural network.\n",
|
||||
"\n",
|
||||
"# Objectives\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Put in practice the notions learn in class about neural computing\n",
|
||||
"* Understand what an MLP is\n",
|
||||
"* Learn to use some libraries, such as scikit-learn "
|
||||
"* Learn to use some libraries, such as Scikit-learn."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -58,7 +58,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -39,7 +39,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Multilayer perceptrons, also called feedforward neural networks or deep feedforward networks, are the most basic deep learning models."
|
||||
"Multilayer perceptrons, called feedforward neural networks or deep feedforward networks, are the most basic deep learning models."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -58,7 +58,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this notebook we are going to try the spiral dataset with different algorthms. In particular, we are going to focus our attention on the MLP classifier.\n",
|
||||
"In this notebook, we will try the spiral dataset with different algorithms. In particular, we are going to focus our attention on the MLP classifier.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Answer directly in your copy of the exercise and submit it as a moodle task."
|
||||
|
@@ -4,7 +4,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -39,10 +39,10 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"In this notebook we are going to apply a MLP to a simple regression task: learning the Fresnel functions.\n",
|
||||
"In this notebook, we are going to apply an MLP to a simple regression task: learning the Fresnel functions.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Answer directly in your copy of the exercise and submit it as a moodle task."
|
||||
"Answer directly in your copy of the exercise and submit it as a Moodle task."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -92,7 +92,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Change this variables to change the train and test dataset."
|
||||
"Change these variables to change the train and test dataset."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@@ -15,7 +15,7 @@ def gen_spiral_dataset(n_examples=500, n_classes=2, a=None, b=None, pi_space=3):
|
||||
theta = np.linspace(0,pi_space*pi, num=n_examples)
|
||||
xy = np.zeros((n_examples,2))
|
||||
|
||||
# logaritmic spirals
|
||||
# logarithmic spirals
|
||||
x_golden_parametric = lambda a, b, theta: a**(theta*b) * cos(theta)
|
||||
y_golden_parametric = lambda a, b, theta: a**(theta*b) * sin(theta)
|
||||
x_golden_parametric = np.vectorize(x_golden_parametric)
|
||||
|
@@ -48,7 +48,7 @@
|
||||
"# Introduction\n",
|
||||
"The purpose of this practice is to understand better how GAs work. \n",
|
||||
"\n",
|
||||
"There are many libraries that implement GAs, you can find some of then in the [References](#References) section."
|
||||
"There are many libraries that implement GAs; you can find some of them in the [References](#References) section."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -56,7 +56,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Genetic Algorithms\n",
|
||||
"In this section we are going to use the library DEAP [[References](#References)] for implementing a genetic algorithms.\n",
|
||||
"In this section, we are going to use the library [DEAP](https://github.com/DEAP/deap/tree/master) for implementing a genetic algorithms.\n",
|
||||
"\n",
|
||||
"We are going to implement the OneMax problem as seen in class.\n",
|
||||
"\n",
|
||||
@@ -187,9 +187,9 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Comparing\n",
|
||||
"Your task is modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n",
|
||||
"Your task is to modify the previous code to canonical GA configuration from Holland (look at the lesson's slides). In addition you should consult the [DEAP API](http://deap.readthedocs.io/en/master/api/tools.html#operators).\n",
|
||||
"\n",
|
||||
"Submit your notebook and include a the modified code, and a comparison of the effects of these changes. \n",
|
||||
"Submit your notebook and include a modified code and a comparison of the effects of these changes. \n",
|
||||
"\n",
|
||||
"Discuss your findings."
|
||||
]
|
||||
@@ -198,33 +198,24 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Optimizing ML hyperparameters\n",
|
||||
"## Optional. Optimizing ML hyperparameters\n",
|
||||
"\n",
|
||||
"One of the applications of Genetic Algorithms is the optimization of ML hyperparameters. Previously we have used GridSearch from Scikit. Using (sklearn-deap)[[References](#References)], optimize the Titatic hyperparameters using both GridSearch and Genetic Algorithms. \n",
|
||||
"One of the applications of Genetic Algorithms is the optimization of ML hyperparameters. Previously, we have used GridSearch from Scikit. Using [sklearn-deap](https://github.com/rsteca/sklearn-deap), optimize the Titatic hyperparameters using both GridSearch and Genetic Algorithms. \n",
|
||||
"\n",
|
||||
"The same exercise (using the digits dataset) can be found in this [notebook](https://github.com/rsteca/sklearn-deap/blob/master/test.ipynb).\n",
|
||||
"\n",
|
||||
"Submit a notebook where you include well-crafted conclusions about the exercises, discussing the pros and cons of using genetic algorithms for this purpose.\n",
|
||||
"Since there is a problem with Scikit version 0.24, you can just comment on the different approaches.",
|
||||
"\n",
|
||||
"Note: There is a problem with the version 0.24 of scikit. Just comment the different approaches."
|
||||
"Alternatively, you can also use the library [sklearn-genetic-opt](https://sklearn-genetic-opt.readthedocs.io/en/stable/index.html) and discuss the digit classification example included in the library: [digits decision tree](https://sklearn-genetic-opt.readthedocs.io/en/stable/notebooks/Digits_decision_tree.html)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Optional exercises\n",
|
||||
"## Optional. Optimizing an ML pipeline with a genetic algorithm\n",
|
||||
"\n",
|
||||
"Here there is a proposed optional exercise."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Optimizing a ML pipeline with a genetic algorithm\n",
|
||||
"\n",
|
||||
"The library [TPOT](#References) optimizes ML pipelines and comes with a lot of (examples)[https://epistasislab.github.io/tpot/examples/] and even notebooks, for example for the [iris dataset](https://github.com/EpistasisLab/tpot/blob/master/tutorials/IRIS.ipynb).\n",
|
||||
"The library [TPOT](https://epistasislab.github.io/tpot/latest/) optimizes ML pipelines and comes with a lot of [examples](https://epistasislab.github.io/tpot/latest/Tutorial/9_Genetic_Algorithm_Overview/) and even notebooks, for example for the [iris dataset](https://github.com/EpistasisLab/tpot/blob/master/tutorials/IRIS.ipynb).\n",
|
||||
"\n",
|
||||
"Your task is to apply TPOT to the intermediate challenge and write a short essay explaining:\n",
|
||||
"* what TPOT does (with your own words).\n",
|
||||
@@ -242,7 +233,8 @@
|
||||
"* [tpot](http://epistasislab.github.io/tpot/)\n",
|
||||
"* [gplearn](http://gplearn.readthedocs.io/en/latest/index.html)\n",
|
||||
"* [scikit-allel](https://scikit-allel.readthedocs.io/en/latest/)\n",
|
||||
"* [scklearn-genetic](https://github.com/manuel-calzolari/sklearn-genetic)"
|
||||
"* [sklearn-genetic](https://github.com/manuel-calzolari/sklearn-genetic)\n",
|
||||
"* [sklearn-genetic-opt](https://sklearn-genetic-opt.readthedocs.io/en/stable/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -256,7 +248,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
|
@@ -48,7 +48,9 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"1. [Q-Learning](2_6_1_Q-Learning.ipynb)"
|
||||
"1. [Q-Learning](2_6_1_Q-Learning_Basic.ipynb)\n",
|
||||
"1. [Visualization](2_6_1_Q-Learning_Visualization.ipynb)\n",
|
||||
"1. [Exercises](2_6_1_Q-Learning_Exercises.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -64,7 +66,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@@ -78,7 +80,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.1"
|
||||
"version": "3.10.10"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
|
@@ -1,455 +0,0 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2018 Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Table of Contents\n",
|
||||
"\n",
|
||||
"* [Introduction](#Introduction)\n",
|
||||
"* [Getting started with OpenAI Gym](#Getting-started-with-OpenAI-Gym)\n",
|
||||
"* [The Frozen Lake scenario](#The-Frozen-Lake-scenario)\n",
|
||||
"* [Q-Learning with the Frozen Lake scenario](#Q-Learning-with-the-Frozen-Lake-scenario)\n",
|
||||
"* [Exercises](#Exercises)\n",
|
||||
"* [Optional exercises](#Optional-exercises)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Introduction\n",
|
||||
"The purpose of this practice is to understand better Reinforcement Learning (RL) and, in particular, Q-Learning.\n",
|
||||
"\n",
|
||||
"We are going to use [OpenAI Gym](https://gym.openai.com/). OpenAI is a toolkit for developing and comparing RL algorithms.Take a loot at ther [website](https://gym.openai.com/).\n",
|
||||
"\n",
|
||||
"It implements [algorithm imitation](http://gym.openai.com/envs/#algorithmic), [classic control problems](http://gym.openai.com/envs/#classic_control), [Atari games](http://gym.openai.com/envs/#atari), [Box2D continuous control](http://gym.openai.com/envs/#box2d), [robotics with MuJoCo, Multi-Joint dynamics with Contact](http://gym.openai.com/envs/#mujoco), and [simple text based environments](http://gym.openai.com/envs/#toy_text).\n",
|
||||
"\n",
|
||||
"This notebook is based on * [Diving deeper into Reinforcement Learning with Q-Learning](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
|
||||
"\n",
|
||||
"First of all, install the OpenAI Gym library:\n",
|
||||
"\n",
|
||||
"```console\n",
|
||||
"foo@bar:~$ pip install gym\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"If you get the error message 'NotImplementedError: abstract', [execute](https://github.com/openai/gym/issues/775) \n",
|
||||
"```console\n",
|
||||
"foo@bar:~$ pip install pyglet==1.2.4\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"If you want to try the Atari environment, it is better that you opt for the full installation from the source. Follow the instructions at [https://github.com/openai/gym#id15](OpenGym).\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Getting started with OpenAI Gym\n",
|
||||
"\n",
|
||||
"First of all, read the [introduction](http://gym.openai.com/docs/#getting-started-with-gym) of OpenAI Gym."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Environments\n",
|
||||
"OpenGym provides a number of problems called *environments*. \n",
|
||||
"\n",
|
||||
"Try the 'CartPole-v0' (or 'MountainCar)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import gym\n",
|
||||
"\n",
|
||||
"env = gym.make(\"CartPole-v1\")\n",
|
||||
"#env = gym.make('MountainCar-v0')\n",
|
||||
"#env = gym.make('Taxi-v2')\n",
|
||||
"\n",
|
||||
"observation = env.reset()\n",
|
||||
"for _ in range(1000):\n",
|
||||
" env.render()\n",
|
||||
" action = env.action_space.sample() # your agent here (this takes random actions)\n",
|
||||
" observation, reward, done, info = env.step(action)\n",
|
||||
"\n",
|
||||
" if done:\n",
|
||||
" observation = env.reset()\n",
|
||||
"env.close()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"This will launch an external window with the game. If you cannot close that window, just execute in a code cell:\n",
|
||||
"\n",
|
||||
"```python\n",
|
||||
"env.close()\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"The full list of available environments can be found printing the environment registry as follows."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from gym import envs\n",
|
||||
"print(envs.registry.all())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The environment’s **step** function returns four values. These are:\n",
|
||||
"\n",
|
||||
"* **observation (object):** an environment-specific object representing your observation of the environment. For example, pixel data from a camera, joint angles and joint velocities of a robot, or the board state in a board game.\n",
|
||||
"* **reward (float):** amount of reward achieved by the previous action. The scale varies between environments, but the goal is always to increase your total reward.\n",
|
||||
"* **done (boolean):** whether it’s time to reset the environment again. Most (but not all) tasks are divided up into well-defined episodes, and done being True indicates the episode has terminated. (For example, perhaps the pole tipped too far, or you lost your last life.).\n",
|
||||
"* **info (dict):** diagnostic information useful for debugging. It can sometimes be useful for learning (for example, it might contain the raw probabilities behind the environment’s last state change). However, official evaluations of your agent are not allowed to use this for learning.\n",
|
||||
"\n",
|
||||
"The typical agent loop consists in first calling the method *reset* which provides an initial observation. Then the agent executes an action, and receives the reward, the new observation, and if the episode has finished (done is true). \n",
|
||||
"\n",
|
||||
"For example, analyze this sample of agent loop for 100 ms. The details of the previous variables for this game as described [here](https://github.com/openai/gym/wiki/CartPole-v0) are:\n",
|
||||
"* **observation**: Cart Position, Cart Velocity, Pole Angle, Pole Velocity.\n",
|
||||
"* **action**: 0\t(Push cart to the left), 1\t(Push cart to the right).\n",
|
||||
"* **reward**: 1 for every step taken, including the termination step."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import gym\n",
|
||||
"env = gym.make('CartPole-v0')\n",
|
||||
"for i_episode in range(20):\n",
|
||||
" observation = env.reset()\n",
|
||||
" for t in range(100):\n",
|
||||
" env.render()\n",
|
||||
" print(observation)\n",
|
||||
" action = env.action_space.sample()\n",
|
||||
" print(\"Action \", action)\n",
|
||||
" observation, reward, done, info = env.step(action)\n",
|
||||
" print(\"Observation \", observation, \", reward \", reward, \", done \", done, \", info \" , info)\n",
|
||||
" if done:\n",
|
||||
" print(\"Episode finished after {} timesteps\".format(t+1))\n",
|
||||
" break"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# The Frozen Lake scenario\n",
|
||||
"We are going to play to the [Frozen Lake](http://gym.openai.com/envs/FrozenLake-v0/) game.\n",
|
||||
"\n",
|
||||
"The problem is a grid where you should go from the 'start' (S) position to the 'goal position (G) (the pizza!). You can only walk through the 'frozen tiles' (F). Unfortunately, you can fall in a 'hole' (H).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise. The possible actions are going left, right, up or down. However, the ice is slippery, so you won't always move in the direction you intend.\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"Here you can see several episodes. A full recording is available at [Frozen World](http://gym.openai.com/envs/FrozenLake-v0/).\n",
|
||||
"\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Q-Learning with the Frozen Lake scenario\n",
|
||||
"We are now going to apply Q-Learning for the Frozen Lake scenario. This part of the notebook is taken from [here](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Q%20learning/Q%20Learning%20with%20FrozenLake.ipynb).\n",
|
||||
"\n",
|
||||
"First we create the environment and a Q-table inizializated with zeros to store the value of each action in a given state. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import numpy as np\n",
|
||||
"import gym\n",
|
||||
"import random\n",
|
||||
"\n",
|
||||
"env = gym.make(\"FrozenLake-v0\")\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"action_size = env.action_space.n\n",
|
||||
"state_size = env.observation_space.n\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"qtable = np.zeros((state_size, action_size))\n",
|
||||
"print(qtable)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Now we define the hyperparameters."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Q-Learning hyperparameters\n",
|
||||
"total_episodes = 10000 # Total episodes\n",
|
||||
"learning_rate = 0.8 # Learning rate\n",
|
||||
"max_steps = 99 # Max steps per episode\n",
|
||||
"gamma = 0.95 # Discounting rate\n",
|
||||
"\n",
|
||||
"# Exploration hyperparameters\n",
|
||||
"epsilon = 1.0 # Exploration rate\n",
|
||||
"max_epsilon = 1.0 # Exploration probability at start\n",
|
||||
"min_epsilon = 0.01 # Minimum exploration probability \n",
|
||||
"decay_rate = 0.01 # Exponential decay rate for exploration prob"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"And now we implement the Q-Learning algorithm.\n",
|
||||
"\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# List of rewards\n",
|
||||
"rewards = []\n",
|
||||
"\n",
|
||||
"# 2 For life or until learning is stopped\n",
|
||||
"for episode in range(total_episodes):\n",
|
||||
" # Reset the environment\n",
|
||||
" state = env.reset()\n",
|
||||
" step = 0\n",
|
||||
" done = False\n",
|
||||
" total_rewards = 0\n",
|
||||
" \n",
|
||||
" for step in range(max_steps):\n",
|
||||
" # 3. Choose an action a in the current world state (s)\n",
|
||||
" ## First we randomize a number\n",
|
||||
" exp_exp_tradeoff = random.uniform(0, 1)\n",
|
||||
" \n",
|
||||
" ## If this number > greater than epsilon --> exploitation (taking the biggest Q value for this state)\n",
|
||||
" if exp_exp_tradeoff > epsilon:\n",
|
||||
" action = np.argmax(qtable[state,:])\n",
|
||||
"\n",
|
||||
" # Else doing a random choice --> exploration\n",
|
||||
" else:\n",
|
||||
" action = env.action_space.sample()\n",
|
||||
"\n",
|
||||
" # Take the action (a) and observe the outcome state(s') and reward (r)\n",
|
||||
" new_state, reward, done, info = env.step(action)\n",
|
||||
"\n",
|
||||
" # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]\n",
|
||||
" # qtable[new_state,:] : all the actions we can take from new state\n",
|
||||
" qtable[state, action] = qtable[state, action] + learning_rate * (reward + gamma * np.max(qtable[new_state, :]) - qtable[state, action])\n",
|
||||
" \n",
|
||||
" total_rewards += reward\n",
|
||||
" \n",
|
||||
" # Our new state is state\n",
|
||||
" state = new_state\n",
|
||||
" \n",
|
||||
" # If done (if we're dead) : finish episode\n",
|
||||
" if done == True: \n",
|
||||
" break\n",
|
||||
" \n",
|
||||
" episode += 1\n",
|
||||
" # Reduce epsilon (because we need less and less exploration)\n",
|
||||
" epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*episode) \n",
|
||||
" rewards.append(total_rewards)\n",
|
||||
"\n",
|
||||
"print (\"Score over time: \" + str(sum(rewards)/total_episodes))\n",
|
||||
"print(qtable)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Finally, we use the learnt Q-table for playing the Frozen World game."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"env.reset()\n",
|
||||
"\n",
|
||||
"for episode in range(5):\n",
|
||||
" state = env.reset()\n",
|
||||
" step = 0\n",
|
||||
" done = False\n",
|
||||
" print(\"****************************************************\")\n",
|
||||
" print(\"EPISODE \", episode)\n",
|
||||
"\n",
|
||||
" for step in range(max_steps):\n",
|
||||
" env.render()\n",
|
||||
" # Take the action (index) that have the maximum expected future reward given that state\n",
|
||||
" action = np.argmax(qtable[state,:])\n",
|
||||
" \n",
|
||||
" new_state, reward, done, info = env.step(action)\n",
|
||||
" \n",
|
||||
" if done:\n",
|
||||
" break\n",
|
||||
" state = new_state\n",
|
||||
"env.close()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Exercises\n",
|
||||
"\n",
|
||||
"## Taxi\n",
|
||||
"Analyze the [Taxi problem](http://gym.openai.com/envs/Taxi-v2/) and solve it applying Q-Learning. You can find a solution as the one previously presented [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym).\n",
|
||||
"\n",
|
||||
"Analyze the impact of not changing the learning rate (alfa or epsilon, depending on the book) or changing it in a different way."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Optional exercises\n",
|
||||
"\n",
|
||||
"## Doom\n",
|
||||
"Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
|
||||
"* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
|
||||
"* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
|
||||
"* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
|
||||
"* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
|
||||
"* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© 2018 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.9"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
1384
ml5/2_6_1_Q-Learning_Basic.ipynb
Normal file
138
ml5/2_6_1_Q-Learning_Exercises.ipynb
Normal file
@@ -0,0 +1,138 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos Á. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## [Introduction to Machine Learning V](2_6_0_Intro_RL.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Exercises\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"## Taxi\n",
|
||||
"Analyze the [Taxi problem](https://gymnasium.farama.org/environments/toy_text/taxi/) and solve it applying Q-Learning. You can find a solution as the one previously presented [here](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym), and the notebook is [here](https://github.com/wagonhelm/Reinforcement-Learning-Introduction/blob/master/Reinforcement%20Learning%20Introduction.ipynb). Take into account that Gymnasium has changed, so you will have to adapt the code.\n",
|
||||
"\n",
|
||||
"Analyze the impact of not changing the learning rate or changing it in a different way. "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Optional exercises\n",
|
||||
"Select one of the following exercises.\n",
|
||||
"\n",
|
||||
"## Blackjack\n",
|
||||
"Analyze how to appy Q-Learning for solving Blackjack.\n",
|
||||
"You can find information in this [article](https://gymnasium.farama.org/tutorials/training_agents/blackjack_tutorial/).\n",
|
||||
"\n",
|
||||
"## Doom\n",
|
||||
"Read this [article](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8) and execute the companion [notebook](https://github.com/simoninithomas/Deep_reinforcement_learning_Course/blob/master/Deep%20Q%20Learning/Doom/Deep%20Q%20learning%20with%20Doom.ipynb). Analyze the results and provide conclusions about DQN.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## References\n",
|
||||
"* [Gymnasium documentation](https://gymnasium.farama.org/).\n",
|
||||
"* [Diving deeper into Reinforcement Learning with Q-Learning, Thomas Simonini](https://medium.freecodecamp.org/diving-deeper-into-reinforcement-learning-with-q-learning-c18d0db58efe).\n",
|
||||
"* Illustrations by [Thomas Simonini](https://github.com/simoninithomas/Deep_reinforcement_learning_Course) and [Sung Kim](https://www.youtube.com/watch?v=xgoO54qN4lY).\n",
|
||||
"* [Frozen Lake solution with TensorFlow](https://analyticsindiamag.com/openai-gym-frozen-lake-beginners-guide-reinforcement-learning/)\n",
|
||||
"* [Deep Q-Learning for Doom](https://medium.freecodecamp.org/an-introduction-to-deep-q-learning-lets-play-doom-54d02d8017d8)\n",
|
||||
"* [Intro OpenAI Gym with Random Search and the Cart Pole scenario](http://www.pinchofintelligence.com/getting-started-openai-gym/)\n",
|
||||
"* [Q-Learning for the Taxi scenario](https://www.oreilly.com/learning/introduction-to-reinforcement-learning-and-openai-gym)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos Á. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.10"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
}
|
368
ml5/2_6_1_Q-Learning_Visualization.ipynb
Normal file
274
ml5/qlearning.py
Normal file
@@ -0,0 +1,274 @@
|
||||
# Class definition of QLearning
|
||||
|
||||
from pathlib import Path
|
||||
from typing import NamedTuple
|
||||
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import seaborn as sns
|
||||
from tqdm import tqdm
|
||||
|
||||
import gymnasium as gym
|
||||
from gymnasium.envs.toy_text.frozen_lake import generate_random_map
|
||||
|
||||
# Params
|
||||
|
||||
class Params(NamedTuple):
|
||||
total_episodes: int # Total episodes
|
||||
learning_rate: float # Learning rate
|
||||
gamma: float # Discounting rate
|
||||
epsilon: float # Exploration probability
|
||||
map_size: int # Number of tiles of one side of the squared environment
|
||||
seed: int # Define a seed so that we get reproducible results
|
||||
is_slippery: bool # If true the player will move in intended direction with probability of 1/3 else will move in either perpendicular direction with equal probability of 1/3 in both directions
|
||||
n_runs: int # Number of runs
|
||||
action_size: int # Number of possible actions
|
||||
state_size: int # Number of possible states
|
||||
proba_frozen: float # Probability that a tile is frozen
|
||||
savefig_folder: Path # Root folder where plots are saved
|
||||
|
||||
|
||||
class Qlearning:
|
||||
def __init__(self, learning_rate, gamma, state_size, action_size):
|
||||
self.state_size = state_size
|
||||
self.action_size = action_size
|
||||
self.learning_rate = learning_rate
|
||||
self.gamma = gamma
|
||||
self.reset_qtable()
|
||||
|
||||
def update(self, state, action, reward, new_state):
|
||||
"""Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]"""
|
||||
delta = (
|
||||
reward
|
||||
+ self.gamma * np.max(self.qtable[new_state][:])
|
||||
- self.qtable[state][action]
|
||||
)
|
||||
q_update = self.qtable[state][action] + self.learning_rate * delta
|
||||
return q_update
|
||||
|
||||
def reset_qtable(self):
|
||||
"""Reset the Q-table."""
|
||||
self.qtable = np.zeros((self.state_size, self.action_size))
|
||||
|
||||
|
||||
class EpsilonGreedy:
|
||||
def __init__(self, epsilon, rng):
|
||||
self.epsilon = epsilon
|
||||
self.rng = rng
|
||||
|
||||
def choose_action(self, action_space, state, qtable):
|
||||
"""Choose an action `a` in the current world state (s)."""
|
||||
# First we randomize a number
|
||||
explor_exploit_tradeoff = self.rng.uniform(0, 1)
|
||||
|
||||
# Exploration
|
||||
if explor_exploit_tradeoff < self.epsilon:
|
||||
action = action_space.sample()
|
||||
|
||||
# Exploitation (taking the biggest Q-value for this state)
|
||||
else:
|
||||
# Break ties randomly
|
||||
# If all actions are the same for this state we choose a random one
|
||||
# (otherwise `np.argmax()` would always take the first one)
|
||||
if np.all(qtable[state][:]) == qtable[state][0]:
|
||||
action = action_space.sample()
|
||||
else:
|
||||
action = np.argmax(qtable[state][:])
|
||||
return action
|
||||
|
||||
|
||||
def run_frozen_maps(maps, params, rng):
|
||||
"""Run FrozenLake in maps and plot results"""
|
||||
map_sizes = maps
|
||||
res_all = pd.DataFrame()
|
||||
st_all = pd.DataFrame()
|
||||
|
||||
for map_size in map_sizes:
|
||||
env = gym.make(
|
||||
"FrozenLake-v1",
|
||||
is_slippery=params.is_slippery,
|
||||
render_mode="rgb_array",
|
||||
desc=generate_random_map(
|
||||
size=map_size, p=params.proba_frozen, seed=params.seed
|
||||
),
|
||||
)
|
||||
|
||||
params = params._replace(action_size=env.action_space.n)
|
||||
params = params._replace(state_size=env.observation_space.n)
|
||||
env.action_space.seed(
|
||||
params.seed
|
||||
) # Set the seed to get reproducible results when sampling the action space
|
||||
learner = Qlearning(
|
||||
learning_rate=params.learning_rate,
|
||||
gamma=params.gamma,
|
||||
state_size=params.state_size,
|
||||
action_size=params.action_size,
|
||||
)
|
||||
explorer = EpsilonGreedy(
|
||||
epsilon=params.epsilon,
|
||||
rng=rng
|
||||
)
|
||||
print(f"Map size: {map_size}x{map_size}")
|
||||
rewards, steps, episodes, qtables, all_states, all_actions = run_env(env, params, learner, explorer)
|
||||
|
||||
# Save the results in dataframes
|
||||
res, st = postprocess(episodes, params, rewards, steps, map_size)
|
||||
res_all = pd.concat([res_all, res])
|
||||
st_all = pd.concat([st_all, st])
|
||||
qtable = qtables.mean(axis=0) # Average the Q-table between runs
|
||||
|
||||
plot_states_actions_distribution(
|
||||
states=all_states, actions=all_actions, map_size=map_size, params=params
|
||||
) # Sanity check
|
||||
plot_q_values_map(qtable, env, map_size, params)
|
||||
|
||||
env.close()
|
||||
return res_all, st_all
|
||||
|
||||
def run_env(env, params, learner, explorer):
|
||||
rewards = np.zeros((params.total_episodes, params.n_runs))
|
||||
steps = np.zeros((params.total_episodes, params.n_runs))
|
||||
episodes = np.arange(params.total_episodes)
|
||||
qtables = np.zeros((params.n_runs, params.state_size, params.action_size))
|
||||
all_states = []
|
||||
all_actions = []
|
||||
|
||||
for run in range(params.n_runs): # Run several times to account for stochasticity
|
||||
learner.reset_qtable() # Reset the Q-table between runs
|
||||
|
||||
for episode in tqdm(
|
||||
episodes, desc=f"Run {run}/{params.n_runs} - Episodes", leave=False
|
||||
):
|
||||
state = env.reset(seed=params.seed)[0] # Reset the environment
|
||||
step = 0
|
||||
done = False
|
||||
total_rewards = 0
|
||||
|
||||
while not done:
|
||||
action = explorer.choose_action(
|
||||
action_space=env.action_space, state=state, qtable=learner.qtable
|
||||
)
|
||||
|
||||
# Log all states and actions
|
||||
all_states.append(state)
|
||||
all_actions.append(action)
|
||||
|
||||
# Take the action (a) and observe the outcome state(s') and reward (r)
|
||||
new_state, reward, terminated, truncated, info = env.step(action)
|
||||
|
||||
done = terminated or truncated
|
||||
|
||||
learner.qtable[state, action] = learner.update(
|
||||
state, action, reward, new_state
|
||||
)
|
||||
|
||||
total_rewards += reward
|
||||
step += 1
|
||||
|
||||
# Our new state is state
|
||||
state = new_state
|
||||
|
||||
# Log all rewards and steps
|
||||
rewards[episode, run] = total_rewards
|
||||
steps[episode, run] = step
|
||||
qtables[run, :, :] = learner.qtable
|
||||
|
||||
return rewards, steps, episodes, qtables, all_states, all_actions
|
||||
|
||||
def postprocess(episodes, params, rewards, steps, map_size):
|
||||
"""Convert the results of the simulation in dataframes."""
|
||||
res = pd.DataFrame(
|
||||
data={
|
||||
"Episodes": np.tile(episodes, reps=params.n_runs),
|
||||
"Rewards": rewards.flatten(),
|
||||
"Steps": steps.flatten(),
|
||||
}
|
||||
)
|
||||
res["cum_rewards"] = rewards.cumsum(axis=0).flatten(order="F")
|
||||
res["map_size"] = np.repeat(f"{map_size}x{map_size}", res.shape[0])
|
||||
|
||||
st = pd.DataFrame(data={"Episodes": episodes, "Steps": steps.mean(axis=1)})
|
||||
st["map_size"] = np.repeat(f"{map_size}x{map_size}", st.shape[0])
|
||||
return res, st
|
||||
|
||||
def qtable_directions_map(qtable, map_size):
|
||||
"""Get the best learned action & map it to arrows."""
|
||||
qtable_val_max = qtable.max(axis=1).reshape(map_size, map_size)
|
||||
qtable_best_action = np.argmax(qtable, axis=1).reshape(map_size, map_size)
|
||||
directions = {0: "←", 1: "↓", 2: "→", 3: "↑"}
|
||||
qtable_directions = np.empty(qtable_best_action.flatten().shape, dtype=str)
|
||||
eps = np.finfo(float).eps # Minimum float number on the machine
|
||||
for idx, val in enumerate(qtable_best_action.flatten()):
|
||||
if qtable_val_max.flatten()[idx] > eps:
|
||||
# Assign an arrow only if a minimal Q-value has been learned as best action
|
||||
# otherwise since 0 is a direction, it also gets mapped on the tiles where
|
||||
# it didn't actually learn anything
|
||||
qtable_directions[idx] = directions[val]
|
||||
qtable_directions = qtable_directions.reshape(map_size, map_size)
|
||||
return qtable_val_max, qtable_directions
|
||||
|
||||
def plot_q_values_map(qtable, env, map_size, params):
|
||||
"""Plot the last frame of the simulation and the policy learned."""
|
||||
qtable_val_max, qtable_directions = qtable_directions_map(qtable, map_size)
|
||||
|
||||
# Plot the last frame
|
||||
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
|
||||
ax[0].imshow(env.render())
|
||||
ax[0].axis("off")
|
||||
ax[0].set_title("Last frame")
|
||||
|
||||
# Plot the policy
|
||||
sns.heatmap(
|
||||
qtable_val_max,
|
||||
annot=qtable_directions,
|
||||
fmt="",
|
||||
ax=ax[1],
|
||||
cmap=sns.color_palette("Blues", as_cmap=True),
|
||||
linewidths=0.7,
|
||||
linecolor="black",
|
||||
xticklabels=[],
|
||||
yticklabels=[],
|
||||
annot_kws={"fontsize": "xx-large"},
|
||||
).set(title="Learned Q-values\nArrows represent best action")
|
||||
for _, spine in ax[1].spines.items():
|
||||
spine.set_visible(True)
|
||||
spine.set_linewidth(0.7)
|
||||
spine.set_color("black")
|
||||
img_title = f"frozenlake_q_values_{map_size}x{map_size}.png"
|
||||
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
|
||||
plt.show()
|
||||
|
||||
def plot_states_actions_distribution(states, actions, map_size, params):
|
||||
"""Plot the distributions of states and actions."""
|
||||
labels = {"LEFT": 0, "DOWN": 1, "RIGHT": 2, "UP": 3}
|
||||
|
||||
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
|
||||
sns.histplot(data=states, ax=ax[0], kde=True)
|
||||
ax[0].set_title("States")
|
||||
sns.histplot(data=actions, ax=ax[1])
|
||||
ax[1].set_xticks(list(labels.values()), labels=labels.keys())
|
||||
ax[1].set_title("Actions")
|
||||
fig.tight_layout()
|
||||
img_title = f"frozenlake_states_actions_distrib_{map_size}x{map_size}.png"
|
||||
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
|
||||
plt.show()
|
||||
|
||||
def plot_steps_and_rewards(rewards_df, steps_df,params):
|
||||
"""Plot the steps and rewards from dataframes."""
|
||||
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
|
||||
sns.lineplot(
|
||||
data=rewards_df, x="Episodes", y="cum_rewards", hue="map_size", ax=ax[0]
|
||||
)
|
||||
ax[0].set(ylabel="Cumulated rewards")
|
||||
|
||||
sns.lineplot(data=steps_df, x="Episodes", y="Steps", hue="map_size", ax=ax[1])
|
||||
ax[1].set(ylabel="Averaged steps number")
|
||||
|
||||
for axi in ax:
|
||||
axi.legend(title="map size")
|
||||
fig.tight_layout()
|
||||
img_title = "frozenlake_steps_and_rewards.png"
|
||||
fig.savefig(params.savefig_folder / img_title, bbox_inches="tight")
|
||||
plt.show()
|
||||
|
742
nlp/0_1_LLM.ipynb
Normal file
@@ -89,7 +89,7 @@
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"In this session we are going to learn to process text so that can apply machine learning techniques."
|
||||
"In this session, we are going to learn to process text so that we can apply machine learning techniques."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -101,7 +101,7 @@
|
||||
},
|
||||
"source": [
|
||||
"# NLP Basics\n",
|
||||
"In this notebook we are going to use two popular NLP libraries:\n",
|
||||
"In this notebook, we are going to use two popular NLP libraries:\n",
|
||||
"* NLTK (Natural Language Toolkit, https://www.nltk.org/) \n",
|
||||
"* Spacy (https://spacy.io/)"
|
||||
]
|
||||
@@ -116,7 +116,7 @@
|
||||
"source": [
|
||||
"Main characteristics:\n",
|
||||
"* both are open source and very popular\n",
|
||||
"* NLTK was released in 2001 while Spacy was in 2015\n",
|
||||
"* NLTK was released in 2001, while Spacy was in 2015\n",
|
||||
"* Spacy provides very efficient implementations"
|
||||
]
|
||||
},
|
||||
@@ -130,7 +130,7 @@
|
||||
"source": [
|
||||
"# Spacy installation\n",
|
||||
"\n",
|
||||
"You need to install previously spacy if not installed:\n",
|
||||
"You need to install spacy if not installed:\n",
|
||||
"* `pip install spacy`\n",
|
||||
"* or `conda install -c conda-forge spacy`\n",
|
||||
"\n",
|
||||
@@ -148,7 +148,7 @@
|
||||
"source": [
|
||||
"# Spacy pipelines\n",
|
||||
"\n",
|
||||
"The function **nlp** takes a raw text and perform several operations (tokenization, tagger, NER, ...)\n",
|
||||
"The function **nlp** takes a raw text and performs several operations (tokenization, tagger, NER, ...)\n",
|
||||
""
|
||||
]
|
||||
},
|
||||
@@ -160,7 +160,7 @@
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"From text to doc trough the pipeline"
|
||||
"From text to doc through the pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -205,7 +205,7 @@
|
||||
"\n",
|
||||
"* **Tokenizer exception:** Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied.\n",
|
||||
"* **Prefix:** Character(s) at the beginning, e.g. $, (, “, ¿.\n",
|
||||
"* **Suffix:** Character(s) at the end, e.g. km, ), ”, !.\n",
|
||||
"* **Suffix:** Character(s) at the end, e.g. km, ”, !.\n",
|
||||
"* **Infix:** Character(s) in between, e.g. -, --, /, …."
|
||||
]
|
||||
},
|
||||
|
@@ -82,7 +82,7 @@
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"### 1. List the first 10 tokens of the doc"
|
||||
"### 1. List the first 10 tokens of the doc."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -149,7 +149,7 @@
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"### 7. Visualize the dependency grammar analysis of the second sentence"
|
||||
"### 7. Visualize the dependency grammar analysis of the second sentence."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -178,7 +178,7 @@
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"### 9. List frequencies of POS in the document in a table "
|
||||
"### 9. List the frequencies of POS in the document in a table."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -191,7 +191,7 @@
|
||||
"source": [
|
||||
"### 10. Preprocessing\n",
|
||||
"\n",
|
||||
"Remove from the doc stopwords, digits and punctuation.\n",
|
||||
"Remove from the doc stopwords, digits, and punctuation.\n",
|
||||
"\n",
|
||||
"Hint: check the token api https://spacy.io/api/token\n",
|
||||
"\n",
|
||||
@@ -207,7 +207,7 @@
|
||||
},
|
||||
"source": [
|
||||
"### 11. Entities of the document\n",
|
||||
"Print the entities of the document, the type of the entity and what the explanation of the entity in a table with three columns.\n",
|
||||
"Print the entities of the document, the type of the entity, and the explanation of the entity in a table with three columns.\n",
|
||||
"\n",
|
||||
"Example:\n",
|
||||
"\n",
|
||||
@@ -223,7 +223,7 @@
|
||||
},
|
||||
"source": [
|
||||
"### 12. Visualize the entities\n",
|
||||
"Show the entities in a graph."
|
||||
"Show the entities highlighted in the text."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -236,7 +236,7 @@
|
||||
"source": [
|
||||
"# Movie review\n",
|
||||
"\n",
|
||||
"Classify the rmoview reviews from the following dataset https://data.world/rajeevsharma993/movie-reviews"
|
||||
"Classify the movie reviews from the following dataset https://data.world/rajeevsharma993/movie-reviews"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@@ -105,9 +105,23 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: \"▸\";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: \"▾\";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: \"\";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: \"\";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id=\"sk-container-id-1\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>CountVectorizer(max_features=5000)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-1\" type=\"checkbox\" checked><label for=\"sk-estimator-id-1\" class=\"sk-toggleable__label sk-toggleable__label-arrow\">CountVectorizer</label><div class=\"sk-toggleable__content\"><pre>CountVectorizer(max_features=5000)</pre></div></div></div></div></div>"
|
||||
],
|
||||
"text/plain": [
|
||||
"CountVectorizer(max_features=5000)"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
||||
"\n",
|
||||
@@ -128,9 +142,21 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"<3x10 sparse matrix of type '<class 'numpy.int64'>'\n",
|
||||
"\twith 15 stored elements in Compressed Sparse Row format>"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectors"
|
||||
@@ -146,12 +172,24 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[[0 1 1 2 0 0 1 2 0 0]\n",
|
||||
" [1 0 0 0 2 0 0 1 2 1]\n",
|
||||
" [1 0 0 0 2 1 0 0 1 1]]\n",
|
||||
"['and' 'but' 'coming' 'is' 'like' 'sandwiches' 'short' 'summer' 'the'\n",
|
||||
" 'winter']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"print(vectors.toarray())\n",
|
||||
"print(vectorizer.get_feature_names())"
|
||||
"print(vectorizer.get_feature_names_out())"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -164,13 +202,25 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array(['and', 'but', 'coming', 'i', 'is', 'like', 'sandwiches', 'short',\n",
|
||||
" 'summer', 'the', 'winter'], dtype=object)"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words=None, token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
"vectorizer.get_feature_names_out()"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -182,20 +232,47 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stderr",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"/home/cif/anaconda3/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
|
||||
" warnings.warn(msg, category=FutureWarning)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['coming', 'like', 'sandwiches', 'short', 'summer', 'winter']"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"vectorizer = CountVectorizer(analyzer=\"word\", stop_words='english', token_pattern='(?u)\\\\b\\\\w+\\\\b') \n",
|
||||
"vectors = vectorizer.fit_transform(documents)\n",
|
||||
"vectorizer.get_feature_names()"
|
||||
"vectorizer.get_feature_names_out()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"frozenset({'or', 'be', 'least', 'ours', 'very', 'noone', 'more', 'can', 'front', 'last', 'co', 'where', 'beyond', 'you', 'was', 'to', 'nine', 'here', 'describe', 'than', 'rather', 'therefore', 'except', 'at', 'again', 'ourselves', 'most', 'anyway', 'thick', 'whither', 'thereupon', 'someone', 'hereupon', 'besides', 'among', 'hasnt', 'across', 'namely', 'because', 'is', 'out', 'same', 'yourself', 'somehow', 'sincere', 'con', 'hereby', 'towards', 'interest', 'much', 'up', 'why', 'myself', 'all', 'nobody', 'though', 'every', 'show', 'not', 'there', 'whether', 'still', 'name', 'when', 'the', 'each', 'six', 'nor', 'and', 'under', 'thereby', 'less', 'either', 'thence', 'into', 'seemed', 'something', 'four', 'sometimes', 'himself', 'those', 'nowhere', 'almost', 'are', 'empty', 'must', 'while', 'afterwards', 'perhaps', 'from', 'detail', 'through', 'any', 'have', 'may', 'he', 'anywhere', 'alone', 'without', 'beforehand', 'had', 'too', 'yourselves', 'our', 'see', 'how', 'please', 'what', 'am', 'do', 'it', 'serious', 'yet', 'down', 'top', 'amount', 'then', 'both', 'fire', 'been', 'wherein', 'done', 'etc', 'whose', 'whereafter', 'who', 'ltd', 'meanwhile', 'further', 'few', 'first', 'behind', 'made', 'yours', 'until', 'toward', 'amoungst', 'anyhow', 'we', 'with', 'give', 'go', 'no', 'back', 'else', 'becomes', 'your', 'fill', 'together', 'another', 'throughout', 'onto', 'de', 'me', 'ten', 'system', 'became', 'per', 'therein', 'everyone', 'often', 'ie', 'put', 'hers', 'herself', 'nevertheless', 'itself', 'eg', 'herein', 'his', 'this', 'cry', 'due', 'bill', 'one', 'on', 'being', 'themselves', 'of', 'some', 'their', 'neither', 'elsewhere', 'since', 'whole', 'eight', 'i', 'a', 'whoever', 'own', 'call', 'them', 'mostly', 'she', 'my', 'cannot', 'us', 'never', 'as', 'thin', 'upon', 'cant', 'un', 'before', 'her', 'otherwise', 'full', 'these', 'next', 'they', 'side', 'somewhere', 'fifty', 'hence', 'so', 'along', 'already', 'three', 'latter', 'anything', 'whom', 'could', 'indeed', 'nothing', 'whereby', 'which', 'sometime', 'become', 'ever', 'amongst', 'by', 'in', 'five', 'after', 'mine', 'fifteen', 'wherever', 'found', 'thereafter', 'third', 'keep', 'anyone', 'will', 'bottom', 'off', 'seem', 'none', 'an', 'whatever', 'over', 'during', 'also', 'latterly', 'via', 'take', 'former', 'above', 'now', 'becoming', 'hereafter', 'such', 'two', 'only', 'about', 'sixty', 're', 'everything', 'others', 'hundred', 'twelve', 'thus', 'even', 'well', 'always', 'once', 'beside', 'get', 'mill', 'seems', 'if', 'whereupon', 'find', 'forty', 'inc', 'whenever', 'around', 'other', 'should', 'many', 'enough', 'however', 'move', 'against', 'several', 'everywhere', 'has', 'whereas', 'that', 'whence', 'eleven', 'its', 'within', 'twenty', 'part', 'although', 'thru', 'couldnt', 'moreover', 'him', 'formerly', 'might', 'seeming', 'but', 'below', 'would', 'between', 'were', 'for'})\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#stop words in scikit-learn for English\n",
|
||||
"print(vectorizer.get_stop_words())"
|
||||
@@ -442,7 +519,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@@ -456,7 +533,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.1"
|
||||
"version": "3.10.10"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
|
@@ -74,9 +74,17 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.datasets import fetch_20newsgroups\n",
|
||||
"\n",
|
||||
@@ -90,9 +98,17 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"20\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#Number of categories\n",
|
||||
"print(len(newsgroups_train.target_names))"
|
||||
@@ -100,9 +116,26 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Category id 4 comp.sys.mac.hardware\n",
|
||||
"Doc A fair number of brave souls who upgraded their SI clock oscillator have\n",
|
||||
"shared their experiences for this poll. Please send a brief message detailing\n",
|
||||
"your experiences with the procedure. Top speed attained, CPU rated speed,\n",
|
||||
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
|
||||
"functionality with 800 and 1.4 m floppies are especially requested.\n",
|
||||
"\n",
|
||||
"I will be summarizing in the next two days, so please add to the network\n",
|
||||
"knowledge base if you have done the clock upgrade and haven't answered this\n",
|
||||
"poll. Thanks.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Show a document\n",
|
||||
"docid = 1\n",
|
||||
@@ -115,9 +148,20 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(11314,)"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"#Number of files\n",
|
||||
"newsgroups_train.filenames.shape"
|
||||
@@ -125,9 +169,20 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(11314, 101322)"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Obtain a vector\n",
|
||||
"\n",
|
||||
@@ -141,9 +196,20 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"66.802987449178"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# The tf-idf vectors are very sparse with an average of 66 non zero components in 101.323 dimensions (.06%)\n",
|
||||
"vectors_train.nnz / float(vectors_train.shape[0])"
|
||||
@@ -165,9 +231,20 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"0.695453607190013"
|
||||
]
|
||||
},
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.naive_bayes import MultinomialNB\n",
|
||||
"\n",
|
||||
@@ -195,29 +272,44 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"alt.atheism: islam atheists say just religion atheism think don people god\n",
|
||||
"comp.graphics: looking format 3d know program file files thanks image graphics\n",
|
||||
"comp.os.ms-windows.misc: card problem thanks driver drivers use files dos file windows\n",
|
||||
"comp.sys.ibm.pc.hardware: monitor disk thanks pc ide controller bus card scsi drive\n",
|
||||
"comp.sys.mac.hardware: know monitor does quadra simms thanks problem drive apple mac\n",
|
||||
"comp.windows.x: using windows x11r5 use application thanks widget server motif window\n",
|
||||
"misc.forsale: asking email sell price condition new shipping offer 00 sale\n",
|
||||
"rec.autos: don ford new good dealer just engine like cars car\n",
|
||||
"rec.motorcycles: don just helmet riding like motorcycle ride bikes dod bike\n",
|
||||
"rec.sport.baseball: braves players pitching hit runs games game baseball team year\n",
|
||||
"rec.sport.hockey: league year nhl games season players play hockey team game\n",
|
||||
"sci.crypt: people use escrow nsa keys government chip clipper encryption key\n",
|
||||
"sci.electronics: don thanks voltage used know does like circuit power use\n",
|
||||
"sci.med: skepticism cadre dsl banks chastity n3jxp pitt gordon geb msg\n",
|
||||
"sci.space: just lunar earth shuttle like moon launch orbit nasa space\n",
|
||||
"soc.religion.christian: believe faith christian christ bible people christians church jesus god\n",
|
||||
"talk.politics.guns: just law firearms government fbi don weapons people guns gun\n",
|
||||
"talk.politics.mideast: said arabs arab turkish people armenians armenian jews israeli israel\n",
|
||||
"talk.politics.misc: know state clinton president just think tax don government people\n",
|
||||
"talk.religion.misc: think don koresh objective christians bible people christian jesus god\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.utils.extmath import density\n",
|
||||
"\n",
|
||||
"print(\"dimensionality: %d\" % model.coef_.shape[1])\n",
|
||||
"print(\"density: %f\" % density(model.coef_))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# We can review the top features per topic in Bayes (attribute coef_)\n",
|
||||
"# We can review the top features per topic in Bayes (attribute feature_log_prob_)\n",
|
||||
"import numpy as np\n",
|
||||
"\n",
|
||||
"def show_top10(classifier, vectorizer, categories):\n",
|
||||
" feature_names = np.asarray(vectorizer.get_feature_names())\n",
|
||||
" feature_names = np.asarray(vectorizer.get_feature_names_out())\n",
|
||||
" for i, category in enumerate(categories):\n",
|
||||
" top10 = np.argsort(classifier.coef_[i])[-10:]\n",
|
||||
" top10 = np.argsort(classifier.feature_log_prob_[i, :])[-10:]\n",
|
||||
" print(\"%s: %s\" % (category, \" \".join(feature_names[top10])))\n",
|
||||
"\n",
|
||||
" \n",
|
||||
@@ -226,9 +318,18 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[ 2 15]\n",
|
||||
"['comp.os.ms-windows.misc', 'soc.religion.christian']\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# We try the classifier in two new docs\n",
|
||||
"\n",
|
||||
@@ -275,7 +376,7 @@
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@@ -289,7 +390,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.1"
|
||||
"version": "3.10.10"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
|
@@ -76,7 +76,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 33,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
@@ -85,7 +85,7 @@
|
||||
"(2034, 2807)"
|
||||
]
|
||||
},
|
||||
"execution_count": 33,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -134,15 +134,41 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 34,
|
||||
"execution_count": 24,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Requirement already satisfied: gensim in /home/cif/anaconda3/lib/python3.10/site-packages (4.3.1)\n",
|
||||
"Requirement already satisfied: scipy>=1.7.0 in /home/cif/anaconda3/lib/python3.10/site-packages (from gensim) (1.10.1)\n",
|
||||
"Requirement already satisfied: smart-open>=1.8.1 in /home/cif/anaconda3/lib/python3.10/site-packages (from gensim) (6.3.0)\n",
|
||||
"Requirement already satisfied: numpy>=1.18.5 in /home/cif/anaconda3/lib/python3.10/site-packages (from gensim) (1.24.2)\n",
|
||||
"Note: you may need to restart the kernel to use updated packages.\n",
|
||||
"Requirement already satisfied: python-Levenshtein in /home/cif/anaconda3/lib/python3.10/site-packages (0.21.0)\n",
|
||||
"Requirement already satisfied: Levenshtein==0.21.0 in /home/cif/anaconda3/lib/python3.10/site-packages (from python-Levenshtein) (0.21.0)\n",
|
||||
"Requirement already satisfied: rapidfuzz<4.0.0,>=2.3.0 in /home/cif/anaconda3/lib/python3.10/site-packages (from Levenshtein==0.21.0->python-Levenshtein) (3.0.0)\n",
|
||||
"Note: you may need to restart the kernel to use updated packages.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"%pip install gensim\n",
|
||||
"%pip install python-Levenshtein"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 23,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from gensim import matutils\n",
|
||||
"\n",
|
||||
"vocab = vectorizer.get_feature_names()\n",
|
||||
"vocab = vectorizer.get_feature_names_out()\n",
|
||||
"\n",
|
||||
"dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names())])\n",
|
||||
"dictionary = dict([(i, s) for i, s in enumerate(vectorizer.get_feature_names_out())])\n",
|
||||
"corpus_tfidf = matutils.Sparse2Corpus(vectors_train)"
|
||||
]
|
||||
},
|
||||
@@ -162,7 +188,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 60,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -176,23 +202,23 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 61,
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(0,\n",
|
||||
" '0.011*\"baptist\" + 0.010*\"koresh\" + 0.009*\"bible\" + 0.006*\"reality\" + 0.006*\"virtual\" + 0.005*\"scarlet\" + 0.005*\"shag\" + 0.004*\"tootsie\" + 0.004*\"kinda\" + 0.004*\"captain\"'),\n",
|
||||
" '0.004*\"central\" + 0.004*\"assumptions\" + 0.004*\"matthew\" + 0.004*\"define\" + 0.004*\"holes\" + 0.003*\"killing\" + 0.003*\"item\" + 0.003*\"curious\" + 0.003*\"going\" + 0.003*\"presentations\"'),\n",
|
||||
" (1,\n",
|
||||
" '0.010*\"targa\" + 0.008*\"thanks\" + 0.008*\"moon\" + 0.007*\"craig\" + 0.007*\"zoroastrians\" + 0.006*\"yayayay\" + 0.005*\"unfortunately\" + 0.005*\"windows\" + 0.005*\"rayshade\" + 0.004*\"tdb\"'),\n",
|
||||
" '0.002*\"mechanism\" + 0.002*\"led\" + 0.002*\"apple\" + 0.002*\"color\" + 0.002*\"mormons\" + 0.002*\"activity\" + 0.002*\"concepts\" + 0.002*\"frank\" + 0.002*\"platform\" + 0.002*\"fault\"'),\n",
|
||||
" (2,\n",
|
||||
" '0.009*\"mary\" + 0.007*\"whatever\" + 0.006*\"god\" + 0.005*\"ns\" + 0.005*\"lucky\" + 0.005*\"joseph\" + 0.005*\"ssrt\" + 0.005*\"samaritan\" + 0.005*\"crusades\" + 0.004*\"phobos\"'),\n",
|
||||
" '0.005*\"objects\" + 0.005*\"obtained\" + 0.003*\"manhattan\" + 0.003*\"capability\" + 0.003*\"education\" + 0.003*\"men\" + 0.003*\"photo\" + 0.003*\"decent\" + 0.003*\"environmental\" + 0.003*\"pain\"'),\n",
|
||||
" (3,\n",
|
||||
" '0.009*\"islam\" + 0.008*\"western\" + 0.008*\"plane\" + 0.008*\"jeff\" + 0.007*\"cheers\" + 0.007*\"kent\" + 0.007*\"joy\" + 0.007*\"khomeini\" + 0.007*\"davidian\" + 0.006*\"basically\"')]"
|
||||
" '0.004*\"car\" + 0.004*\"contain\" + 0.004*\"groups\" + 0.004*\"center\" + 0.004*\"evil\" + 0.004*\"maintain\" + 0.004*\"comets\" + 0.004*\"88\" + 0.004*\"density\" + 0.003*\"company\"')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 61,
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -211,7 +237,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 62,
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -243,14 +269,14 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 63,
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Dictionary(10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...)\n"
|
||||
"Dictionary<10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...>\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -263,7 +289,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 64,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -274,7 +300,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 65,
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -286,14 +312,14 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 71,
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Dictionary(10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...)\n"
|
||||
"Dictionary<10913 unique tokens: ['cel', 'ds', 'hi', 'nothing', 'prj']...>\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -305,7 +331,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 72,
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -315,7 +341,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 73,
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -328,7 +354,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 74,
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
@@ -346,7 +372,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 75,
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -359,23 +385,23 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 76,
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(0,\n",
|
||||
" '0.009*\"whatever\" + 0.007*\"plane\" + 0.007*\"ns\" + 0.007*\"joy\" + 0.006*\"happy\" + 0.005*\"bob\" + 0.004*\"phil\" + 0.004*\"nasa\" + 0.003*\"purdue\" + 0.003*\"neie\"'),\n",
|
||||
" '0.011*\"mary\" + 0.007*\"ns\" + 0.006*\"joseph\" + 0.006*\"lucky\" + 0.006*\"ssrt\" + 0.005*\"god\" + 0.005*\"unfortunately\" + 0.004*\"rayshade\" + 0.004*\"phil\" + 0.004*\"nasa\"'),\n",
|
||||
" (1,\n",
|
||||
" '0.009*\"god\" + 0.008*\"mary\" + 0.008*\"targa\" + 0.007*\"baptist\" + 0.007*\"thanks\" + 0.007*\"koresh\" + 0.006*\"really\" + 0.006*\"bible\" + 0.005*\"lot\" + 0.005*\"lucky\"'),\n",
|
||||
" '0.009*\"thanks\" + 0.009*\"targa\" + 0.008*\"whatever\" + 0.008*\"baptist\" + 0.007*\"islam\" + 0.006*\"cheers\" + 0.006*\"kent\" + 0.006*\"zoroastrians\" + 0.006*\"joy\" + 0.006*\"lot\"'),\n",
|
||||
" (2,\n",
|
||||
" '0.010*\"moon\" + 0.007*\"phobos\" + 0.006*\"unfortunately\" + 0.006*\"martian\" + 0.006*\"russian\" + 0.005*\"rayshade\" + 0.005*\"anybody\" + 0.005*\"perturbations\" + 0.005*\"thanks\" + 0.004*\"apollo\"'),\n",
|
||||
" '0.008*\"moon\" + 0.008*\"really\" + 0.008*\"western\" + 0.007*\"plane\" + 0.006*\"samaritan\" + 0.006*\"crusades\" + 0.006*\"baltimore\" + 0.005*\"bob\" + 0.005*\"septuagint\" + 0.005*\"virtual\"'),\n",
|
||||
" (3,\n",
|
||||
" '0.008*\"islam\" + 0.008*\"western\" + 0.007*\"jeff\" + 0.007*\"zoroastrians\" + 0.006*\"davidian\" + 0.006*\"basically\" + 0.005*\"bull\" + 0.005*\"gerald\" + 0.005*\"sorry\" + 0.004*\"kent\"')]"
|
||||
" '0.009*\"koresh\" + 0.008*\"bible\" + 0.008*\"jeff\" + 0.007*\"basically\" + 0.006*\"gerald\" + 0.006*\"bull\" + 0.005*\"pd\" + 0.004*\"also\" + 0.003*\"dam\" + 0.003*\"feiner\"')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 76,
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -387,14 +413,14 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 77,
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.7154438), (1, 0.10569019), (2, 0.09522807), (3, 0.08363795)]\n"
|
||||
"[(0, 0.09161347), (1, 0.1133858), (2, 0.103424065), (3, 0.69157666)]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -406,7 +432,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 78,
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
@@ -427,14 +453,14 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 79,
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.06320839), (1, 0.80878526), (2, 0.06274223), (3, 0.065264106)]\n"
|
||||
"[(0, 0.066217005), (1, 0.8084562), (2, 0.062542014), (3, 0.0627848)]\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -446,14 +472,14 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 80,
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"0.009*\"god\" + 0.008*\"mary\" + 0.008*\"targa\" + 0.007*\"baptist\" + 0.007*\"thanks\" + 0.007*\"koresh\" + 0.006*\"really\" + 0.006*\"bible\" + 0.005*\"lot\" + 0.005*\"lucky\"\n"
|
||||
"0.009*\"thanks\" + 0.009*\"targa\" + 0.008*\"whatever\" + 0.008*\"baptist\" + 0.007*\"islam\" + 0.006*\"cheers\" + 0.006*\"kent\" + 0.006*\"zoroastrians\" + 0.006*\"joy\" + 0.006*\"lot\"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -464,15 +490,15 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 81,
|
||||
"execution_count": 19,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"[(0, 0.10564032), (1, 0.67894983), (2, 0.104482815), (3, 0.11092702)]\n",
|
||||
"0.009*\"god\" + 0.008*\"mary\" + 0.008*\"targa\" + 0.007*\"baptist\" + 0.007*\"thanks\" + 0.007*\"koresh\" + 0.006*\"really\" + 0.006*\"bible\" + 0.005*\"lot\" + 0.005*\"lucky\"\n"
|
||||
"[(0, 0.11006463), (1, 0.6813435), (2, 0.10399808), (3, 0.10459379)]\n",
|
||||
"0.009*\"thanks\" + 0.009*\"targa\" + 0.008*\"whatever\" + 0.008*\"baptist\" + 0.007*\"islam\" + 0.006*\"cheers\" + 0.006*\"kent\" + 0.006*\"zoroastrians\" + 0.006*\"joy\" + 0.006*\"lot\"\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
@@ -492,7 +518,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 82,
|
||||
"execution_count": 20,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
@@ -508,23 +534,23 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 83,
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"[(0,\n",
|
||||
" '0.769*\"god\" + 0.346*\"jesus\" + 0.235*\"bible\" + 0.204*\"christian\" + 0.148*\"christians\" + 0.107*\"christ\" + 0.090*\"well\" + 0.085*\"koresh\" + 0.081*\"kent\" + 0.080*\"christianity\"'),\n",
|
||||
" '-0.769*\"god\" + -0.345*\"jesus\" + -0.235*\"bible\" + -0.203*\"christian\" + -0.149*\"christians\" + -0.107*\"christ\" + -0.089*\"well\" + -0.085*\"koresh\" + -0.082*\"kent\" + -0.081*\"christianity\"'),\n",
|
||||
" (1,\n",
|
||||
" '-0.863*\"thanks\" + -0.255*\"please\" + -0.159*\"hello\" + -0.152*\"hi\" + 0.124*\"god\" + -0.111*\"sorry\" + -0.088*\"could\" + -0.074*\"windows\" + -0.067*\"jpeg\" + -0.063*\"gif\"'),\n",
|
||||
" '-0.863*\"thanks\" + -0.255*\"please\" + -0.159*\"hello\" + -0.152*\"hi\" + 0.123*\"god\" + -0.112*\"sorry\" + -0.088*\"could\" + -0.074*\"windows\" + -0.067*\"jpeg\" + -0.063*\"gif\"'),\n",
|
||||
" (2,\n",
|
||||
" '-0.780*\"well\" + 0.229*\"god\" + -0.165*\"yes\" + 0.154*\"thanks\" + -0.133*\"ico\" + -0.133*\"tek\" + -0.130*\"queens\" + -0.130*\"bronx\" + -0.130*\"beauchaine\" + -0.130*\"manhattan\"'),\n",
|
||||
" '0.779*\"well\" + -0.229*\"god\" + 0.165*\"yes\" + -0.154*\"thanks\" + 0.135*\"ico\" + 0.134*\"tek\" + 0.131*\"queens\" + 0.131*\"bronx\" + 0.131*\"beauchaine\" + 0.131*\"manhattan\"'),\n",
|
||||
" (3,\n",
|
||||
" '-0.338*\"well\" + 0.336*\"ico\" + 0.334*\"tek\" + 0.328*\"bronx\" + 0.328*\"beauchaine\" + 0.328*\"queens\" + 0.326*\"manhattan\" + 0.305*\"com\" + 0.305*\"bob\" + 0.072*\"god\"')]"
|
||||
" '-0.342*\"well\" + 0.335*\"ico\" + 0.333*\"tek\" + 0.327*\"bronx\" + 0.327*\"queens\" + 0.327*\"beauchaine\" + 0.325*\"manhattan\" + 0.305*\"bob\" + 0.304*\"com\" + 0.073*\"god\"')]"
|
||||
]
|
||||
},
|
||||
"execution_count": 83,
|
||||
"execution_count": 21,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
@@ -536,7 +562,7 @@
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 84,
|
||||
"execution_count": 22,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
@@ -595,7 +621,7 @@
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
@@ -609,7 +635,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.8"
|
||||
"version": "3.10.10"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
|
33
nlp/spacy/spacy-pipeline.svg
Normal file
@@ -0,0 +1,33 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="1155" height="221" viewBox="0 0 1155 221">
|
||||
<defs>
|
||||
<rect id="a" width="735" height="170" x="210" y="25" rx="30"/>
|
||||
<mask id="b" width="735" height="170" x="0" y="0" fill="#fff" maskContentUnits="userSpaceOnUse" maskUnits="objectBoundingBox">
|
||||
<use xlink:href="#a"/>
|
||||
</mask>
|
||||
</defs>
|
||||
<g fill="none" fill-rule="evenodd" transform="translate(0 26)">
|
||||
<rect width="145" height="80" x="2.5" y="2.5" fill="#D8D8D8" stroke="#6A6A6A" stroke-width="5" rx="10" transform="translate(0 70)"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M55.4 99.7v3.9h-7.6V125H43v-21.4h-7.7v-3.9h20zm10.2 7c1 0 2.1.2 3 .6a6.8 6.8 0 014.1 4.1 9.6 9.6 0 01.6 4.3l-.2.5-.3.3H61.3c0 2 .6 3.3 1.4 4.1.9.9 2 1.3 3.5 1.3a6 6 0 001.8-.2l1.3-.6 1-.5.8-.3c.2 0 .3 0 .5.2l.3.2 1.3 1.6c-.5.6-1 1-1.6 1.4a9 9 0 01-3.9 1.4l-2 .2c-1.2 0-2.3-.2-3.4-.7-1-.4-2-1-2.8-1.8a8.6 8.6 0 01-1.9-3 11.6 11.6 0 010-7.6c.3-1.1.9-2 1.6-2.8a8 8 0 012.7-2 9 9 0 013.7-.6zm0 3.2a4 4 0 00-3 1c-.6.7-1 1.8-1.3 3h8.1c0-.5 0-1-.2-1.5-.1-.5-.4-1-.7-1.3-.3-.4-.7-.7-1.2-1a4 4 0 00-1.7-.2zm15.5 5.8l-5.9-8.7h4.2c.3 0 .5 0 .7.2l.4.4 3.7 6a4.9 4.9 0 01.6-1.2l3-4.7.4-.5.6-.2h4l-6 8.5L93 125h-4.2c-.3 0-.5 0-.7-.2l-.5-.6-3.8-6.3-.4 1.1-3.4 5.2-.5.5a1 1 0 01-.7.3H75l6-9.3zm20.5 9.6c-1.5 0-2.7-.5-3.5-1.3a5 5 0 01-1.3-3.7v-10H95c-.3 0-.5 0-.6-.2-.2-.2-.3-.4-.3-.7v-1.7l2.9-.5 1-5c0-.1 0-.3.2-.5l.7-.2h2.2v5.7h4.7v3h-4.7v9.8c0 .6.2 1 .4 1.3.3.3.7.5 1.2.5l.6-.1a3.7 3.7 0 00.9-.4l.3-.1.3.1.3.3 1.2 2c-.6.6-1.3 1-2.1 1.3a8 8 0 01-2.6.4z"/>
|
||||
<rect width="145" height="80" x="2.5" y="2.5" fill="#D7CCF4" stroke="#8978B5" stroke-width="5" rx="10" transform="translate(1005 70)"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M1050.3 101.5a58.8 58.8 0 016.8-.4c2.2 0 4 .4 5.4 1 1.4.6 2.5 1.5 3.4 2.6a10 10 0 011.7 4 23.2 23.2 0 010 9.6c-.3 1.5-1 2.9-1.8 4-.8 1.3-2 2.2-3.5 3-1.5.7-3.4 1-5.8 1a37.3 37.3 0 01-5-.1l-1.2-.2v-24.5zm7 4a15.6 15.6 0 00-2.3 0V122h.5a158 158 0 001.6.1 6 6 0 003.2-.7c.8-.5 1.4-1.2 1.8-2 .4-.8.7-1.8.8-2.8a27.3 27.3 0 000-5.8 8 8 0 00-.7-2.6c-.4-.8-1-1.5-1.8-2-.7-.5-1.8-.8-3.1-.8zm13.4 11.8c0-1.5.2-2.8.7-4a8 8 0 014.8-4.7c1.1-.4 2.4-.6 3.8-.6 1.5 0 2.8.2 4 .7 1 .4 2 1 2.9 1.8.8.9 1.4 1.8 1.8 3 .4 1.1.6 2.4.6 3.7 0 1.5-.2 2.8-.7 4a8 8 0 01-4.8 4.7c-1.1.4-2.4.6-3.8.6a11 11 0 01-4-.7c-1-.4-2-1-2.9-1.8a7.9 7.9 0 01-1.8-3c-.4-1.1-.6-2.4-.6-3.8zm4.7 0c0 .7.1 1.4.3 2 .2.7.5 1.3 1 1.8a4.1 4.1 0 003.3 1.5c1.4 0 2.5-.4 3.3-1.3.9-.8 1.3-2.2 1.3-4a6 6 0 00-1.2-4c-.8-1-2-1.4-3.4-1.4-.7 0-1.3 0-1.8.3-.6.2-1 .5-1.5 1-.4.4-.7 1-1 1.6-.2.7-.3 1.5-.3 2.4zm34.2 7c-1 .7-2 1.3-3.3 1.6-1.3.4-2.7.6-4 .6-1.6 0-3-.2-4.1-.7-1.2-.4-2.2-1-3-1.8a8 8 0 01-1.8-3 10.9 10.9 0 010-7.7 8.2 8.2 0 015.2-4.7 14.3 14.3 0 017.6-.2l2.6 1v6.1h-3.8v-3.2l-2.2-.3c-.7 0-1.3.1-2 .3a4.8 4.8 0 00-2.9 2.6c-.3.7-.5 1.4-.5 2.3 0 .8.2 1.5.4 2.1a5 5 0 002.8 2.8 8.2 8.2 0 005.6-.2l1.9-1 1.5 3.4z"/>
|
||||
<use stroke="#3AC" stroke-dasharray="5 10" stroke-width="10" mask="url(#b)" xlink:href="#a"/>
|
||||
<g transform="translate(540)">
|
||||
<rect width="95" height="50" x="2.5" y="2.5" fill="#C3E7F1" stroke="#3AC" stroke-width="5" rx="10"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M27.8 24.5h4.4l.3 1.6h.1a5.2 5.2 0 014.2-2c.7 0 1.3.1 1.8.3.6.2 1 .4 1.4.8.4.4.7 1 1 1.6.1.6.3 1.5.3 2.4V37H38v-7.1c0-1-.2-1.8-.7-2.2-.4-.5-1-.7-1.7-.7-.6 0-1.2.2-1.7.6-.5.3-.9.8-1 1.3V37h-3.3v-9.8h-1.8v-2.7zm16.9-5H50v11.6c0 1.2.2 2.1.5 2.6s.8.8 1.5.8c.5 0 1 0 1.3-.2l1-.4 1.2 2.2a15.3 15.3 0 01-1.8 1 6.1 6.1 0 01-2.3.3c-1.5 0-2.7-.4-3.5-1.3-.8-.8-1.1-1.9-1.1-3.4V22.3h-2.1v-2.7zm12.8 5h4.3L62 26h.1c.9-1.2 2.3-1.9 4.2-1.9a6 6 0 012.1.4c.7.3 1.2.6 1.7 1.1.4.6.8 1.2 1 2 .3.8.4 1.7.4 2.8 0 1-.1 2-.4 3-.3.8-.7 1.5-1.2 2.1-.6.6-1.2 1-2 1.4-.7.3-1.6.5-2.6.5-.5 0-1 0-1.5-.2-.5 0-1-.2-1.3-.3V42h-3.2V27.2h-1.9v-2.7zm8 2.4c-.7 0-1.3.2-1.8.5s-.9.8-1 1.4V34c.2.2.5.3 1 .4l1.3.2c.4 0 .9 0 1.3-.2s.7-.4 1-.8c.3-.4.6-.8.7-1.3.2-.6.3-1.2.3-2 0-1-.3-1.9-.8-2.5-.6-.6-1.2-.9-2-.9z"/>
|
||||
</g>
|
||||
<path fill="#3AC" d="M205 112.5L180 125v-25z"/>
|
||||
<path stroke="#3AC" stroke-linecap="square" stroke-width="5" d="M180 112.5h-23.1"/>
|
||||
<path fill="#3AC" d="M1000 112.5L975 125v-25z"/>
|
||||
<path stroke="#3AC" stroke-linecap="square" stroke-width="5" d="M975 112.5h-23.1"/>
|
||||
<path fill="#EAC1CC" stroke="#F03969" stroke-linejoin="round" stroke-width="3.8" d="M230 75h135l23.5 43.4L365 160H230l23.5-41.5z"/>
|
||||
<path fill="#F2D7B2" stroke="#F0A439" stroke-linejoin="round" stroke-width="3.8" d="M395 75h135l23.5 43.4L530 160H395l23.5-41.5z"/>
|
||||
<path fill="#F2E7A6" stroke="#CDB217" stroke-linejoin="round" stroke-width="3.8" d="M515 75h135l23.5 43.4L650 160H515l23.5-41.5z"/>
|
||||
<path fill="#D7E99A" stroke="#B2D73A" stroke-linejoin="round" stroke-width="3.8" d="M640 75h135l23.5 43.4L775 160H640l23.5-41.5z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" stroke-linejoin="round" stroke-width="3.8" d="M765 75h135l23.5 43.4L900 160H765l23.5-41.5z"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M265.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm10.9-13.2c1 0 1.8.1 2.6.4a5.6 5.6 0 013.3 3.4c.3.8.4 1.8.4 2.8 0 1-.1 1.9-.4 2.7a5.5 5.5 0 01-3.3 3.4 7 7 0 01-2.6.5 7 7 0 01-2.6-.5 5.6 5.6 0 01-3.3-3.4 7.8 7.8 0 010-5.5c.3-.8.7-1.5 1.3-2 .5-.6 1.2-1 2-1.4a7 7 0 012.6-.4zm0 10.8c1 0 1.9-.3 2.4-1 .5-.8.7-1.8.7-3.2 0-1.4-.2-2.4-.7-3.2-.5-.7-1.3-1-2.4-1-1 0-1.9.3-2.4 1-.5.8-.8 1.8-.8 3.2 0 1.4.3 2.4.8 3.1.5.8 1.3 1.1 2.4 1.1zm11.9-16.4v10.7h.5l.5-.1.4-.3 3.2-4 .4-.4.7-.1h2.8l-4 4.7-.4.5-.5.4.4.4.4.6 4.3 6.2h-2.8l-.6-.1c-.2-.1-.3-.2-.4-.5l-3.3-4.8a1 1 0 00-.4-.4h-1.2v5.8h-3.1v-18.6h3zm16 5.6c.7 0 1.5.1 2.2.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1.1 3 .6.6 1.4.9 2.4.9.6 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.9 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.7-.2-2.5-.5s-1.4-.7-2-1.3c-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-1 0-1.6.2-2.1.8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm8 10.8v-12.8h1.9c.4 0 .6.2.8.5l.2 1a7 7 0 011.7-1.2 4.6 4.6 0 012.2-.5c.7 0 1.4 0 1.9.3l1.4 1 .8 1.6c.2.6.3 1.2.3 2v8.1h-3.1v-8.2c0-.7-.2-1.4-.6-1.8-.3-.4-.9-.6-1.6-.6l-1.5.3c-.5.3-1 .6-1.3 1v9.3h-3.1zm17.5-12.8V125H327v-12.8h3zm.4-3.8l-.1.8a2 2 0 01-1 1 2 2 0 01-2.2-.4 2 2 0 01-.4-.6l-.2-.8a2 2 0 01.6-1.4 2 2 0 011.3-.5l.8.1a2 2 0 011 1l.3.8zm12.3 5v.7l-.3.5-6.2 8h6.4v2.4h-10v-1.3l.2-.5c0-.2.1-.4.3-.5l6.1-8.2h-6.2v-2.3h9.8v1.3zm7.8-1.4c.8 0 1.6.1 2.2.4a4.9 4.9 0 013 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.8 1c-.3.5-.7.8-1.1 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.8-.2-2.5-.5-.8-.3-1.5-.7-2-1.3-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm8 10.8v-12.8h1.9l.6.1c.2.2.3.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.3 2.3-.2.3-.3.1h-.6l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.1 1.5v8h-3.1z"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M440.9 125.2c-1.1 0-2-.3-2.6-1-.6-.6-.9-1.4-.9-2.5v-7.2h-1.3c-.2 0-.3 0-.4-.2-.2 0-.2-.2-.2-.5v-1.2l2-.3.7-3.5.2-.4.5-.2h1.6v4h3.4v2.3h-3.4v7c0 .3 0 .6.3.9.2.2.5.3.8.3h.5a2.6 2.6 0 00.6-.3l.2-.1h.2l.2.3 1 1.5-1.6.8-1.8.3zm15.5-.2H455l-.7-.1c-.2-.1-.3-.3-.4-.6l-.3-.9a10.6 10.6 0 01-1.9 1.3 5 5 0 01-1 .4 6.4 6.4 0 01-2.8-.1l-1.2-.7a3 3 0 01-.7-1c-.2-.5-.3-1-.3-1.6 0-.5.1-1 .4-1.4.2-.5.6-.9 1.2-1.3s1.4-.7 2.4-1c1-.2 2.2-.3 3.7-.3v-.8c0-.9-.2-1.5-.6-2-.3-.3-.9-.5-1.6-.5a3.8 3.8 0 00-2 .5l-.8.4c-.2.2-.4.2-.6.2-.3 0-.4 0-.6-.2l-.3-.3-.6-1c1.5-1.4 3.2-2 5.3-2 .8 0 1.4 0 2 .3a4.3 4.3 0 012.5 2.6c.2.6.3 1.3.3 2v8.1zm-6-2h.9a3.3 3.3 0 001.4-.7l.7-.6v-2.2c-1 0-1.7.1-2.3.3a6 6 0 00-1.5.4l-.7.6c-.2.2-.3.5-.3.8 0 .5.2.9.5 1.1.3.3.8.4 1.3.4zm13.5-11l1.5.1 1.3.5h3.7v1.2l-.1.4-.6.2-1.1.3a4 4 0 01.3 1.4 3.8 3.8 0 01-1.5 3c-.4.4-1 .7-1.6.9a6.5 6.5 0 01-3.4.1c-.4.3-.6.5-.6.8 0 .3.2.5.4.6l1 .3h1.3a27.5 27.5 0 013 .3l1.3.5 1 1c.2.3.3.8.3 1.5 0 .5-.2 1-.4 1.6-.3.5-.7 1-1.2 1.4-.6.4-1.2.8-2 1a10.1 10.1 0 01-5.2.1 6 6 0 01-1.7-.7c-.5-.3-.9-.7-1-1.1-.3-.4-.4-.8-.4-1.3 0-.6.1-1 .5-1.5.4-.4.9-.7 1.5-1-.3-.1-.5-.4-.7-.7a2 2 0 01-.3-1.1v-.6l.4-.6.5-.6.8-.5a3.7 3.7 0 01-2-3.5 3.8 3.8 0 011.3-3l1.6-.8c.6-.2 1.3-.3 2-.3zm3.3 13.6c0-.3 0-.5-.2-.6-.1-.2-.3-.3-.6-.4l-1-.2a16.7 16.7 0 00-2.2-.2H462c-.4.1-.6.4-.8.6-.3.3-.4.6-.4 1 0 .2 0 .4.2.6l.5.5 1 .3 1.4.1 1.5-.1c.4-.1.8-.2 1-.4l.7-.5.1-.7zm-3.3-7.3c.3 0 .7 0 1-.2l.7-.4.4-.7.1-.8a2 2 0 00-.5-1.5c-.4-.4-1-.6-1.8-.6-.7 0-1.3.2-1.7.6a2 2 0 00-.5 1.5l.1.8a1.8 1.8 0 001.2 1.1l1 .2zm12.9-6.3l1.5.1 1.4.5h3.7v1.2l-.2.4-.5.2-1.2.3a4 4 0 01.3 1.4 3.8 3.8 0 01-1.4 3c-.5.4-1 .7-1.6.9a6.5 6.5 0 01-3.4.1c-.4.3-.6.5-.6.8 0 .3 0 .5.3.6l1 .3h1.3a27.5 27.5 0 013 .3l1.3.5 1 1c.2.3.3.8.3 1.5 0 .5-.1 1-.4 1.6-.3.5-.7 1-1.2 1.4-.5.4-1.2.8-2 1a10.1 10.1 0 01-5.2.1 6 6 0 01-1.7-.7c-.5-.3-.8-.7-1-1.1-.3-.4-.4-.8-.4-1.3 0-.6.2-1 .5-1.5.4-.4 1-.7 1.6-1-.3-.1-.6-.4-.8-.7a2 2 0 01-.3-1.1l.1-.6.3-.6.6-.6.7-.5a3.7 3.7 0 01-2-3.5 3.8 3.8 0 011.3-3c.5-.3 1-.6 1.7-.8.6-.2 1.3-.3 2-.3zm3.4 13.6c0-.3-.1-.5-.3-.6-.1-.2-.3-.3-.6-.4l-.9-.2a16.7 16.7 0 00-2.3-.2H475l-.8.6c-.2.3-.3.6-.3 1l.1.6.6.5 1 .3 1.4.1 1.5-.1 1-.4c.3-.1.5-.3.6-.5l.2-.7zm-3.4-7.3c.4 0 .7 0 1-.2.3 0 .5-.2.7-.4l.4-.7.2-.8a2 2 0 00-.6-1.5c-.4-.4-1-.6-1.7-.6-.8 0-1.3.2-1.7.6a2 2 0 00-.6 1.5c0 .3 0 .6.2.8a1.8 1.8 0 001 1.1l1 .2zm13.8-6.3c.8 0 1.5.1 2.2.4a4.9 4.9 0 013 3 6.9 6.9 0 01.3 3l-.1.3-.2.2h-8.3c0 1.4.4 2.4 1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.2.3 1 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.8 1c-.4.2-1 .2-1.4.2-.9 0-1.7-.2-2.4-.5-.8-.3-1.5-.7-2-1.3-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.5-.4 2.5-.4zm0 2.2c-.9 0-1.6.2-2 .8-.6.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1l-.5-1-.8-.6-1.3-.2zm8 10.8v-12.8h1.9l.6.1c.2.2.2.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3-.1.3-.4.1h-.5l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.2 1.5v8h-3z"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M556.6 129.2v-17h2l.4.1c.2.1.3.2.3.4l.3 1.2c.5-.6 1-1 1.8-1.4a4.8 4.8 0 014.2-.1c.6.3 1.1.7 1.5 1.2a6 6 0 011 2 10.3 10.3 0 010 5.6c-.3.8-.7 1.5-1.1 2a5.1 5.1 0 01-6 1.7l-1.3-1v5.3h-3zm6-14.8c-.6 0-1.1.1-1.6.4-.4.3-.9.6-1.3 1.1v5.8a3 3 0 002.5 1.1c.5 0 .9 0 1.3-.2l1-.8c.2-.4.4-.8.5-1.4a8.6 8.6 0 000-3.8c0-.5-.2-1-.4-1.3a2 2 0 00-.9-.7c-.3-.2-.6-.2-1-.2zm18.2 10.6h-1.3l-.7-.1c-.2-.1-.3-.3-.4-.6l-.3-.9a10.6 10.6 0 01-2 1.3 5 5 0 01-1 .4 6.4 6.4 0 01-2.7-.1c-.5-.2-.9-.4-1.2-.7a3 3 0 01-.8-1c-.2-.5-.3-1-.3-1.6 0-.5.2-1 .4-1.4.3-.5.7-.9 1.3-1.3.6-.4 1.4-.7 2.4-1 1-.2 2.2-.3 3.6-.3v-.8c0-.9-.2-1.5-.5-2-.4-.3-1-.5-1.6-.5a3.8 3.8 0 00-2.1.5l-.7.4c-.2.2-.4.2-.7.2-.2 0-.4 0-.5-.2-.2 0-.3-.2-.4-.3l-.5-1c1.4-1.4 3.2-2 5.3-2a4.3 4.3 0 014.4 3c.2.5.3 1.2.3 1.9v8.1zm-6-2h1a3.3 3.3 0 001.4-.7l.6-.6v-2.2c-.9 0-1.6.1-2.2.3a6 6 0 00-1.5.4l-.8.6-.2.8c0 .5.2.9.5 1.1.3.3.7.4 1.2.4zm9 2v-12.8h1.9l.6.1c.2.2.3.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3-.1.3-.4.1h-.5l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.1 1.5v8h-3.1zm17.9-10.3l-.3.3h-.8a32.9 32.9 0 00-1.4-.7h-1c-.6 0-1 0-1.4.3-.4.3-.5.6-.5 1 0 .3 0 .5.2.7l.7.5 1 .4a33 33 0 012.3.8c.4.2.8.4 1 .7.4.2.6.5.8 1l.2 1.2c0 .7 0 1.2-.3 1.7-.2.6-.6 1-1 1.4-.4.4-1 .7-1.6.9a7 7 0 01-3.5.2 7.6 7.6 0 01-2.3-.8l-.8-.7.7-1.1c0-.2.2-.3.3-.4h1a12 12 0 001.4.8l1.2.1h1l.6-.4c.1-.2.3-.3.3-.5l.1-.6c0-.3 0-.6-.2-.8l-.7-.5-1-.3a33.5 33.5 0 01-2.4-.9 4 4 0 01-1-.7 3 3 0 01-.7-1 3.7 3.7 0 011-4.2c.4-.3.9-.6 1.5-.8.6-.2 1.3-.3 2-.3 1 0 1.8.1 2.5.4.7.3 1.3.7 1.8 1.2l-.7 1zm8.6-2.7c.8 0 1.6.1 2.2.4a4.9 4.9 0 013 3 6.9 6.9 0 01.3 3v.3l-.3.2h-8.3c.1 1.4.5 2.4 1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.6-.1h.3l.3.3.9 1c-.4.5-.8.8-1.2 1a6.4 6.4 0 01-2.7 1c-.5.2-1 .2-1.4.2-1 0-1.8-.2-2.5-.5-.8-.3-1.5-.7-2-1.3-.6-.5-1-1.3-1.4-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.7-.3 1.6-.4 2.5-.4zm0 2.2c-.8 0-1.5.2-2 .8-.5.5-.9 1.2-1 2.1h5.8c0-.4 0-.8-.2-1.1 0-.4-.2-.7-.5-1l-.8-.6-1.2-.2zm8 10.8v-12.8h1.9l.6.1c.2.2.3.4.3.7l.2 1.5a6 6 0 011.6-1.9c.6-.4 1.3-.7 2-.7s1.2.2 1.6.5l-.4 2.3-.1.3-.4.1h-.5l-.8-.2c-.7 0-1.2.2-1.7.6a4 4 0 00-1.1 1.5v8h-3.1z"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M701.6 125v-12.8h2c.3 0 .6.2.7.5l.2 1a7 7 0 011.8-1.2 4.6 4.6 0 012.2-.5c.7 0 1.3 0 1.9.3.5.3 1 .6 1.3 1 .4.5.7 1 .8 1.6.2.6.3 1.2.3 2v8.1h-3v-8.2c0-.7-.2-1.4-.6-1.8-.4-.4-1-.6-1.6-.6-.6 0-1 .1-1.5.3l-1.4 1v9.3h-3zm19.6-13c.8 0 1.5.1 2.2.4a4.9 4.9 0 012.9 3 6.9 6.9 0 01.4 3l-.1.3-.3.2H718c.2 1.4.5 2.4 1.1 3 .7.6 1.5.9 2.5.9.5 0 1 0 1.3-.2a22 22 0 001.7-.8l.5-.1h.4l.2.3.9 1c-.3.5-.7.8-1.1 1a6.4 6.4 0 01-2.8 1c-.5.2-1 .2-1.4.2-.9 0-1.7-.2-2.5-.5-.7-.3-1.4-.7-2-1.3-.5-.5-1-1.3-1.3-2.1a8.3 8.3 0 010-5.5 5.7 5.7 0 013.2-3.4c.6-.3 1.5-.4 2.5-.4zm0 2.2c-.9 0-1.6.2-2 .8-.6.5-1 1.2-1 2.1h5.7l-.1-1.1-.5-1-.9-.6-1.2-.2zm8 10.8v-12.8h1.8l.7.1.3.7.1 1.5a6 6 0 011.6-1.9c.7-.4 1.4-.7 2.1-.7.7 0 1.2.2 1.6.5l-.4 2.3c0 .1 0 .2-.2.3l-.3.1h-.5l-.9-.2c-.6 0-1.2.2-1.6.6a4 4 0 00-1.2 1.5v8h-3z"/>
|
||||
<path fill="#3D4251" fill-rule="nonzero" d="M831 123.3a2 2 0 01.5-1.3 2 2 0 011.3-.6 1.9 1.9 0 011.4.6 1.9 1.9 0 01.3 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.2-.1-.3-.3-.4-.6a2 2 0 01-.2-.7zm5.5 0a2 2 0 01.6-1.3 2 2 0 011.3-.6 1.9 1.9 0 011.4.6 1.9 1.9 0 01.4 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.3-.1-.4-.3-.5-.6a2 2 0 01-.2-.7zm5.7 0a2 2 0 01.5-1.3 2 2 0 011.4-.6 1.9 1.9 0 011.3.6 1.9 1.9 0 01.4 2 1.8 1.8 0 01-1 1 2 2 0 01-2-.4c-.3-.1-.4-.3-.5-.6a2 2 0 01-.1-.7z"/>
|
||||
</g>
|
||||
</svg>
|
After Width: | Height: | Size: 13 KiB |
305
nlp/spacy/tokenization.svg
Normal file
@@ -0,0 +1,305 @@
|
||||
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="598" height="386" viewBox="0 0 598 386">
|
||||
<defs>
|
||||
<path id="a" d="M51.3 10.9a4.3 4.3 0 01-.6-2.2c0-.6.2-1.2.5-1.9a6 6 0 011.4-1.6l.6.4.1.2-.1.3a7.5 7.5 0 00-.6 1l-.2.5a2.5 2.5 0 000 1.4l.3.8.1.2c0 .2 0 .3-.3.4l-1.2.5zm3.4 0a4.3 4.3 0 01-.7-2.2c0-.6.2-1.2.5-1.9A6 6 0 0156 5.2l.6.4h.1v.5a7.5 7.5 0 00-.7 1l-.2.5a2.5 2.5 0 000 1.4l.4.8v.2c0 .2 0 .3-.2.4l-1.2.5zm7.4 9.3H69V22h-9V6.2h2.2v14zM75 10.7a5 5 0 011.9.3 4.1 4.1 0 012.4 2.5c.2.7.3 1.4.3 2.2v.6l-.4.1h-7.5c0 .7.1 1.4.3 1.9.2.5.4 1 .7 1.3l1.1.8 1.5.2c.5 0 .9 0 1.2-.2a6 6 0 001.6-.7l.5-.2.3.2.6.7-.9.8-1 .5a6.9 6.9 0 01-4.6 0c-.7-.2-1.2-.6-1.7-1-.5-.6-.8-1.2-1.1-2a7.6 7.6 0 010-4.7 4.7 4.7 0 012.7-3c.6-.2 1.3-.3 2.1-.3zm0 1.4a3 3 0 00-2.2.8c-.5.6-.9 1.3-1 2.3h6l-.1-1.2c-.1-.4-.3-.7-.6-1-.2-.3-.5-.5-.9-.7a3 3 0 00-1.2-.2zm10.5 10c-.9 0-1.6-.2-2-.7-.5-.5-.7-1.2-.7-2v-6.9h-1.4l-.3-.1-.1-.3v-.8l1.8-.2.5-3.5.1-.3h1.3V11H88v1.4h-3.2v6.7c0 .5.1.8.4 1 .2.3.5.4.8.4l.6-.1a2.3 2.3 0 00.6-.4h.2l.3.1.6 1c-.3.3-.8.5-1.2.7l-1.5.3zm6.2-16.7a4.1 4.1 0 01.6 2.1 4 4 0 01-.5 2 6 6 0 01-1.3 1.6l-.6-.4-.2-.1v-.1l.1-.3a5.1 5.1 0 00.7-1l.2-.5a2.5 2.5 0 000-1.4l-.4-.8v-.2c0-.2 0-.3.2-.4l1.2-.5zm9.7 7.3c-.1.2-.3.2-.4.2h-.4a8.6 8.6 0 00-1.2-.6 3.4 3.4 0 00-2 0l-.6.3-.4.5-.2.7c0 .2.1.5.3.7l.6.5 1 .3a49.6 49.6 0 013 1.3l.7.8.2 1.2c0 .5 0 1-.3 1.4-.2.5-.4.8-.8 1.1a4 4 0 01-1.3.8c-.5.2-1.1.3-1.8.3a5.6 5.6 0 01-3.7-1.4l.4-.7.2-.2.4-.1.4.1.5.4.8.3 1 .2c.5 0 .8 0 1-.2.4 0 .6-.2.8-.4l.4-.6.2-.7c0-.3-.1-.5-.3-.7a2 2 0 00-.6-.6l-1-.3a68.4 68.4 0 01-2.1-.8l-1-.5-.6-.9c-.2-.3-.2-.7-.2-1.2a3 3 0 011-2.2c.3-.3.7-.6 1.2-.8a5 5 0 011.7-.2c.8 0 1.4.1 2 .3l1.5 1-.4.7z"/>
|
||||
<path id="b" d="M183.5 10.7l1.4.1 1.1.5h3v.7c0 .3 0 .4-.4.5l-1.3.2c.3.4.4 1 .4 1.6a3.4 3.4 0 01-1.2 2.6c-.3.3-.8.5-1.3.7a5.6 5.6 0 01-3.2 0l-.5.5-.2.5c0 .3.1.5.4.6.2.2.5.3.8.3l1.2.1a36.1 36.1 0 012.7.3c.5 0 .9.2 1.2.4.4.2.7.4.9.7a3 3 0 010 2.6c-.3.5-.6 1-1 1.3-.5.3-1 .6-1.7.8a8.4 8.4 0 01-4.3 0 5 5 0 01-1.6-.6c-.4-.2-.7-.6-.9-1-.2-.3-.3-.6-.3-1 0-.6.2-1 .5-1.4.4-.4.9-.7 1.5-1a2 2 0 01-.8-.5c-.2-.3-.3-.6-.3-1l.1-.5c0-.2.2-.4.3-.5a2.9 2.9 0 011-1c-.5-.2-1-.6-1.2-1.2-.3-.5-.5-1-.5-1.7a3.3 3.3 0 011.2-2.6 4 4 0 011.3-.8l1.7-.2zm3.5 12c0-.4 0-.6-.2-.8l-.7-.4-.9-.2a13.9 13.9 0 00-2.2-.1l-1.2-.1-1 .7a1.5 1.5 0 00-.2 1.7l.6.6c.3.1.6.3 1 .3l1.4.2c.6 0 1 0 1.4-.2.5 0 .8-.2 1.1-.4.3-.1.5-.3.7-.6l.2-.8zm-3.5-6.1l1-.2c.4-.1.6-.3.8-.5l.5-.7.2-.9c0-.7-.2-1.2-.7-1.6-.4-.4-1-.6-1.8-.6s-1.4.2-1.8.6c-.4.4-.6 1-.6 1.6 0 .3 0 .6.2 1a2 2 0 001.2 1c.3.2.6.3 1 .3zm12-6c.8 0 1.6.2 2.2.5a4.7 4.7 0 012.8 3c.2.7.3 1.5.3 2.3 0 .9 0 1.7-.3 2.4s-.6 1.3-1 1.8c-.6.5-1.1.9-1.8 1.2-.6.2-1.4.4-2.2.4-.8 0-1.5-.2-2.2-.4-.6-.3-1.2-.7-1.7-1.2-.4-.5-.8-1.1-1-1.8a7 7 0 01-.4-2.4c0-.8.1-1.6.4-2.3.2-.8.6-1.4 1-1.9.5-.5 1-.8 1.7-1.1.7-.3 1.4-.4 2.2-.4zm0 10c1.1 0 2-.3 2.5-1 .5-.8.8-1.8.8-3.2 0-1.3-.3-2.3-.8-3-.5-.8-1.4-1.2-2.5-1.2-.5 0-1 .1-1.4.3-.4.2-.8.5-1 .8l-.7 1.4-.2 1.7.2 1.8.6 1.3c.3.4.7.6 1 .8l1.5.3z"/>
|
||||
<path id="c" d="M250.4 22.2c-.8 0-1.5-.3-2-.8s-.7-1.2-.7-2v-6.9h-1.3l-.3-.1-.2-.3v-.8l1.9-.2.4-3.5c0-.1 0-.2.2-.3h1.3V11h3.2v1.4h-3.2v6.7c0 .5 0 .8.3 1 .2.3.5.4.9.4l.5-.1a2.3 2.3 0 00.7-.4h.2l.3.1.5 1a4.1 4.1 0 01-2.7 1zm9.4-11.5c.8 0 1.5.1 2.2.4a4.7 4.7 0 012.7 3 7.2 7.2 0 010 4.7c-.2.7-.6 1.3-1 1.8-.5.5-1 .9-1.7 1.2-.7.2-1.4.4-2.2.4-.8 0-1.6-.2-2.2-.4-.7-.3-1.2-.7-1.7-1.2s-.8-1.1-1-1.8a7 7 0 01-.4-2.4c0-.8 0-1.6.3-2.3a4.7 4.7 0 012.7-3c.7-.3 1.5-.4 2.3-.4zm0 10c1 0 2-.4 2.4-1.2.6-.7.9-1.7.9-3 0-1.4-.3-2.4-.9-3.2-.5-.7-1.3-1-2.4-1-.6 0-1 0-1.5.2l-1 .8c-.3.4-.5.8-.6 1.4l-.2 1.7c0 .7 0 1.3.2 1.8.1.5.3 1 .6 1.3.3.4.6.6 1 .8.4.2 1 .3 1.5.3z"/>
|
||||
<path id="d" d="M347.6 6.2l.5.1.3.3 9.1 11.9a7.5 7.5 0 010-1.1V6.2h1.8V22h-1a1 1 0 01-.5 0 1 1 0 01-.3-.4l-9.1-11.9a14.1 14.1 0 010 1V22h-1.9V6.2h1.1zm14.6 14.6a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1c.2 0 .4.2.5.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.4-1zm10-5V22h-2v-6.3l-5.9-9.5h2c.1 0 .3 0 .4.2l.3.3 3.6 6.2a7.6 7.6 0 01.6 1.4 13 13 0 01.6-1.4l3.6-6.2.3-.3.4-.2h2l-5.9 9.5zm5.2 5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1c.2 0 .3.2.4.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1zm8.4-14.6v6.3a27.8 27.8 0 01-.2 4h-1.4a66.4 66.4 0 01-.2-4V6.2h1.8zm-2.3 14.6a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1c.2 0 .3.2.4.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1zm8.1-15.4a4.1 4.1 0 01.6 2.1 4 4 0 01-.5 2 6 6 0 01-1.3 1.6l-.6-.4-.2-.1v-.1l.1-.3a5.1 5.1 0 00.7-1l.2-.5a2.5 2.5 0 000-1.4l-.4-.8v-.2c0-.2 0-.3.2-.4l1.2-.5zm3.4 0a4.1 4.1 0 01.6 2.1 4 4 0 01-.5 2 6 6 0 01-1.4 1.6l-.6-.4-.1-.1v-.1-.3a5.1 5.1 0 00.7-1l.2-.5a2.5 2.5 0 000-1.4l-.4-.8v-.2c0-.2 0-.3.3-.4l1.2-.5z"/>
|
||||
<path id="e" d="M13.5 77.8c-.5-.8-.8-1.6-.8-2.5 0-.7.2-1.4.6-2 .3-.7.9-1.4 1.6-2l.8.6.2.2V72.4l-.1.2a5.7 5.7 0 00-.6.8l-.2.6-.1.7.1.8c0 .3.2.5.4.9l.1.3c0 .2-.1.4-.4.5l-1.6.6zm3.6 0c-.5-.8-.7-1.6-.7-2.5 0-.7.2-1.4.5-2 .4-.7 1-1.4 1.6-2l.9.6.1.2v.5a5.7 5.7 0 00-.7.8l-.2.6v1.5l.5.9v.3c0 .2 0 .4-.3.5l-1.7.6z"/>
|
||||
<path id="f" d="M80.8 86.8h6.8v1.7h-9V72.8h2.2v14zm12.9-9.6a5 5 0 011.8.4A4.1 4.1 0 0198 80c.2.6.3 1.3.3 2.1l-.1.6-.4.2h-7.4c0 .7.1 1.3.3 1.8.2.5.4 1 .7 1.3.3.4.7.6 1.1.8l1.5.3 1.2-.2a6 6 0 001.6-.7l.4-.2c.2 0 .3 0 .4.2l.6.7-1 .7c-.2.3-.6.4-1 .6a6.9 6.9 0 01-4.5 0c-.7-.3-1.2-.6-1.7-1.1-.5-.5-.9-1.2-1.1-1.9a7.6 7.6 0 010-4.7c.2-.7.5-1.3 1-1.8.4-.5 1-.9 1.6-1.2.6-.2 1.4-.4 2.2-.4zm0 1.5a3 3 0 00-2.2.8c-.5.5-.9 1.3-1 2.3h6l-.1-1.3-.6-1c-.2-.2-.5-.5-.9-.6a3 3 0 00-1.2-.2zm10.5 10c-.9 0-1.6-.2-2-.7-.5-.5-.8-1.2-.8-2.1V79h-1.6l-.1-.4v-.8l1.8-.2.5-3.4.1-.3.3-.1h1v3.8h3.2V79h-3.2v6.7c0 .5.1.8.3 1 .3.3.6.4 1 .4h.4a2.3 2.3 0 00.7-.4l.2-.1c.1 0 .2 0 .3.2l.6 1-1.3.7-1.4.2zm6.2-16.7a4.1 4.1 0 01.6 2 4 4 0 01-.5 2 6 6 0 01-1.4 1.6l-.6-.3V77h-.1l.1-.3a5.1 5.1 0 00.6-1l.2-.6a2.5 2.5 0 000-1.3c0-.3-.2-.6-.3-.8l-.1-.3c0-.1 0-.3.3-.4l1.2-.4zm9.6 7.2c0 .2-.2.3-.4.3l-.3-.1a8.6 8.6 0 00-1.3-.6 3.4 3.4 0 00-1.8 0l-.7.4-.5.5-.1.6c0 .3 0 .5.2.7l.7.5 1 .4a49.6 49.6 0 013 1.3c.2.2.5.5.6.8.2.3.3.7.3 1.1l-.3 1.5c-.2.4-.5.8-.8 1a4 4 0 01-1.3.8l-1.8.3a5.6 5.6 0 01-3.8-1.3l.5-.8.2-.2h.8l.5.4.7.4 1.2.1 1-.1.7-.4.4-.6.1-.7c0-.3 0-.6-.2-.8a2 2 0 00-.7-.5l-.9-.4a68.4 68.4 0 01-2.1-.7l-1-.6-.6-.8-.3-1.2a3 3 0 011-2.3l1.3-.7a5 5 0 011.7-.3c.7 0 1.4.1 2 .4.6.2 1 .5 1.5 1l-.5.6z"/>
|
||||
<path id="g" d="M182.2 77.2c.4 0 .9 0 1.3.2.4 0 .8.2 1.2.4h3v.8c0 .2-.2.4-.5.4l-1.2.2c.2.5.3 1 .3 1.6a3.4 3.4 0 01-1.1 2.6c-.4.3-.8.6-1.4.7-.5.2-1 .3-1.6.3-.6 0-1 0-1.5-.2l-.5.5c-.2.2-.2.3-.2.5s0 .4.3.6l.8.3h1.2a36.1 36.1 0 012.8.3l1.2.4.8.8c.2.3.3.7.3 1.2s0 1-.3 1.4c-.3.5-.6.9-1 1.2a7 7 0 01-3.8 1.1c-.9 0-1.6 0-2.2-.2a5 5 0 01-1.5-.6l-1-1-.2-1c0-.6.1-1.1.5-1.5.3-.4.8-.7 1.4-1a2 2 0 01-.7-.5c-.2-.2-.3-.6-.3-1v-.5l.3-.5a2.9 2.9 0 011.1-.9c-.5-.3-1-.7-1.3-1.2-.3-.5-.5-1.1-.5-1.8 0-.5.1-1 .4-1.5l.8-1.1a4 4 0 011.4-.7c.5-.2 1-.3 1.7-.3zm3.4 12c0-.3 0-.6-.2-.7-.1-.2-.4-.3-.6-.4l-1-.2a13.9 13.9 0 00-2.2-.2h-1.1c-.4.1-.8.4-1 .6a1.5 1.5 0 00-.2 1.8c.1.2.3.4.6.5.2.2.6.3 1 .4l1.4.1 1.4-.1c.4-.1.8-.2 1-.4l.7-.6c.2-.3.2-.6.2-.8zm-3.4-6.1c.4 0 .7 0 1-.2.3 0 .6-.2.8-.4l.4-.7.2-1c0-.6-.2-1.2-.6-1.6-.4-.4-1-.6-1.8-.6s-1.4.2-1.8.6a2.5 2.5 0 00-.5 2.5 2 2 0 001.2 1.2l1 .2zm12-5.9c.8 0 1.5.2 2.2.4a4.7 4.7 0 012.7 3c.2.7.4 1.5.4 2.4 0 .8-.2 1.6-.4 2.3-.2.7-.6 1.3-1 1.8-.5.5-1 1-1.7 1.2-.7.3-1.4.4-2.2.4-.8 0-1.6-.1-2.2-.4-.7-.3-1.3-.7-1.7-1.2-.5-.5-.8-1-1-1.8a7 7 0 01-.5-2.3c0-.9.2-1.7.4-2.4.3-.7.6-1.3 1-1.8.5-.5 1.1-.9 1.8-1.2.6-.2 1.4-.4 2.2-.4zm0 10c1 0 1.9-.4 2.4-1.1.6-.8.8-1.8.8-3.1s-.2-2.4-.8-3.1c-.5-.8-1.3-1.1-2.4-1.1-.6 0-1 0-1.5.3-.4.1-.7.4-1 .8-.3.3-.5.8-.6 1.3-.2.5-.2 1.1-.2 1.8 0 .6 0 1.2.2 1.8.1.5.3 1 .6 1.3l1 .8c.4.2 1 .3 1.5.3z"/>
|
||||
<path id="h" d="M249 88.7c-1 0-1.6-.2-2-.7-.6-.5-.8-1.2-.8-2.1V79h-1.6l-.2-.4v-.8l1.9-.2.4-3.4c0-.2 0-.2.2-.3l.3-.1h1v3.8h3.2V79h-3.2v6.7c0 .5 0 .8.3 1 .2.3.5.4.9.4h.5a2.3 2.3 0 00.7-.4l.2-.1c.1 0 .2 0 .3.2l.5 1-1.2.7-1.5.2zm9.3-11.5c.8 0 1.5.2 2.2.4a4.7 4.7 0 012.7 3c.3.7.4 1.5.4 2.4 0 .8-.1 1.6-.4 2.3-.2.7-.6 1.3-1 1.8-.5.5-1 1-1.7 1.2-.7.3-1.4.4-2.2.4-.8 0-1.6-.1-2.2-.4-.7-.3-1.2-.7-1.7-1.2s-.8-1-1-1.8a7 7 0 01-.4-2.3c0-.9 0-1.7.3-2.4s.6-1.3 1.1-1.8c.5-.5 1-.9 1.7-1.2.6-.2 1.4-.4 2.2-.4zm0 10c1 0 2-.4 2.4-1.1.6-.8.9-1.8.9-3.1s-.3-2.4-.9-3.1c-.5-.8-1.3-1.1-2.4-1.1-.6 0-1 0-1.5.3-.4.1-.7.4-1 .8-.3.3-.5.8-.6 1.3L255 83c0 .6 0 1.2.2 1.8.1.5.3 1 .6 1.3l1 .8c.4.2 1 .3 1.5.3z"/>
|
||||
<path id="i" d="M347.2 72.8h.5l.3.3 9.1 12a7.5 7.5 0 010-1.2V72.8h1.8v15.7h-1a1 1 0 01-.5 0 1 1 0 01-.3-.3l-9.1-12a14.1 14.1 0 010 1.1v11.2h-1.9V72.8h1.1zm14.6 14.5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4c.2 0 .4 0 .5.2.2 0 .3.1.5.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .3 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.4-1zm10-5v6.2h-2v-6.2l-5.9-9.5h2l.4.1.2.4 3.7 6.1a7.6 7.6 0 01.6 1.4 13 13 0 01.6-1.4l3.6-6.1.3-.4.4-.1h2l-5.9 9.5zm5.2 5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4c.1 0 .3 0 .5.2.2 0 .3.1.4.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .3 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1zm8.4-14.5V79a27.8 27.8 0 01-.3 4h-1.3a66.4 66.4 0 01-.3-4v-6.3h2zm-2.3 14.5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4c.1 0 .3 0 .5.2l.4.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .3 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1zm8.1-15.3a4.1 4.1 0 01.6 2 4 4 0 01-.5 2 6 6 0 01-1.3 1.6l-.6-.3-.2-.2.1-.3a5.1 5.1 0 00.7-1l.2-.6a2.5 2.5 0 000-1.3l-.4-.8v-.3c0-.1 0-.3.2-.4l1.2-.4zm3.3 0a4.1 4.1 0 01.7 2 4 4 0 01-.5 2 6 6 0 01-1.4 1.6l-.6-.3-.1-.2v-.3a5.1 5.1 0 00.7-1l.2-.6a2.5 2.5 0 000-1.3c0-.3-.2-.6-.4-.8v-.3c0-.1 0-.3.2-.4l1.2-.4z"/>
|
||||
<path id="j" d="M13.5 141c-.5-.7-.8-1.6-.8-2.4 0-.7.2-1.4.6-2.1.3-.7.9-1.3 1.6-1.8l.8.5.2.1v.4l-.1.2a5.7 5.7 0 00-.6.8l-.2.6-.1.6.1.8c0 .3.2.6.4 1l.1.2c0 .3-.1.4-.4.5l-1.6.7zm3.6 0c-.5-.7-.7-1.6-.7-2.4 0-.7.2-1.4.5-2.1.4-.7 1-1.3 1.6-1.8l.9.5.1.1V136a5.7 5.7 0 00-.7.8l-.2.6v1.4l.5 1v.2c0 .3 0 .4-.3.5l-1.7.7z"/>
|
||||
<path id="k" d="M60 149.4h6.3v2.4h-9.4V136h3v13.5zm12.4-9c.7 0 1.4 0 2 .3a4.3 4.3 0 012.5 2.6 6 6 0 01.4 2.7l-.1.3-.2.2h-7.3c0 1.2.4 2 1 2.6a3 3 0 002 .8c.5 0 1 0 1.2-.2l.9-.3.6-.4.5-.1h.3l.2.2.8 1-1 1a5.7 5.7 0 01-2.4.8h-1.3a6 6 0 01-2.1-.3c-.7-.3-1.3-.7-1.8-1.2s-.9-1.1-1.2-1.9a7.3 7.3 0 010-4.7c.2-.7.6-1.3 1-1.8a5 5 0 011.7-1.2c.7-.3 1.5-.4 2.3-.4zm0 1.9c-.7 0-1.3.2-1.8.7-.4.4-.7 1-.8 1.9h5v-1l-.5-.8a2 2 0 00-.8-.6l-1-.2zm10.8 9.7c-1 0-1.7-.3-2.2-.8-.6-.6-.8-1.4-.8-2.3v-6.3H79c-.1 0-.3 0-.4-.2l-.1-.4v-1l1.8-.4.6-3c0-.2 0-.3.2-.4l.4-.1h1.4v3.5h3v2h-3v6c0 .4 0 .7.2 1l.8.2h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.3-.9.6-1.4.7a5 5 0 01-1.6.3z"/>
|
||||
<path id="l" d="M183 140.3c.4 0 .9 0 1.3.2.4 0 .8.2 1.2.4h3.2v1l-.1.4c0 .1-.2.2-.5.2l-1 .2a3.5 3.5 0 01.3 1.3 3.4 3.4 0 01-1.3 2.6c-.4.4-.9.6-1.4.8a5.7 5.7 0 01-3 .1c-.3.2-.5.5-.5.7 0 .3 0 .4.3.5l.8.3h1.2a24.2 24.2 0 012.6.3l1.2.4.8.8c.2.4.3.8.3 1.4 0 .5 0 1-.3 1.4l-1 1.3-1.8.9a8.9 8.9 0 01-4.5 0c-.7-.1-1.2-.3-1.6-.6l-1-1-.2-1c0-.6.1-1 .4-1.4.4-.4.8-.7 1.4-.9-.3-.1-.5-.3-.7-.6-.2-.3-.2-.6-.2-1v-.5l.3-.6.5-.5.6-.4a3.3 3.3 0 01-1.8-3 3.4 3.4 0 011.2-2.7c.4-.3 1-.5 1.5-.7a6 6 0 011.8-.3zm3 12c0-.2-.1-.4-.3-.5 0-.2-.3-.3-.5-.3a14.7 14.7 0 00-2.8-.3l-1-.1c-.4.1-.6.3-.8.6-.2.2-.3.5-.3.8 0 .2 0 .3.2.5 0 .2.2.3.4.5l.9.2 1.2.2c.5 0 1 0 1.3-.2.4 0 .7-.1 1-.3l.5-.5.1-.6zm-3-6.4l.8-.1.7-.4.3-.6.2-.7c0-.6-.2-1-.5-1.4-.4-.3-.9-.5-1.5-.5-.7 0-1.2.2-1.5.5-.4.4-.5.8-.5 1.4v.7a1.6 1.6 0 001 1l1 .1zm12.3-5.5c.8 0 1.6 0 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.4 4.9 4.9 0 01-2.9 3 6.2 6.2 0 01-4.6 0 5 5 0 01-2.8-3c-.3-.7-.4-1.6-.4-2.4 0-1 0-1.7.4-2.5.2-.7.6-1.3 1-1.8a5 5 0 011.9-1.1c.6-.3 1.4-.4 2.3-.4zm0 9.5c.9 0 1.6-.3 2-1 .5-.6.7-1.5.7-2.7 0-1.2-.2-2.2-.7-2.8-.4-.6-1.1-1-2-1-1 0-1.7.4-2.2 1-.4.6-.6 1.6-.6 2.8 0 1.2.2 2.1.6 2.7.5.7 1.2 1 2.2 1z"/>
|
||||
<path id="m" d="M251.3 152c-1 0-1.7-.3-2.2-.8-.6-.6-.8-1.4-.8-2.3v-6.3H247c-.1 0-.2 0-.3-.2l-.2-.4v-1l1.8-.4.6-3c0-.2 0-.3.2-.4l.4-.1h1.4v3.5h3v2h-3v6c0 .4 0 .7.3 1l.7.2h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.3-.9.6-1.4.7a5 5 0 01-1.6.3zm9.7-11.6c.8 0 1.6 0 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.4 4.9 4.9 0 01-2.9 3 6.2 6.2 0 01-4.6 0 5 5 0 01-2.8-3c-.3-.7-.4-1.6-.4-2.4 0-1 0-1.7.4-2.5.2-.7.6-1.3 1-1.8a5 5 0 011.9-1.1c.6-.3 1.4-.4 2.3-.4zm0 9.5c.9 0 1.6-.3 2-1 .5-.6.7-1.5.7-2.7 0-1.2-.2-2.2-.7-2.8-.4-.6-1.1-1-2-1-1 0-1.7.4-2.2 1-.4.6-.6 1.6-.6 2.8 0 1.2.2 2.1.6 2.7.5.7 1.2 1 2.2 1z"/>
|
||||
<path id="n" d="M347.7 136l.5.1.3.3 9.1 11.9a7.5 7.5 0 010-1V136h1.8v15.7h-1a1 1 0 01-.5 0 1 1 0 01-.3-.4l-9.1-11.8a14.1 14.1 0 010 1v11.2h-1.9v-15.7h1.1zm14.6 14.6a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1.5.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.4-1zm10-5v6.2h-2v-6.3l-5.9-9.4h2l.4.1.2.4 3.7 6a7.6 7.6 0 01.6 1.5 13 13 0 01.6-1.4l3.6-6.1c0-.2.2-.3.3-.4l.4-.1h2l-5.9 9.4zm5.2 5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1.4.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1zm8.4-14.5v6.2a27.8 27.8 0 01-.3 4h-1.3a66.4 66.4 0 01-.3-4v-6.2h2zm-2.3 14.5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1.4.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1zm8.1-15.4a4.1 4.1 0 01.6 2.1 4 4 0 01-.5 2 6 6 0 01-1.3 1.6l-.6-.4h-.2v-.2l.1-.3a5.1 5.1 0 00.7-1l.2-.5a2.5 2.5 0 000-1.4l-.4-.8v-.2c0-.2 0-.3.2-.4l1.2-.5zm3.3 0a4.1 4.1 0 01.7 2.1 4 4 0 01-.5 2 6 6 0 01-1.4 1.6l-.6-.4h-.1v-.2-.3a5.1 5.1 0 00.7-1l.2-.5a2.5 2.5 0 000-1.4l-.4-.8v-.2c0-.2 0-.3.2-.4l1.2-.5z"/>
|
||||
<path id="o" d="M127 135l.6 1.2a4.3 4.3 0 01-.4 3.3c-.4.7-.9 1.3-1.6 1.9l-.8-.5-.2-.2v-.2l.1-.3a6.5 6.5 0 00.6-.9 2.9 2.9 0 00.3-1.2l-.1-.8c0-.3-.2-.6-.4-.9l-.1-.3c0-.2.1-.4.4-.5l1.6-.6zm9.7 7.7l-.2.3h-.3a29 29 0 00-1.6-.6h-1a2 2 0 00-1.2.3 1 1 0 00-.5.9l.3.6.6.4.9.3a29 29 0 012 .8l.9.5.6.9c.2.3.3.7.3 1.1 0 .6-.1 1-.3 1.5-.2.5-.5.9-.9 1.2a4 4 0 01-1.4.8 6.1 6.1 0 01-3 .2 6.7 6.7 0 01-2.1-.7l-.8-.6.7-1c0-.2.1-.2.3-.3l.4-.1.4.1a10.6 10.6 0 001.3.6l1 .2.8-.1.6-.3.3-.5.1-.5c0-.2 0-.5-.2-.6a2 2 0 00-.6-.5 29.4 29.4 0 01-3-1l-.9-.6-.6-1c-.2-.3-.2-.7-.2-1.2a3.3 3.3 0 011-2.4 4 4 0 011.4-.8c.5-.2 1.1-.2 1.8-.2a4.8 4.8 0 013.7 1.4l-.6 1z"/>
|
||||
<path id="p" d="M13.5 208.7c-.5-.8-.8-1.6-.8-2.5 0-.7.2-1.4.6-2 .3-.7.9-1.3 1.6-2l.8.6.2.2V203.3l-.1.2a5.7 5.7 0 00-.6.9l-.2.5-.1.7.1.8c0 .3.2.6.4.9l.1.3c0 .2-.1.4-.4.5l-1.6.6zm3.6 0c-.5-.8-.7-1.6-.7-2.5 0-.7.2-1.4.5-2 .4-.7 1-1.3 1.6-2l.9.6.1.2v.5a5.7 5.7 0 00-.7.9l-.2.5v1.5l.5.9v.3c0 .2 0 .4-.3.5l-1.7.6z"/>
|
||||
<path id="q" d="M60 217h6.3v2.5h-9.4v-16h3V217zm12.4-9c.7 0 1.4.1 2 .3a4.3 4.3 0 012.5 2.6 6 6 0 01.4 2.7l-.1.3-.2.2h-7.3c0 1.2.4 2 1 2.6a3 3 0 002 .8l1.2-.1.9-.4.6-.3.5-.2h.3l.2.3.8 1-1 .9a5.7 5.7 0 01-2.4.8l-1.3.1a6 6 0 01-2.1-.4c-.7-.2-1.3-.6-1.8-1.1-.5-.5-.9-1.2-1.2-2a7.3 7.3 0 010-4.7c.2-.7.6-1.3 1-1.8a5 5 0 011.7-1.2c.7-.3 1.5-.4 2.3-.4zm0 2c-.7 0-1.3.2-1.8.6-.4.5-.7 1-.8 2h5v-1l-.5-.9a2 2 0 00-.8-.6l-1-.2zm10.8 9.6c-1 0-1.7-.2-2.2-.8-.6-.6-.8-1.3-.8-2.3v-6.3H79l-.4-.1-.1-.5v-1l1.8-.3.6-3.1c0-.2 0-.3.2-.4H82.9v3.5h3v1.9h-3v6.1c0 .4 0 .6.2.8.2.2.5.3.8.3h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.4-.9.6-1.4.8a5 5 0 01-1.6.2z"/>
|
||||
<path id="r" d="M185.5 208l1.3.1 1.2.5h3.2v1l-.1.4-.5.2-1 .1a3.5 3.5 0 01.3 1.3 3.4 3.4 0 01-1.3 2.7l-1.4.7a5.7 5.7 0 01-3 .2c-.3.2-.5.4-.5.7 0 .2 0 .4.3.5l.8.2h1.2a24.2 24.2 0 012.6.3l1.2.5c.3.2.6.4.8.8.2.3.3.8.3 1.3s0 1-.3 1.5l-1 1.2c-.6.4-1.1.7-1.8.9a8.9 8.9 0 01-4.5 0c-.7 0-1.2-.3-1.6-.6l-1-1-.2-1c0-.6.1-1 .4-1.4.4-.3.8-.6 1.4-.8l-.7-.7c-.2-.2-.2-.6-.2-1v-.5l.3-.5.5-.5.6-.4c-.6-.3-1-.8-1.3-1.3-.4-.5-.5-1-.5-1.8a3.4 3.4 0 011.2-2.6c.4-.4 1-.6 1.5-.8a6 6 0 011.8-.2zm3 12c0-.3-.1-.4-.3-.6l-.5-.3a14.7 14.7 0 00-2.8-.3l-1-.1-.8.6c-.2.2-.3.5-.3.8 0 .2 0 .4.2.5 0 .2.2.4.4.5l.9.3h2.5l1-.3.5-.5.1-.6zm-3-6.5l.8-.1c.3 0 .5-.2.7-.4l.3-.6.2-.7c0-.6-.2-1-.5-1.3-.4-.4-.9-.5-1.5-.5-.7 0-1.2.1-1.5.5-.4.3-.5.7-.5 1.3v.7a1.6 1.6 0 001 1l1 .1zm12.3-5.5c.8 0 1.6.1 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.4 4.9 4.9 0 01-2.9 3c-.6.3-1.4.4-2.2.4-.9 0-1.7-.1-2.3-.4a5 5 0 01-3-3c-.2-.7-.3-1.5-.3-2.4 0-.9 0-1.7.4-2.4.2-.7.6-1.3 1-1.8a5 5 0 011.9-1.2c.6-.3 1.4-.4 2.3-.4zm0 9.5c.9 0 1.6-.3 2-1 .5-.5.7-1.5.7-2.7 0-1.2-.2-2.1-.7-2.8-.4-.6-1.1-1-2-1-1 0-1.7.4-2.2 1-.4.7-.6 1.6-.6 2.8 0 1.2.2 2.1.6 2.8.5.6 1.2 1 2.2 1z"/>
|
||||
<path id="s" d="M252.8 219.6c-1 0-1.7-.2-2.2-.8-.6-.6-.8-1.3-.8-2.3v-6.3h-1.2l-.3-.1-.2-.5v-1l1.8-.3.6-3.1c0-.2 0-.3.2-.4H252.5v3.5h3v1.9h-3v6.1c0 .4 0 .6.3.8.1.2.4.3.7.3h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.4-.9.6-1.4.8a5 5 0 01-1.6.2zm9.7-11.6c.8 0 1.6.1 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.4 4.9 4.9 0 01-2.9 3c-.6.3-1.4.4-2.2.4-.9 0-1.7-.1-2.3-.4a5 5 0 01-3-3c-.2-.7-.3-1.5-.3-2.4 0-.9 0-1.7.4-2.4.2-.7.6-1.3 1-1.8a5 5 0 011.9-1.2c.6-.3 1.4-.4 2.3-.4zm0 9.5c.9 0 1.6-.3 2-1 .5-.5.7-1.5.7-2.7 0-1.2-.2-2.1-.7-2.8-.4-.6-1.1-1-2-1-1 0-1.7.4-2.2 1-.4.7-.6 1.6-.6 2.8 0 1.2.2 2.1.6 2.8.5.6 1.2 1 2.2 1z"/>
|
||||
<path id="t" d="M331.8 203.7h.4l.3.4 9.2 11.8a7.5 7.5 0 01-.1-1v-11.2h1.9v15.8h-1.1a1 1 0 01-.4-.1 1 1 0 01-.4-.3l-9-11.9a14.1 14.1 0 010 1v11.3h-2v-15.8h1.2zm14.6 14.5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.3h.5l.4.4a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .3 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1zm10-5v6.3h-2.1v-6.3l-5.8-9.5h1.9l.4.1.3.4 3.6 6.1a7.6 7.6 0 01.6 1.4 13 13 0 01.7-1.4l3.6-6.1.2-.3c.1-.2.3-.2.5-.2h1.9l-5.8 9.5zm5.1 5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.3h.5l.5.4a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .3 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.4-1zm8.5-14.5v6.3a27.8 27.8 0 01-.3 4h-1.3a66.4 66.4 0 01-.3-4v-6.3h1.9zm-2.4 14.5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.3h.5l.5.4a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .3 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.4-1z"/>
|
||||
<path id="u" d="M127 202.6l.6 1.2a4.3 4.3 0 01-.4 3.4c-.4.6-.9 1.3-1.6 1.8l-.8-.5-.2-.2v-.1l.1-.4a6.5 6.5 0 00.6-.8 2.9 2.9 0 00.3-1.3l-.1-.8c0-.3-.2-.6-.4-.9l-.1-.3c0-.2.1-.4.4-.5l1.6-.6zm9.7 7.8l-.2.2h-.3a29 29 0 00-1.6-.6h-1a2 2 0 00-1.2.3 1 1 0 00-.5.9c0 .2.1.5.3.6.1.2.3.3.6.4l.9.4a29 29 0 012 .7l.9.6c.3.2.5.5.6.8.2.3.3.7.3 1.2l-.3 1.5-.9 1.2a4 4 0 01-1.4.8 6.1 6.1 0 01-3 .1 6.7 6.7 0 01-2.1-.7l-.8-.6.7-1 .3-.3h.8a10.6 10.6 0 001.3.7l1 .1h.8l.6-.4.3-.4.1-.5c0-.3 0-.5-.2-.7a2 2 0 00-.6-.4 29.4 29.4 0 01-3-1l-.9-.7-.6-.9c-.2-.3-.2-.8-.2-1.3a3.3 3.3 0 011-2.4 4 4 0 011.4-.7 5.6 5.6 0 014 .1c.6.2 1.1.6 1.5 1l-.6 1z"/>
|
||||
<path id="v" d="M429.6 202.6l.5 1.2a4.3 4.3 0 01-.3 3.4c-.4.6-1 1.3-1.6 1.8l-.9-.5-.1-.2v-.1-.4a7.8 7.8 0 00.7-.8l.2-.6a2.5 2.5 0 000-1.5l-.5-.9v-.3c0-.2 0-.4.3-.5l1.7-.6zm3.6 0l.6 1.2a4.3 4.3 0 01-.4 3.4c-.3.6-.9 1.3-1.6 1.8l-.8-.5-.2-.2v-.1l.1-.4a7.8 7.8 0 00.6-.8l.2-.6a2.5 2.5 0 000-1.5c0-.3-.2-.6-.4-.9l-.1-.3c0-.2.1-.4.4-.5l1.6-.6z"/>
|
||||
<path id="w" d="M13.5 275.3c-.5-.9-.8-1.7-.8-2.5 0-.7.2-1.4.6-2.1.3-.7.9-1.3 1.6-1.9l.8.6.2.1v.4l-.1.1a5.7 5.7 0 00-.6.9l-.2.6-.1.6.1.8c0 .3.2.6.4 1l.1.2c0 .3-.1.4-.4.5l-1.6.7zm3.6 0c-.5-.9-.7-1.7-.7-2.5 0-.7.2-1.4.5-2.1.4-.7 1-1.3 1.6-1.9l.9.6.1.1v.5a5.7 5.7 0 00-.7.9l-.2.6v1.4l.5 1v.2c0 .3 0 .4-.3.5l-1.7.7z"/>
|
||||
<path id="x" d="M58.4 283.6h6.4v2.4h-9.4v-16h3v13.6zm12.5-9c.7 0 1.4 0 2 .3a4.3 4.3 0 012.5 2.6 6 6 0 01.4 2.7l-.1.3-.2.1-.3.1h-7c0 1.2.4 2 1 2.6a3 3 0 002 .8c.5 0 1 0 1.2-.2l.9-.3.6-.4.5-.1h.3l.2.2.8 1c-.3.4-.6.7-1 .9a5.7 5.7 0 01-2.4.9H71a6 6 0 01-2.1-.3c-.7-.3-1.3-.7-1.8-1.2s-.9-1.1-1.2-1.9a7.3 7.3 0 010-4.8c.2-.6.6-1.2 1-1.7a5 5 0 011.7-1.2c.7-.3 1.5-.5 2.3-.5zm0 1.9c-.7 0-1.3.2-1.8.7-.4.4-.7 1-.8 1.9h5v-1l-.5-.9a2 2 0 00-.8-.5l-1-.2zm10.8 9.7c-1 0-1.7-.3-2.2-.9-.6-.5-.8-1.3-.8-2.2v-6.4H77l-.1-.5V275l1.8-.3.6-3c0-.2 0-.3.2-.4l.4-.1h1.4v3.5h3v2h-3v6c0 .4 0 .7.2.9.2.2.5.3.8.3h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.3-.9.6-1.4.7a5 5 0 01-1.6.3z"/>
|
||||
<path id="y" d="M183 274.5c.4 0 .9 0 1.3.2.4 0 .8.2 1.2.4h3.2v1l-.1.4c0 .1-.2.2-.5.2l-1 .2a3.5 3.5 0 01.3 1.3 3.4 3.4 0 01-1.3 2.6l-1.4.8a5.7 5.7 0 01-3 .1c-.3.2-.5.4-.5.7 0 .2 0 .4.3.5l.8.2 1.2.1a24.2 24.2 0 012.6.3c.5 0 .9.2 1.2.4l.8.8c.2.4.3.8.3 1.3s0 1-.3 1.5l-1 1.3c-.6.3-1.1.6-1.8.8-.7.3-1.4.4-2.3.4-.9 0-1.6-.1-2.2-.3-.7-.1-1.2-.4-1.6-.6l-1-1-.2-1.1c0-.5.1-1 .4-1.3.4-.4.8-.7 1.4-.9l-.7-.6c-.2-.3-.2-.6-.2-1v-.5l.3-.6.5-.5.6-.4a3.3 3.3 0 01-1.8-3 3.4 3.4 0 011.2-2.7 6 6 0 013.2-1zm3 12c0-.2-.1-.4-.3-.5 0-.2-.3-.3-.5-.4a14.7 14.7 0 00-2.8-.3h-1c-.4.1-.6.3-.8.5-.2.3-.3.5-.3.8 0 .2 0 .4.2.6 0 .2.2.3.4.4.2.2.5.3.9.3l1.2.1h1.3l1-.4.5-.5.1-.6zm-3-6.4c.3 0 .6 0 .8-.2.3 0 .5-.2.7-.3l.3-.6.2-.8c0-.5-.2-1-.5-1.3-.4-.3-.9-.5-1.5-.5-.7 0-1.2.2-1.5.5-.4.3-.5.8-.5 1.3v.8a1.6 1.6 0 001 1h1zm12.3-5.6c.8 0 1.6.2 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.5 4.9 4.9 0 01-2.9 3 6.2 6.2 0 01-4.6 0 5 5 0 01-2.8-3c-.3-.8-.4-1.6-.4-2.5 0-.9 0-1.7.4-2.4.2-.7.6-1.3 1-1.8a5 5 0 011.9-1.2c.6-.2 1.4-.4 2.3-.4zm0 9.6c.9 0 1.6-.3 2-1 .5-.6.7-1.5.7-2.7 0-1.3-.2-2.2-.7-2.8-.4-.7-1.1-1-2-1-1 0-1.7.3-2.2 1-.4.6-.6 1.5-.6 2.8 0 1.2.2 2 .6 2.7.5.7 1.2 1 2.2 1z"/>
|
||||
<path id="z" d="M251.3 286.2c-1 0-1.7-.3-2.2-.9-.6-.5-.8-1.3-.8-2.2v-6.4h-1.5l-.2-.5V275l1.8-.3.6-3c0-.2 0-.3.2-.4l.4-.1h1.4v3.5h3v2h-3v6c0 .4 0 .7.3.9.1.2.4.3.7.3h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.3-.9.6-1.4.7a5 5 0 01-1.6.3zm9.7-11.7c.8 0 1.6.2 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.5 4.9 4.9 0 01-2.9 3 6.2 6.2 0 01-4.6 0 5 5 0 01-2.8-3c-.3-.8-.4-1.6-.4-2.5 0-.9 0-1.7.4-2.4.2-.7.6-1.3 1-1.8a5 5 0 011.9-1.2c.6-.2 1.4-.4 2.3-.4zm0 9.6c.9 0 1.6-.3 2-1 .5-.6.7-1.5.7-2.7 0-1.3-.2-2.2-.7-2.8-.4-.7-1.1-1-2-1-1 0-1.7.3-2.2 1-.4.6-.6 1.5-.6 2.8 0 1.2.2 2 .6 2.7.5.7 1.2 1 2.2 1z"/>
|
||||
<path id="A" d="M310.4 270.2l.4.1.4.3 9 11.9a7.5 7.5 0 010-1.1v-11.2h2V286H321a1 1 0 01-.4 0 1 1 0 01-.3-.4l-9.2-11.9a14.1 14.1 0 010 1V286h-1.8v-15.8h1.1zm14.6 14.6a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1c.2 0 .3.2.5.3a1.3 1.3 0 01.3 1 1.4 1.4 0 01-.3 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.4-1zm10-5v6.2H333v-6.3l-5.8-9.5h1.9c.2 0 .3 0 .4.2.2 0 .2.2.3.3l3.6 6.2a7.6 7.6 0 01.7 1.4 13 13 0 01.6-1.4l3.6-6.2.3-.3.4-.2h2l-5.9 9.5zm5.2 5a1.4 1.4 0 01.4-1 1.3 1.3 0 011-.4l.5.1.4.3a1.3 1.3 0 01.4 1 1.4 1.4 0 01-.4 1 1.3 1.3 0 01-1 .4 1.4 1.4 0 01-1-.4 1.4 1.4 0 01-.3-1z"/>
|
||||
<path id="B" d="M126.5 269.1l.6 1.3a4.3 4.3 0 01-.4 3.3c-.4.7-.9 1.3-1.6 1.9l-.8-.6-.2-.1v-.2l.1-.3a6.5 6.5 0 00.6-.9 2.9 2.9 0 00.3-1.2l-.1-.8c0-.3-.2-.6-.4-1l-.1-.2c0-.3.1-.4.4-.5l1.6-.7zm9.7 7.8l-.2.3h-.3a29 29 0 00-1.6-.6h-1a2 2 0 00-1.2.3 1 1 0 00-.5.9l.3.6.6.4.9.3a29 29 0 012 .8l.9.5c.3.3.5.5.6.9.2.3.3.7.3 1.1 0 .6-.1 1-.3 1.5-.2.5-.5.9-.9 1.2a4 4 0 01-1.4.8 6.1 6.1 0 01-3 .2 6.7 6.7 0 01-2.1-.8l-.8-.5.7-1c0-.2.1-.3.3-.3l.4-.1.4.1a10.6 10.6 0 001.3.6l1 .2.8-.1.6-.3.3-.5.1-.5c0-.3 0-.5-.2-.6a2 2 0 00-.6-.5 29.4 29.4 0 01-3-1l-.9-.7-.6-.8c-.2-.4-.2-.8-.2-1.3a3.3 3.3 0 011-2.4 4 4 0 011.4-.8 5.6 5.6 0 014 .1c.6.3 1.1.6 1.5 1l-.6 1z"/>
|
||||
<path id="C" d="M429.6 269.1l.5 1.3a4.3 4.3 0 01-.3 3.3c-.4.7-1 1.3-1.6 1.9l-.9-.6-.1-.1v-.2-.3a7.8 7.8 0 00.7-.9l.2-.6a2.5 2.5 0 000-1.4l-.5-1v-.2c0-.3 0-.4.3-.5l1.7-.7zm3.6 0l.6 1.3a4.3 4.3 0 01-.4 3.3c-.3.7-.9 1.3-1.6 1.9l-.8-.6-.2-.1v-.2l.1-.3a7.8 7.8 0 00.6-.9l.2-.6a2.5 2.5 0 000-1.4c0-.3-.2-.6-.4-1l-.1-.2c0-.3.1-.4.4-.5l1.6-.7z"/>
|
||||
<path id="D" d="M387.8 270v6.4a19.2 19.2 0 01-.3 4h-1.9a41.8 41.8 0 01-.3-4V270h2.5zm-3 14.5a1.7 1.7 0 01.5-1.2 1.7 1.7 0 011.2-.5 1.6 1.6 0 011.2.5 1.7 1.7 0 01.3 1.9c0 .2-.2.3-.3.5l-.5.3-.7.2a1.7 1.7 0 01-1.2-.5l-.3-.5-.2-.7z"/>
|
||||
<path id="E" d="M13.5 341.8c-.5-.8-.8-1.6-.8-2.5 0-.7.2-1.4.6-2 .3-.7.9-1.4 1.6-2l.8.6.2.2V336.4l-.1.2a5.7 5.7 0 00-.6.8l-.2.6-.1.7.1.8c0 .3.2.5.4.9l.1.3c0 .2-.1.4-.4.5l-1.6.6zm3.6 0c-.5-.8-.7-1.6-.7-2.5 0-.7.2-1.4.5-2 .4-.7 1-1.4 1.6-2l.9.6.1.2v.5a5.7 5.7 0 00-.7.8l-.2.6v1.5l.5.9v.3c0 .2 0 .4-.3.5l-1.7.6z"/>
|
||||
<path id="F" d="M60 350.1h6.3v2.4h-9.4v-15.9h3v13.5zm12.4-9c.7 0 1.4.1 2 .3a4.3 4.3 0 012.5 2.6 6 6 0 01.4 2.7l-.1.3-.2.2h-7.3c0 1.2.4 2 1 2.6a3 3 0 002 .8l1.2-.1.9-.4.6-.3.5-.2h.3l.2.3.8 1-1 .8a5.7 5.7 0 01-2.4 1h-1.3a6 6 0 01-2.1-.4c-.7-.2-1.3-.6-1.8-1.1-.5-.5-.9-1.2-1.2-2a7.3 7.3 0 010-4.7c.2-.7.6-1.3 1-1.8a5 5 0 011.7-1.2c.7-.3 1.5-.4 2.3-.4zm0 2c-.7 0-1.3.2-1.8.6-.4.4-.7 1-.8 1.9h5v-1l-.5-.8a2 2 0 00-.8-.6l-1-.2zm10.8 9.6c-1 0-1.7-.3-2.2-.8-.6-.6-.8-1.3-.8-2.3v-6.3H79l-.4-.1-.1-.5v-1l1.8-.4.6-3c0-.2 0-.3.2-.4H82.9v3.5h3v1.9h-3v6.1c0 .4 0 .6.2.8.2.2.5.3.8.3h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.4-.9.6-1.4.8a5 5 0 01-1.6.2z"/>
|
||||
<path id="G" d="M184 341l1.3.2 1.2.4h3.2v1l-.1.5-.5.2-1 .1a3.5 3.5 0 01.3 1.3 3.4 3.4 0 01-1.3 2.7l-1.4.7a5.7 5.7 0 01-3 .1c-.3.3-.5.5-.5.8 0 .2 0 .4.3.5l.8.2h1.2a24.2 24.2 0 012.6.3l1.2.5c.3.2.6.4.8.8.2.3.3.8.3 1.3s0 1-.3 1.4c-.3.5-.6 1-1 1.3-.6.4-1.1.7-1.8.9a8.9 8.9 0 01-4.5 0c-.7-.1-1.2-.3-1.6-.6l-1-1-.2-1c0-.6.1-1 .4-1.4.4-.4.8-.6 1.4-.9-.3-.1-.5-.3-.7-.6-.2-.2-.2-.6-.2-1v-.5l.3-.5.5-.5.6-.5a3.3 3.3 0 01-1.8-3 3.4 3.4 0 011.2-2.7c.4-.3 1-.5 1.5-.7a6 6 0 011.8-.2zm3 12c0-.2-.1-.3-.3-.5l-.5-.3a14.7 14.7 0 00-2.8-.3l-1-.1-.8.6c-.2.2-.3.5-.3.8 0 .2 0 .4.2.5 0 .2.2.4.4.5l.9.3h2.5l1-.4c.2 0 .4-.3.5-.4l.1-.6zm-3-6.4l.8-.1.7-.4.3-.6.2-.7c0-.6-.2-1-.5-1.3-.4-.4-.9-.5-1.5-.5-.7 0-1.2.1-1.5.5-.4.3-.5.7-.5 1.3v.7a1.6 1.6 0 001 1l1 .1zm12.3-5.5c.8 0 1.6.1 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.4 4.9 4.9 0 01-2.9 3c-.6.3-1.4.4-2.2.4-.9 0-1.7-.1-2.3-.4a5 5 0 01-3-3c-.2-.7-.3-1.5-.3-2.4 0-1 0-1.7.4-2.4.2-.7.6-1.4 1-1.9a5 5 0 011.9-1.1c.6-.3 1.4-.4 2.3-.4zm0 9.5c.9 0 1.6-.3 2-1 .5-.6.7-1.5.7-2.7 0-1.2-.2-2.1-.7-2.8-.4-.6-1.1-1-2-1-1 0-1.7.4-2.2 1-.4.7-.6 1.6-.6 2.8 0 1.2.2 2.1.6 2.8.5.6 1.2 1 2.2 1z"/>
|
||||
<path id="H" d="M249.8 352.7c-1 0-1.7-.3-2.2-.8-.6-.6-.8-1.3-.8-2.3v-6.3h-1.2l-.3-.1-.2-.5v-1l1.8-.4.6-3c0-.2 0-.3.2-.4H249.5v3.5h3v1.9h-3v6.1c0 .4 0 .6.3.8.1.2.4.3.7.3h.4a2.3 2.3 0 00.5-.3h.4l.2.2.8 1.3c-.4.4-.9.6-1.4.8a5 5 0 01-1.6.2zm9.7-11.6c.8 0 1.6.1 2.2.4a5 5 0 013 3c.2.7.3 1.5.3 2.4a7 7 0 01-.4 2.4 4.9 4.9 0 01-2.9 3c-.6.3-1.4.4-2.2.4-.9 0-1.7-.1-2.3-.4a5 5 0 01-3-3c-.2-.7-.3-1.5-.3-2.4 0-1 0-1.7.4-2.4.2-.7.6-1.4 1-1.9a5 5 0 011.9-1.1c.6-.3 1.4-.4 2.3-.4zm0 9.5c.9 0 1.6-.3 2-1 .5-.6.7-1.5.7-2.7 0-1.2-.2-2.1-.7-2.8-.4-.6-1.1-1-2-1-1 0-1.7.4-2.2 1-.4.7-.6 1.6-.6 2.8 0 1.2.2 2.1.6 2.8.5.6 1.2 1 2.2 1z"/>
|
||||
<path id="I" d="M312.3 336.6h.4l.2.1.2.2.2.2 8.4 10.6a11 11 0 010-1.4v-9.7h2.5v16h-1.5c-.2 0-.4 0-.6-.2-.2 0-.3-.2-.4-.4l-8.4-10.6a15.3 15.3 0 01.1 1.4v9.7h-2.6v-15.9h1.5zm14.3 14.4a1.7 1.7 0 01.5-1.1 1.7 1.7 0 011.2-.5 1.6 1.6 0 011.2.5 1.7 1.7 0 01.3 1.8c0 .2-.2.4-.3.5l-.6.4-.6.1a1.7 1.7 0 01-1.2-.5c-.1-.1-.3-.3-.3-.5l-.2-.7zm11-4.6v6.1h-3v-6.1l-5.7-9.8h2.6l.6.2.4.5 2.9 5.3a13.3 13.3 0 01.8 1.7 12 12 0 01.7-1.7l2.9-5.3c0-.2.2-.3.4-.5l.6-.2h2.6l-5.8 9.8zm4.7 4.6a1.7 1.7 0 01.5-1.1 1.7 1.7 0 011.2-.5 1.6 1.6 0 011.1.5 1.7 1.7 0 01.4 1.8c0 .2-.2.4-.4.5l-.5.4-.6.1a1.7 1.7 0 01-1.2-.5l-.4-.5-.1-.7z"/>
|
||||
<path id="J" d="M128 335.7l.6 1.2a4.3 4.3 0 01-.4 3.4c-.4.6-.9 1.2-1.6 1.8l-.8-.5-.2-.2v-.2l.1-.3a6.5 6.5 0 00.6-.9 2.9 2.9 0 00.3-1.2l-.1-.8c0-.3-.2-.6-.4-.9l-.1-.3c0-.2.1-.4.4-.5l1.6-.6zm9.7 7.8l-.2.2h-.3a29 29 0 00-1.6-.6h-1a2 2 0 00-1.2.3 1 1 0 00-.5.9l.3.6c.1.2.3.3.6.4l.9.4a29 29 0 012 .7l.9.6c.3.2.5.5.6.8.2.3.3.7.3 1.2l-.3 1.5-.9 1.2a4 4 0 01-1.4.7 6.1 6.1 0 01-3 .2 6.7 6.7 0 01-2.1-.7l-.8-.6.7-1 .3-.3h.8a10.6 10.6 0 001.3.7l1 .1.8-.1.6-.3.3-.4.1-.5c0-.3 0-.5-.2-.7a2 2 0 00-.6-.4 29.4 29.4 0 01-3-1l-.9-.7-.6-.9c-.2-.3-.2-.8-.2-1.3a3.3 3.3 0 011-2.4 4 4 0 011.4-.7 5.6 5.6 0 014 .1c.6.2 1.1.6 1.5 1l-.6 1z"/>
|
||||
<path id="K" d="M429.6 335.7l.5 1.2a4.3 4.3 0 01-.3 3.4l-1.6 1.8-.9-.5-.1-.2v-.2-.3a7.8 7.8 0 00.7-.9l.2-.5a2.5 2.5 0 000-1.5l-.5-.9v-.3c0-.2 0-.4.3-.5l1.7-.6zm3.6 0l.6 1.2a4.3 4.3 0 01-.4 3.4c-.3.6-.9 1.2-1.6 1.8l-.8-.5-.2-.2v-.2l.1-.3a7.8 7.8 0 00.6-.9l.2-.5a2.5 2.5 0 000-1.5c0-.3-.2-.6-.4-.9l-.1-.3c0-.2.1-.4.4-.5l1.6-.6z"/>
|
||||
<path id="L" d="M387.8 336.6v6.3a19.2 19.2 0 01-.3 4h-1.9a41.8 41.8 0 01-.3-4v-6.3h2.5zm-3 14.4a1.7 1.7 0 01.5-1.1 1.7 1.7 0 011.2-.5 1.6 1.6 0 011.2.5 1.7 1.7 0 01.3 1.8c0 .2-.2.4-.3.5l-.5.4-.7.1a1.7 1.7 0 01-1.2-.5l-.3-.5-.2-.7z"/>
|
||||
<path id="M" d="M16.4 11.3V15H14V4h3.9c.7 0 1.4.2 2 .3l1.3.8c.4.3.6.7.8 1.1l.3 1.4c0 .6-.1 1-.3 1.5a3 3 0 01-.8 1.2c-.4.3-.8.6-1.4.8-.5.2-1.2.2-2 .2h-1.3zm0-1.9h1.4c.6 0 1-.1 1.4-.4.3-.4.4-.8.4-1.4l-.1-.6a1.4 1.4 0 00-1-1H16.5v3.4zM26 11v4h-2.5V4H27c.8 0 1.5.2 2 .3l1.4.7c.4.3.6.6.8 1l.2 1.3-.1 1a3 3 0 01-1.1 1.6l-1 .5.5.3.4.5 2.3 3.8h-2.3c-.4 0-.7-.2-.9-.5l-1.8-3.2-.3-.3a1 1 0 00-.5 0H26zm0-1.8h1l1-.1.5-.4c.2-.1.3-.3.3-.5l.1-.7c0-.5-.1-.9-.4-1.1-.3-.3-.8-.4-1.5-.4h-1v3.2zm14.5-5.1v2H36v2.5h3.4v1.8H36v2.7h4.5V15h-7V4h7zM49 4v2h-4.5v2.7h3.7v2h-3.7V15h-2.6V4h7zM53 15h-2.5V4H53v11zm4.8-5.6L54.4 4h2.9l.2.3L59.7 8v-.2l.2-.1 1.9-3.3c0-.2.3-.3.5-.3h2.4l-3.4 5.2 3.5 5.7h-2.6l-.4-.1a1 1 0 01-.2-.3l-2.2-3.8-.1.3-2 3.5-.3.3-.4.1h-2.4l3.6-5.6z"/>
|
||||
<path id="N" d="M20 136.3l-.2.3h-.4-.3a67.9 67.9 0 00-1-.5l-.8-.2c-.5 0-.8.1-1 .4a1 1 0 00-.4.8c0 .2 0 .4.2.5.1.2.3.3.6.4l.7.3a19.6 19.6 0 011.8.7l.8.5a2.6 2.6 0 01.8 2c0 .5-.1 1-.3 1.4a3.3 3.3 0 01-2 2l-1.6.2a5.3 5.3 0 01-2.1-.4 6 6 0 01-1-.4 4 4 0 01-.7-.6l.8-1.2s0-.2.2-.2l.3-.1.5.1a32.8 32.8 0 001.1.7l1 .1c.4 0 .7 0 1-.3.3-.2.4-.5.4-1 0-.2 0-.4-.2-.6l-.6-.4a23.2 23.2 0 01-2.6-.9l-.7-.5-.6-1a3.5 3.5 0 010-2.4l.8-1c.3-.3.7-.6 1.2-.8.4-.2 1-.2 1.6-.2a6 6 0 011.9.3 5 5 0 011.4.8l-.6 1.2zm6.6 6.7c.3 0 .6 0 .9-.2.3 0 .5-.2.7-.5l.4-.7.1-1V134h2.6v6.4a5 5 0 01-.4 1.9 4.1 4.1 0 01-2.4 2.4c-.5.2-1.2.3-2 .3-.7 0-1.3 0-1.9-.3-.6-.2-1-.6-1.5-1-.4-.4-.7-.8-.9-1.4-.2-.6-.3-1.2-.3-1.9v-6.4h2.5v6.4c0 .4 0 .8.2 1 0 .4.2.6.4.8.2.3.4.4.7.5l.9.2zm13.4-9v2h-4.5v2.8h3.7v2h-3.7v4.2h-2.6v-11h7zm8.3 0v2h-4.5v2.8h3.8v2h-3.8v4.2h-2.5v-11h7zm4.1 11H50v-11h2.5v11zm4.7-5.6l-3.4-5.3h2.9l.2.3L59 138l.1-.2.1-.1 2-3.3c0-.2.2-.3.4-.3h2.5l-3.5 5.2 3.5 5.7h-2.5l-.4-.1a1 1 0 01-.2-.3l-2.2-3.8-.2.3-2 3.5-.3.3-.3.1h-2.4l3.5-5.6z"/>
|
||||
<path id="O" d="M20 201.3l-.2.3h-.4-.3a67.9 67.9 0 00-1-.5l-.8-.2c-.5 0-.8.1-1 .4a1 1 0 00-.4.8c0 .2 0 .4.2.5.1.2.3.3.6.4l.7.3a19.6 19.6 0 011.8.7l.8.5a2.6 2.6 0 01.8 2c0 .5-.1 1-.3 1.4a3.3 3.3 0 01-2 2l-1.6.2a5.3 5.3 0 01-2.1-.4 6 6 0 01-1-.4 4 4 0 01-.7-.6l.8-1.2s0-.2.2-.2l.3-.1.5.1a32.8 32.8 0 001.1.7l1 .1c.4 0 .7 0 1-.3.3-.2.4-.5.4-1 0-.2 0-.4-.2-.6l-.6-.4a23.2 23.2 0 01-2.6-.9l-.7-.5-.6-1a3.5 3.5 0 010-2.4l.8-1c.3-.3.7-.6 1.2-.8.4-.2 1-.2 1.6-.2a6 6 0 011.9.3 5 5 0 011.4.8l-.6 1.2zm6.6 6.7c.3 0 .6 0 .9-.2.3 0 .5-.2.7-.5l.4-.7.1-1V199h2.6v6.4a5 5 0 01-.4 1.9 4.1 4.1 0 01-2.4 2.4c-.5.2-1.2.3-2 .3-.7 0-1.3 0-1.9-.3-.6-.2-1-.6-1.5-1-.4-.4-.7-.8-.9-1.4-.2-.6-.3-1.2-.3-1.9v-6.4h2.5v6.4c0 .4 0 .8.2 1 0 .4.2.6.4.8.2.3.4.4.7.5l.9.2zm13.4-9v2h-4.5v2.8h3.7v2h-3.7v4.2h-2.6v-11h7zm8.3 0v2h-4.5v2.8h3.8v2h-3.8v4.2h-2.5v-11h7zm4.1 11H50v-11h2.5v11zm4.7-5.6l-3.4-5.3h2.9l.2.3L59 203l.1-.2.1-.1 2-3.3c0-.2.2-.3.4-.3h2.5l-3.5 5.2 3.5 5.7h-2.5l-.4-.1a1 1 0 01-.2-.3l-2.2-3.8-.2.3-2 3.5-.3.3-.3.1h-2.4l3.5-5.6z"/>
|
||||
<path id="P" d="M8 264v2H3.4v2.6h3.4v1.8H3.4v2.7H8v1.9H1v-11h7zm4 5.4L8.8 264h2.9l.2.3L14 268v-.2l.2-.1 1.9-3.3c0-.2.3-.3.5-.3H19l-3.4 5.2L19 275h-2.6l-.4-.1a1 1 0 01-.2-.3l-2.2-3.8-.1.3-2 3.5-.3.3-.4.1H8.6l3.5-5.6zm15.2 2.8h.2l.1.1 1 1c-.4.7-1 1-1.6 1.4-.7.3-1.5.4-2.4.4-.8 0-1.5-.1-2.2-.4a4.8 4.8 0 01-2.7-3 6.5 6.5 0 010-4.4 5.2 5.2 0 013-3 6.3 6.3 0 014.5 0 4.8 4.8 0 011.6 1.1l-.9 1.2-.2.1-.3.1H27l-.2-.2a45 45 0 00-1.2-.5 3.5 3.5 0 00-2 .2l-1 .7-.6 1-.2 1.5c0 .6 0 1 .2 1.5l.6 1.1a2.6 2.6 0 002 1l.7-.1a2.6 2.6 0 001.5-.7h.2l.2-.1zm9.5-8.1v2h-4.5v2.5h3.5v1.8h-3.5v2.7h4.5v1.9h-7v-11h7zm4 7.2v3.7h-2.5v-11H42c.8 0 1.4.2 2 .3.6.2 1 .5 1.4.8l.8 1.1.2 1.4c0 .6 0 1-.3 1.5a3 3 0 01-.8 1.2l-1.3.8c-.6.2-1.2.2-2 .2h-1.3zm0-1.9H42c.7 0 1.1-.1 1.4-.4.3-.4.5-.8.5-1.4l-.1-.6a1.4 1.4 0 00-1-1h-2.1v3.4zm15-5.3v2h-3.1v8.9H50v-9H47v-2h8.7zm3.8 10.9h-2.6v-11h2.6v11zm12.8-5.5c0 .8-.1 1.6-.4 2.2a5.3 5.3 0 01-5.3 3.4c-.8 0-1.6-.1-2.3-.4a5.3 5.3 0 01-3.4-5.2c0-.8.2-1.5.5-2.2a5.2 5.2 0 013-3c.6-.2 1.4-.3 2.2-.3a6 6 0 012.4.4 5.4 5.4 0 013.3 5.1zm-2.6 0c0-.5 0-1-.2-1.4a3 3 0 00-.6-1.1c-.3-.3-.6-.6-1-.7a3.4 3.4 0 00-2.6 0c-.4.1-.7.4-1 .7a3 3 0 00-.5 1l-.3 1.5c0 .6.1 1 .3 1.5 0 .4.3.8.6 1.1.2.3.5.5 1 .7l1.2.2c.5 0 1 0 1.3-.2l1-.7c.3-.3.5-.7.6-1.1l.2-1.5zm5.1-5.4h.5l.2.2.2.2 5.2 6.5a13.8 13.8 0 010-1.1V264H83V275h-1.9a1 1 0 01-.4-.4l-5.1-6.5a23.3 23.3 0 010 1v5.9h-2.2v-11h1.3z"/>
|
||||
<path id="Q" d="M31.8 334.5c0 .8-.1 1.6-.4 2.2a5.1 5.1 0 01-3 2.9c-.6.3-1.4.4-2.3.4H22v-11h4.2c.9 0 1.7.2 2.4.5s1.3.6 1.8 1.1c.5.5.8 1 1.1 1.8.3.6.4 1.3.4 2.1zm-2.6 0c0-.5 0-1-.2-1.4-.1-.5-.3-.8-.6-1.1-.3-.3-.6-.6-1-.7-.3-.2-.8-.3-1.3-.3h-1.7v7h1.7c.5 0 1 0 1.3-.2l1-.7c.3-.3.5-.7.6-1.1.2-.4.2-1 .2-1.5zm14.6 0c0 .8-.1 1.6-.4 2.2a5.3 5.3 0 01-5.3 3.4c-.8 0-1.6-.1-2.3-.4a5.3 5.3 0 01-3.3-5.2c0-.8.1-1.5.4-2.2a5.2 5.2 0 013-3c.6-.2 1.4-.3 2.2-.3a6 6 0 012.4.4 5.4 5.4 0 013.3 5.1zm-2.6 0c0-.5 0-1-.2-1.4a3 3 0 00-.6-1.1c-.3-.3-.6-.6-1-.7a3.4 3.4 0 00-2.6 0l-1 .7a3 3 0 00-.5 1c-.2.5-.2 1-.2 1.5 0 .6 0 1 .2 1.5.1.4.3.8.6 1.1.2.3.6.5 1 .7l1.2.2c.5 0 1 0 1.3-.2l1-.7c.3-.3.5-.7.6-1.1.2-.4.2-1 .2-1.5zm5.2-5.4h.4l.2.2.2.2 5.2 6.5a13.8 13.8 0 010-1.1V329h2.2V340H52.8a1 1 0 01-.4-.4l-5.2-6.5a23.3 23.3 0 010 1v5.9H45v-11h1.4zm17 0v2H59v2.5h3.5v1.8h-3.5v2.7h4.5v1.9h-7v-11h7z"/>
|
||||
<path id="R" d="M8 69v2H3.4v2.6h3.4v1.8H3.4V78H8v2H1V69h7zm4 5.4L8.8 69h2.9l.2.3L14 73v-.2l.2-.1 1.9-3.3c0-.2.3-.3.5-.3H19l-3.4 5.2L19 80h-2.6l-.4-.1a1 1 0 01-.2-.3l-2.2-3.8-.1.3-2 3.5-.3.3-.4.1H8.6l3.5-5.6zm15.2 2.8h.2l.1.1 1 1c-.4.7-1 1-1.6 1.4-.7.3-1.5.4-2.4.4-.8 0-1.5-.1-2.2-.4a4.8 4.8 0 01-2.7-3 6.5 6.5 0 010-4.4 5.2 5.2 0 013-3 6.3 6.3 0 014.5 0 4.8 4.8 0 011.6 1.1l-.9 1.2-.2.1-.3.1H27l-.2-.2a45 45 0 00-1.2-.5 3.5 3.5 0 00-2 .2l-1 .7-.6 1-.2 1.5c0 .6 0 1 .2 1.5l.6 1.1a2.6 2.6 0 002 1l.7-.1a2.6 2.6 0 001.5-.7h.2l.2-.1zm9.5-8.1v2h-4.5v2.5h3.5v1.8h-3.5V78h4.5v2h-7V69h7zm4 7.2V80h-2.5V69H42c.8 0 1.4.2 2 .3.6.2 1 .5 1.4.8l.8 1.1.2 1.4c0 .6 0 1-.3 1.5a3 3 0 01-.8 1.2l-1.3.8c-.6.2-1.2.2-2 .2h-1.3zm0-1.9H42c.7 0 1.1-.1 1.4-.4.3-.4.5-.8.5-1.4l-.1-.6a1.4 1.4 0 00-1-1h-2.1v3.4zm15-5.3v2h-3.1V80H50v-9H47v-2h8.7zM59.5 80h-2.6V69h2.6v11zm12.8-5.5c0 .8-.1 1.6-.4 2.2a5.3 5.3 0 01-5.3 3.4c-.8 0-1.6-.1-2.3-.4a5.3 5.3 0 01-3.4-5.2c0-.8.2-1.5.5-2.2a5.2 5.2 0 013-3c.6-.2 1.4-.3 2.2-.3a6 6 0 012.4.4 5.4 5.4 0 013.3 5.1zm-2.6 0c0-.5 0-1-.2-1.4a3 3 0 00-.6-1.1c-.3-.3-.6-.6-1-.7a3.4 3.4 0 00-2.6 0c-.4.1-.7.4-1 .7a3 3 0 00-.5 1l-.3 1.5c0 .6.1 1 .3 1.5 0 .4.3.8.6 1.1.2.3.5.5 1 .7l1.2.2c.5 0 1 0 1.3-.2l1-.7c.3-.3.5-.7.6-1.1l.2-1.5zm5.1-5.4h.5l.2.2.2.2 5.2 6.5a13.8 13.8 0 010-1.1V69H83V80h-1.9a1 1 0 01-.4-.4L75.8 73a23.3 23.3 0 010 1V80h-2.2V69h1.3z"/>
|
||||
</defs>
|
||||
<g fill="none" fill-rule="evenodd">
|
||||
<g stroke-linejoin="round" stroke-width="3.8">
|
||||
<path stroke="#3AC" d="M82.4 46.5v13h-60v12m60-25v13h21.8v12"/>
|
||||
<path fill="#C3E7F1" stroke="#3AC" d="M6 5h152.7v41.7H6z"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M195.8 46.5v25"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M168.5 5h54.6v41.7h-54.6z"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M261.3 46.5v25"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M234 5h54.5v41.7H234z"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M377 46.5v25"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M299.5 5h153.8v41.7H299.5z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M22.4 113v21.8"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M6 71.5h32.7v41.7H6z"/>
|
||||
<path stroke="#3AC" d="M104.2 113v12H76.9v9.8m27.3-21.8v12h31.6v9.8"/>
|
||||
<path fill="#C3E7F1" stroke="#3AC" d="M49.6 71.5h109.1v41.7H49.6z"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M195.8 113v21.8"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M168.5 71.5h54.6v41.7h-54.6z"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M261.3 113v21.8"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M234 71.5h54.5v41.7H234z"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M377 113v21.8"/>
|
||||
<path fill="#F5F5F5" stroke="#B7B7B7" d="M299.5 71.5h153.8v41.7H299.5z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M6 134.8h32.7v41.5H6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M77 176.3v26.2"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M49.6 134.8h54.6v41.5H49.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M195.8 176.3v26.2"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M168.5 134.8h54.6v41.5h-54.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M261.3 176.3v26.2"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M234 134.8h54.5v41.5H234z"/>
|
||||
<path stroke="#3AC" d="M377 176.3v14.2h-22v12m22-26.2v14.2h60v12"/>
|
||||
<path fill="#C3E7F1" stroke="#3AC" d="M299.5 134.8h153.8v41.5H299.5z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M135.8 176.3v26.2"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M114 134.8h43.6v41.5H114z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M22.4 244v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M6 202.2h32.7v41.7H6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M77 244v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M49.6 202.2h54.6v41.7H49.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M195.8 244v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M168.5 202.2h54.6v41.7h-54.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M261.3 244v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M234 202.2h54.5v41.7H234z"/>
|
||||
<path stroke="#3AC" d="M355 244v12h-21.7v13m21.8-25v12h37v13"/>
|
||||
<path fill="#C3E7F1" stroke="#3AC" d="M299.5 202.2h110.1v41.7H299.5z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M135.8 244v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M114 202.2h43.6v41.7H114z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M437 244v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M420.5 202.2h32.8v41.7h-32.8z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M22.4 310.5v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M6 268.7h32.7v41.8H6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M77 310.5v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M49.6 268.7h54.6v41.8H49.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M195.8 310.5v21.8-18.6 21.8"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M168.5 268.7h54.6v41.8h-54.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M234 268.7h54.5v41.8H234z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M333.3 310.5v25"/>
|
||||
<path fill="#C3E7F1" stroke="#3AC" d="M299.5 268.7H366v41.8h-66.5z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M135.8 310.5v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M114 268.7h43.6v41.8H114z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M437 310.5v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M420.5 268.7h32.8v41.8h-32.8z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M392.2 310.5v25"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M375.8 268.7h32.7v41.8h-32.7z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M8 335.5h28.7a2 2 0 012 2V375a2 2 0 01-2 2H8a2 2 0 01-2-2v-37.5c0-1 .9-2 2-2z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M49.6 335.5h54.6V377H49.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M168.5 335.5h54.6V377h-54.6z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M234 335.5h54.5V377H234z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M299.5 335.5H366V377h-66.5z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M114 335.5h43.6V377H114z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M420.5 335.5h32.8V377h-32.8z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M375.8 335.5h32.7V377h-32.7z"/>
|
||||
<path fill="#B5F3D4" stroke="#3AD787" d="M261.3 310.5v25"/>
|
||||
</g>
|
||||
<g fill-rule="nonzero">
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#a"/>
|
||||
<use fill="#1A1E23" xlink:href="#a"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#b"/>
|
||||
<use fill="#1A1E23" xlink:href="#b"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#c"/>
|
||||
<use fill="#1A1E23" xlink:href="#c"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#d"/>
|
||||
<use fill="#1A1E23" xlink:href="#d"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#e"/>
|
||||
<use fill="#1A1E23" xlink:href="#e"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#f"/>
|
||||
<use fill="#1A1E23" xlink:href="#f"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#g"/>
|
||||
<use fill="#1A1E23" xlink:href="#g"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#h"/>
|
||||
<use fill="#1A1E23" xlink:href="#h"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#i"/>
|
||||
<use fill="#1A1E23" xlink:href="#i"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#j"/>
|
||||
<use fill="#1A1E23" xlink:href="#j"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#k"/>
|
||||
<use fill="#1A1E23" xlink:href="#k"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#l"/>
|
||||
<use fill="#1A1E23" xlink:href="#l"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#m"/>
|
||||
<use fill="#1A1E23" xlink:href="#m"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#n"/>
|
||||
<use fill="#1A1E23" xlink:href="#n"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#o"/>
|
||||
<use fill="#1A1E23" xlink:href="#o"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#p"/>
|
||||
<use fill="#1A1E23" xlink:href="#p"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#q"/>
|
||||
<use fill="#1A1E23" xlink:href="#q"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#r"/>
|
||||
<use fill="#1A1E23" xlink:href="#r"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#s"/>
|
||||
<use fill="#1A1E23" xlink:href="#s"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#t"/>
|
||||
<use fill="#1A1E23" xlink:href="#t"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#u"/>
|
||||
<use fill="#1A1E23" xlink:href="#u"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#v"/>
|
||||
<use fill="#1A1E23" xlink:href="#v"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#w"/>
|
||||
<use fill="#1A1E23" xlink:href="#w"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#x"/>
|
||||
<use fill="#1A1E23" xlink:href="#x"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#y"/>
|
||||
<use fill="#1A1E23" xlink:href="#y"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#z"/>
|
||||
<use fill="#1A1E23" xlink:href="#z"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#A"/>
|
||||
<use fill="#1A1E23" xlink:href="#A"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#B"/>
|
||||
<use fill="#1A1E23" xlink:href="#B"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#C"/>
|
||||
<use fill="#1A1E23" xlink:href="#C"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#D"/>
|
||||
<use fill="#1A1E23" xlink:href="#D"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#E"/>
|
||||
<use fill="#1A1E23" xlink:href="#E"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#F"/>
|
||||
<use fill="#1A1E23" xlink:href="#F"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#G"/>
|
||||
<use fill="#1A1E23" xlink:href="#G"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#H"/>
|
||||
<use fill="#1A1E23" xlink:href="#H"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#I"/>
|
||||
<use fill="#1A1E23" xlink:href="#I"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#J"/>
|
||||
<use fill="#1A1E23" xlink:href="#J"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#K"/>
|
||||
<use fill="#1A1E23" xlink:href="#K"/>
|
||||
</g>
|
||||
<g transform="translate(6 11)">
|
||||
<use fill="#3D4251" xlink:href="#L"/>
|
||||
<use fill="#1A1E23" xlink:href="#L"/>
|
||||
</g>
|
||||
</g>
|
||||
<rect width="101" height="20" x="483" y="16" fill="#3AC" fill-rule="nonzero" stroke="#3AC" stroke-width="2.2" rx="10"/>
|
||||
<rect width="101" height="20" x="483" y="146" fill="#3AC" fill-rule="nonzero" stroke="#3AC" stroke-width="2.2" rx="10"/>
|
||||
<rect width="101" height="20" x="483" y="211" fill="#3AC" fill-rule="nonzero" stroke="#3AC" stroke-width="2.2" rx="10"/>
|
||||
<rect width="101" height="20" x="483" y="276" fill="#3AC" fill-rule="nonzero" stroke="#3AC" stroke-width="2.2" rx="10"/>
|
||||
<rect width="101" height="20" x="483" y="341" fill="#3AD787" fill-rule="nonzero" stroke="#3AD787" stroke-width="2.2" rx="10"/>
|
||||
<rect width="101" height="20" x="483" y="81" fill="#3AC" fill-rule="nonzero" stroke="#3AC" stroke-width="2.2" rx="10"/>
|
||||
<g fill-rule="nonzero">
|
||||
<g transform="translate(493 16)">
|
||||
<use fill="#000" xlink:href="#M"/>
|
||||
<use fill="#FFF" xlink:href="#M"/>
|
||||
</g>
|
||||
<g transform="translate(493 16)">
|
||||
<use fill="#000" xlink:href="#N"/>
|
||||
<use fill="#FFF" xlink:href="#N"/>
|
||||
</g>
|
||||
<g transform="translate(493 16)">
|
||||
<use fill="#000" xlink:href="#O"/>
|
||||
<use fill="#FFF" xlink:href="#O"/>
|
||||
</g>
|
||||
<g transform="translate(493 16)">
|
||||
<use fill="#000" xlink:href="#P"/>
|
||||
<use fill="#FFF" xlink:href="#P"/>
|
||||
</g>
|
||||
<g transform="translate(493 16)">
|
||||
<use fill="#000" xlink:href="#Q"/>
|
||||
<use fill="#FFF" xlink:href="#Q"/>
|
||||
</g>
|
||||
<g transform="translate(493 16)">
|
||||
<use fill="#000" xlink:href="#R"/>
|
||||
<use fill="#FFF" xlink:href="#R"/>
|
||||
</g>
|
||||
</g>
|
||||
</g>
|
||||
</svg>
|
After Width: | Height: | Size: 45 KiB |
@@ -74,7 +74,6 @@
|
||||
"* [The Python tutorial](https://docs.python.org/3/tutorial/)\n",
|
||||
"* [Object-Oriented Programming in Python](http://python-textbok.readthedocs.org/en/latest/index.html)\n",
|
||||
"* [Python3 tutorial](http://www.python-course.eu/python3_course.php)\n",
|
||||
"* [Python for the Busy Java Developer, Deepak Sarda, 2014](http://antrix.net/static/pages/python-for-java/online/)\n",
|
||||
"* [Style Guide for Python Code (PEP-0008)](https://www.python.org/dev/peps/pep-0008/)\n",
|
||||
"* [Python Slides](http://tdc-www.harvard.edu/Python.pdf)\n",
|
||||
"* [Python for Programmers - 1 day course](http://www.ucs.cam.ac.uk/docs/course-notes/unix-courses/archived/archived-python-courses/PythonProgIntro/files/notes.pdf)\n",
|
||||
|
@@ -85,7 +85,7 @@
|
||||
"In Python3, there are the following [numeric types](https://docs.python.org/3/library/stdtypes.html#typesnumeric):\n",
|
||||
"* integers (int): 1, -1, ...\n",
|
||||
"* floating point numbers (float): 0.1, 1E2\n",
|
||||
"* complex numbers (complex): 2 + 3j\n",
|
||||
"* complex numbers (complex): 2 + 3j\n.",
|
||||
"Let's play a bit"
|
||||
]
|
||||
},
|
||||
|
@@ -377,7 +377,7 @@
|
||||
"\n",
|
||||
"Tuples are faster than lists. Its main usage is when the collection is constant, or you do not want it can be changed (write protected). \n",
|
||||
"\n",
|
||||
"Tuples can be converted into lists and vice-versa, with the methods list() and tuple()."
|
||||
"Tuples can be converted into lists and vice-versa, with the methods *list()* and *tuple()*."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@@ -37,7 +37,7 @@
|
||||
"\n",
|
||||
"A set object is an unordered collection of distinct objects. There are two built-in set types: **set** (mutable) and **frozenset** (inmutable).\n",
|
||||
"\n",
|
||||
"A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects. There is only one bultin mapping type: **dictionary**."
|
||||
"A mapping object maps hashable values to arbitrary objects. Mappings are mutable objects. There is only one builtin mapping type: **dictionary**."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@@ -65,7 +65,7 @@
|
||||
"Python is a **strongly typed** language and **dynamically typed** language.\n",
|
||||
"\n",
|
||||
"This means:\n",
|
||||
"* ** dynamically typed**: variables do not declare a static type (as in Java int a = 2;). Variables have no type themselves, they are just names that hold a reference to some object. The type of the variable is changed dynamically when you change the type of the assigned data object. \n",
|
||||
"* **dynamically typed**: variables do not declare a static type (as in Java int a = 2;). Variables have no type themselves, they are just names that hold a reference to some object. The type of the variable is changed dynamically when you change the type of the assigned data object. \n",
|
||||
"* **strongly typed**: the interpreter tracks variable types. There is no implicit type conversion. This means that all the type variables should be converted manually, preventing from unexpected behaviour. "
|
||||
]
|
||||
},
|
||||
|
@@ -41,7 +41,7 @@
|
||||
"The first argument of instance class method is self, that refers to the current instance of the class.\n",
|
||||
"There is a special method, __init__ that initializes the object. It is like a constructor, but the object is already created when __init__ is called.\n",
|
||||
"\n",
|
||||
"Instance attributes are define as self.variables. (self is the same than this in Java)."
|
||||
"Instance attributes are define as *self.variables*. (self is the same than this in Java)."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
154
sna/0_Intro_Network_Analysis.ipynb
Normal file
@@ -0,0 +1,154 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Introduction to Network Analysis\n",
|
||||
" \n",
|
||||
"In this session, we are going to get more insight regarding how to analyze and visualize social networks.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Objectives\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Understanding why networks are important in data science\n",
|
||||
"* Experimenting with network analysis with networkx."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Table of Contents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"1. [Home](0_Intro_Network_Analysis.ipynb)\n",
|
||||
"2. [First Steps](1_First_Steps.ipynb)\n",
|
||||
"3. [Working_with_Graphs](2_Working_with_Graphs.ipynb)\n",
|
||||
"4. [Network Analysis](3_Network_Analysis.ipynb)\n",
|
||||
"5. [Social Networks](4_Social_Networks.ipynb)\n",
|
||||
"6. [Pandas integration](5_Pandas.ipynb)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|