mirror of https://github.com/gsi-upm/sitc
Compare commits
7 Commits
0150ce7cf7
...
21819abeae
Author | SHA1 | Date |
---|---|---|
Carlos A. Iglesias | 21819abeae | 1 month ago |
Carlos A. Iglesias | 0d4c0c706d | 1 month ago |
Carlos A. Iglesias | 8de629b495 | 1 month ago |
Carlos A. Iglesias | 86114b4a56 | 1 month ago |
Carlos A. Iglesias | 1a3f618995 | 1 month ago |
Carlos A. Iglesias | a1121c03a5 | 1 month ago |
Carlos A. Iglesias | 715d0cb77f | 1 month ago |
@ -0,0 +1 @@
|
||||
|
@ -0,0 +1 @@
|
||||
|
@ -0,0 +1,157 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Introduction to Preprocessing\n",
|
||||
"In this session, we will get more insight regarding how to preprocess data.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Objectives\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Understanding the need for preprocessing\n",
|
||||
"* Understanding different preprocessing techniques\n",
|
||||
"* Experimenting with several environments for preprocessing"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Table of Contents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"1. [Home](00_Intro_Preprocessing.ipynb)\n",
|
||||
"3. [Initial Check](02_Initial_Check.ipynb)\n",
|
||||
"4. [Filter Data](03_Filter_Data.ipynb)\n",
|
||||
"5. [Unknown values](04_Unknown_Values.ipynb)\n",
|
||||
"6. [Duplicated values](05_Duplicated_Values.ipynb)\n",
|
||||
"7. [Rescaling Data](06_Rescaling_Data.ipynb)\n",
|
||||
"8. [Binarize Data](07_Binarize_Data.ipynb)\n",
|
||||
"9. [Categorial features](08_Categorical.ipynb)\n",
|
||||
"10. [String Data](09_String_Data.ipynb)\n",
|
||||
"12. [Handy libraries for preprocessing](11_0_Handy.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
@ -0,0 +1,714 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Initial Check with Pandas\n",
|
||||
"\n",
|
||||
"We can start with a quick quality check."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Load and check data\n",
|
||||
"Check which data you are loading."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>PassengerId</th>\n",
|
||||
" <th>Survived</th>\n",
|
||||
" <th>Pclass</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>SibSp</th>\n",
|
||||
" <th>Parch</th>\n",
|
||||
" <th>Ticket</th>\n",
|
||||
" <th>Fare</th>\n",
|
||||
" <th>Cabin</th>\n",
|
||||
" <th>Embarked</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Braund, Mr. Owen Harris</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>22.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>A/5 21171</td>\n",
|
||||
" <td>7.2500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>38.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>PC 17599</td>\n",
|
||||
" <td>71.2833</td>\n",
|
||||
" <td>C85</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Heikkinen, Miss. Laina</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>26.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>STON/O2. 3101282</td>\n",
|
||||
" <td>7.9250</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>4</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>113803</td>\n",
|
||||
" <td>53.1000</td>\n",
|
||||
" <td>C123</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>5</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Allen, Mr. William Henry</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>35.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>373450</td>\n",
|
||||
" <td>8.0500</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>6</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Moran, Mr. James</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>330877</td>\n",
|
||||
" <td>8.4583</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>Q</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>7</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>McCarthy, Mr. Timothy J</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>54.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>17463</td>\n",
|
||||
" <td>51.8625</td>\n",
|
||||
" <td>E46</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>7</th>\n",
|
||||
" <td>8</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Palsson, Master. Gosta Leonard</td>\n",
|
||||
" <td>male</td>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>349909</td>\n",
|
||||
" <td>21.0750</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>8</th>\n",
|
||||
" <td>9</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>27.0</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>347742</td>\n",
|
||||
" <td>11.1333</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>S</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>9</th>\n",
|
||||
" <td>10</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>Nasser, Mrs. Nicholas (Adele Achem)</td>\n",
|
||||
" <td>female</td>\n",
|
||||
" <td>14.0</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>237736</td>\n",
|
||||
" <td>30.0708</td>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" <td>C</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" PassengerId Survived Pclass \\\n",
|
||||
"0 1 0 3 \n",
|
||||
"1 2 1 1 \n",
|
||||
"2 3 1 3 \n",
|
||||
"3 4 1 1 \n",
|
||||
"4 5 0 3 \n",
|
||||
"5 6 0 3 \n",
|
||||
"6 7 0 1 \n",
|
||||
"7 8 0 3 \n",
|
||||
"8 9 1 3 \n",
|
||||
"9 10 1 2 \n",
|
||||
"\n",
|
||||
" Name Sex Age SibSp \\\n",
|
||||
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
||||
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
||||
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
||||
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
||||
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
||||
"5 Moran, Mr. James male NaN 0 \n",
|
||||
"6 McCarthy, Mr. Timothy J male 54.0 0 \n",
|
||||
"7 Palsson, Master. Gosta Leonard male 2.0 3 \n",
|
||||
"8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 \n",
|
||||
"9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 \n",
|
||||
"\n",
|
||||
" Parch Ticket Fare Cabin Embarked \n",
|
||||
"0 0 A/5 21171 7.2500 NaN S \n",
|
||||
"1 0 PC 17599 71.2833 C85 C \n",
|
||||
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
|
||||
"3 0 113803 53.1000 C123 S \n",
|
||||
"4 0 373450 8.0500 NaN S \n",
|
||||
"5 0 330877 8.4583 NaN Q \n",
|
||||
"6 0 17463 51.8625 E46 S \n",
|
||||
"7 1 349909 21.0750 NaN S \n",
|
||||
"8 2 347742 11.1333 NaN S \n",
|
||||
"9 0 237736 30.0708 NaN C "
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"df = pd.read_csv('https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv')\n",
|
||||
"df.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Check number of columns and rows"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"(891, 12)"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Check names and types of columns\n",
|
||||
"Check the data and type, for example if dates are of strings or what."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',\n",
|
||||
" 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],\n",
|
||||
" dtype='object')\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"PassengerId int64\n",
|
||||
"Survived int64\n",
|
||||
"Pclass int64\n",
|
||||
"Name object\n",
|
||||
"Sex object\n",
|
||||
"Age float64\n",
|
||||
"SibSp int64\n",
|
||||
"Parch int64\n",
|
||||
"Ticket object\n",
|
||||
"Fare float64\n",
|
||||
"Cabin object\n",
|
||||
"Embarked object\n",
|
||||
"dtype: object"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Get column names\n",
|
||||
"print(df.columns)\n",
|
||||
"# Get column data types\n",
|
||||
"df.dtypes"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Check if the column is unique"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"PassengerId is unique: True\n",
|
||||
"Survived is unique: False\n",
|
||||
"Pclass is unique: False\n",
|
||||
"Name is unique: True\n",
|
||||
"Sex is unique: False\n",
|
||||
"Age is unique: False\n",
|
||||
"SibSp is unique: False\n",
|
||||
"Parch is unique: False\n",
|
||||
"Ticket is unique: False\n",
|
||||
"Fare is unique: False\n",
|
||||
"Cabin is unique: False\n",
|
||||
"Embarked is unique: False\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"for i in column_names:\n",
|
||||
" print('{} is unique: {}'.format(i, df[i].is_unique))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Check if the dataframe has an index\n",
|
||||
"We will need it to do joins or merges."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"RangeIndex(start=0, stop=891, step=1)"
|
||||
]
|
||||
},
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# check if there is an index. If not, you will get 'AtributeError: function object has no atribute index'\n",
|
||||
"df.index"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,\n",
|
||||
" 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,\n",
|
||||
" 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,\n",
|
||||
" 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,\n",
|
||||
" 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,\n",
|
||||
" 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,\n",
|
||||
" 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,\n",
|
||||
" 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,\n",
|
||||
" 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,\n",
|
||||
" 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,\n",
|
||||
" 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,\n",
|
||||
" 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,\n",
|
||||
" 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,\n",
|
||||
" 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,\n",
|
||||
" 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194,\n",
|
||||
" 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207,\n",
|
||||
" 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220,\n",
|
||||
" 221, 222, 223, 224, 225, 226, 227, 228, 229, 230, 231, 232, 233,\n",
|
||||
" 234, 235, 236, 237, 238, 239, 240, 241, 242, 243, 244, 245, 246,\n",
|
||||
" 247, 248, 249, 250, 251, 252, 253, 254, 255, 256, 257, 258, 259,\n",
|
||||
" 260, 261, 262, 263, 264, 265, 266, 267, 268, 269, 270, 271, 272,\n",
|
||||
" 273, 274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285,\n",
|
||||
" 286, 287, 288, 289, 290, 291, 292, 293, 294, 295, 296, 297, 298,\n",
|
||||
" 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311,\n",
|
||||
" 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324,\n",
|
||||
" 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337,\n",
|
||||
" 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350,\n",
|
||||
" 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363,\n",
|
||||
" 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376,\n",
|
||||
" 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389,\n",
|
||||
" 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400, 401, 402,\n",
|
||||
" 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415,\n",
|
||||
" 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428,\n",
|
||||
" 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441,\n",
|
||||
" 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454,\n",
|
||||
" 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467,\n",
|
||||
" 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480,\n",
|
||||
" 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493,\n",
|
||||
" 494, 495, 496, 497, 498, 499, 500, 501, 502, 503, 504, 505, 506,\n",
|
||||
" 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519,\n",
|
||||
" 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530, 531, 532,\n",
|
||||
" 533, 534, 535, 536, 537, 538, 539, 540, 541, 542, 543, 544, 545,\n",
|
||||
" 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556, 557, 558,\n",
|
||||
" 559, 560, 561, 562, 563, 564, 565, 566, 567, 568, 569, 570, 571,\n",
|
||||
" 572, 573, 574, 575, 576, 577, 578, 579, 580, 581, 582, 583, 584,\n",
|
||||
" 585, 586, 587, 588, 589, 590, 591, 592, 593, 594, 595, 596, 597,\n",
|
||||
" 598, 599, 600, 601, 602, 603, 604, 605, 606, 607, 608, 609, 610,\n",
|
||||
" 611, 612, 613, 614, 615, 616, 617, 618, 619, 620, 621, 622, 623,\n",
|
||||
" 624, 625, 626, 627, 628, 629, 630, 631, 632, 633, 634, 635, 636,\n",
|
||||
" 637, 638, 639, 640, 641, 642, 643, 644, 645, 646, 647, 648, 649,\n",
|
||||
" 650, 651, 652, 653, 654, 655, 656, 657, 658, 659, 660, 661, 662,\n",
|
||||
" 663, 664, 665, 666, 667, 668, 669, 670, 671, 672, 673, 674, 675,\n",
|
||||
" 676, 677, 678, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688,\n",
|
||||
" 689, 690, 691, 692, 693, 694, 695, 696, 697, 698, 699, 700, 701,\n",
|
||||
" 702, 703, 704, 705, 706, 707, 708, 709, 710, 711, 712, 713, 714,\n",
|
||||
" 715, 716, 717, 718, 719, 720, 721, 722, 723, 724, 725, 726, 727,\n",
|
||||
" 728, 729, 730, 731, 732, 733, 734, 735, 736, 737, 738, 739, 740,\n",
|
||||
" 741, 742, 743, 744, 745, 746, 747, 748, 749, 750, 751, 752, 753,\n",
|
||||
" 754, 755, 756, 757, 758, 759, 760, 761, 762, 763, 764, 765, 766,\n",
|
||||
" 767, 768, 769, 770, 771, 772, 773, 774, 775, 776, 777, 778, 779,\n",
|
||||
" 780, 781, 782, 783, 784, 785, 786, 787, 788, 789, 790, 791, 792,\n",
|
||||
" 793, 794, 795, 796, 797, 798, 799, 800, 801, 802, 803, 804, 805,\n",
|
||||
" 806, 807, 808, 809, 810, 811, 812, 813, 814, 815, 816, 817, 818,\n",
|
||||
" 819, 820, 821, 822, 823, 824, 825, 826, 827, 828, 829, 830, 831,\n",
|
||||
" 832, 833, 834, 835, 836, 837, 838, 839, 840, 841, 842, 843, 844,\n",
|
||||
" 845, 846, 847, 848, 849, 850, 851, 852, 853, 854, 855, 856, 857,\n",
|
||||
" 858, 859, 860, 861, 862, 863, 864, 865, 866, 867, 868, 869, 870,\n",
|
||||
" 871, 872, 873, 874, 875, 876, 877, 878, 879, 880, 881, 882, 883,\n",
|
||||
" 884, 885, 886, 887, 888, 889, 890])"
|
||||
]
|
||||
},
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# # Check the index values\n",
|
||||
"df.index.values"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# If index does not exist\n",
|
||||
"df.set_index('column_name_to_use', inplace=True)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"PassengerId 0\n",
|
||||
"Survived 0\n",
|
||||
"Pclass 0\n",
|
||||
"Name 0\n",
|
||||
"Sex 0\n",
|
||||
"Age 177\n",
|
||||
"SibSp 0\n",
|
||||
"Parch 0\n",
|
||||
"Ticket 0\n",
|
||||
"Fare 0\n",
|
||||
"Cabin 687\n",
|
||||
"Embarked 2\n",
|
||||
"dtype: int64"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Count missing vales per column\n",
|
||||
"df.isnull().sum()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
@ -0,0 +1,150 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Filter Data\n",
|
||||
"\n",
|
||||
"Select the columns you want and delete the others."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Create list comprehension of the columns you want to lose\n",
|
||||
"columns_to_drop = [column_names[i] for i in [1, 3, 5]]\n",
|
||||
"# Drop unwanted columns \n",
|
||||
"df.drop(columns_to_drop, inplace=True, axis=1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
@ -0,0 +1,591 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Unknown values\n",
|
||||
"\n",
|
||||
"Two possible approaches are **remove** these rows or **fill** them. It depends on every case."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Filling NaN values\n",
|
||||
"If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
|
||||
"\n",
|
||||
"* For **string** fields, we can fill NaN with **' '**.\n",
|
||||
"\n",
|
||||
"* For **numbers**, we can fill with the **mean** or **median** value. \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Fill NaN with ' '\n",
|
||||
"df['col'] = df['col'].fillna(' ')\n",
|
||||
"# Fill NaN with 99\n",
|
||||
"df['col'] = df['col'].fillna(99)\n",
|
||||
"# Fill NaN with the mean of the column\n",
|
||||
"df['col'] = df['col'].fillna(df['col'].mean())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Propagate non-null values forward or backward\n",
|
||||
"You can also **propagate** non-null values with these methods:\n",
|
||||
"\n",
|
||||
"* **ffill**: Fill values by propagating the last valid observation to the next valid.\n",
|
||||
"* **bfill**: Fill values using the following valid observation to fill the gap.\n",
|
||||
"* **interpolate**: Fill NaN values using interpolation.\n",
|
||||
"\n",
|
||||
"It will fill the next value in the dataframe with the previous non-NaN value. \n",
|
||||
"\n",
|
||||
"You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 NaN\n",
|
||||
"1 NaN\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 NaN\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We fill forward the value 4.0 and fill the next one (limit = 1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 NaN\n",
|
||||
"1 NaN\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 4.0\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
" df.ffill(limit = 1)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df.ffill()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>NaN</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"0 2.0\n",
|
||||
"1 2.0\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0\n",
|
||||
"5 NaN\n",
|
||||
"6 NaN"
|
||||
]
|
||||
},
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df.bfill()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Removing NaN values\n",
|
||||
"We can remove them by row or column (use inplace=True if you want to modify the DataFrame)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 26,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>col1</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>2.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>3.0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>4.0</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" col1\n",
|
||||
"2 2.0\n",
|
||||
"3 3.0\n",
|
||||
"4 4.0"
|
||||
]
|
||||
},
|
||||
"execution_count": 26,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Drop any rows which have any nans\n",
|
||||
"df1 = df.dropna()\n",
|
||||
"# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n",
|
||||
"df2 = df.dropna(axis=1)\n",
|
||||
"# Only drop columns which have at least 90% non-NaNs \n",
|
||||
"df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n",
|
||||
"df1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
@ -0,0 +1,198 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Binarize Data\n",
|
||||
"* We can transform our data using a binary threshold. All values above the threshold are marked 1, and all values equal to or below are marked 0.\n",
|
||||
"* This is called binarizing your data or thresholding your data. \n",
|
||||
"\n",
|
||||
"* It can be helpful when you have probabilities that you want to make crisp values."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Binarize Data with Scikit-Learn\n",
|
||||
"We can create new binary attributes in Python using Scikit-learn with the Binarizer class.\n",
|
||||
"I"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import Binarizer\n",
|
||||
"\n",
|
||||
"X = [[ 1., -1., 2.],\n",
|
||||
" [ 2., 0., 0.],\n",
|
||||
" [ 0., 1.1, -1.]]"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"transformer = Binarizer(threshold=1.0).fit(X) # threshold 1.0"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"array([[0., 0., 1.],\n",
|
||||
" [1., 0., 0.],\n",
|
||||
" [0., 1., 0.]])"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"transformer.transform(X)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
@ -0,0 +1,812 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Categorical Data\n",
|
||||
"\n",
|
||||
"For many ML algorithms, we need to transform categorical data into numbers.\n",
|
||||
"\n",
|
||||
"For example:\n",
|
||||
"* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
|
||||
"* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
|
||||
"* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
|
||||
"\n",
|
||||
"There are two main approaches:\n",
|
||||
"* Integer encoding\n",
|
||||
"* One hot encoding"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Integer Encoding\n",
|
||||
"We assign a number to every value:\n",
|
||||
"\n",
|
||||
"['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
|
||||
"\n",
|
||||
"['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
|
||||
"\n",
|
||||
"['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
|
||||
"\n",
|
||||
"The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
|
||||
"\n",
|
||||
"In our examples, this representation can be suitable for **temperature**, but not for the other two."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## One Hot Encoding\n",
|
||||
"A binary column is created for each value of the categorical variable."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "raw",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Sex M F U\n",
|
||||
"----- ---------\n",
|
||||
"M 1 0 0\n",
|
||||
"F is transformed into 0 1 0\n",
|
||||
"Unknown 0 0 1\n",
|
||||
"M 1 0 0 "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Transforming categorical data with Scikit-Learn\n",
|
||||
"\n",
|
||||
"We can use:\n",
|
||||
"* **get_dummies()** (one hot encoding)\n",
|
||||
"* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
|
||||
"\n",
|
||||
"We are going to learn the first approach."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### One Hot Encoding\n",
|
||||
"We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" Name Age Sex Position\n",
|
||||
"0 Marius 18 Male graduate\n",
|
||||
"1 Maria 19 Female professor\n",
|
||||
"2 John 20 Male TA\n",
|
||||
"3 Carla 30 Female phD\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
|
||||
" \"Age\": [18, 19, 20, 30],\n",
|
||||
"\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
|
||||
" \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
|
||||
" }\n",
|
||||
"df = pd.DataFrame(data)\n",
|
||||
"print(df)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 18,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" <th>position_encoded</th>\n",
|
||||
" <th>Sex_Female</th>\n",
|
||||
" <th>Sex_Male</th>\n",
|
||||
" <th>Position_TA</th>\n",
|
||||
" <th>Position_graduate</th>\n",
|
||||
" <th>Position_phD</th>\n",
|
||||
" <th>Position_professor</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" <td>True</td>\n",
|
||||
" <td>False</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n",
|
||||
"0 Marius 18 1 1 False True \n",
|
||||
"1 Maria 19 0 3 True False \n",
|
||||
"2 John 20 1 0 False True \n",
|
||||
"3 Carla 30 0 2 True False \n",
|
||||
"\n",
|
||||
" Position_TA Position_graduate Position_phD Position_professor \n",
|
||||
"0 False True False False \n",
|
||||
"1 False False False True \n",
|
||||
"2 True False False False \n",
|
||||
"3 False False True False "
|
||||
]
|
||||
},
|
||||
"execution_count": 18,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
},
|
||||
{
|
||||
"name": "stdout",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
"The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
|
||||
"df_onehot"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"We can also use *OneHotEncoder* from Scikit."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 27,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Sex_Female</th>\n",
|
||||
" <th>Sex_Male</th>\n",
|
||||
" <th>Position_TA</th>\n",
|
||||
" <th>Position_graduate</th>\n",
|
||||
" <th>Position_phD</th>\n",
|
||||
" <th>Position_professor</th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" <th>position_encoded</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>1.0</td>\n",
|
||||
" <td>0.0</td>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n",
|
||||
"0 0.0 1.0 0.0 1.0 0.0 \n",
|
||||
"1 1.0 0.0 0.0 0.0 0.0 \n",
|
||||
"2 0.0 1.0 1.0 0.0 0.0 \n",
|
||||
"3 1.0 0.0 0.0 0.0 1.0 \n",
|
||||
"\n",
|
||||
" Position_professor Name Age sex_encoded position_encoded \n",
|
||||
"0 0.0 Marius 18 1 1 \n",
|
||||
"1 1.0 Maria 19 0 3 \n",
|
||||
"2 0.0 John 20 1 0 \n",
|
||||
"3 0.0 Carla 30 0 2 "
|
||||
]
|
||||
},
|
||||
"execution_count": 27,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import OneHotEncoder\n",
|
||||
"from sklearn.compose import make_column_transformer\n",
|
||||
"\n",
|
||||
"df_onehotencoder = df\n",
|
||||
"# create OneHotEncoder object\n",
|
||||
"encoder = OneHotEncoder()\n",
|
||||
"\n",
|
||||
"# Transformer for several columns\n",
|
||||
"transformer = make_column_transformer(\n",
|
||||
" (OneHotEncoder(), ['Sex', 'Position']),\n",
|
||||
" remainder='passthrough',\n",
|
||||
" verbose_feature_names_out=False)\n",
|
||||
"\n",
|
||||
"# transform\n",
|
||||
"transformed = transformer.fit_transform(df_onehotencoder)\n",
|
||||
"\n",
|
||||
"df_onehotencoder = pd.DataFrame(\n",
|
||||
" transformed,\n",
|
||||
" columns=transformer.get_feature_names_out())\n",
|
||||
"df_onehotencoder"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Integer encoding\n",
|
||||
"We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Position</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>graduate</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>professor</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>TA</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>phD</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age Sex Position\n",
|
||||
"0 Marius 18 Male graduate\n",
|
||||
"1 Maria 19 Female professor\n",
|
||||
"2 John 20 Male TA\n",
|
||||
"3 Carla 30 Female phD"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"from sklearn.preprocessing import LabelEncoder\n",
|
||||
"# creating instance of labelencoder\n",
|
||||
"labelencoder = LabelEncoder()\n",
|
||||
"df_encoded = df\n",
|
||||
"# Assigning numerical values and storing in another column\n",
|
||||
"sex_values = ('Male', 'Female')\n",
|
||||
"position_values = ('graduate', 'professor', 'TA', 'phD')\n",
|
||||
"df_encoded"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Position</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>graduate</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>professor</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>TA</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>phD</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age Sex Position sex_encoded\n",
|
||||
"0 Marius 18 Male graduate 1\n",
|
||||
"1 Maria 19 Female professor 0\n",
|
||||
"2 John 20 Male TA 1\n",
|
||||
"3 Carla 30 Female phD 0"
|
||||
]
|
||||
},
|
||||
"execution_count": 16,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
|
||||
"df_encoded"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>Name</th>\n",
|
||||
" <th>Age</th>\n",
|
||||
" <th>Sex</th>\n",
|
||||
" <th>Position</th>\n",
|
||||
" <th>sex_encoded</th>\n",
|
||||
" <th>position_encoded</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>Marius</td>\n",
|
||||
" <td>18</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>graduate</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>Maria</td>\n",
|
||||
" <td>19</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>professor</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>John</td>\n",
|
||||
" <td>20</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>TA</td>\n",
|
||||
" <td>1</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>Carla</td>\n",
|
||||
" <td>30</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>phD</td>\n",
|
||||
" <td>0</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" Name Age Sex Position sex_encoded position_encoded\n",
|
||||
"0 Marius 18 Male graduate 1 1\n",
|
||||
"1 Maria 19 Female professor 0 3\n",
|
||||
"2 John 20 Male TA 1 0\n",
|
||||
"3 Carla 30 Female phD 0 2"
|
||||
]
|
||||
},
|
||||
"execution_count": 17,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
|
||||
"df_encoded"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
@ -0,0 +1,139 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Handy libraries\n",
|
||||
"Libraries that help in several preprocessing tasks.\n",
|
||||
"\n",
|
||||
"* [datacleaner](11_1_datacleaner.ipynb)\n",
|
||||
"* [autoclean](11_3_autoclean.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||||
"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), A. Sharma, 2018.\n",
|
||||
"* [Handy Python Libraries for Formatting and Cleaning Data](https://mode.com/blog/python-data-cleaning-libraries), M. Bierly, 2016\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"celltoolbar": "Slideshow",
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
Binary file not shown.
After Width: | Height: | Size: 3.1 KiB |
Binary file not shown.
After Width: | Height: | Size: 152 KiB |
@ -0,0 +1 @@
|
||||
|
@ -0,0 +1,185 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Introduction to Visualization\n",
|
||||
" \n",
|
||||
"In this session, we will get more insight regarding how to visualize data.\n",
|
||||
"\n",
|
||||
"# Objectives\n",
|
||||
"\n",
|
||||
"The main objectives of this session are:\n",
|
||||
"* Understanding how to visualize data\n",
|
||||
"* Understanding the purpose of different charts \n",
|
||||
"* Experimenting with several environments for visualizing data\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Seaborn\n",
|
||||
"\n",
|
||||
"Seaborn is a library that visualizes data in Python. The main characteristics are:\n",
|
||||
"\n",
|
||||
"* A dataset-oriented API for examining relationships between multiple variables\n",
|
||||
"* Specialized support for using categorical variables to show observations or aggregate statistics\n",
|
||||
"* Options for visualizing univariate or bivariate distributions and for comparing them between subsets of data\n",
|
||||
"* Automatic estimation and plotting of linear regression models for different kinds of dependent variables\n",
|
||||
"* Convenient views of the overall structure of complex datasets\n",
|
||||
"* High-level abstractions for structuring multi-plot grids that let you quickly build complex visualizations\n",
|
||||
"* Concise control over matplotlib figure styling with several built-in themes\n",
|
||||
"* Tools for choosing color palettes that faithfully reveal patterns in your data\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Install\n",
|
||||
"Use:\n",
|
||||
"\n",
|
||||
"**conda install seaborn**\n",
|
||||
"\n",
|
||||
"or \n",
|
||||
"\n",
|
||||
"**pip install seaborn**"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "slide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Table of Contents"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"1. [Home](00_Intro_Visualization.ipynb)\n",
|
||||
"2. [Dataset](01_Dataset.ipynb)\n",
|
||||
"3. [Comparison Charts](02_Comparison_Charts.ipynb)\n",
|
||||
" 1. [More Comparison Charts](02_01_More_Comparison_Charts.ipynb)\n",
|
||||
"4. [Distribution Charts](03_Distribution_Charts.ipynb)\n",
|
||||
"5. [Hierarchical charts](04_Hierarchical_Charts.ipynb)\n",
|
||||
"6. [Relational charts](05_Relational_Charts.ipynb)\n",
|
||||
"7. [Spatial charts](06_Spatial_Charts.ipynb)\n",
|
||||
"8. [Temporal charts](07_Temporal_Charts.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.11.7"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
@ -0,0 +1,363 @@
|
||||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Course Notes for Learning Intelligent Systems"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## [Introduction to Visualization](00_Intro_Visualization.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# Dataset\n",
|
||||
"Seaborn includes several datasets. We can consult the available datasets and load them. \n",
|
||||
"\n",
|
||||
"The datasets are also available at https://github.com/mwaskom/seaborn-data."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "fragment"
|
||||
}
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"from matplotlib import pyplot as plt\n",
|
||||
"import seaborn as sns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"['anagrams',\n",
|
||||
" 'anscombe',\n",
|
||||
" 'attention',\n",
|
||||
" 'brain_networks',\n",
|
||||
" 'car_crashes',\n",
|
||||
" 'diamonds',\n",
|
||||
" 'dots',\n",
|
||||
" 'dowjones',\n",
|
||||
" 'exercise',\n",
|
||||
" 'flights',\n",
|
||||
" 'fmri',\n",
|
||||
" 'geyser',\n",
|
||||
" 'glue',\n",
|
||||
" 'healthexp',\n",
|
||||
" 'iris',\n",
|
||||
" 'mpg',\n",
|
||||
" 'penguins',\n",
|
||||
" 'planets',\n",
|
||||
" 'seaice',\n",
|
||||
" 'taxis',\n",
|
||||
" 'tips',\n",
|
||||
" 'titanic']"
|
||||
]
|
||||
},
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"sns.get_dataset_names()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "subslide"
|
||||
}
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<div>\n",
|
||||
"<style scoped>\n",
|
||||
" .dataframe tbody tr th:only-of-type {\n",
|
||||
" vertical-align: middle;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe tbody tr th {\n",
|
||||
" vertical-align: top;\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" .dataframe thead th {\n",
|
||||
" text-align: right;\n",
|
||||
" }\n",
|
||||
"</style>\n",
|
||||
"<table border=\"1\" class=\"dataframe\">\n",
|
||||
" <thead>\n",
|
||||
" <tr style=\"text-align: right;\">\n",
|
||||
" <th></th>\n",
|
||||
" <th>total_bill</th>\n",
|
||||
" <th>tip</th>\n",
|
||||
" <th>sex</th>\n",
|
||||
" <th>smoker</th>\n",
|
||||
" <th>day</th>\n",
|
||||
" <th>time</th>\n",
|
||||
" <th>size</th>\n",
|
||||
" </tr>\n",
|
||||
" </thead>\n",
|
||||
" <tbody>\n",
|
||||
" <tr>\n",
|
||||
" <th>0</th>\n",
|
||||
" <td>16.99</td>\n",
|
||||
" <td>1.01</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>1</th>\n",
|
||||
" <td>10.34</td>\n",
|
||||
" <td>1.66</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>2</th>\n",
|
||||
" <td>21.01</td>\n",
|
||||
" <td>3.50</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>3</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>3</th>\n",
|
||||
" <td>23.68</td>\n",
|
||||
" <td>3.31</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>4</th>\n",
|
||||
" <td>24.59</td>\n",
|
||||
" <td>3.61</td>\n",
|
||||
" <td>Female</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>5</th>\n",
|
||||
" <td>25.29</td>\n",
|
||||
" <td>4.71</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>6</th>\n",
|
||||
" <td>8.77</td>\n",
|
||||
" <td>2.00</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>7</th>\n",
|
||||
" <td>26.88</td>\n",
|
||||
" <td>3.12</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>4</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>8</th>\n",
|
||||
" <td>15.04</td>\n",
|
||||
" <td>1.96</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" <tr>\n",
|
||||
" <th>9</th>\n",
|
||||
" <td>14.78</td>\n",
|
||||
" <td>3.23</td>\n",
|
||||
" <td>Male</td>\n",
|
||||
" <td>No</td>\n",
|
||||
" <td>Sun</td>\n",
|
||||
" <td>Dinner</td>\n",
|
||||
" <td>2</td>\n",
|
||||
" </tr>\n",
|
||||
" </tbody>\n",
|
||||
"</table>\n",
|
||||
"</div>"
|
||||
],
|
||||
"text/plain": [
|
||||
" total_bill tip sex smoker day time size\n",
|
||||
"0 16.99 1.01 Female No Sun Dinner 2\n",
|
||||
"1 10.34 1.66 Male No Sun Dinner 3\n",
|
||||
"2 21.01 3.50 Male No Sun Dinner 3\n",
|
||||
"3 23.68 3.31 Male No Sun Dinner 2\n",
|
||||
"4 24.59 3.61 Female No Sun Dinner 4\n",
|
||||
"5 25.29 4.71 Male No Sun Dinner 4\n",
|
||||
"6 8.77 2.00 Male No Sun Dinner 2\n",
|
||||
"7 26.88 3.12 Male No Sun Dinner 4\n",
|
||||
"8 15.04 1.96 Male No Sun Dinner 2\n",
|
||||
"9 14.78 3.23 Male No Sun Dinner 2"
|
||||
]
|
||||
},
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"df = sns.load_dataset('tips')\n",
|
||||
"df.head(10)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"# References\n",
|
||||
"* [Seaborn](http://seaborn.pydata.org/index.html) documentation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"slideshow": {
|
||||
"slide_type": "skip"
|
||||
}
|
||||
},
|
||||
"source": [
|
||||
"## Licence\n",
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"datacleaner": {
|
||||
"position": {
|
||||
"top": "50px"
|
||||
},
|
||||
"python": {
|
||||
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||
},
|
||||
"window_display": false
|
||||
},
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.13"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
"autocomplete": true,
|
||||
"bibliofile": "biblio.bib",
|
||||
"cite_by": "apalike",
|
||||
"current_citInitial": 1,
|
||||
"eqLabelWithNumbers": true,
|
||||
"eqNumInitial": 1,
|
||||
"hotkeys": {
|
||||
"equation": "Ctrl-E",
|
||||
"itemize": "Ctrl-I"
|
||||
},
|
||||
"labels_anchors": false,
|
||||
"latex_user_defs": false,
|
||||
"report_style_numbering": false,
|
||||
"user_envs_cfg": false
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
Binary file not shown.
After Width: | Height: | Size: 3.1 KiB |
Loading…
Reference in New Issue