mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-16 19:42:28 +00:00
813 lines
23 KiB
Plaintext
813 lines
23 KiB
Plaintext
|
{
|
||
|
"cells": [
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"# Course Notes for Learning Intelligent Systems"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"# Categorical Data\n",
|
||
|
"\n",
|
||
|
"For many ML algorithms, we need to transform categorical data into numbers.\n",
|
||
|
"\n",
|
||
|
"For example:\n",
|
||
|
"* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
|
||
|
"* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
|
||
|
"* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
|
||
|
"\n",
|
||
|
"There are two main approaches:\n",
|
||
|
"* Integer encoding\n",
|
||
|
"* One hot encoding"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Integer Encoding\n",
|
||
|
"We assign a number to every value:\n",
|
||
|
"\n",
|
||
|
"['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
|
||
|
"\n",
|
||
|
"['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
|
||
|
"\n",
|
||
|
"['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
|
||
|
"\n",
|
||
|
"The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
|
||
|
"\n",
|
||
|
"In our examples, this representation can be suitable for **temperature**, but not for the other two."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## One Hot Encoding\n",
|
||
|
"A binary column is created for each value of the categorical variable."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "raw",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"Sex M F U\n",
|
||
|
"----- ---------\n",
|
||
|
"M 1 0 0\n",
|
||
|
"F is transformed into 0 1 0\n",
|
||
|
"Unknown 0 0 1\n",
|
||
|
"M 1 0 0 "
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "slide"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Transforming categorical data with Scikit-Learn\n",
|
||
|
"\n",
|
||
|
"We can use:\n",
|
||
|
"* **get_dummies()** (one hot encoding)\n",
|
||
|
"* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
|
||
|
"\n",
|
||
|
"We are going to learn the first approach."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### One Hot Encoding\n",
|
||
|
"We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 11,
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "fragment"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
" Name Age Sex Position\n",
|
||
|
"0 Marius 18 Male graduate\n",
|
||
|
"1 Maria 19 Female professor\n",
|
||
|
"2 John 20 Male TA\n",
|
||
|
"3 Carla 30 Female phD\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"import pandas as pd\n",
|
||
|
"\n",
|
||
|
"data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
|
||
|
" \"Age\": [18, 19, 20, 30],\n",
|
||
|
"\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
|
||
|
" \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
|
||
|
" }\n",
|
||
|
"df = pd.DataFrame(data)\n",
|
||
|
"print(df)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 18,
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "subslide"
|
||
|
}
|
||
|
},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Name</th>\n",
|
||
|
" <th>Age</th>\n",
|
||
|
" <th>sex_encoded</th>\n",
|
||
|
" <th>position_encoded</th>\n",
|
||
|
" <th>Sex_Female</th>\n",
|
||
|
" <th>Sex_Male</th>\n",
|
||
|
" <th>Position_TA</th>\n",
|
||
|
" <th>Position_graduate</th>\n",
|
||
|
" <th>Position_phD</th>\n",
|
||
|
" <th>Position_professor</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Marius</td>\n",
|
||
|
" <td>18</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>Maria</td>\n",
|
||
|
" <td>19</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>3</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>John</td>\n",
|
||
|
" <td>20</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Carla</td>\n",
|
||
|
" <td>30</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>2</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" <td>True</td>\n",
|
||
|
" <td>False</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n",
|
||
|
"0 Marius 18 1 1 False True \n",
|
||
|
"1 Maria 19 0 3 True False \n",
|
||
|
"2 John 20 1 0 False True \n",
|
||
|
"3 Carla 30 0 2 True False \n",
|
||
|
"\n",
|
||
|
" Position_TA Position_graduate Position_phD Position_professor \n",
|
||
|
"0 False True False False \n",
|
||
|
"1 False False False True \n",
|
||
|
"2 True False False False \n",
|
||
|
"3 False False True False "
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 18,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
},
|
||
|
{
|
||
|
"name": "stdout",
|
||
|
"output_type": "stream",
|
||
|
"text": [
|
||
|
"The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
|
||
|
"df_onehot"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"We can also use *OneHotEncoder* from Scikit."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 27,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Sex_Female</th>\n",
|
||
|
" <th>Sex_Male</th>\n",
|
||
|
" <th>Position_TA</th>\n",
|
||
|
" <th>Position_graduate</th>\n",
|
||
|
" <th>Position_phD</th>\n",
|
||
|
" <th>Position_professor</th>\n",
|
||
|
" <th>Name</th>\n",
|
||
|
" <th>Age</th>\n",
|
||
|
" <th>sex_encoded</th>\n",
|
||
|
" <th>position_encoded</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>Marius</td>\n",
|
||
|
" <td>18</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>Maria</td>\n",
|
||
|
" <td>19</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>3</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>John</td>\n",
|
||
|
" <td>20</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>1.0</td>\n",
|
||
|
" <td>0.0</td>\n",
|
||
|
" <td>Carla</td>\n",
|
||
|
" <td>30</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n",
|
||
|
"0 0.0 1.0 0.0 1.0 0.0 \n",
|
||
|
"1 1.0 0.0 0.0 0.0 0.0 \n",
|
||
|
"2 0.0 1.0 1.0 0.0 0.0 \n",
|
||
|
"3 1.0 0.0 0.0 0.0 1.0 \n",
|
||
|
"\n",
|
||
|
" Position_professor Name Age sex_encoded position_encoded \n",
|
||
|
"0 0.0 Marius 18 1 1 \n",
|
||
|
"1 1.0 Maria 19 0 3 \n",
|
||
|
"2 0.0 John 20 1 0 \n",
|
||
|
"3 0.0 Carla 30 0 2 "
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 27,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.preprocessing import OneHotEncoder\n",
|
||
|
"from sklearn.compose import make_column_transformer\n",
|
||
|
"\n",
|
||
|
"df_onehotencoder = df\n",
|
||
|
"# create OneHotEncoder object\n",
|
||
|
"encoder = OneHotEncoder()\n",
|
||
|
"\n",
|
||
|
"# Transformer for several columns\n",
|
||
|
"transformer = make_column_transformer(\n",
|
||
|
" (OneHotEncoder(), ['Sex', 'Position']),\n",
|
||
|
" remainder='passthrough',\n",
|
||
|
" verbose_feature_names_out=False)\n",
|
||
|
"\n",
|
||
|
"# transform\n",
|
||
|
"transformed = transformer.fit_transform(df_onehotencoder)\n",
|
||
|
"\n",
|
||
|
"df_onehotencoder = pd.DataFrame(\n",
|
||
|
" transformed,\n",
|
||
|
" columns=transformer.get_feature_names_out())\n",
|
||
|
"df_onehotencoder"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {},
|
||
|
"source": [
|
||
|
"### Integer encoding\n",
|
||
|
"We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 14,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Name</th>\n",
|
||
|
" <th>Age</th>\n",
|
||
|
" <th>Sex</th>\n",
|
||
|
" <th>Position</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Marius</td>\n",
|
||
|
" <td>18</td>\n",
|
||
|
" <td>Male</td>\n",
|
||
|
" <td>graduate</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>Maria</td>\n",
|
||
|
" <td>19</td>\n",
|
||
|
" <td>Female</td>\n",
|
||
|
" <td>professor</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>John</td>\n",
|
||
|
" <td>20</td>\n",
|
||
|
" <td>Male</td>\n",
|
||
|
" <td>TA</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Carla</td>\n",
|
||
|
" <td>30</td>\n",
|
||
|
" <td>Female</td>\n",
|
||
|
" <td>phD</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Name Age Sex Position\n",
|
||
|
"0 Marius 18 Male graduate\n",
|
||
|
"1 Maria 19 Female professor\n",
|
||
|
"2 John 20 Male TA\n",
|
||
|
"3 Carla 30 Female phD"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 14,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"from sklearn.preprocessing import LabelEncoder\n",
|
||
|
"# creating instance of labelencoder\n",
|
||
|
"labelencoder = LabelEncoder()\n",
|
||
|
"df_encoded = df\n",
|
||
|
"# Assigning numerical values and storing in another column\n",
|
||
|
"sex_values = ('Male', 'Female')\n",
|
||
|
"position_values = ('graduate', 'professor', 'TA', 'phD')\n",
|
||
|
"df_encoded"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 16,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Name</th>\n",
|
||
|
" <th>Age</th>\n",
|
||
|
" <th>Sex</th>\n",
|
||
|
" <th>Position</th>\n",
|
||
|
" <th>sex_encoded</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Marius</td>\n",
|
||
|
" <td>18</td>\n",
|
||
|
" <td>Male</td>\n",
|
||
|
" <td>graduate</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>Maria</td>\n",
|
||
|
" <td>19</td>\n",
|
||
|
" <td>Female</td>\n",
|
||
|
" <td>professor</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>John</td>\n",
|
||
|
" <td>20</td>\n",
|
||
|
" <td>Male</td>\n",
|
||
|
" <td>TA</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Carla</td>\n",
|
||
|
" <td>30</td>\n",
|
||
|
" <td>Female</td>\n",
|
||
|
" <td>phD</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Name Age Sex Position sex_encoded\n",
|
||
|
"0 Marius 18 Male graduate 1\n",
|
||
|
"1 Maria 19 Female professor 0\n",
|
||
|
"2 John 20 Male TA 1\n",
|
||
|
"3 Carla 30 Female phD 0"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 16,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
|
||
|
"df_encoded"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "code",
|
||
|
"execution_count": 17,
|
||
|
"metadata": {},
|
||
|
"outputs": [
|
||
|
{
|
||
|
"data": {
|
||
|
"text/html": [
|
||
|
"<div>\n",
|
||
|
"<style scoped>\n",
|
||
|
" .dataframe tbody tr th:only-of-type {\n",
|
||
|
" vertical-align: middle;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe tbody tr th {\n",
|
||
|
" vertical-align: top;\n",
|
||
|
" }\n",
|
||
|
"\n",
|
||
|
" .dataframe thead th {\n",
|
||
|
" text-align: right;\n",
|
||
|
" }\n",
|
||
|
"</style>\n",
|
||
|
"<table border=\"1\" class=\"dataframe\">\n",
|
||
|
" <thead>\n",
|
||
|
" <tr style=\"text-align: right;\">\n",
|
||
|
" <th></th>\n",
|
||
|
" <th>Name</th>\n",
|
||
|
" <th>Age</th>\n",
|
||
|
" <th>Sex</th>\n",
|
||
|
" <th>Position</th>\n",
|
||
|
" <th>sex_encoded</th>\n",
|
||
|
" <th>position_encoded</th>\n",
|
||
|
" </tr>\n",
|
||
|
" </thead>\n",
|
||
|
" <tbody>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>0</th>\n",
|
||
|
" <td>Marius</td>\n",
|
||
|
" <td>18</td>\n",
|
||
|
" <td>Male</td>\n",
|
||
|
" <td>graduate</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>1</th>\n",
|
||
|
" <td>Maria</td>\n",
|
||
|
" <td>19</td>\n",
|
||
|
" <td>Female</td>\n",
|
||
|
" <td>professor</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>3</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>2</th>\n",
|
||
|
" <td>John</td>\n",
|
||
|
" <td>20</td>\n",
|
||
|
" <td>Male</td>\n",
|
||
|
" <td>TA</td>\n",
|
||
|
" <td>1</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" </tr>\n",
|
||
|
" <tr>\n",
|
||
|
" <th>3</th>\n",
|
||
|
" <td>Carla</td>\n",
|
||
|
" <td>30</td>\n",
|
||
|
" <td>Female</td>\n",
|
||
|
" <td>phD</td>\n",
|
||
|
" <td>0</td>\n",
|
||
|
" <td>2</td>\n",
|
||
|
" </tr>\n",
|
||
|
" </tbody>\n",
|
||
|
"</table>\n",
|
||
|
"</div>"
|
||
|
],
|
||
|
"text/plain": [
|
||
|
" Name Age Sex Position sex_encoded position_encoded\n",
|
||
|
"0 Marius 18 Male graduate 1 1\n",
|
||
|
"1 Maria 19 Female professor 0 3\n",
|
||
|
"2 John 20 Male TA 1 0\n",
|
||
|
"3 Carla 30 Female phD 0 2"
|
||
|
]
|
||
|
},
|
||
|
"execution_count": 17,
|
||
|
"metadata": {},
|
||
|
"output_type": "execute_result"
|
||
|
}
|
||
|
],
|
||
|
"source": [
|
||
|
"df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
|
||
|
"df_encoded"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"# References\n",
|
||
|
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
|
||
|
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
|
||
|
]
|
||
|
},
|
||
|
{
|
||
|
"cell_type": "markdown",
|
||
|
"metadata": {
|
||
|
"slideshow": {
|
||
|
"slide_type": "skip"
|
||
|
}
|
||
|
},
|
||
|
"source": [
|
||
|
"## Licence\n",
|
||
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||
|
"\n",
|
||
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||
|
]
|
||
|
}
|
||
|
],
|
||
|
"metadata": {
|
||
|
"celltoolbar": "Slideshow",
|
||
|
"datacleaner": {
|
||
|
"position": {
|
||
|
"top": "50px"
|
||
|
},
|
||
|
"python": {
|
||
|
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||
|
},
|
||
|
"window_display": false
|
||
|
},
|
||
|
"kernelspec": {
|
||
|
"display_name": "Python 3 (ipykernel)",
|
||
|
"language": "python",
|
||
|
"name": "python3"
|
||
|
},
|
||
|
"language_info": {
|
||
|
"codemirror_mode": {
|
||
|
"name": "ipython",
|
||
|
"version": 3
|
||
|
},
|
||
|
"file_extension": ".py",
|
||
|
"mimetype": "text/x-python",
|
||
|
"name": "python",
|
||
|
"nbconvert_exporter": "python",
|
||
|
"pygments_lexer": "ipython3",
|
||
|
"version": "3.10.13"
|
||
|
},
|
||
|
"latex_envs": {
|
||
|
"LaTeX_envs_menu_present": true,
|
||
|
"autocomplete": true,
|
||
|
"bibliofile": "biblio.bib",
|
||
|
"cite_by": "apalike",
|
||
|
"current_citInitial": 1,
|
||
|
"eqLabelWithNumbers": true,
|
||
|
"eqNumInitial": 1,
|
||
|
"hotkeys": {
|
||
|
"equation": "Ctrl-E",
|
||
|
"itemize": "Ctrl-I"
|
||
|
},
|
||
|
"labels_anchors": false,
|
||
|
"latex_user_defs": false,
|
||
|
"report_style_numbering": false,
|
||
|
"user_envs_cfg": false
|
||
|
}
|
||
|
},
|
||
|
"nbformat": 4,
|
||
|
"nbformat_minor": 4
|
||
|
}
|