You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
sitc/ml21/preprocessing/08_Categorical.ipynb

813 lines
23 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Categorical Data\n",
"\n",
"For many ML algorithms, we need to transform categorical data into numbers.\n",
"\n",
"For example:\n",
"* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
"* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
"* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
"\n",
"There are two main approaches:\n",
"* Integer encoding\n",
"* One hot encoding"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Integer Encoding\n",
"We assign a number to every value:\n",
"\n",
"['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
"\n",
"['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
"\n",
"['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
"\n",
"The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
"\n",
"In our examples, this representation can be suitable for **temperature**, but not for the other two."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## One Hot Encoding\n",
"A binary column is created for each value of the categorical variable."
]
},
{
"cell_type": "raw",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Sex M F U\n",
"----- ---------\n",
"M 1 0 0\n",
"F is transformed into 0 1 0\n",
"Unknown 0 0 1\n",
"M 1 0 0 "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Transforming categorical data with Scikit-Learn\n",
"\n",
"We can use:\n",
"* **get_dummies()** (one hot encoding)\n",
"* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
"\n",
"We are going to learn the first approach."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### One Hot Encoding\n",
"We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Name Age Sex Position\n",
"0 Marius 18 Male graduate\n",
"1 Maria 19 Female professor\n",
"2 John 20 Male TA\n",
"3 Carla 30 Female phD\n"
]
}
],
"source": [
"import pandas as pd\n",
"\n",
"data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
" \"Age\": [18, 19, 20, 30],\n",
"\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
" \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
" }\n",
"df = pd.DataFrame(data)\n",
"print(df)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" <th>Sex_Female</th>\n",
" <th>Sex_Male</th>\n",
" <th>Position_TA</th>\n",
" <th>Position_graduate</th>\n",
" <th>Position_phD</th>\n",
" <th>Position_professor</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>False</td>\n",
" <td>True</td>\n",
" <td>False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n",
"0 Marius 18 1 1 False True \n",
"1 Maria 19 0 3 True False \n",
"2 John 20 1 0 False True \n",
"3 Carla 30 0 2 True False \n",
"\n",
" Position_TA Position_graduate Position_phD Position_professor \n",
"0 False True False False \n",
"1 False False False True \n",
"2 True False False False \n",
"3 False False True False "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
]
}
],
"source": [
"df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
"df_onehot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use *OneHotEncoder* from Scikit."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Sex_Female</th>\n",
" <th>Sex_Male</th>\n",
" <th>Position_TA</th>\n",
" <th>Position_graduate</th>\n",
" <th>Position_phD</th>\n",
" <th>Position_professor</th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n",
"0 0.0 1.0 0.0 1.0 0.0 \n",
"1 1.0 0.0 0.0 0.0 0.0 \n",
"2 0.0 1.0 1.0 0.0 0.0 \n",
"3 1.0 0.0 0.0 0.0 1.0 \n",
"\n",
" Position_professor Name Age sex_encoded position_encoded \n",
"0 0.0 Marius 18 1 1 \n",
"1 1.0 Maria 19 0 3 \n",
"2 0.0 John 20 1 0 \n",
"3 0.0 Carla 30 0 2 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import OneHotEncoder\n",
"from sklearn.compose import make_column_transformer\n",
"\n",
"df_onehotencoder = df\n",
"# create OneHotEncoder object\n",
"encoder = OneHotEncoder()\n",
"\n",
"# Transformer for several columns\n",
"transformer = make_column_transformer(\n",
" (OneHotEncoder(), ['Sex', 'Position']),\n",
" remainder='passthrough',\n",
" verbose_feature_names_out=False)\n",
"\n",
"# transform\n",
"transformed = transformer.fit_transform(df_onehotencoder)\n",
"\n",
"df_onehotencoder = pd.DataFrame(\n",
" transformed,\n",
" columns=transformer.get_feature_names_out())\n",
"df_onehotencoder"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Integer encoding\n",
"We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position\n",
"0 Marius 18 Male graduate\n",
"1 Maria 19 Female professor\n",
"2 John 20 Male TA\n",
"3 Carla 30 Female phD"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.preprocessing import LabelEncoder\n",
"# creating instance of labelencoder\n",
"labelencoder = LabelEncoder()\n",
"df_encoded = df\n",
"# Assigning numerical values and storing in another column\n",
"sex_values = ('Male', 'Female')\n",
"position_values = ('graduate', 'professor', 'TA', 'phD')\n",
"df_encoded"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" <th>sex_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position sex_encoded\n",
"0 Marius 18 Male graduate 1\n",
"1 Maria 19 Female professor 0\n",
"2 John 20 Male TA 1\n",
"3 Carla 30 Female phD 0"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
"df_encoded"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Age</th>\n",
" <th>Sex</th>\n",
" <th>Position</th>\n",
" <th>sex_encoded</th>\n",
" <th>position_encoded</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Marius</td>\n",
" <td>18</td>\n",
" <td>Male</td>\n",
" <td>graduate</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Maria</td>\n",
" <td>19</td>\n",
" <td>Female</td>\n",
" <td>professor</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John</td>\n",
" <td>20</td>\n",
" <td>Male</td>\n",
" <td>TA</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Carla</td>\n",
" <td>30</td>\n",
" <td>Female</td>\n",
" <td>phD</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Age Sex Position sex_encoded position_encoded\n",
"0 Marius 18 Male graduate 1 1\n",
"1 Maria 19 Female professor 0 3\n",
"2 John 20 Male TA 1 0\n",
"3 Carla 30 Female phD 0 2"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
"df_encoded"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"# References\n",
"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.13"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 4
}