sitc/ml21/preprocessing/08_Categorical.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Categorical Data\n",
    "\n",
    "For many ML algorithms, we need to transform categorical data into numbers.\n",
    "\n",
    "For example:\n",
    "* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n",
    "* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n",
    "* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n",
    "\n",
    "There are two main approaches:\n",
    "* Integer encoding\n",
    "* One hot encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Integer Encoding\n",
    "We assign a number to every value:\n",
    "\n",
    "['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",
    "\n",
    "['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",
    "\n",
    "['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",
    "\n",
    "The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",
    "\n",
    "In our examples, this representation can be suitable for **temperature**, but not for the other two."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## One Hot Encoding\n",
    "A binary column is created for each value of the categorical variable."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Sex                                               M  F U\n",
    "-----                                            ---------\n",
    "M                                                 1  0 0\n",
    "F                     is transformed into         0  1 0\n",
    "Unknown                                           0  0 1\n",
    "M                                                 1  0 0 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Transforming categorical data  with Scikit-Learn\n",
    "\n",
    "We can use:\n",
    "* **get_dummies()** (one hot encoding)\n",
    "* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n",
    "\n",
    "We are going to learn the first approach."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### One Hot Encoding\n",
    "We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     Name  Age     Sex   Position\n",
      "0  Marius   18    Male   graduate\n",
      "1   Maria   19  Female  professor\n",
      "2    John   20    Male         TA\n",
      "3   Carla   30  Female        phD\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",
    "        \"Age\": [18, 19, 20, 30],\n",
    "\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",
    "        \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",
    "       }\n",
    "df = pd.DataFrame(data)\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>sex_encoded</th>\n",
       "      <th>position_encoded</th>\n",
       "      <th>Sex_Female</th>\n",
       "      <th>Sex_Male</th>\n",
       "      <th>Position_TA</th>\n",
       "      <th>Position_graduate</th>\n",
       "      <th>Position_phD</th>\n",
       "      <th>Position_professor</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age  sex_encoded  position_encoded  Sex_Female  Sex_Male  \\\n",
       "0  Marius   18            1                 1       False      True   \n",
       "1   Maria   19            0                 3        True     False   \n",
       "2    John   20            1                 0       False      True   \n",
       "3   Carla   30            0                 2        True     False   \n",
       "\n",
       "   Position_TA  Position_graduate  Position_phD  Position_professor  \n",
       "0        False               True         False               False  \n",
       "1        False              False         False                True  \n",
       "2         True              False         False               False  \n",
       "3        False              False          True               False  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"
     ]
    }
   ],
   "source": [
    "df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",
    "df_onehot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use *OneHotEncoder* from Scikit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Sex_Female</th>\n",
       "      <th>Sex_Male</th>\n",
       "      <th>Position_TA</th>\n",
       "      <th>Position_graduate</th>\n",
       "      <th>Position_phD</th>\n",
       "      <th>Position_professor</th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>sex_encoded</th>\n",
       "      <th>position_encoded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Sex_Female Sex_Male Position_TA Position_graduate Position_phD  \\\n",
       "0        0.0      1.0         0.0               1.0          0.0   \n",
       "1        1.0      0.0         0.0               0.0          0.0   \n",
       "2        0.0      1.0         1.0               0.0          0.0   \n",
       "3        1.0      0.0         0.0               0.0          1.0   \n",
       "\n",
       "  Position_professor    Name Age sex_encoded position_encoded  \n",
       "0                0.0  Marius  18           1                1  \n",
       "1                1.0   Maria  19           0                3  \n",
       "2                0.0    John  20           1                0  \n",
       "3                0.0   Carla  30           0                2  "
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import OneHotEncoder\n",
    "from sklearn.compose import make_column_transformer\n",
    "\n",
    "df_onehotencoder = df\n",
    "# create OneHotEncoder object\n",
    "encoder = OneHotEncoder()\n",
    "\n",
    "# Transformer for several columns\n",
    "transformer = make_column_transformer(\n",
    "  (OneHotEncoder(), ['Sex', 'Position']),\n",
    "  remainder='passthrough',\n",
    "  verbose_feature_names_out=False)\n",
    "\n",
    "# transform\n",
    "transformed = transformer.fit_transform(df_onehotencoder)\n",
    "\n",
    "df_onehotencoder = pd.DataFrame(\n",
    "  transformed,\n",
    "  columns=transformer.get_feature_names_out())\n",
    "df_onehotencoder"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Integer encoding\n",
    "We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Position</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>graduate</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>professor</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>TA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>Female</td>\n",
       "      <td>phD</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age     Sex   Position\n",
       "0  Marius   18    Male   graduate\n",
       "1   Maria   19  Female  professor\n",
       "2    John   20    Male         TA\n",
       "3   Carla   30  Female        phD"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.preprocessing import LabelEncoder\n",
    "# creating instance of labelencoder\n",
    "labelencoder = LabelEncoder()\n",
    "df_encoded = df\n",
    "# Assigning numerical values and storing in another column\n",
    "sex_values = ('Male', 'Female')\n",
    "position_values = ('graduate', 'professor', 'TA', 'phD')\n",
    "df_encoded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Position</th>\n",
       "      <th>sex_encoded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>graduate</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>professor</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>TA</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>Female</td>\n",
       "      <td>phD</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age     Sex   Position  sex_encoded\n",
       "0  Marius   18    Male   graduate            1\n",
       "1   Maria   19  Female  professor            0\n",
       "2    John   20    Male         TA            1\n",
       "3   Carla   30  Female        phD            0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",
    "df_encoded"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Position</th>\n",
       "      <th>sex_encoded</th>\n",
       "      <th>position_encoded</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Marius</td>\n",
       "      <td>18</td>\n",
       "      <td>Male</td>\n",
       "      <td>graduate</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Maria</td>\n",
       "      <td>19</td>\n",
       "      <td>Female</td>\n",
       "      <td>professor</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John</td>\n",
       "      <td>20</td>\n",
       "      <td>Male</td>\n",
       "      <td>TA</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Carla</td>\n",
       "      <td>30</td>\n",
       "      <td>Female</td>\n",
       "      <td>phD</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Name  Age     Sex   Position  sex_encoded  position_encoded\n",
       "0  Marius   18    Male   graduate            1                 1\n",
       "1   Maria   19  Female  professor            0                 3\n",
       "2    John   20    Male         TA            1                 0\n",
       "3   Carla   30  Female        phD            0                 2"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",
    "df_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
Added preprocessing notebooks 2024-04-03 20:50:36 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"![](images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "slide"`
			`}`
			`},`
			`"source": [`
			`"# Categorical Data\n",`
			`"\n",`
			`"For many ML algorithms, we need to transform categorical data into numbers.\n",`
			`"\n",`
			`"For example:\n",`
			`"* 'Sex' with values 'M', 'F', 'Unknown'. \n",`
			`"* 'Position' with values 'phD', 'Professor', 'TA', 'graduate'.\n",`
			`"* 'Temperature' with values 'low', 'medium', 'high'.\n",`
			`"\n",`
			`"There are two main approaches:\n",`
			`"* Integer encoding\n",`
			`"* One hot encoding"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "slide"`
			`}`
			`},`
			`"source": [`
			`"## Integer Encoding\n",`
			`"We assign a number to every value:\n",`
			`"\n",`
			`"['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n",`
			`"\n",`
			`"['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n",`
			`"\n",`
			`"['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n",`
			`"\n",`
			`"The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n",`
			`"\n",`
			`"In our examples, this representation can be suitable for temperature, but not for the other two."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "slide"`
			`}`
			`},`
			`"source": [`
			`"## One Hot Encoding\n",`
			`"A binary column is created for each value of the categorical variable."`
			`]`
			`},`
			`{`
			`"cell_type": "raw",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "fragment"`
			`}`
			`},`
			`"source": [`
			`"Sex M F U\n",`
			`"----- ---------\n",`
			`"M 1 0 0\n",`
			`"F is transformed into 0 1 0\n",`
			`"Unknown 0 0 1\n",`
			`"M 1 0 0 "`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "slide"`
			`}`
			`},`
			`"source": [`
			`"## Transforming categorical data with Scikit-Learn\n",`
			`"\n",`
			`"We can use:\n",`
			`"* get_dummies() (one hot encoding)\n",`
			`"* LabelEncoder (integer encoding) and OneHotEncoder (one hot encoding). \n",`
			`"\n",`
			`"We are going to learn the first approach."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### One Hot Encoding\n",`
			`"We can use Pandas (get_dummies) or Scikit-Learn (OneHotEncoder)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "fragment"`
			`}`
			`},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`" Name Age Sex Position\n",`
			`"0 Marius 18 Male graduate\n",`
			`"1 Maria 19 Female professor\n",`
			`"2 John 20 Male TA\n",`
			`"3 Carla 30 Female phD\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"import pandas as pd\n",`
			`"\n",`
			`"data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n",`
			`" \"Age\": [18, 19, 20, 30],\n",`
			`"\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n",`
			`" \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n",`
			`" }\n",`
			`"df = pd.DataFrame(data)\n",`
			`"print(df)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "subslide"`
			`}`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>Name</th>\n",`
			`" <th>Age</th>\n",`
			`" <th>sex_encoded</th>\n",`
			`" <th>position_encoded</th>\n",`
			`" <th>Sex_Female</th>\n",`
			`" <th>Sex_Male</th>\n",`
			`" <th>Position_TA</th>\n",`
			`" <th>Position_graduate</th>\n",`
			`" <th>Position_phD</th>\n",`
			`" <th>Position_professor</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>Marius</td>\n",`
			`" <td>18</td>\n",`
			`" <td>1</td>\n",`
			`" <td>1</td>\n",`
			`" <td>False</td>\n",`
			`" <td>True</td>\n",`
			`" <td>False</td>\n",`
			`" <td>True</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>Maria</td>\n",`
			`" <td>19</td>\n",`
			`" <td>0</td>\n",`
			`" <td>3</td>\n",`
			`" <td>True</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" <td>True</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>John</td>\n",`
			`" <td>20</td>\n",`
			`" <td>1</td>\n",`
			`" <td>0</td>\n",`
			`" <td>False</td>\n",`
			`" <td>True</td>\n",`
			`" <td>True</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>Carla</td>\n",`
			`" <td>30</td>\n",`
			`" <td>0</td>\n",`
			`" <td>2</td>\n",`
			`" <td>True</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" <td>False</td>\n",`
			`" <td>True</td>\n",`
			`" <td>False</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n",`
			`"0 Marius 18 1 1 False True \n",`
			`"1 Maria 19 0 3 True False \n",`
			`"2 John 20 1 0 False True \n",`
			`"3 Carla 30 0 2 True False \n",`
			`"\n",`
			`" Position_TA Position_graduate Position_phD Position_professor \n",`
			`"0 False True False False \n",`
			`"1 False False False True \n",`
			`"2 True False False False \n",`
			`"3 False False True False "`
			`]`
			`},`
			`"execution_count": 18,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`},`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n"`
			`]`
			`}`
			`],`
			`"source": [`
			`"df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n",`
			`"df_onehot"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We can also use OneHotEncoder from Scikit."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 27,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>Sex_Female</th>\n",`
			`" <th>Sex_Male</th>\n",`
			`" <th>Position_TA</th>\n",`
			`" <th>Position_graduate</th>\n",`
			`" <th>Position_phD</th>\n",`
			`" <th>Position_professor</th>\n",`
			`" <th>Name</th>\n",`
			`" <th>Age</th>\n",`
			`" <th>sex_encoded</th>\n",`
			`" <th>position_encoded</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>0.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>Marius</td>\n",`
			`" <td>18</td>\n",`
			`" <td>1</td>\n",`
			`" <td>1</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>1.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" <td>Maria</td>\n",`
			`" <td>19</td>\n",`
			`" <td>0</td>\n",`
			`" <td>3</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>0.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>John</td>\n",`
			`" <td>20</td>\n",`
			`" <td>1</td>\n",`
			`" <td>0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>1.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" <td>0.0</td>\n",`
			`" <td>Carla</td>\n",`
			`" <td>30</td>\n",`
			`" <td>0</td>\n",`
			`" <td>2</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n",`
			`"0 0.0 1.0 0.0 1.0 0.0 \n",`
			`"1 1.0 0.0 0.0 0.0 0.0 \n",`
			`"2 0.0 1.0 1.0 0.0 0.0 \n",`
			`"3 1.0 0.0 0.0 0.0 1.0 \n",`
			`"\n",`
			`" Position_professor Name Age sex_encoded position_encoded \n",`
			`"0 0.0 Marius 18 1 1 \n",`
			`"1 1.0 Maria 19 0 3 \n",`
			`"2 0.0 John 20 1 0 \n",`
			`"3 0.0 Carla 30 0 2 "`
			`]`
			`},`
			`"execution_count": 27,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from sklearn.preprocessing import OneHotEncoder\n",`
			`"from sklearn.compose import make_column_transformer\n",`
			`"\n",`
			`"df_onehotencoder = df\n",`
			`"# create OneHotEncoder object\n",`
			`"encoder = OneHotEncoder()\n",`
			`"\n",`
			`"# Transformer for several columns\n",`
			`"transformer = make_column_transformer(\n",`
			`" (OneHotEncoder(), ['Sex', 'Position']),\n",`
			`" remainder='passthrough',\n",`
			`" verbose_feature_names_out=False)\n",`
			`"\n",`
			`"# transform\n",`
			`"transformed = transformer.fit_transform(df_onehotencoder)\n",`
			`"\n",`
			`"df_onehotencoder = pd.DataFrame(\n",`
			`" transformed,\n",`
			`" columns=transformer.get_feature_names_out())\n",`
			`"df_onehotencoder"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Integer encoding\n",`
			`"We will use LabelEncoder. It is possible to get the original values with inverse_transform. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 14,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>Name</th>\n",`
			`" <th>Age</th>\n",`
			`" <th>Sex</th>\n",`
			`" <th>Position</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>Marius</td>\n",`
			`" <td>18</td>\n",`
			`" <td>Male</td>\n",`
			`" <td>graduate</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>Maria</td>\n",`
			`" <td>19</td>\n",`
			`" <td>Female</td>\n",`
			`" <td>professor</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>John</td>\n",`
			`" <td>20</td>\n",`
			`" <td>Male</td>\n",`
			`" <td>TA</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>Carla</td>\n",`
			`" <td>30</td>\n",`
			`" <td>Female</td>\n",`
			`" <td>phD</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" Name Age Sex Position\n",`
			`"0 Marius 18 Male graduate\n",`
			`"1 Maria 19 Female professor\n",`
			`"2 John 20 Male TA\n",`
			`"3 Carla 30 Female phD"`
			`]`
			`},`
			`"execution_count": 14,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"from sklearn.preprocessing import LabelEncoder\n",`
			`"# creating instance of labelencoder\n",`
			`"labelencoder = LabelEncoder()\n",`
			`"df_encoded = df\n",`
			`"# Assigning numerical values and storing in another column\n",`
			`"sex_values = ('Male', 'Female')\n",`
			`"position_values = ('graduate', 'professor', 'TA', 'phD')\n",`
			`"df_encoded"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 16,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>Name</th>\n",`
			`" <th>Age</th>\n",`
			`" <th>Sex</th>\n",`
			`" <th>Position</th>\n",`
			`" <th>sex_encoded</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>Marius</td>\n",`
			`" <td>18</td>\n",`
			`" <td>Male</td>\n",`
			`" <td>graduate</td>\n",`
			`" <td>1</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>Maria</td>\n",`
			`" <td>19</td>\n",`
			`" <td>Female</td>\n",`
			`" <td>professor</td>\n",`
			`" <td>0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>John</td>\n",`
			`" <td>20</td>\n",`
			`" <td>Male</td>\n",`
			`" <td>TA</td>\n",`
			`" <td>1</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>Carla</td>\n",`
			`" <td>30</td>\n",`
			`" <td>Female</td>\n",`
			`" <td>phD</td>\n",`
			`" <td>0</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" Name Age Sex Position sex_encoded\n",`
			`"0 Marius 18 Male graduate 1\n",`
			`"1 Maria 19 Female professor 0\n",`
			`"2 John 20 Male TA 1\n",`
			`"3 Carla 30 Female phD 0"`
			`]`
			`},`
			`"execution_count": 16,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n",`
			`"df_encoded"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>Name</th>\n",`
			`" <th>Age</th>\n",`
			`" <th>Sex</th>\n",`
			`" <th>Position</th>\n",`
			`" <th>sex_encoded</th>\n",`
			`" <th>position_encoded</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>Marius</td>\n",`
			`" <td>18</td>\n",`
			`" <td>Male</td>\n",`
			`" <td>graduate</td>\n",`
			`" <td>1</td>\n",`
			`" <td>1</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>Maria</td>\n",`
			`" <td>19</td>\n",`
			`" <td>Female</td>\n",`
			`" <td>professor</td>\n",`
			`" <td>0</td>\n",`
			`" <td>3</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>John</td>\n",`
			`" <td>20</td>\n",`
			`" <td>Male</td>\n",`
			`" <td>TA</td>\n",`
			`" <td>1</td>\n",`
			`" <td>0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>Carla</td>\n",`
			`" <td>30</td>\n",`
			`" <td>Female</td>\n",`
			`" <td>phD</td>\n",`
			`" <td>0</td>\n",`
			`" <td>2</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" Name Age Sex Position sex_encoded position_encoded\n",`
			`"0 Marius 18 Male graduate 1 1\n",`
			`"1 Maria 19 Female professor 0 3\n",`
			`"2 John 20 Male TA 1 0\n",`
			`"3 Carla 30 Female phD 0 2"`
			`]`
			`},`
			`"execution_count": 17,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n",`
			`"df_encoded"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"# References\n",`
			`"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",`
			`"* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"## Licence\n",`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"celltoolbar": "Slideshow",`
			`"datacleaner": {`
			`"position": {`
			`"top": "50px"`
			`},`
			`"python": {`
			`"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"`
			`},`
			`"window_display": false`
			`},`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.13"`
			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`