![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)

# Categorical Data

For many ML algorithms, we need to transform categorical data into numbers.

For example:
* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. 
* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.
* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.

There are two main approaches:
* Integer encoding
* One hot encoding

## Integer Encoding
We assign a number to every value:

['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]

['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]

['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]

The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. 

In our examples, this representation can be suitable for **temperature**, but not for the other two.

## One Hot Encoding
A binary column is created for each value of the categorical variable.

## Transforming categorical data  with Scikit-Learn

We can use:
* **get_dummies()** (one hot encoding)
* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). 

We are going to learn the first approach.

### One Hot Encoding
We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*).

In [11]:
import pandas as pd

data = {"Name": ["Marius", "Maria", "John", "Carla"],
        "Age": [18, 19, 20, 30],
		"Sex": ["Male", "Female", "Male", "Female"],
        "Position": ["graduate", "professor", "TA", "phD"]
       }
df = pd.DataFrame(data)
print(df)

     Name  Age     Sex   Position
0  Marius   18    Male   graduate
1   Maria   19  Female  professor
2    John   20    Male         TA
3   Carla   30  Female        phD


In [18]:
df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])
df_onehot

Unnamed: 0,Name,Age,sex_encoded,position_encoded,Sex_Female,Sex_Male,Position_TA,Position_graduate,Position_phD,Position_professor
0,Marius,18,1,1,False,True,False,True,False,False
1,Maria,19,0,3,True,False,False,False,False,True
2,John,20,1,0,False,True,True,False,False,False
3,Carla,30,0,2,True,False,False,False,True,False


The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


We can also use *OneHotEncoder* from Scikit.

In [27]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer

df_onehotencoder = df
# create OneHotEncoder object
encoder = OneHotEncoder()

# Transformer for several columns
transformer = make_column_transformer(
  (OneHotEncoder(), ['Sex', 'Position']),
  remainder='passthrough',
  verbose_feature_names_out=False)

# transform
transformed = transformer.fit_transform(df_onehotencoder)

df_onehotencoder = pd.DataFrame(
  transformed,
  columns=transformer.get_feature_names_out())
df_onehotencoder

Unnamed: 0,Sex_Female,Sex_Male,Position_TA,Position_graduate,Position_phD,Position_professor,Name,Age,sex_encoded,position_encoded
0,0.0,1.0,0.0,1.0,0.0,0.0,Marius,18,1,1
1,1.0,0.0,0.0,0.0,0.0,1.0,Maria,19,0,3
2,0.0,1.0,1.0,0.0,0.0,0.0,John,20,1,0
3,1.0,0.0,0.0,0.0,1.0,0.0,Carla,30,0,2


Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline.

### Integer encoding
We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [14]:
from sklearn.preprocessing import LabelEncoder
# creating instance of labelencoder
labelencoder = LabelEncoder()
df_encoded = df
# Assigning numerical values and storing in another column
sex_values = ('Male', 'Female')
position_values = ('graduate', 'professor', 'TA', 'phD')
df_encoded

Unnamed: 0,Name,Age,Sex,Position
0,Marius,18,Male,graduate
1,Maria,19,Female,professor
2,John,20,Male,TA
3,Carla,30,Female,phD


In [16]:
df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])
df_encoded

Unnamed: 0,Name,Age,Sex,Position,sex_encoded
0,Marius,18,Male,graduate,1
1,Maria,19,Female,professor,0
2,John,20,Male,TA,1
3,Carla,30,Female,phD,0


In [17]:
df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])
df_encoded

Unnamed: 0,Name,Age,Sex,Position,sex_encoded,position_encoded
0,Marius,18,Male,graduate,1,1
1,Maria,19,Female,professor,0,3
2,John,20,Male,TA,1,0
3,Carla,30,Female,phD,0,2


# References
* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, 
* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn

## Licence
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.