![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)

# Unknown values

Two possible approaches are **remove** these rows or **fill** them. It depends on every case.

In [2]:
import pandas as pd
import numpy as np

## Filling NaN values
If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.

* For **string** fields, we can fill NaN with **' '**.

* For **numbers**, we can fill with the **mean** or **median** value. 


## Propagate non-null values forward or backward
You can also **propagate** non-null values with these methods:

* **ffill**: Fill values by propagating the last valid observation to the next valid.
* **bfill**:  Fill values using the following valid observation to fill the gap.
* **interpolate**:  Fill NaN values using interpolation.

It will fill the next value in the dataframe with the previous non-NaN value. 

You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place.

In [17]:
df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})

In [11]:
df

Unnamed: 0,col1
0,
1,
2,2.0
3,3.0
4,4.0
5,
6,


We fill forward the value 4.0 and fill the next one (limit = 1)

In [12]:
 df.ffill(limit = 1)

Unnamed: 0,col1
0,
1,
2,2.0
3,3.0
4,4.0
5,4.0
6,


In [None]:
df.ffill()

We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values.

In [13]:
df.bfill()

Unnamed: 0,col1
0,2.0
1,2.0
2,2.0
3,3.0
4,4.0
5,
6,


## Removing NaN values
We can remove them by row or column (use inplace=True if you want to modify the DataFrame).

In [26]:
# Drop any rows which have any nans
df1 = df.dropna()
# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)
df2 = df.dropna(axis=1)
# Only drop columns which have at least 90% non-NaNs 
df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)
df1

Unnamed: 0,col1
2,2.0
3,3.0
4,4.0


# References
* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, 
* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)

## Licence
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.