![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias

## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)

# Table of Contents
* [Data munging with Pandas and Scikit-learn](#Data-munging-with-Pandas-and-Scikit-learn)
* [Examining a DataFrame](#Examining-a-DataFrame)
* [Selecting rows in a DataFrame](#Selecting-rows-in-a-DataFrame)
* [Grouping](#Grouping)
* [Pivot tables](#Pivot-tables)
* [Null and missing values](#Null-and-missing-values)
* [Analysing non numerical columns](#Analysing-non-numerical-columns)
* [Encoding categorical values](#Encoding-categorical-values)

# Data munging with Pandas and Scikit-learn

This notebook provides a more detailed introduction to Pandas and scikit-learn using the Titanic dataset.

[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one "raw" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. 
Some of the most common tasks are:
* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)
* Encode categorical variables as integers
* Combine  datasets
* Rename variables and convert types
* Transform / scale variables

We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.

First we load the dataset and we get a dataframe.

In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

df = pd.read_csv('data-titanic/train.csv')

# Show the first 5 rows
df[:5]

## Examining a DataFrame

We can  examine properties of the dataset.

In [None]:
# Information about columns and their types
df.info()

We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked.

In [None]:
# We can list non numerical properties, with a boolean indexing of the Series df.dtypes
df.dtypes[df.dtypes == object]

Let's explore the DataFrame.

In [None]:
# Number of samples and features
df.shape

In [None]:
# Basic statistics of the dataset in all the numeric columns
df.describe()

Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns.

In [None]:
# Describe statistics of relevant columns. We pass a list of columns
df[['Survived', 'Age', 'SibSp', 'Parch', 'Fare']].describe()

## Selecting rows in a DataFrame

In [None]:
# Select the first 5 rows
df.head(5)

In [None]:
# Select the last 5 rows
df.tail(5)

In [None]:
# Select several rows
df[2:5]

In [None]:
# Select the first 5 values of a column by name
df['Survived'][:5]

In [None]:
# Select several columns. Observe that the first parameter is a list
df[['Survived', 'Sex', 'Age']][:5]

In [None]:
# Passengers older than 20. Observe dataframe columns can be accessed like attributes.
df.Age > 30

In [None]:
# Select passengers older than 20 (only the last 5). We use boolean indexing
df[df.Age > 20][-5:]

In [None]:
# Select passengers older than 20 that survived (only the last 5)
df[(df.Age > 20) & (df.Survived == 1)][-5:]

In [None]:
# Alternative syntax with query to the standard Python 
# In large dataframes, the perfomance of DataFrame.query() using numexpr is considerable faster, look at the references
df.query('Age > 20 and Survived == 1')[-5:]

DataFrames provide a set of functions for selection that we will need later


|Operation | Syntax | Result |
|-----------------------------|
|Select column                  | df[col]       | Series |
|Select row by label            | df.loc[label] | Series |
|Select row by integer location | df.iloc[loc]  | Series |
|Slice rows	                    | df[5:10]	    | DataFrame |
|Select rows by boolean vector  | df[bool_vec]  | DataFrame |

In [None]:
# Select column and show last 4
df['Age'][-4:]

In [None]:
# Select row by label. We select with [index-labels, column-labels], and show last 4
df.loc[:, 'Age'][-4:]

In [None]:
#Select row by column index (Age is the column 5), and show last 4
df.iloc[:, 5][-4:]

In [None]:
#Slice rows - last 5 columns
df[-5:]

In [None]:
# Select based on boolean vector and show last 5 columns
df[df.Age > 20][-5:]

## Grouping

Rows can be grouped by one or more columns, and apply aggregated operators on the GroupBy object.

In [None]:
# Number of users  per sex (SQL like)
df.groupby('Sex').size()

In [None]:
#Mean age of  passengers per Passenger class

#First we calculate the mean
df.groupby('Pclass').mean()

In [None]:
#And now we answer the initial query (only mean age)
df.groupby('Pclass')['Age'].mean()

In [None]:
# Alternative syntax
df.groupby('Pclass').Age.mean()

In [None]:
#Mean Age and SibSp of passengers grouped by passenger class and sex
df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()

In [None]:
#Show mean  Age and  SibSp for passengers  older than 25 grouped by Passenger Class and Sex
df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()

In [None]:
# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex 
df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()

In [None]:
# We can also decide which function apply in each column

#Show mean Age, mean SibSp, and number of passengers older than 25 that survived,  grouped by Passenger Class and Sex
df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, 
                                                                         'SibSp': np.mean, 'Survived': np.sum})

# Pivot tables

Pivot tables are an intuitive way to analyze data, and alternative to group columns.

In [None]:
pd.pivot_table(df, index='Sex')

In [None]:
pd.pivot_table(df, index=['Sex', 'Pclass'])

In [None]:
pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])

In [None]:
pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)

In [None]:
# Try np.sum, np.size, len
pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])

In [None]:
# Try np.sum, np.size, len
table = pd.pivot_table(df, index=['Sex', 'Pclass', 'Survived'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum],
                       columns=['Embarked'])
table

In [None]:
table.query('Survived == 1')

## Duplicates

In [None]:
df.duplicated().any()

In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns).

## Null and missing values

Here we check how many null values there are.

We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values.

In [None]:
df.isnull().sum()

In [None]:
# Drop records with missing values
df_original = df.copy()
df_clean = df.dropna()
print("Original", df.shape)
print("Cleaned", df_clean.shape)

Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*.

In [None]:
# Fill missing values with the median
df_filled = df.fillna(df.median())
df_filled[-5:]

In [None]:
#The original df has not been modified
df[-5:]

Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. 

Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.

In addition, we could drop rows with any or all null values (method *dropna()*).

If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*.

In [None]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df[-5:]

In [None]:
#Another possibility is to assign the modified dataframe
# First we get the df with NaN values
df = df_original.copy()
#Fill NaN and assign to the column
df['Age'] = df['Age'].fillna(df['Age'].median())
df[-5:]

Now we are going to see how to change the Sex value of PassengerId 889, and then replace the missing values of Sex. It is just an example for practicing.

In [None]:
# There are not labels for rows, so we use the numeric index
df.iloc[889]

In [None]:
#We access row and column
df.iloc[889]['Sex']

In [None]:
# But we are working on a copy 
df.iloc[889]['Sex'] = np.nan

In [None]:
# If we want to change, we should not chain selections
# The selection can be done with the column name
df.loc[889, 'Sex']

In [None]:
# Or with the index of the column
df.iloc[889, 4]

In [None]:
# This indexing works for changing values
df.loc[889, 'Sex'] = np.nan
df[-5:]

In [None]:
df['Sex'].fillna('male', inplace=True)
df[-5:]

There are other interesting possibilities of **fillna**. We can fill with the previous valid value (**method=bfill**) or the next valid value (**method=ffill**). For example, with time series, it is frequent to use the last valid value (bfill).  Another alternative is to use the method **interpolate()**.

Look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more details.



**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*.

# Analysing non numerical columns

As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.

**Name** and **Ticket** do not seem informative.

Regarding **Cabin**, most values were missing, so we can ignore it. 

**Sex** and **Embarked** are categorical features, so we will encode as integers.

In [None]:
# We remove Cabin and Ticket. We should specify the axis
# Use axis 0 for dropping rows and axis 1 for dropping columns
df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)
df[-5:]

# Encoding categorical values

*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since  scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. 

In [None]:
#First we check if there is any null values. Observe the use of any()
df['Sex'].isnull().any()

In [None]:
#Now we check the values of Sex
df['Sex'].unique()

Now we are going to encode the values with our pandas knowledge.

In [None]:
df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1
df[-5:]

In [None]:
#An alternative is to create a new column with the encoded valuesm and define a mapping
df = df_original.copy()
df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)
df.head()

In [None]:
#Check nulls
df['Embarked'].isnull().any()

In [None]:
#Check how many nulls

df['Embarked'].isnull().sum()

In [None]:
#Check values
df['Embarked'].unique()

In [None]:
#Check distribution of Embarked
df.groupby('Embarked').size()

In [None]:
#Replace nulls with the most common value
df['Embarked'].fillna('S', inplace=True)
df['Embarked'].isnull().any()

In [None]:
# Now we replace as previosly the categories with integers
df.loc[df["Embarked"] == "S", "Embarked"] = 0
df.loc[df["Embarked"] == "C", "Embarked"] = 1
df.loc[df["Embarked"] == "Q", "Embarked"] = 2
df[-5:]

Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. 

To avoid this error,  Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a  new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.

We will learn how to do this in the next notebook.  More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html).

# References

* [Pandas](http://pandas.pydata.org/)
* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)
* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)
* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)
* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)
* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

©  Carlos A. Iglesias, Universidad Politécnica de Madrid.