![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias

## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)

# Table of Contents
* [Data munging with Pandas and Scikit-learn](#Data-munging-with-Pandas-and-Scikit-learn)
* [Examining a DataFrame](#Examining-a-DataFrame)
* [Selecting rows in a DataFrame](#Selecting-rows-in-a-DataFrame)
* [Grouping](#Grouping)
* [Pivot tables](#Pivot-tables)
* [Null and missing values](#Null-and-missing-values)
* [Analysing non numerical columns](#Analysing-non-numerical-columns)
* [Encoding categorical values](#Encoding-categorical-values)

# Data munging with Pandas and Scikit-learn

This notebook provides a more detailed introduction to Pandas and scikit-learn using the Titanic dataset.

[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one "raw" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.

*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. 
Some of the most common tasks are:
* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)
* Encode categorical variables as integers
* Combine  datasets
* Rename variables and convert types
* Transform / scale variables

We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.

First we load the dataset and we get a dataframe.

In [1]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

df = pd.read_csv('data-titanic/train.csv')

# Show the first 5 rows
df[:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Examining a DataFrame

We can  examine properties of the dataset.

In [2]:
# Information about columns and their types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked.

In [3]:
# We can list non numerical properties, with a boolean indexing of the Series df.dtypes
df.dtypes[df.dtypes == object]

Name        object
Sex         object
Ticket      object
Cabin       object
Embarked    object
dtype: object

Let's explore the DataFrame.

In [4]:
# Number of samples and features
df.shape

(891, 12)

In [5]:
# Basic statistics of the dataset in all the numeric columns
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns.

In [6]:
# Describe statistics of relevant columns. We pass a list of columns
df[['Survived', 'Age', 'SibSp', 'Parch', 'Fare']].describe()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare
count,891.0,714.0,891.0,891.0,891.0
mean,0.383838,29.699118,0.523008,0.381594,32.204208
std,0.486592,14.526497,1.102743,0.806057,49.693429
min,0.0,0.42,0.0,0.0,0.0
25%,0.0,20.125,0.0,0.0,7.9104
50%,0.0,28.0,0.0,0.0,14.4542
75%,1.0,38.0,1.0,0.0,31.0
max,1.0,80.0,8.0,6.0,512.3292


## Selecting rows in a DataFrame

In [7]:
# Select the first 5 rows
df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [8]:
# Select the last 5 rows
df.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [9]:
# Select several rows
df[2:5]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
# Select the first 5 values of a column by name
df['Survived'][:5]

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

In [11]:
# Select several columns. Observe that the first parameter is a list
df[['Survived', 'Sex', 'Age']][:5]

Unnamed: 0,Survived,Sex,Age
0,0,male,22.0
1,1,female,38.0
2,1,female,26.0
3,1,female,35.0
4,0,male,35.0


In [12]:
# Passengers older than 20. Observe dataframe columns can be accessed like attributes.
df.Age > 30

0      False
1       True
2      False
3       True
4       True
5      False
6       True
7      False
8      False
9      False
10     False
11      True
12     False
13      True
14     False
15      True
16     False
17     False
18      True
19     False
20      True
21      True
22     False
23     False
24     False
25      True
26     False
27     False
28     False
29     False
       ...  
861    False
862     True
863    False
864    False
865     True
866    False
867     True
868    False
869    False
870    False
871     True
872     True
873     True
874    False
875    False
876    False
877    False
878    False
879     True
880    False
881     True
882    False
883    False
884    False
885     True
886    False
887    False
888    False
889    False
890     True
Name: Age, dtype: bool

In [13]:
# Select passengers older than 20 (only the last 5). We use boolean indexing
df[df.Age > 20][-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [14]:
# Select passengers older than 20 that survived (only the last 5)
df[(df.Age > 20) & (df.Survived == 1)][-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


In [15]:
# Alternative syntax with query to the standard Python 
# In large dataframes, the perfomance of DataFrame.query() using numexpr is considerable faster, look at the references
df.query('Age > 20 and Survived == 1')[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
874,875,1,2,"Abelson, Mrs. Samuel (Hannah Wizosky)",female,28.0,1,0,P/PP 3381,24.0,,C
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


DataFrames provide a set of functions for selection that we will need later


|Operation | Syntax | Result |
|-----------------------------|
|Select column                  | df[col]       | Series |
|Select row by label            | df.loc[label] | Series |
|Select row by integer location | df.iloc[loc]  | Series |
|Slice rows	                    | df[5:10]	    | DataFrame |
|Select rows by boolean vector  | df[bool_vec]  | DataFrame |

In [16]:
# Select column and show last 4
df['Age'][-4:]

887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64

In [17]:
# Select row by label. We select with [index-labels, column-labels], and show last 4
df.loc[:, 'Age'][-4:]

887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64

In [18]:
#Select row by column index (Age is the column 5), and show last 4
df.iloc[:, 5][-4:]

887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, dtype: float64

In [19]:
#Slice rows - last 5 columns
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [20]:
# Select based on boolean vector and show last 5 columns
df[df.Age > 20][-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
885,886,0,3,"Rice, Mrs. William (Margaret Norton)",female,39.0,0,5,382652,29.125,,Q
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


## Grouping

Rows can be grouped by one or more columns, and apply aggregated operators on the GroupBy object.

In [21]:
# Number of users  per sex (SQL like)
df.groupby('Sex').size()

Sex
female    314
male      577
dtype: int64

In [22]:
#Mean age of  passengers per Passenger class

#First we calculate the mean
df.groupby('Pclass').mean()

Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,38.233441,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.87763,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.14062,0.615071,0.393075,13.67555


In [23]:
#And now we answer the initial query (only mean age)
df.groupby('Pclass')['Age'].mean()

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [24]:
# Alternative syntax
df.groupby('Pclass').Age.mean()

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

In [25]:
#Mean Age and SibSp of passengers grouped by passenger class and sex
df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,SibSp
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,34.611765,0.553191
1,male,41.281386,0.311475
2,female,28.722973,0.486842
2,male,30.740707,0.342593
3,female,21.75,0.895833
3,male,26.507589,0.498559


In [26]:
#Show mean  Age and  SibSp for passengers  older than 25 grouped by Passenger Class and Sex
df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,SibSp
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1,female,42.052632,0.473684
1,male,45.017241,0.333333
2,female,36.566667,0.444444
2,male,38.809524,0.301587
3,female,34.959459,0.513514
3,male,35.778226,0.185484


In [27]:
# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex 
df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,SibSp,Survived
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,female,34.611765,0.541176,0.964706
1,male,41.685,0.37,0.39
2,female,28.722973,0.5,0.918919
2,male,32.329787,0.351064,0.106383
3,female,22.602041,0.806122,0.438776
3,male,26.713147,0.49004,0.143426


In [28]:
# We can also decide which function apply in each column

#Show mean Age, mean SibSp, and number of passengers older than 25 that survived,  grouped by Passenger Class and Sex
df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, 
                                                                         'SibSp': np.mean, 'Survived': np.size})

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,SibSp,Survived
Pclass,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,female,34.611765,0.541176,85
1,male,41.685,0.37,100
2,female,28.722973,0.5,74
2,male,32.329787,0.351064,94
3,female,22.602041,0.806122,98
3,male,26.713147,0.49004,251


# Pivot tables

Pivot tables are an intuitive way to analyze data, and alternative to group columns.

In [29]:
pd.pivot_table(df, index='Sex')

Unnamed: 0_level_0,Age,Fare,Parch,PassengerId,Pclass,SibSp,Survived
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,27.915709,44.479818,0.649682,431.028662,2.159236,0.694268,0.742038
male,30.726645,25.523893,0.235702,454.147314,2.389948,0.429809,0.188908


In [30]:
pd.pivot_table(df, index=['Sex', 'Pclass'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,Fare,Parch,PassengerId,SibSp,Survived
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,1,34.611765,106.125798,0.457447,469.212766,0.553191,0.968085
female,2,28.722973,21.970121,0.605263,443.105263,0.486842,0.921053
female,3,21.75,16.11881,0.798611,399.729167,0.895833,0.5
male,1,41.281386,67.226127,0.278689,455.729508,0.311475,0.368852
male,2,30.740707,19.741782,0.222222,447.962963,0.342593,0.157407
male,3,26.507589,12.661633,0.224784,455.51585,0.498559,0.135447


In [31]:
pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,SibSp
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,34.611765,0.553191
female,2,28.722973,0.486842
female,3,21.75,0.895833
male,1,41.281386,0.311475
male,2,30.740707,0.342593
male,3,26.507589,0.498559


In [32]:
pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,Age,SibSp
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,34.611765,0.553191
female,2,28.722973,0.486842
female,3,21.75,0.895833
male,1,41.281386,0.311475
male,2,30.740707,0.342593
male,3,26.507589,0.498559


In [33]:
# Try np.sum, np.size, len
pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,mean,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Age,SibSp,Age,SibSp
Sex,Pclass,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
female,1,34.611765,0.553191,2942.0,52
female,2,28.722973,0.486842,2125.5,37
female,3,21.75,0.895833,2218.5,129
male,1,41.281386,0.311475,4169.42,38
male,2,30.740707,0.342593,3043.33,37
male,3,26.507589,0.498559,6706.42,173


In [34]:
# Try np.sum, np.size, len
table = pd.pivot_table(df, index=['Sex', 'Pclass', 'Survived'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum],
                       columns=['Embarked'])
table

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mean,mean,mean,mean,mean,mean,sum,sum,sum,sum,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Age,Age,Age,SibSp,SibSp,SibSp,Age,Age,Age,SibSp,SibSp,SibSp
Unnamed: 0_level_2,Unnamed: 1_level_2,Embarked,C,Q,S,C,Q,S,C,Q,S,C,Q,S
Sex,Pclass,Survived,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3
female,1,0,50.0,,13.5,0.0,,1.0,50.0,,27.0,0.0,,2.0
female,1,1,35.675676,33.0,33.619048,0.52381,1.0,0.586957,1320.0,33.0,1412.0,22.0,1.0,27.0
female,2,0,,,36.0,,,0.5,,,216.0,,,3.0
female,2,1,19.142857,30.0,29.091667,0.714286,0.0,0.47541,134.0,30.0,1745.5,5.0,0.0,29.0
female,3,0,20.7,28.1,23.688889,0.5,0.111111,1.6,103.5,140.5,1066.0,4.0,1.0,88.0
female,3,1,11.045455,17.6,22.548387,0.6,0.25,0.636364,121.5,88.0,699.0,9.0,6.0,21.0
male,1,0,43.05,44.0,45.3625,0.16,2.0,0.294118,861.0,44.0,1814.5,4.0,2.0,15.0
male,1,1,36.4375,,36.121667,0.352941,,0.392857,583.0,,866.92,6.0,,11.0
male,2,0,29.5,57.0,33.414474,0.625,0.0,0.280488,206.5,57.0,2539.5,5.0,0.0,23.0
male,2,1,1.0,,17.095,0.0,,0.6,1.0,,239.33,0.0,,9.0


In [35]:
table.query('Survived == 1')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mean,mean,mean,mean,mean,mean,sum,sum,sum,sum,sum,sum
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Age,Age,Age,SibSp,SibSp,SibSp,Age,Age,Age,SibSp,SibSp,SibSp
Unnamed: 0_level_2,Unnamed: 1_level_2,Embarked,C,Q,S,C,Q,S,C,Q,S,C,Q,S
Sex,Pclass,Survived,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3
female,1,1,35.675676,33.0,33.619048,0.52381,1.0,0.586957,1320.0,33.0,1412.0,22.0,1.0,27.0
female,2,1,19.142857,30.0,29.091667,0.714286,0.0,0.47541,134.0,30.0,1745.5,5.0,0.0,29.0
female,3,1,11.045455,17.6,22.548387,0.6,0.25,0.636364,121.5,88.0,699.0,9.0,6.0,21.0
male,1,1,36.4375,,36.121667,0.352941,,0.392857,583.0,,866.92,6.0,,11.0
male,2,1,1.0,,17.095,0.0,,0.6,1.0,,239.33,0.0,,9.0
male,3,1,18.488571,29.0,22.933333,0.4,0.666667,0.294118,129.42,29.0,688.0,4.0,2.0,10.0


## Duplicates

In [36]:
df.duplicated().any()

False

In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns).

## Null and missing values

Here we check how many null values there are.

We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values.

In [37]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [54]:
# Drop records with missing values
df_original = df.copy()
df_clean = df.dropna()
print("Original", df.shape)
print("Cleaned", df_clean.shape)

Original (891, 10)
Cleaned (889, 10)


Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*.

In [39]:
# Fill missing values with the median
df_filled = df.fillna(df.median())
df_filled[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,28.0,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [40]:
#The original df has not been modified
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. 

Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.

In addition, we could drop rows with any or all null values (method *dropna()*).

If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*.

In [41]:
df['Age'].fillna(df['Age'].mean(), inplace=True)
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [42]:
#Another possibility is to assign the modified dataframe
# First we get the df with NaN values
df = df_original.copy()
#Fill NaN and assign to the column
df['Age'] = df['Age'].fillna(df['Age'].median())
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


Now we are going to see how to change the Sex value of PassengerId 889, and then replace the missing values of Sex. It is just an example for practicing.

In [43]:
# There are not labels for rows, so we use the numeric index
df.iloc[889]

PassengerId                      890
Survived                           1
Pclass                             1
Name           Behr, Mr. Karl Howell
Sex                             male
Age                               26
SibSp                              0
Parch                              0
Ticket                        111369
Fare                              30
Cabin                           C148
Embarked                           C
Name: 889, dtype: object

In [44]:
#We access row and column
df.iloc[889]['Sex']

'male'

In [45]:
# But we are working on a copy 
df.iloc[889]['Sex'] = np.nan

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


In [46]:
# If we want to change, we should not chain selections
# The selection can be done with the column name
df.loc[889, 'Sex']

'male'

In [47]:
# Or with the index of the column
df.iloc[889, 4]

'male'

In [48]:
# This indexing works for changing values
df.loc[889, 'Sex'] = np.nan
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [49]:
df['Sex'].fillna('male', inplace=True)
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


There are other interesting possibilities of **fillna**. We can fill with the previous valid value (**method=bfill**) or the next valid value (**method=ffill**). For example, with time series, it is frequent to use the last valid value (bfill).  Another alternative is to use the method **interpolate()**.

Look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more details.



**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*.

# Analysing non numerical columns

As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.

**Name** and **Ticket** do not seem informative.

Regarding **Cabin**, most values were missing, so we can ignore it. 

**Sex** and **Embarked** are categorical features, so we will encode as integers.

In [50]:
# We remove Cabin and Ticket. We should specify the axis
# Use axis 0 for dropping rows and axis 1 for dropping columns
df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,23.45,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,7.75,Q


# Encoding categorical values

*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since  scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. 

In [51]:
#First we check if there is any null values. Observe the use of any()
df['Sex'].isnull().any()

False

In [52]:
#Now we check the values of Sex
df['Sex'].unique()

array(['male', 'female'], dtype=object)

Now we are going to encode the values with our pandas knowledge.

In [53]:
df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
886,887,0,2,"Montvila, Rev. Juozas",0,27.0,0,0,13.0,S
887,888,1,1,"Graham, Miss. Margaret Edith",1,19.0,0,0,30.0,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,29.699118,1,2,23.45,S
889,890,1,1,"Behr, Mr. Karl Howell",0,26.0,0,0,30.0,C
890,891,0,3,"Dooley, Mr. Patrick",0,32.0,0,0,7.75,Q


In [8]:
#An alternative is to create a new column with the encoded valuesm and define a mapping
df = df_original.copy()
df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Gender
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


In [51]:
#Check nulls
df['Embarked'].isnull().any()

True

In [110]:
#Check how many nulls

df['Embarked'].isnull().sum()

2

In [111]:
#Check values
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [112]:
#Check distribution of Embarked
df.groupby('Embarked').size()

Embarked
C    168
Q     77
S    644
dtype: int64

In [113]:
#Replace nulls with the most common value
df['Embarked'].fillna('S', inplace=True)
df['Embarked'].isnull().any()

False

In [114]:
# Now we replace as previosly the categories with integers
df.loc[df["Embarked"] == "S", "Embarked"] = 0
df.loc[df["Embarked"] == "C", "Embarked"] = 1
df.loc[df["Embarked"] == "Q", "Embarked"] = 2
df[-5:]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,29.699118,1,2,23.45,0
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0,1
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,7.75,2


Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. 

To avoid this error,  Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a  new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.

We will learn how to do this in the next notebook.  More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html).

# References

* [Pandas](http://pandas.pydata.org/)
* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)
* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)
* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)
* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)
* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid.