1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-11-22 14:32:28 +00:00
sitc/ml2/3_3_Data_Munging_with_Pandas.ipynb

5412 lines
165 KiB
Plaintext
Raw Normal View History

2016-03-28 12:03:08 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Data munging with Pandas and Scikit-learn](#Data-munging-with-Pandas-and-Scikit-learn)\n",
"* [Examining a DataFrame](#Examining-a-DataFrame)\n",
"* [Selecting rows in a DataFrame](#Selecting-rows-in-a-DataFrame)\n",
"* [Grouping](#Grouping)\n",
"* [Pivot tables](#Pivot-tables)\n",
"* [Null and missing values](#Null-and-missing-values)\n",
"* [Analysing non numerical columns](#Analysing-non-numerical-columns)\n",
"* [Encoding categorical values](#Encoding-categorical-values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data munging with Pandas and Scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook provides a more detailed introduction to Pandas and scikit-learn using the Titanic dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-04-05 14:34:35 +00:00
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
2016-03-28 12:03:08 +00:00
"\n",
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
"Some of the most common tasks are:\n",
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
"* Encode categorical variables as integers\n",
"* Combine datasets\n",
"* Rename variables and convert types\n",
"* Transform / scale variables\n",
"\n",
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
"\n",
"First we load the dataset and we get a dataframe."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import Series, DataFrame\n",
"\n",
"df = pd.read_csv('data-titanic/train.csv')\n",
"\n",
"# Show the first 5 rows\n",
"df[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examining a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can examine properties of the dataset."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 891 entries, 0 to 890\n",
"Data columns (total 12 columns):\n",
"PassengerId 891 non-null int64\n",
"Survived 891 non-null int64\n",
"Pclass 891 non-null int64\n",
"Name 891 non-null object\n",
"Sex 891 non-null object\n",
"Age 714 non-null float64\n",
"SibSp 891 non-null int64\n",
"Parch 891 non-null int64\n",
"Ticket 891 non-null object\n",
"Fare 891 non-null float64\n",
"Cabin 204 non-null object\n",
"Embarked 889 non-null object\n",
"dtypes: float64(2), int64(5), object(5)\n",
"memory usage: 83.6+ KB\n"
]
}
],
"source": [
"# Information about columns and their types\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Name object\n",
"Sex object\n",
"Ticket object\n",
"Cabin object\n",
"Embarked object\n",
"dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
"df.dtypes[df.dtypes == object]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's explore the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"(891, 12)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Number of samples and features\n",
"df.shape"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>714.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>446.000000</td>\n",
" <td>0.383838</td>\n",
" <td>2.308642</td>\n",
" <td>29.699118</td>\n",
" <td>0.523008</td>\n",
" <td>0.381594</td>\n",
" <td>32.204208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>257.353842</td>\n",
" <td>0.486592</td>\n",
" <td>0.836071</td>\n",
" <td>14.526497</td>\n",
" <td>1.102743</td>\n",
" <td>0.806057</td>\n",
" <td>49.693429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.420000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>223.500000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" <td>20.125000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>7.910400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>446.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>28.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>14.454200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>668.500000</td>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>38.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>31.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>891.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.000000</td>\n",
" <td>80.000000</td>\n",
" <td>8.000000</td>\n",
" <td>6.000000</td>\n",
" <td>512.329200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Age SibSp \\\n",
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
"\n",
" Parch Fare \n",
"count 891.000000 891.000000 \n",
"mean 0.381594 32.204208 \n",
"std 0.806057 49.693429 \n",
"min 0.000000 0.000000 \n",
"25% 0.000000 7.910400 \n",
"50% 0.000000 14.454200 \n",
"75% 0.000000 31.000000 \n",
"max 6.000000 512.329200 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Basic statistics of the dataset in all the numeric columns\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>891.000000</td>\n",
" <td>714.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" <td>891.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>0.383838</td>\n",
" <td>29.699118</td>\n",
" <td>0.523008</td>\n",
" <td>0.381594</td>\n",
" <td>32.204208</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.486592</td>\n",
" <td>14.526497</td>\n",
" <td>1.102743</td>\n",
" <td>0.806057</td>\n",
" <td>49.693429</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>0.000000</td>\n",
" <td>0.420000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>0.000000</td>\n",
" <td>20.125000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>7.910400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>0.000000</td>\n",
" <td>28.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>14.454200</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>1.000000</td>\n",
" <td>38.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>31.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>1.000000</td>\n",
" <td>80.000000</td>\n",
" <td>8.000000</td>\n",
" <td>6.000000</td>\n",
" <td>512.329200</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Age SibSp Parch Fare\n",
"count 891.000000 714.000000 891.000000 891.000000 891.000000\n",
"mean 0.383838 29.699118 0.523008 0.381594 32.204208\n",
"std 0.486592 14.526497 1.102743 0.806057 49.693429\n",
"min 0.000000 0.420000 0.000000 0.000000 0.000000\n",
"25% 0.000000 20.125000 0.000000 0.000000 7.910400\n",
"50% 0.000000 28.000000 0.000000 0.000000 14.454200\n",
"75% 1.000000 38.000000 1.000000 0.000000 31.000000\n",
"max 1.000000 80.000000 8.000000 6.000000 512.329200"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Describe statistics of relevant columns. We pass a list of columns\n",
"df[['Survived', 'Age', 'SibSp', 'Parch', 'Fare']].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting rows in a DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"0 0 A/5 21171 7.2500 NaN S \n",
"1 0 PC 17599 71.2833 C85 C \n",
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
"3 0 113803 53.1000 C123 S \n",
"4 0 373450 8.0500 NaN S "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select the first 5 rows\n",
"df.head(5)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.0 0 0 211536 13.00 NaN S \n",
"887 female 19.0 0 0 112053 30.00 B42 S \n",
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26.0 0 0 111369 30.00 C148 C \n",
"890 male 32.0 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select the last 5 rows\n",
"df.tail(5)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.925</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.100</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.050</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp Parch \\\n",
"2 Heikkinen, Miss. Laina female 26.0 0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 \n",
"4 Allen, Mr. William Henry male 35.0 0 0 \n",
"\n",
" Ticket Fare Cabin Embarked \n",
"2 STON/O2. 3101282 7.925 NaN S \n",
"3 113803 53.100 C123 S \n",
"4 373450 8.050 NaN S "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select several rows\n",
"df[2:5]"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 1\n",
"3 1\n",
"4 0\n",
"Name: Survived, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select the first 5 values of a column by name\n",
"df['Survived'][:5]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Survived</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>1</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Survived Sex Age\n",
"0 0 male 22.0\n",
"1 1 female 38.0\n",
"2 1 female 26.0\n",
"3 1 female 35.0\n",
"4 0 male 35.0"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select several columns. Observe that the first parameter is a list\n",
"df[['Survived', 'Sex', 'Age']][:5]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 True\n",
"2 False\n",
"3 True\n",
"4 True\n",
"5 False\n",
"6 True\n",
"7 False\n",
"8 False\n",
"9 False\n",
"10 False\n",
"11 True\n",
"12 False\n",
"13 True\n",
"14 False\n",
"15 True\n",
"16 False\n",
"17 False\n",
"18 True\n",
"19 False\n",
"20 True\n",
"21 True\n",
"22 False\n",
"23 False\n",
"24 False\n",
"25 True\n",
"26 False\n",
"27 False\n",
"28 False\n",
"29 False\n",
" ... \n",
"861 False\n",
"862 True\n",
"863 False\n",
"864 False\n",
"865 True\n",
"866 False\n",
"867 True\n",
"868 False\n",
"869 False\n",
"870 False\n",
"871 True\n",
"872 True\n",
"873 True\n",
"874 False\n",
"875 False\n",
"876 False\n",
"877 False\n",
"878 False\n",
"879 True\n",
"880 False\n",
"881 True\n",
"882 False\n",
"883 False\n",
"884 False\n",
"885 True\n",
"886 False\n",
"887 False\n",
"888 False\n",
"889 False\n",
"890 True\n",
"Name: Age, dtype: bool"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Passengers older than 20. Observe dataframe columns can be accessed like attributes.\n",
"df.Age > 30"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>884</th>\n",
" <td>885</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Sutehall, Mr. Henry Jr</td>\n",
" <td>male</td>\n",
" <td>25.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>SOTON/OQ 392076</td>\n",
" <td>7.050</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>885</th>\n",
" <td>886</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Rice, Mrs. William (Margaret Norton)</td>\n",
" <td>female</td>\n",
" <td>39.0</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>382652</td>\n",
" <td>29.125</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.750</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"884 885 0 3 Sutehall, Mr. Henry Jr \n",
"885 886 0 3 Rice, Mrs. William (Margaret Norton) \n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"884 male 25.0 0 0 SOTON/OQ 392076 7.050 NaN S \n",
"885 female 39.0 0 5 382652 29.125 NaN Q \n",
"886 male 27.0 0 0 211536 13.000 NaN S \n",
"889 male 26.0 0 0 111369 30.000 C148 C \n",
"890 male 32.0 0 0 370376 7.750 NaN Q "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select passengers older than 20 (only the last 5). We use boolean indexing\n",
"df[df.Age > 20][-5:]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>871</th>\n",
" <td>872</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Beckwith, Mrs. Richard Leonard (Sallie Monypeny)</td>\n",
" <td>female</td>\n",
" <td>47.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11751</td>\n",
" <td>52.5542</td>\n",
" <td>D35</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>874</th>\n",
" <td>875</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Abelson, Mrs. Samuel (Hannah Wizosky)</td>\n",
" <td>female</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>P/PP 3381</td>\n",
" <td>24.0000</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>879</th>\n",
" <td>880</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)</td>\n",
" <td>female</td>\n",
" <td>56.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11767</td>\n",
" <td>83.1583</td>\n",
" <td>C50</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>880</th>\n",
" <td>881</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Shelley, Mrs. William (Imanita Parrish Hall)</td>\n",
" <td>female</td>\n",
" <td>25.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>230433</td>\n",
" <td>26.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"871 872 1 1 \n",
"874 875 1 2 \n",
"879 880 1 1 \n",
"880 881 1 2 \n",
"889 890 1 1 \n",
"\n",
" Name Sex Age SibSp \\\n",
"871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 \n",
"874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 \n",
"879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 \n",
"880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"871 1 11751 52.5542 D35 S \n",
"874 0 P/PP 3381 24.0000 NaN C \n",
"879 1 11767 83.1583 C50 C \n",
"880 1 230433 26.0000 NaN S \n",
"889 0 111369 30.0000 C148 C "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select passengers older than 20 that survived (only the last 5)\n",
"df[(df.Age > 20) & (df.Survived == 1)][-5:]"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>871</th>\n",
" <td>872</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Beckwith, Mrs. Richard Leonard (Sallie Monypeny)</td>\n",
" <td>female</td>\n",
" <td>47.0</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>11751</td>\n",
" <td>52.5542</td>\n",
" <td>D35</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>874</th>\n",
" <td>875</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Abelson, Mrs. Samuel (Hannah Wizosky)</td>\n",
" <td>female</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>P/PP 3381</td>\n",
" <td>24.0000</td>\n",
" <td>NaN</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>879</th>\n",
" <td>880</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)</td>\n",
" <td>female</td>\n",
" <td>56.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>11767</td>\n",
" <td>83.1583</td>\n",
" <td>C50</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>880</th>\n",
" <td>881</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>Shelley, Mrs. William (Imanita Parrish Hall)</td>\n",
" <td>female</td>\n",
" <td>25.0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>230433</td>\n",
" <td>26.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"871 872 1 1 \n",
"874 875 1 2 \n",
"879 880 1 1 \n",
"880 881 1 2 \n",
"889 890 1 1 \n",
"\n",
" Name Sex Age SibSp \\\n",
"871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 \n",
"874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 \n",
"879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 \n",
"880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked \n",
"871 1 11751 52.5542 D35 S \n",
"874 0 P/PP 3381 24.0000 NaN C \n",
"879 1 11767 83.1583 C50 C \n",
"880 1 230433 26.0000 NaN S \n",
"889 0 111369 30.0000 C148 C "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Alternative syntax with query to the standard Python \n",
"# In large dataframes, the perfomance of DataFrame.query() using numexpr is considerable faster, look at the references\n",
"df.query('Age > 20 and Survived == 1')[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrames provide a set of functions for selection that we will need later\n",
"\n",
"\n",
"|Operation | Syntax | Result |\n",
"|-----------------------------|\n",
"|Select column | df[col] | Series |\n",
"|Select row by label | df.loc[label] | Series |\n",
"|Select row by integer location | df.iloc[loc] | Series |\n",
"|Slice rows\t | df[5:10]\t | DataFrame |\n",
"|Select rows by boolean vector | df[bool_vec] | DataFrame |"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"887 19.0\n",
"888 NaN\n",
"889 26.0\n",
"890 32.0\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select column and show last 4\n",
"df['Age'][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"887 19.0\n",
"888 NaN\n",
"889 26.0\n",
"890 32.0\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select row by label. We select with [index-labels, column-labels], and show last 4\n",
"df.loc[:, 'Age'][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"887 19.0\n",
"888 NaN\n",
"889 26.0\n",
"890 32.0\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Select row by column index (Age is the column 5), and show last 4\n",
"df.iloc[:, 5][-4:]"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.0 0 0 211536 13.00 NaN S \n",
"887 female 19.0 0 0 112053 30.00 B42 S \n",
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26.0 0 0 111369 30.00 C148 C \n",
"890 male 32.0 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Slice rows - last 5 columns\n",
"df[-5:]"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>884</th>\n",
" <td>885</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Sutehall, Mr. Henry Jr</td>\n",
" <td>male</td>\n",
" <td>25.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>SOTON/OQ 392076</td>\n",
" <td>7.050</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>885</th>\n",
" <td>886</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Rice, Mrs. William (Margaret Norton)</td>\n",
" <td>female</td>\n",
" <td>39.0</td>\n",
" <td>0</td>\n",
" <td>5</td>\n",
" <td>382652</td>\n",
" <td>29.125</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.750</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"884 885 0 3 Sutehall, Mr. Henry Jr \n",
"885 886 0 3 Rice, Mrs. William (Margaret Norton) \n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"884 male 25.0 0 0 SOTON/OQ 392076 7.050 NaN S \n",
"885 female 39.0 0 5 382652 29.125 NaN Q \n",
"886 male 27.0 0 0 211536 13.000 NaN S \n",
"889 male 26.0 0 0 111369 30.000 C148 C \n",
"890 male 32.0 0 0 370376 7.750 NaN Q "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select based on boolean vector and show last 5 columns\n",
"df[df.Age > 20][-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Grouping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rows can be grouped by one or more columns, and apply aggregated operators on the GroupBy object."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Sex\n",
"female 314\n",
"male 577\n",
"dtype: int64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Number of users per sex (SQL like)\n",
"df.groupby('Sex').size()"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>461.597222</td>\n",
" <td>0.629630</td>\n",
" <td>38.233441</td>\n",
" <td>0.416667</td>\n",
" <td>0.356481</td>\n",
" <td>84.154687</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>445.956522</td>\n",
" <td>0.472826</td>\n",
" <td>29.877630</td>\n",
" <td>0.402174</td>\n",
" <td>0.380435</td>\n",
" <td>20.662183</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>439.154786</td>\n",
" <td>0.242363</td>\n",
" <td>25.140620</td>\n",
" <td>0.615071</td>\n",
" <td>0.393075</td>\n",
" <td>13.675550</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Age SibSp Parch Fare\n",
"Pclass \n",
"1 461.597222 0.629630 38.233441 0.416667 0.356481 84.154687\n",
"2 445.956522 0.472826 29.877630 0.402174 0.380435 20.662183\n",
"3 439.154786 0.242363 25.140620 0.615071 0.393075 13.675550"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Mean age of passengers per Passenger class\n",
"\n",
"#First we calculate the mean\n",
"df.groupby('Pclass').mean()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Pclass\n",
"1 38.233441\n",
"2 29.877630\n",
"3 25.140620\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#And now we answer the initial query (only mean age)\n",
"df.groupby('Pclass')['Age'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Pclass\n",
"1 38.233441\n",
"2 29.877630\n",
"3 25.140620\n",
"Name: Age, dtype: float64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Alternative syntax\n",
"df.groupby('Pclass').Age.mean()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
" <th>female</th>\n",
" <td>34.611765</td>\n",
" <td>0.553191</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>41.281386</td>\n",
" <td>0.311475</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
" <th>female</th>\n",
" <td>28.722973</td>\n",
" <td>0.486842</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>30.740707</td>\n",
" <td>0.342593</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
" <th>female</th>\n",
" <td>21.750000</td>\n",
" <td>0.895833</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>26.507589</td>\n",
" <td>0.498559</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age SibSp\n",
"Pclass Sex \n",
"1 female 34.611765 0.553191\n",
" male 41.281386 0.311475\n",
"2 female 28.722973 0.486842\n",
" male 30.740707 0.342593\n",
"3 female 21.750000 0.895833\n",
" male 26.507589 0.498559"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Mean Age and SibSp of passengers grouped by passenger class and sex\n",
"df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
" <th>female</th>\n",
" <td>42.052632</td>\n",
" <td>0.473684</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>45.017241</td>\n",
" <td>0.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
" <th>female</th>\n",
" <td>36.566667</td>\n",
" <td>0.444444</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>38.809524</td>\n",
" <td>0.301587</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
" <th>female</th>\n",
" <td>34.959459</td>\n",
" <td>0.513514</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>35.778226</td>\n",
" <td>0.185484</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age SibSp\n",
"Pclass Sex \n",
"1 female 42.052632 0.473684\n",
" male 45.017241 0.333333\n",
"2 female 36.566667 0.444444\n",
" male 38.809524 0.301587\n",
"3 female 34.959459 0.513514\n",
" male 35.778226 0.185484"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Show mean Age and SibSp for passengers older than 25 grouped by Passenger Class and Sex\n",
"df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Survived</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
" <th>female</th>\n",
" <td>34.611765</td>\n",
" <td>0.541176</td>\n",
" <td>0.964706</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>41.685000</td>\n",
" <td>0.370000</td>\n",
" <td>0.390000</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
" <th>female</th>\n",
" <td>28.722973</td>\n",
" <td>0.500000</td>\n",
" <td>0.918919</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>32.329787</td>\n",
" <td>0.351064</td>\n",
" <td>0.106383</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
" <th>female</th>\n",
" <td>22.602041</td>\n",
" <td>0.806122</td>\n",
" <td>0.438776</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>26.713147</td>\n",
" <td>0.490040</td>\n",
" <td>0.143426</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age SibSp Survived\n",
"Pclass Sex \n",
"1 female 34.611765 0.541176 0.964706\n",
" male 41.685000 0.370000 0.390000\n",
"2 female 28.722973 0.500000 0.918919\n",
" male 32.329787 0.351064 0.106383\n",
"3 female 22.602041 0.806122 0.438776\n",
" male 26.713147 0.490040 0.143426"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Survived</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Pclass</th>\n",
" <th>Sex</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
" <th>female</th>\n",
" <td>34.611765</td>\n",
" <td>0.541176</td>\n",
" <td>85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>41.685000</td>\n",
" <td>0.370000</td>\n",
" <td>100</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
" <th>female</th>\n",
" <td>28.722973</td>\n",
" <td>0.500000</td>\n",
" <td>74</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>32.329787</td>\n",
" <td>0.351064</td>\n",
" <td>94</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
" <th>female</th>\n",
" <td>22.602041</td>\n",
" <td>0.806122</td>\n",
" <td>98</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>26.713147</td>\n",
" <td>0.490040</td>\n",
" <td>251</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age SibSp Survived\n",
"Pclass Sex \n",
"1 female 34.611765 0.541176 85\n",
" male 41.685000 0.370000 100\n",
"2 female 28.722973 0.500000 74\n",
" male 32.329787 0.351064 94\n",
"3 female 22.602041 0.806122 98\n",
" male 26.713147 0.490040 251"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We can also decide which function apply in each column\n",
"\n",
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, \n",
" 'SibSp': np.mean, 'Survived': np.size})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pivot tables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pivot tables are an intuitive way to analyze data, and alternative to group columns."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>Fare</th>\n",
" <th>Parch</th>\n",
" <th>PassengerId</th>\n",
" <th>Pclass</th>\n",
" <th>SibSp</th>\n",
" <th>Survived</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>female</th>\n",
" <td>27.915709</td>\n",
" <td>44.479818</td>\n",
" <td>0.649682</td>\n",
" <td>431.028662</td>\n",
" <td>2.159236</td>\n",
" <td>0.694268</td>\n",
" <td>0.742038</td>\n",
" </tr>\n",
" <tr>\n",
" <th>male</th>\n",
" <td>30.726645</td>\n",
" <td>25.523893</td>\n",
" <td>0.235702</td>\n",
" <td>454.147314</td>\n",
" <td>2.389948</td>\n",
" <td>0.429809</td>\n",
" <td>0.188908</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age Fare Parch PassengerId Pclass SibSp \\\n",
"Sex \n",
"female 27.915709 44.479818 0.649682 431.028662 2.159236 0.694268 \n",
"male 30.726645 25.523893 0.235702 454.147314 2.389948 0.429809 \n",
"\n",
" Survived \n",
"Sex \n",
"female 0.742038 \n",
"male 0.188908 "
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.pivot_table(df, index='Sex')"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>Fare</th>\n",
" <th>Parch</th>\n",
" <th>PassengerId</th>\n",
" <th>SibSp</th>\n",
" <th>Survived</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
" <th>1</th>\n",
" <td>34.611765</td>\n",
" <td>106.125798</td>\n",
" <td>0.457447</td>\n",
" <td>469.212766</td>\n",
" <td>0.553191</td>\n",
" <td>0.968085</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>28.722973</td>\n",
" <td>21.970121</td>\n",
" <td>0.605263</td>\n",
" <td>443.105263</td>\n",
" <td>0.486842</td>\n",
" <td>0.921053</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>21.750000</td>\n",
" <td>16.118810</td>\n",
" <td>0.798611</td>\n",
" <td>399.729167</td>\n",
" <td>0.895833</td>\n",
" <td>0.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
" <th>1</th>\n",
" <td>41.281386</td>\n",
" <td>67.226127</td>\n",
" <td>0.278689</td>\n",
" <td>455.729508</td>\n",
" <td>0.311475</td>\n",
" <td>0.368852</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30.740707</td>\n",
" <td>19.741782</td>\n",
" <td>0.222222</td>\n",
" <td>447.962963</td>\n",
" <td>0.342593</td>\n",
" <td>0.157407</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>26.507589</td>\n",
" <td>12.661633</td>\n",
" <td>0.224784</td>\n",
" <td>455.515850</td>\n",
" <td>0.498559</td>\n",
" <td>0.135447</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age Fare Parch PassengerId SibSp \\\n",
"Sex Pclass \n",
"female 1 34.611765 106.125798 0.457447 469.212766 0.553191 \n",
" 2 28.722973 21.970121 0.605263 443.105263 0.486842 \n",
" 3 21.750000 16.118810 0.798611 399.729167 0.895833 \n",
"male 1 41.281386 67.226127 0.278689 455.729508 0.311475 \n",
" 2 30.740707 19.741782 0.222222 447.962963 0.342593 \n",
" 3 26.507589 12.661633 0.224784 455.515850 0.498559 \n",
"\n",
" Survived \n",
"Sex Pclass \n",
"female 1 0.968085 \n",
" 2 0.921053 \n",
" 3 0.500000 \n",
"male 1 0.368852 \n",
" 2 0.157407 \n",
" 3 0.135447 "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'])"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
" <th>1</th>\n",
" <td>34.611765</td>\n",
" <td>0.553191</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>28.722973</td>\n",
" <td>0.486842</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>21.750000</td>\n",
" <td>0.895833</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
" <th>1</th>\n",
" <td>41.281386</td>\n",
" <td>0.311475</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30.740707</td>\n",
" <td>0.342593</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>26.507589</td>\n",
" <td>0.498559</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age SibSp\n",
"Sex Pclass \n",
"female 1 34.611765 0.553191\n",
" 2 28.722973 0.486842\n",
" 3 21.750000 0.895833\n",
"male 1 41.281386 0.311475\n",
" 2 30.740707 0.342593\n",
" 3 26.507589 0.498559"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
" <th>1</th>\n",
" <td>34.611765</td>\n",
" <td>0.553191</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>28.722973</td>\n",
" <td>0.486842</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>21.750000</td>\n",
" <td>0.895833</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
" <th>1</th>\n",
" <td>41.281386</td>\n",
" <td>0.311475</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30.740707</td>\n",
" <td>0.342593</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>26.507589</td>\n",
" <td>0.498559</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Age SibSp\n",
"Sex Pclass \n",
"female 1 34.611765 0.553191\n",
" 2 28.722973 0.486842\n",
" 3 21.750000 0.895833\n",
"male 1 41.281386 0.311475\n",
" 2 30.740707 0.342593\n",
" 3 26.507589 0.498559"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th colspan=\"2\" halign=\"left\">mean</th>\n",
" <th colspan=\"2\" halign=\"left\">sum</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
" <th>1</th>\n",
" <td>34.611765</td>\n",
" <td>0.553191</td>\n",
" <td>2942.00</td>\n",
" <td>52</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>28.722973</td>\n",
" <td>0.486842</td>\n",
" <td>2125.50</td>\n",
" <td>37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>21.750000</td>\n",
" <td>0.895833</td>\n",
" <td>2218.50</td>\n",
" <td>129</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
" <th>1</th>\n",
" <td>41.281386</td>\n",
" <td>0.311475</td>\n",
" <td>4169.42</td>\n",
" <td>38</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>30.740707</td>\n",
" <td>0.342593</td>\n",
" <td>3043.33</td>\n",
" <td>37</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>26.507589</td>\n",
" <td>0.498559</td>\n",
" <td>6706.42</td>\n",
" <td>173</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mean sum \n",
" Age SibSp Age SibSp\n",
"Sex Pclass \n",
"female 1 34.611765 0.553191 2942.00 52\n",
" 2 28.722973 0.486842 2125.50 37\n",
" 3 21.750000 0.895833 2218.50 129\n",
"male 1 41.281386 0.311475 4169.42 38\n",
" 2 30.740707 0.342593 3043.33 37\n",
" 3 26.507589 0.498559 6706.42 173"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Try np.sum, np.size, len\n",
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th colspan=\"6\" halign=\"left\">mean</th>\n",
" <th colspan=\"6\" halign=\"left\">sum</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th>Embarked</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th>Survived</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"6\" valign=\"top\">female</th>\n",
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
" <th>0</th>\n",
" <td>50.000000</td>\n",
" <td>NaN</td>\n",
" <td>13.500000</td>\n",
" <td>0.000000</td>\n",
" <td>NaN</td>\n",
" <td>1.000000</td>\n",
" <td>50.00</td>\n",
" <td>NaN</td>\n",
" <td>27.00</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>35.675676</td>\n",
" <td>33.000000</td>\n",
" <td>33.619048</td>\n",
" <td>0.523810</td>\n",
" <td>1.000000</td>\n",
" <td>0.586957</td>\n",
" <td>1320.00</td>\n",
" <td>33.0</td>\n",
" <td>1412.00</td>\n",
" <td>22.0</td>\n",
" <td>1.0</td>\n",
" <td>27.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>36.000000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>0.500000</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>216.00</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>19.142857</td>\n",
" <td>30.000000</td>\n",
" <td>29.091667</td>\n",
" <td>0.714286</td>\n",
" <td>0.000000</td>\n",
" <td>0.475410</td>\n",
" <td>134.00</td>\n",
" <td>30.0</td>\n",
" <td>1745.50</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>29.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
" <th>0</th>\n",
" <td>20.700000</td>\n",
" <td>28.100000</td>\n",
" <td>23.688889</td>\n",
" <td>0.500000</td>\n",
" <td>0.111111</td>\n",
" <td>1.600000</td>\n",
" <td>103.50</td>\n",
" <td>140.5</td>\n",
" <td>1066.00</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" <td>88.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>11.045455</td>\n",
" <td>17.600000</td>\n",
" <td>22.548387</td>\n",
" <td>0.600000</td>\n",
" <td>0.250000</td>\n",
" <td>0.636364</td>\n",
" <td>121.50</td>\n",
" <td>88.0</td>\n",
" <td>699.00</td>\n",
" <td>9.0</td>\n",
" <td>6.0</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"6\" valign=\"top\">male</th>\n",
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
" <th>0</th>\n",
" <td>43.050000</td>\n",
" <td>44.000000</td>\n",
" <td>45.362500</td>\n",
" <td>0.160000</td>\n",
" <td>2.000000</td>\n",
" <td>0.294118</td>\n",
" <td>861.00</td>\n",
" <td>44.0</td>\n",
" <td>1814.50</td>\n",
" <td>4.0</td>\n",
" <td>2.0</td>\n",
" <td>15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>36.437500</td>\n",
" <td>NaN</td>\n",
" <td>36.121667</td>\n",
" <td>0.352941</td>\n",
" <td>NaN</td>\n",
" <td>0.392857</td>\n",
" <td>583.00</td>\n",
" <td>NaN</td>\n",
" <td>866.92</td>\n",
" <td>6.0</td>\n",
" <td>NaN</td>\n",
" <td>11.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
" <th>0</th>\n",
" <td>29.500000</td>\n",
" <td>57.000000</td>\n",
" <td>33.414474</td>\n",
" <td>0.625000</td>\n",
" <td>0.000000</td>\n",
" <td>0.280488</td>\n",
" <td>206.50</td>\n",
" <td>57.0</td>\n",
" <td>2539.50</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.000000</td>\n",
" <td>NaN</td>\n",
" <td>17.095000</td>\n",
" <td>0.000000</td>\n",
" <td>NaN</td>\n",
" <td>0.600000</td>\n",
" <td>1.00</td>\n",
" <td>NaN</td>\n",
" <td>239.33</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
" <th>0</th>\n",
" <td>27.555556</td>\n",
" <td>28.076923</td>\n",
" <td>27.168478</td>\n",
" <td>0.181818</td>\n",
" <td>0.583333</td>\n",
" <td>0.562771</td>\n",
" <td>496.00</td>\n",
" <td>365.0</td>\n",
" <td>4999.00</td>\n",
" <td>6.0</td>\n",
" <td>21.0</td>\n",
" <td>130.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>18.488571</td>\n",
" <td>29.000000</td>\n",
" <td>22.933333</td>\n",
" <td>0.400000</td>\n",
" <td>0.666667</td>\n",
" <td>0.294118</td>\n",
" <td>129.42</td>\n",
" <td>29.0</td>\n",
" <td>688.00</td>\n",
" <td>4.0</td>\n",
" <td>2.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mean \\\n",
" Age SibSp \n",
"Embarked C Q S C Q \n",
"Sex Pclass Survived \n",
"female 1 0 50.000000 NaN 13.500000 0.000000 NaN \n",
" 1 35.675676 33.000000 33.619048 0.523810 1.000000 \n",
" 2 0 NaN NaN 36.000000 NaN NaN \n",
" 1 19.142857 30.000000 29.091667 0.714286 0.000000 \n",
" 3 0 20.700000 28.100000 23.688889 0.500000 0.111111 \n",
" 1 11.045455 17.600000 22.548387 0.600000 0.250000 \n",
"male 1 0 43.050000 44.000000 45.362500 0.160000 2.000000 \n",
" 1 36.437500 NaN 36.121667 0.352941 NaN \n",
" 2 0 29.500000 57.000000 33.414474 0.625000 0.000000 \n",
" 1 1.000000 NaN 17.095000 0.000000 NaN \n",
" 3 0 27.555556 28.076923 27.168478 0.181818 0.583333 \n",
" 1 18.488571 29.000000 22.933333 0.400000 0.666667 \n",
"\n",
" sum \n",
" Age SibSp \n",
"Embarked S C Q S C Q S \n",
"Sex Pclass Survived \n",
"female 1 0 1.000000 50.00 NaN 27.00 0.0 NaN 2.0 \n",
" 1 0.586957 1320.00 33.0 1412.00 22.0 1.0 27.0 \n",
" 2 0 0.500000 NaN NaN 216.00 NaN NaN 3.0 \n",
" 1 0.475410 134.00 30.0 1745.50 5.0 0.0 29.0 \n",
" 3 0 1.600000 103.50 140.5 1066.00 4.0 1.0 88.0 \n",
" 1 0.636364 121.50 88.0 699.00 9.0 6.0 21.0 \n",
"male 1 0 0.294118 861.00 44.0 1814.50 4.0 2.0 15.0 \n",
" 1 0.392857 583.00 NaN 866.92 6.0 NaN 11.0 \n",
" 2 0 0.280488 206.50 57.0 2539.50 5.0 0.0 23.0 \n",
" 1 0.600000 1.00 NaN 239.33 0.0 NaN 9.0 \n",
" 3 0 0.562771 496.00 365.0 4999.00 6.0 21.0 130.0 \n",
" 1 0.294118 129.42 29.0 688.00 4.0 2.0 10.0 "
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Try np.sum, np.size, len\n",
"table = pd.pivot_table(df, index=['Sex', 'Pclass', 'Survived'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum],\n",
" columns=['Embarked'])\n",
"table"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th colspan=\"6\" halign=\"left\">mean</th>\n",
" <th colspan=\"6\" halign=\"left\">sum</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th>Embarked</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" <th>C</th>\n",
" <th>Q</th>\n",
" <th>S</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sex</th>\n",
" <th>Pclass</th>\n",
" <th>Survived</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
" <th>1</th>\n",
" <th>1</th>\n",
" <td>35.675676</td>\n",
" <td>33.0</td>\n",
" <td>33.619048</td>\n",
" <td>0.523810</td>\n",
" <td>1.000000</td>\n",
" <td>0.586957</td>\n",
" <td>1320.00</td>\n",
" <td>33.0</td>\n",
" <td>1412.00</td>\n",
" <td>22.0</td>\n",
" <td>1.0</td>\n",
" <td>27.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>1</th>\n",
" <td>19.142857</td>\n",
" <td>30.0</td>\n",
" <td>29.091667</td>\n",
" <td>0.714286</td>\n",
" <td>0.000000</td>\n",
" <td>0.475410</td>\n",
" <td>134.00</td>\n",
" <td>30.0</td>\n",
" <td>1745.50</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>29.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>1</th>\n",
" <td>11.045455</td>\n",
" <td>17.6</td>\n",
" <td>22.548387</td>\n",
" <td>0.600000</td>\n",
" <td>0.250000</td>\n",
" <td>0.636364</td>\n",
" <td>121.50</td>\n",
" <td>88.0</td>\n",
" <td>699.00</td>\n",
" <td>9.0</td>\n",
" <td>6.0</td>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
" <th>1</th>\n",
" <th>1</th>\n",
" <td>36.437500</td>\n",
" <td>NaN</td>\n",
" <td>36.121667</td>\n",
" <td>0.352941</td>\n",
" <td>NaN</td>\n",
" <td>0.392857</td>\n",
" <td>583.00</td>\n",
" <td>NaN</td>\n",
" <td>866.92</td>\n",
" <td>6.0</td>\n",
" <td>NaN</td>\n",
" <td>11.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <th>1</th>\n",
" <td>1.000000</td>\n",
" <td>NaN</td>\n",
" <td>17.095000</td>\n",
" <td>0.000000</td>\n",
" <td>NaN</td>\n",
" <td>0.600000</td>\n",
" <td>1.00</td>\n",
" <td>NaN</td>\n",
" <td>239.33</td>\n",
" <td>0.0</td>\n",
" <td>NaN</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <th>1</th>\n",
" <td>18.488571</td>\n",
" <td>29.0</td>\n",
" <td>22.933333</td>\n",
" <td>0.400000</td>\n",
" <td>0.666667</td>\n",
" <td>0.294118</td>\n",
" <td>129.42</td>\n",
" <td>29.0</td>\n",
" <td>688.00</td>\n",
" <td>4.0</td>\n",
" <td>2.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mean \\\n",
" Age SibSp \n",
"Embarked C Q S C Q \n",
"Sex Pclass Survived \n",
"female 1 1 35.675676 33.0 33.619048 0.523810 1.000000 \n",
" 2 1 19.142857 30.0 29.091667 0.714286 0.000000 \n",
" 3 1 11.045455 17.6 22.548387 0.600000 0.250000 \n",
"male 1 1 36.437500 NaN 36.121667 0.352941 NaN \n",
" 2 1 1.000000 NaN 17.095000 0.000000 NaN \n",
" 3 1 18.488571 29.0 22.933333 0.400000 0.666667 \n",
"\n",
" sum \n",
" Age SibSp \n",
"Embarked S C Q S C Q S \n",
"Sex Pclass Survived \n",
"female 1 1 0.586957 1320.00 33.0 1412.00 22.0 1.0 27.0 \n",
" 2 1 0.475410 134.00 30.0 1745.50 5.0 0.0 29.0 \n",
" 3 1 0.636364 121.50 88.0 699.00 9.0 6.0 21.0 \n",
"male 1 1 0.392857 583.00 NaN 866.92 6.0 NaN 11.0 \n",
" 2 1 0.600000 1.00 NaN 239.33 0.0 NaN 9.0 \n",
" 3 1 0.294118 129.42 29.0 688.00 4.0 2.0 10.0 "
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"table.query('Survived == 1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Duplicates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.duplicated().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Null and missing values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we check how many null values there are.\n",
"\n",
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 0\n",
"Survived 0\n",
"Pclass 0\n",
"Name 0\n",
"Sex 0\n",
"Age 177\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 0\n",
"Fare 0\n",
"Cabin 687\n",
"Embarked 2\n",
"dtype: int64"
]
},
"execution_count": 37,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Original (891, 10)\n",
"Cleaned (889, 10)\n"
]
}
],
"source": [
"# Drop records with missing values\n",
"df_original = df.copy()\n",
"df_clean = df.dropna()\n",
"print(\"Original\", df.shape)\n",
"print(\"Cleaned\", df_clean.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>28.0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.0 0 0 211536 13.00 NaN S \n",
"887 female 19.0 0 0 112053 30.00 B42 S \n",
"888 female 28.0 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26.0 0 0 111369 30.00 C148 C \n",
"890 male 32.0 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Fill missing values with the median\n",
"df_filled = df.fillna(df.median())\n",
"df_filled[-5:]"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.0 0 0 211536 13.00 NaN S \n",
"887 female 19.0 0 0 112053 30.00 B42 S \n",
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26.0 0 0 111369 30.00 C148 C \n",
"890 male 32.0 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#The original df has not been modified\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
"\n",
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
"\n",
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
"\n",
"If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>29.699118</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26.000000 0 0 111369 30.00 C148 C \n",
"890 male 32.000000 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
"df[-5:]"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>29.699118</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26.000000 0 0 111369 30.00 C148 C \n",
"890 male 32.000000 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Another possibility is to assign the modified dataframe\n",
"# First we get the df with NaN values\n",
"df = df_original.copy()\n",
"#Fill NaN and assign to the column\n",
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to see how to change the Sex value of PassengerId 889, and then replace the missing values of Sex. It is just an example for practicing."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"PassengerId 890\n",
"Survived 1\n",
"Pclass 1\n",
"Name Behr, Mr. Karl Howell\n",
"Sex male\n",
"Age 26\n",
"SibSp 0\n",
"Parch 0\n",
"Ticket 111369\n",
"Fare 30\n",
"Cabin C148\n",
"Embarked C\n",
"Name: 889, dtype: object"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# There are not labels for rows, so we use the numeric index\n",
"df.iloc[889]"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'male'"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#We access row and column\n",
"df.iloc[889]['Sex']"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/cif/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: \n",
"A value is trying to be set on a copy of a slice from a DataFrame\n",
"\n",
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
" from ipykernel import kernelapp as app\n"
]
}
],
"source": [
"# But we are working on a copy \n",
"df.iloc[889]['Sex'] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'male'"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# If we want to change, we should not chain selections\n",
"# The selection can be done with the column name\n",
"df.loc[889, 'Sex']"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'male'"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Or with the index of the column\n",
"df.iloc[889, 4]"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>29.699118</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>NaN</td>\n",
" <td>26.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
"889 NaN 26.000000 0 0 111369 30.00 C148 C \n",
"890 male 32.000000 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This indexing works for changing values\n",
"df.loc[889, 'Sex'] = np.nan\n",
"df[-5:]"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.00</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.00</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>29.699118</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.45</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.00</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.75</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
"889 male 26.000000 0 0 111369 30.00 C148 C \n",
"890 male 32.000000 0 0 370376 7.75 NaN Q "
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['Sex'].fillna('male', inplace=True)\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"There are other interesting possibilities of **fillna**. We can fill with the previous valid value (**method=bfill**) or the next valid value (**method=ffill**). For example, with time series, it is frequent to use the last valid value (bfill). Another alternative is to use the method **interpolate()**.\n",
"\n",
"Look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more details.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysing non numerical columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
"\n",
"**Name** and **Ticket** do not seem informative.\n",
"\n",
"Regarding **Cabin**, most values were missing, so we can ignore it. \n",
"\n",
"**Sex** and **Embarked** are categorical features, so we will encode as integers."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13.00</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.00</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>29.699118</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>23.45</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.00</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7.75</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Fare Embarked \n",
"886 male 27.000000 0 0 13.00 S \n",
"887 female 19.000000 0 0 30.00 S \n",
"888 female 29.699118 1 2 23.45 S \n",
"889 male 26.000000 0 0 30.00 C \n",
"890 male 32.000000 0 0 7.75 Q "
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We remove Cabin and Ticket. We should specify the axis\n",
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Encoding categorical values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#First we check if there is any null values. Observe the use of any()\n",
"df['Sex'].isnull().any()"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['male', 'female'], dtype=object)"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Now we check the values of Sex\n",
"df['Sex'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to encode the values with our pandas knowledge."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>0</td>\n",
" <td>27.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13.00</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>1</td>\n",
" <td>19.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.00</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>1</td>\n",
" <td>29.699118</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>23.45</td>\n",
" <td>S</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>0</td>\n",
" <td>26.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.00</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>0</td>\n",
" <td>32.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7.75</td>\n",
" <td>Q</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Fare Embarked \n",
"886 0 27.000000 0 0 13.00 S \n",
"887 1 19.000000 0 0 30.00 S \n",
"888 1 29.699118 1 2 23.45 S \n",
"889 0 26.000000 0 0 30.00 C \n",
"890 0 32.000000 0 0 7.75 Q "
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
"df[-5:]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>Gender</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked Gender \n",
"0 0 A/5 21171 7.2500 NaN S 0 \n",
"1 0 PC 17599 71.2833 C85 C 1 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 1 \n",
"3 0 113803 53.1000 C123 S 1 \n",
"4 0 373450 8.0500 NaN S 0 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
"df = df_original.copy()\n",
"df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 51,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Check nulls\n",
"df['Embarked'].isnull().any()"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Check how many nulls\n",
"\n",
"df['Embarked'].isnull().sum()"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array(['S', 'C', 'Q', nan], dtype=object)"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Check values\n",
"df['Embarked'].unique()"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"Embarked\n",
"C 168\n",
"Q 77\n",
"S 644\n",
"dtype: int64"
]
},
"execution_count": 112,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Check distribution of Embarked\n",
"df.groupby('Embarked').size()"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Replace nulls with the most common value\n",
"df['Embarked'].fillna('S', inplace=True)\n",
"df['Embarked'].isnull().any()"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Fare</th>\n",
" <th>Embarked</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>13.00</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.00</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>29.699118</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>23.45</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.00</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.000000</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>7.75</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass Name \\\n",
"886 887 0 2 Montvila, Rev. Juozas \n",
"887 888 1 1 Graham, Miss. Margaret Edith \n",
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
"889 890 1 1 Behr, Mr. Karl Howell \n",
"890 891 0 3 Dooley, Mr. Patrick \n",
"\n",
" Sex Age SibSp Parch Fare Embarked \n",
"886 male 27.000000 0 0 13.00 0 \n",
"887 female 19.000000 0 0 30.00 0 \n",
"888 female 29.699118 1 2 23.45 0 \n",
"889 male 26.000000 0 0 30.00 1 \n",
"890 male 32.000000 0 0 7.75 2 "
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Now we replace as previosly the categories with integers\n",
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. \n",
"\n",
"To avoid this error, Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.\n",
"\n",
"We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Pandas](http://pandas.pydata.org/)\n",
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n",
"* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)\n",
"* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
"* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2017-04-20 14:07:10 +00:00
"version": "3.5.2"
2016-03-28 12:03:08 +00:00
}
},
"nbformat": 4,
"nbformat_minor": 0
}