mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-17 20:12:28 +00:00
5412 lines
165 KiB
Plaintext
5412 lines
165 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Data munging with Pandas and Scikit-learn](#Data-munging-with-Pandas-and-Scikit-learn)\n",
|
|
"* [Examining a DataFrame](#Examining-a-DataFrame)\n",
|
|
"* [Selecting rows in a DataFrame](#Selecting-rows-in-a-DataFrame)\n",
|
|
"* [Grouping](#Grouping)\n",
|
|
"* [Pivot tables](#Pivot-tables)\n",
|
|
"* [Null and missing values](#Null-and-missing-values)\n",
|
|
"* [Analysing non numerical columns](#Analysing-non-numerical-columns)\n",
|
|
"* [Encoding categorical values](#Encoding-categorical-values)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data munging with Pandas and Scikit-learn"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This notebook provides a more detailed introduction to Pandas and scikit-learn using the Titanic dataset."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
|
|
"\n",
|
|
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
|
"Some of the most common tasks are:\n",
|
|
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
|
|
"* Encode categorical variables as integers\n",
|
|
"* Combine datasets\n",
|
|
"* Rename variables and convert types\n",
|
|
"* Transform / scale variables\n",
|
|
"\n",
|
|
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
|
|
"\n",
|
|
"First we load the dataset and we get a dataframe."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"scrolled": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Braund, Mr. Owen Harris</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>22.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>A/5 21171</td>\n",
|
|
" <td>7.2500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>38.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>PC 17599</td>\n",
|
|
" <td>71.2833</td>\n",
|
|
" <td>C85</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Heikkinen, Miss. Laina</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>STON/O2. 3101282</td>\n",
|
|
" <td>7.9250</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>113803</td>\n",
|
|
" <td>53.1000</td>\n",
|
|
" <td>C123</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Allen, Mr. William Henry</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>373450</td>\n",
|
|
" <td>8.0500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass \\\n",
|
|
"0 1 0 3 \n",
|
|
"1 2 1 1 \n",
|
|
"2 3 1 3 \n",
|
|
"3 4 1 1 \n",
|
|
"4 5 0 3 \n",
|
|
"\n",
|
|
" Name Sex Age SibSp \\\n",
|
|
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
|
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
|
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
|
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
|
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
|
"\n",
|
|
" Parch Ticket Fare Cabin Embarked \n",
|
|
"0 0 A/5 21171 7.2500 NaN S \n",
|
|
"1 0 PC 17599 71.2833 C85 C \n",
|
|
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
|
|
"3 0 113803 53.1000 C123 S \n",
|
|
"4 0 373450 8.0500 NaN S "
|
|
]
|
|
},
|
|
"execution_count": 1,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from pandas import Series, DataFrame\n",
|
|
"\n",
|
|
"df = pd.read_csv('data-titanic/train.csv')\n",
|
|
"\n",
|
|
"# Show the first 5 rows\n",
|
|
"df[:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Examining a DataFrame"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can examine properties of the dataset."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"<class 'pandas.core.frame.DataFrame'>\n",
|
|
"RangeIndex: 891 entries, 0 to 890\n",
|
|
"Data columns (total 12 columns):\n",
|
|
"PassengerId 891 non-null int64\n",
|
|
"Survived 891 non-null int64\n",
|
|
"Pclass 891 non-null int64\n",
|
|
"Name 891 non-null object\n",
|
|
"Sex 891 non-null object\n",
|
|
"Age 714 non-null float64\n",
|
|
"SibSp 891 non-null int64\n",
|
|
"Parch 891 non-null int64\n",
|
|
"Ticket 891 non-null object\n",
|
|
"Fare 891 non-null float64\n",
|
|
"Cabin 204 non-null object\n",
|
|
"Embarked 889 non-null object\n",
|
|
"dtypes: float64(2), int64(5), object(5)\n",
|
|
"memory usage: 83.6+ KB\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Information about columns and their types\n",
|
|
"df.info()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Name object\n",
|
|
"Sex object\n",
|
|
"Ticket object\n",
|
|
"Cabin object\n",
|
|
"Embarked object\n",
|
|
"dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
|
|
"df.dtypes[df.dtypes == object]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's explore the DataFrame."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(891, 12)"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Number of samples and features\n",
|
|
"df.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>count</th>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>714.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>mean</th>\n",
|
|
" <td>446.000000</td>\n",
|
|
" <td>0.383838</td>\n",
|
|
" <td>2.308642</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>0.523008</td>\n",
|
|
" <td>0.381594</td>\n",
|
|
" <td>32.204208</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>std</th>\n",
|
|
" <td>257.353842</td>\n",
|
|
" <td>0.486592</td>\n",
|
|
" <td>0.836071</td>\n",
|
|
" <td>14.526497</td>\n",
|
|
" <td>1.102743</td>\n",
|
|
" <td>0.806057</td>\n",
|
|
" <td>49.693429</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>min</th>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>0.420000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>25%</th>\n",
|
|
" <td>223.500000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>2.000000</td>\n",
|
|
" <td>20.125000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>7.910400</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>50%</th>\n",
|
|
" <td>446.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>3.000000</td>\n",
|
|
" <td>28.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>14.454200</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>75%</th>\n",
|
|
" <td>668.500000</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>3.000000</td>\n",
|
|
" <td>38.000000</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>31.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>max</th>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>3.000000</td>\n",
|
|
" <td>80.000000</td>\n",
|
|
" <td>8.000000</td>\n",
|
|
" <td>6.000000</td>\n",
|
|
" <td>512.329200</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Age SibSp \\\n",
|
|
"count 891.000000 891.000000 891.000000 714.000000 891.000000 \n",
|
|
"mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n",
|
|
"std 257.353842 0.486592 0.836071 14.526497 1.102743 \n",
|
|
"min 1.000000 0.000000 1.000000 0.420000 0.000000 \n",
|
|
"25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n",
|
|
"50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n",
|
|
"75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n",
|
|
"max 891.000000 1.000000 3.000000 80.000000 8.000000 \n",
|
|
"\n",
|
|
" Parch Fare \n",
|
|
"count 891.000000 891.000000 \n",
|
|
"mean 0.381594 32.204208 \n",
|
|
"std 0.806057 49.693429 \n",
|
|
"min 0.000000 0.000000 \n",
|
|
"25% 0.000000 7.910400 \n",
|
|
"50% 0.000000 14.454200 \n",
|
|
"75% 0.000000 31.000000 \n",
|
|
"max 6.000000 512.329200 "
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Basic statistics of the dataset in all the numeric columns\n",
|
|
"df.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>count</th>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>714.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" <td>891.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>mean</th>\n",
|
|
" <td>0.383838</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>0.523008</td>\n",
|
|
" <td>0.381594</td>\n",
|
|
" <td>32.204208</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>std</th>\n",
|
|
" <td>0.486592</td>\n",
|
|
" <td>14.526497</td>\n",
|
|
" <td>1.102743</td>\n",
|
|
" <td>0.806057</td>\n",
|
|
" <td>49.693429</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>min</th>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.420000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>25%</th>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>20.125000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>7.910400</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>50%</th>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>28.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>14.454200</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>75%</th>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>38.000000</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>31.000000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>max</th>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>80.000000</td>\n",
|
|
" <td>8.000000</td>\n",
|
|
" <td>6.000000</td>\n",
|
|
" <td>512.329200</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Survived Age SibSp Parch Fare\n",
|
|
"count 891.000000 714.000000 891.000000 891.000000 891.000000\n",
|
|
"mean 0.383838 29.699118 0.523008 0.381594 32.204208\n",
|
|
"std 0.486592 14.526497 1.102743 0.806057 49.693429\n",
|
|
"min 0.000000 0.420000 0.000000 0.000000 0.000000\n",
|
|
"25% 0.000000 20.125000 0.000000 0.000000 7.910400\n",
|
|
"50% 0.000000 28.000000 0.000000 0.000000 14.454200\n",
|
|
"75% 1.000000 38.000000 1.000000 0.000000 31.000000\n",
|
|
"max 1.000000 80.000000 8.000000 6.000000 512.329200"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Describe statistics of relevant columns. We pass a list of columns\n",
|
|
"df[['Survived', 'Age', 'SibSp', 'Parch', 'Fare']].describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Selecting rows in a DataFrame"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Braund, Mr. Owen Harris</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>22.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>A/5 21171</td>\n",
|
|
" <td>7.2500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>38.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>PC 17599</td>\n",
|
|
" <td>71.2833</td>\n",
|
|
" <td>C85</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Heikkinen, Miss. Laina</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>STON/O2. 3101282</td>\n",
|
|
" <td>7.9250</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>113803</td>\n",
|
|
" <td>53.1000</td>\n",
|
|
" <td>C123</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Allen, Mr. William Henry</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>373450</td>\n",
|
|
" <td>8.0500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass \\\n",
|
|
"0 1 0 3 \n",
|
|
"1 2 1 1 \n",
|
|
"2 3 1 3 \n",
|
|
"3 4 1 1 \n",
|
|
"4 5 0 3 \n",
|
|
"\n",
|
|
" Name Sex Age SibSp \\\n",
|
|
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
|
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
|
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
|
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
|
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
|
"\n",
|
|
" Parch Ticket Fare Cabin Embarked \n",
|
|
"0 0 A/5 21171 7.2500 NaN S \n",
|
|
"1 0 PC 17599 71.2833 C85 C \n",
|
|
"2 0 STON/O2. 3101282 7.9250 NaN S \n",
|
|
"3 0 113803 53.1000 C123 S \n",
|
|
"4 0 373450 8.0500 NaN S "
|
|
]
|
|
},
|
|
"execution_count": 7,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select the first 5 rows\n",
|
|
"df.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.0 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.0 0 0 112053 30.00 B42 S \n",
|
|
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 male 26.0 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.0 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select the last 5 rows\n",
|
|
"df.tail(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"scrolled": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Heikkinen, Miss. Laina</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>STON/O2. 3101282</td>\n",
|
|
" <td>7.925</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>113803</td>\n",
|
|
" <td>53.100</td>\n",
|
|
" <td>C123</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Allen, Mr. William Henry</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>373450</td>\n",
|
|
" <td>8.050</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass \\\n",
|
|
"2 3 1 3 \n",
|
|
"3 4 1 1 \n",
|
|
"4 5 0 3 \n",
|
|
"\n",
|
|
" Name Sex Age SibSp Parch \\\n",
|
|
"2 Heikkinen, Miss. Laina female 26.0 0 0 \n",
|
|
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 \n",
|
|
"4 Allen, Mr. William Henry male 35.0 0 0 \n",
|
|
"\n",
|
|
" Ticket Fare Cabin Embarked \n",
|
|
"2 STON/O2. 3101282 7.925 NaN S \n",
|
|
"3 113803 53.100 C123 S \n",
|
|
"4 373450 8.050 NaN S "
|
|
]
|
|
},
|
|
"execution_count": 9,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select several rows\n",
|
|
"df[2:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 0\n",
|
|
"1 1\n",
|
|
"2 1\n",
|
|
"3 1\n",
|
|
"4 0\n",
|
|
"Name: Survived, dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 10,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select the first 5 values of a column by name\n",
|
|
"df['Survived'][:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>22.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>38.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>0</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Survived Sex Age\n",
|
|
"0 0 male 22.0\n",
|
|
"1 1 female 38.0\n",
|
|
"2 1 female 26.0\n",
|
|
"3 1 female 35.0\n",
|
|
"4 0 male 35.0"
|
|
]
|
|
},
|
|
"execution_count": 11,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select several columns. Observe that the first parameter is a list\n",
|
|
"df[['Survived', 'Sex', 'Age']][:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"0 False\n",
|
|
"1 True\n",
|
|
"2 False\n",
|
|
"3 True\n",
|
|
"4 True\n",
|
|
"5 False\n",
|
|
"6 True\n",
|
|
"7 False\n",
|
|
"8 False\n",
|
|
"9 False\n",
|
|
"10 False\n",
|
|
"11 True\n",
|
|
"12 False\n",
|
|
"13 True\n",
|
|
"14 False\n",
|
|
"15 True\n",
|
|
"16 False\n",
|
|
"17 False\n",
|
|
"18 True\n",
|
|
"19 False\n",
|
|
"20 True\n",
|
|
"21 True\n",
|
|
"22 False\n",
|
|
"23 False\n",
|
|
"24 False\n",
|
|
"25 True\n",
|
|
"26 False\n",
|
|
"27 False\n",
|
|
"28 False\n",
|
|
"29 False\n",
|
|
" ... \n",
|
|
"861 False\n",
|
|
"862 True\n",
|
|
"863 False\n",
|
|
"864 False\n",
|
|
"865 True\n",
|
|
"866 False\n",
|
|
"867 True\n",
|
|
"868 False\n",
|
|
"869 False\n",
|
|
"870 False\n",
|
|
"871 True\n",
|
|
"872 True\n",
|
|
"873 True\n",
|
|
"874 False\n",
|
|
"875 False\n",
|
|
"876 False\n",
|
|
"877 False\n",
|
|
"878 False\n",
|
|
"879 True\n",
|
|
"880 False\n",
|
|
"881 True\n",
|
|
"882 False\n",
|
|
"883 False\n",
|
|
"884 False\n",
|
|
"885 True\n",
|
|
"886 False\n",
|
|
"887 False\n",
|
|
"888 False\n",
|
|
"889 False\n",
|
|
"890 True\n",
|
|
"Name: Age, dtype: bool"
|
|
]
|
|
},
|
|
"execution_count": 12,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Passengers older than 20. Observe dataframe columns can be accessed like attributes.\n",
|
|
"df.Age > 30"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>884</th>\n",
|
|
" <td>885</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Sutehall, Mr. Henry Jr</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>25.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>SOTON/OQ 392076</td>\n",
|
|
" <td>7.050</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>885</th>\n",
|
|
" <td>886</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Rice, Mrs. William (Margaret Norton)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>39.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>382652</td>\n",
|
|
" <td>29.125</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.000</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.750</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"884 885 0 3 Sutehall, Mr. Henry Jr \n",
|
|
"885 886 0 3 Rice, Mrs. William (Margaret Norton) \n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"884 male 25.0 0 0 SOTON/OQ 392076 7.050 NaN S \n",
|
|
"885 female 39.0 0 5 382652 29.125 NaN Q \n",
|
|
"886 male 27.0 0 0 211536 13.000 NaN S \n",
|
|
"889 male 26.0 0 0 111369 30.000 C148 C \n",
|
|
"890 male 32.0 0 0 370376 7.750 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select passengers older than 20 (only the last 5). We use boolean indexing\n",
|
|
"df[df.Age > 20][-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>871</th>\n",
|
|
" <td>872</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Beckwith, Mrs. Richard Leonard (Sallie Monypeny)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>47.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>11751</td>\n",
|
|
" <td>52.5542</td>\n",
|
|
" <td>D35</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>874</th>\n",
|
|
" <td>875</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Abelson, Mrs. Samuel (Hannah Wizosky)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>28.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>P/PP 3381</td>\n",
|
|
" <td>24.0000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>879</th>\n",
|
|
" <td>880</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>56.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>11767</td>\n",
|
|
" <td>83.1583</td>\n",
|
|
" <td>C50</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>880</th>\n",
|
|
" <td>881</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Shelley, Mrs. William (Imanita Parrish Hall)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>25.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>230433</td>\n",
|
|
" <td>26.0000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.0000</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass \\\n",
|
|
"871 872 1 1 \n",
|
|
"874 875 1 2 \n",
|
|
"879 880 1 1 \n",
|
|
"880 881 1 2 \n",
|
|
"889 890 1 1 \n",
|
|
"\n",
|
|
" Name Sex Age SibSp \\\n",
|
|
"871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 \n",
|
|
"874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 \n",
|
|
"879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 \n",
|
|
"880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 \n",
|
|
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
|
|
"\n",
|
|
" Parch Ticket Fare Cabin Embarked \n",
|
|
"871 1 11751 52.5542 D35 S \n",
|
|
"874 0 P/PP 3381 24.0000 NaN C \n",
|
|
"879 1 11767 83.1583 C50 C \n",
|
|
"880 1 230433 26.0000 NaN S \n",
|
|
"889 0 111369 30.0000 C148 C "
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select passengers older than 20 that survived (only the last 5)\n",
|
|
"df[(df.Age > 20) & (df.Survived == 1)][-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>871</th>\n",
|
|
" <td>872</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Beckwith, Mrs. Richard Leonard (Sallie Monypeny)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>47.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>11751</td>\n",
|
|
" <td>52.5542</td>\n",
|
|
" <td>D35</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>874</th>\n",
|
|
" <td>875</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Abelson, Mrs. Samuel (Hannah Wizosky)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>28.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>P/PP 3381</td>\n",
|
|
" <td>24.0000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>879</th>\n",
|
|
" <td>880</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>56.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>11767</td>\n",
|
|
" <td>83.1583</td>\n",
|
|
" <td>C50</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>880</th>\n",
|
|
" <td>881</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Shelley, Mrs. William (Imanita Parrish Hall)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>25.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>230433</td>\n",
|
|
" <td>26.0000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.0000</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass \\\n",
|
|
"871 872 1 1 \n",
|
|
"874 875 1 2 \n",
|
|
"879 880 1 1 \n",
|
|
"880 881 1 2 \n",
|
|
"889 890 1 1 \n",
|
|
"\n",
|
|
" Name Sex Age SibSp \\\n",
|
|
"871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 \n",
|
|
"874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 \n",
|
|
"879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 \n",
|
|
"880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 \n",
|
|
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
|
|
"\n",
|
|
" Parch Ticket Fare Cabin Embarked \n",
|
|
"871 1 11751 52.5542 D35 S \n",
|
|
"874 0 P/PP 3381 24.0000 NaN C \n",
|
|
"879 1 11767 83.1583 C50 C \n",
|
|
"880 1 230433 26.0000 NaN S \n",
|
|
"889 0 111369 30.0000 C148 C "
|
|
]
|
|
},
|
|
"execution_count": 15,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Alternative syntax with query to the standard Python \n",
|
|
"# In large dataframes, the perfomance of DataFrame.query() using numexpr is considerable faster, look at the references\n",
|
|
"df.query('Age > 20 and Survived == 1')[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"DataFrames provide a set of functions for selection that we will need later\n",
|
|
"\n",
|
|
"\n",
|
|
"|Operation | Syntax | Result |\n",
|
|
"|-----------------------------|\n",
|
|
"|Select column | df[col] | Series |\n",
|
|
"|Select row by label | df.loc[label] | Series |\n",
|
|
"|Select row by integer location | df.iloc[loc] | Series |\n",
|
|
"|Slice rows\t | df[5:10]\t | DataFrame |\n",
|
|
"|Select rows by boolean vector | df[bool_vec] | DataFrame |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 16,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"887 19.0\n",
|
|
"888 NaN\n",
|
|
"889 26.0\n",
|
|
"890 32.0\n",
|
|
"Name: Age, dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 16,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select column and show last 4\n",
|
|
"df['Age'][-4:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"887 19.0\n",
|
|
"888 NaN\n",
|
|
"889 26.0\n",
|
|
"890 32.0\n",
|
|
"Name: Age, dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 17,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select row by label. We select with [index-labels, column-labels], and show last 4\n",
|
|
"df.loc[:, 'Age'][-4:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 18,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"887 19.0\n",
|
|
"888 NaN\n",
|
|
"889 26.0\n",
|
|
"890 32.0\n",
|
|
"Name: Age, dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 18,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Select row by column index (Age is the column 5), and show last 4\n",
|
|
"df.iloc[:, 5][-4:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 19,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"scrolled": true
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.0 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.0 0 0 112053 30.00 B42 S \n",
|
|
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 male 26.0 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.0 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 19,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Slice rows - last 5 columns\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>884</th>\n",
|
|
" <td>885</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Sutehall, Mr. Henry Jr</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>25.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>SOTON/OQ 392076</td>\n",
|
|
" <td>7.050</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>885</th>\n",
|
|
" <td>886</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Rice, Mrs. William (Margaret Norton)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>39.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>5</td>\n",
|
|
" <td>382652</td>\n",
|
|
" <td>29.125</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.000</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.750</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"884 885 0 3 Sutehall, Mr. Henry Jr \n",
|
|
"885 886 0 3 Rice, Mrs. William (Margaret Norton) \n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"884 male 25.0 0 0 SOTON/OQ 392076 7.050 NaN S \n",
|
|
"885 female 39.0 0 5 382652 29.125 NaN Q \n",
|
|
"886 male 27.0 0 0 211536 13.000 NaN S \n",
|
|
"889 male 26.0 0 0 111369 30.000 C148 C \n",
|
|
"890 male 32.0 0 0 370376 7.750 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Select based on boolean vector and show last 5 columns\n",
|
|
"df[df.Age > 20][-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Grouping"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Rows can be grouped by one or more columns, and apply aggregated operators on the GroupBy object."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 21,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Sex\n",
|
|
"female 314\n",
|
|
"male 577\n",
|
|
"dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 21,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Number of users per sex (SQL like)\n",
|
|
"df.groupby('Sex').size()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 22,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>461.597222</td>\n",
|
|
" <td>0.629630</td>\n",
|
|
" <td>38.233441</td>\n",
|
|
" <td>0.416667</td>\n",
|
|
" <td>0.356481</td>\n",
|
|
" <td>84.154687</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>445.956522</td>\n",
|
|
" <td>0.472826</td>\n",
|
|
" <td>29.877630</td>\n",
|
|
" <td>0.402174</td>\n",
|
|
" <td>0.380435</td>\n",
|
|
" <td>20.662183</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>439.154786</td>\n",
|
|
" <td>0.242363</td>\n",
|
|
" <td>25.140620</td>\n",
|
|
" <td>0.615071</td>\n",
|
|
" <td>0.393075</td>\n",
|
|
" <td>13.675550</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Age SibSp Parch Fare\n",
|
|
"Pclass \n",
|
|
"1 461.597222 0.629630 38.233441 0.416667 0.356481 84.154687\n",
|
|
"2 445.956522 0.472826 29.877630 0.402174 0.380435 20.662183\n",
|
|
"3 439.154786 0.242363 25.140620 0.615071 0.393075 13.675550"
|
|
]
|
|
},
|
|
"execution_count": 22,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Mean age of passengers per Passenger class\n",
|
|
"\n",
|
|
"#First we calculate the mean\n",
|
|
"df.groupby('Pclass').mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 23,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Pclass\n",
|
|
"1 38.233441\n",
|
|
"2 29.877630\n",
|
|
"3 25.140620\n",
|
|
"Name: Age, dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 23,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#And now we answer the initial query (only mean age)\n",
|
|
"df.groupby('Pclass')['Age'].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 24,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Pclass\n",
|
|
"1 38.233441\n",
|
|
"2 29.877630\n",
|
|
"3 25.140620\n",
|
|
"Name: Age, dtype: float64"
|
|
]
|
|
},
|
|
"execution_count": 24,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Alternative syntax\n",
|
|
"df.groupby('Pclass').Age.mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 25,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>34.611765</td>\n",
|
|
" <td>0.553191</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>41.281386</td>\n",
|
|
" <td>0.311475</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>28.722973</td>\n",
|
|
" <td>0.486842</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>30.740707</td>\n",
|
|
" <td>0.342593</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>21.750000</td>\n",
|
|
" <td>0.895833</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>26.507589</td>\n",
|
|
" <td>0.498559</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age SibSp\n",
|
|
"Pclass Sex \n",
|
|
"1 female 34.611765 0.553191\n",
|
|
" male 41.281386 0.311475\n",
|
|
"2 female 28.722973 0.486842\n",
|
|
" male 30.740707 0.342593\n",
|
|
"3 female 21.750000 0.895833\n",
|
|
" male 26.507589 0.498559"
|
|
]
|
|
},
|
|
"execution_count": 25,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Mean Age and SibSp of passengers grouped by passenger class and sex\n",
|
|
"df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 26,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>42.052632</td>\n",
|
|
" <td>0.473684</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>45.017241</td>\n",
|
|
" <td>0.333333</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>36.566667</td>\n",
|
|
" <td>0.444444</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>38.809524</td>\n",
|
|
" <td>0.301587</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>34.959459</td>\n",
|
|
" <td>0.513514</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>35.778226</td>\n",
|
|
" <td>0.185484</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age SibSp\n",
|
|
"Pclass Sex \n",
|
|
"1 female 42.052632 0.473684\n",
|
|
" male 45.017241 0.333333\n",
|
|
"2 female 36.566667 0.444444\n",
|
|
" male 38.809524 0.301587\n",
|
|
"3 female 34.959459 0.513514\n",
|
|
" male 35.778226 0.185484"
|
|
]
|
|
},
|
|
"execution_count": 26,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Show mean Age and SibSp for passengers older than 25 grouped by Passenger Class and Sex\n",
|
|
"df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 27,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>34.611765</td>\n",
|
|
" <td>0.541176</td>\n",
|
|
" <td>0.964706</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>41.685000</td>\n",
|
|
" <td>0.370000</td>\n",
|
|
" <td>0.390000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>28.722973</td>\n",
|
|
" <td>0.500000</td>\n",
|
|
" <td>0.918919</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>32.329787</td>\n",
|
|
" <td>0.351064</td>\n",
|
|
" <td>0.106383</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>22.602041</td>\n",
|
|
" <td>0.806122</td>\n",
|
|
" <td>0.438776</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>26.713147</td>\n",
|
|
" <td>0.490040</td>\n",
|
|
" <td>0.143426</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age SibSp Survived\n",
|
|
"Pclass Sex \n",
|
|
"1 female 34.611765 0.541176 0.964706\n",
|
|
" male 41.685000 0.370000 0.390000\n",
|
|
"2 female 28.722973 0.500000 0.918919\n",
|
|
" male 32.329787 0.351064 0.106383\n",
|
|
"3 female 22.602041 0.806122 0.438776\n",
|
|
" male 26.713147 0.490040 0.143426"
|
|
]
|
|
},
|
|
"execution_count": 27,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
|
|
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 28,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>34.611765</td>\n",
|
|
" <td>0.541176</td>\n",
|
|
" <td>85</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>41.685000</td>\n",
|
|
" <td>0.370000</td>\n",
|
|
" <td>100</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>28.722973</td>\n",
|
|
" <td>0.500000</td>\n",
|
|
" <td>74</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>32.329787</td>\n",
|
|
" <td>0.351064</td>\n",
|
|
" <td>94</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
|
|
" <th>female</th>\n",
|
|
" <td>22.602041</td>\n",
|
|
" <td>0.806122</td>\n",
|
|
" <td>98</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>26.713147</td>\n",
|
|
" <td>0.490040</td>\n",
|
|
" <td>251</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age SibSp Survived\n",
|
|
"Pclass Sex \n",
|
|
"1 female 34.611765 0.541176 85\n",
|
|
" male 41.685000 0.370000 100\n",
|
|
"2 female 28.722973 0.500000 74\n",
|
|
" male 32.329787 0.351064 94\n",
|
|
"3 female 22.602041 0.806122 98\n",
|
|
" male 26.713147 0.490040 251"
|
|
]
|
|
},
|
|
"execution_count": 28,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# We can also decide which function apply in each column\n",
|
|
"\n",
|
|
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
|
|
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, \n",
|
|
" 'SibSp': np.mean, 'Survived': np.size})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Pivot tables"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Pivot tables are an intuitive way to analyze data, and alternative to group columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 29,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>female</th>\n",
|
|
" <td>27.915709</td>\n",
|
|
" <td>44.479818</td>\n",
|
|
" <td>0.649682</td>\n",
|
|
" <td>431.028662</td>\n",
|
|
" <td>2.159236</td>\n",
|
|
" <td>0.694268</td>\n",
|
|
" <td>0.742038</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>male</th>\n",
|
|
" <td>30.726645</td>\n",
|
|
" <td>25.523893</td>\n",
|
|
" <td>0.235702</td>\n",
|
|
" <td>454.147314</td>\n",
|
|
" <td>2.389948</td>\n",
|
|
" <td>0.429809</td>\n",
|
|
" <td>0.188908</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age Fare Parch PassengerId Pclass SibSp \\\n",
|
|
"Sex \n",
|
|
"female 27.915709 44.479818 0.649682 431.028662 2.159236 0.694268 \n",
|
|
"male 30.726645 25.523893 0.235702 454.147314 2.389948 0.429809 \n",
|
|
"\n",
|
|
" Survived \n",
|
|
"Sex \n",
|
|
"female 0.742038 \n",
|
|
"male 0.188908 "
|
|
]
|
|
},
|
|
"execution_count": 29,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.pivot_table(df, index='Sex')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 30,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>34.611765</td>\n",
|
|
" <td>106.125798</td>\n",
|
|
" <td>0.457447</td>\n",
|
|
" <td>469.212766</td>\n",
|
|
" <td>0.553191</td>\n",
|
|
" <td>0.968085</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>28.722973</td>\n",
|
|
" <td>21.970121</td>\n",
|
|
" <td>0.605263</td>\n",
|
|
" <td>443.105263</td>\n",
|
|
" <td>0.486842</td>\n",
|
|
" <td>0.921053</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>21.750000</td>\n",
|
|
" <td>16.118810</td>\n",
|
|
" <td>0.798611</td>\n",
|
|
" <td>399.729167</td>\n",
|
|
" <td>0.895833</td>\n",
|
|
" <td>0.500000</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>41.281386</td>\n",
|
|
" <td>67.226127</td>\n",
|
|
" <td>0.278689</td>\n",
|
|
" <td>455.729508</td>\n",
|
|
" <td>0.311475</td>\n",
|
|
" <td>0.368852</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30.740707</td>\n",
|
|
" <td>19.741782</td>\n",
|
|
" <td>0.222222</td>\n",
|
|
" <td>447.962963</td>\n",
|
|
" <td>0.342593</td>\n",
|
|
" <td>0.157407</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>26.507589</td>\n",
|
|
" <td>12.661633</td>\n",
|
|
" <td>0.224784</td>\n",
|
|
" <td>455.515850</td>\n",
|
|
" <td>0.498559</td>\n",
|
|
" <td>0.135447</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age Fare Parch PassengerId SibSp \\\n",
|
|
"Sex Pclass \n",
|
|
"female 1 34.611765 106.125798 0.457447 469.212766 0.553191 \n",
|
|
" 2 28.722973 21.970121 0.605263 443.105263 0.486842 \n",
|
|
" 3 21.750000 16.118810 0.798611 399.729167 0.895833 \n",
|
|
"male 1 41.281386 67.226127 0.278689 455.729508 0.311475 \n",
|
|
" 2 30.740707 19.741782 0.222222 447.962963 0.342593 \n",
|
|
" 3 26.507589 12.661633 0.224784 455.515850 0.498559 \n",
|
|
"\n",
|
|
" Survived \n",
|
|
"Sex Pclass \n",
|
|
"female 1 0.968085 \n",
|
|
" 2 0.921053 \n",
|
|
" 3 0.500000 \n",
|
|
"male 1 0.368852 \n",
|
|
" 2 0.157407 \n",
|
|
" 3 0.135447 "
|
|
]
|
|
},
|
|
"execution_count": 30,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 31,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>34.611765</td>\n",
|
|
" <td>0.553191</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>28.722973</td>\n",
|
|
" <td>0.486842</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>21.750000</td>\n",
|
|
" <td>0.895833</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>41.281386</td>\n",
|
|
" <td>0.311475</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30.740707</td>\n",
|
|
" <td>0.342593</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>26.507589</td>\n",
|
|
" <td>0.498559</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age SibSp\n",
|
|
"Sex Pclass \n",
|
|
"female 1 34.611765 0.553191\n",
|
|
" 2 28.722973 0.486842\n",
|
|
" 3 21.750000 0.895833\n",
|
|
"male 1 41.281386 0.311475\n",
|
|
" 2 30.740707 0.342593\n",
|
|
" 3 26.507589 0.498559"
|
|
]
|
|
},
|
|
"execution_count": 31,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>34.611765</td>\n",
|
|
" <td>0.553191</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>28.722973</td>\n",
|
|
" <td>0.486842</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>21.750000</td>\n",
|
|
" <td>0.895833</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>41.281386</td>\n",
|
|
" <td>0.311475</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30.740707</td>\n",
|
|
" <td>0.342593</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>26.507589</td>\n",
|
|
" <td>0.498559</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Age SibSp\n",
|
|
"Sex Pclass \n",
|
|
"female 1 34.611765 0.553191\n",
|
|
" 2 28.722973 0.486842\n",
|
|
" 3 21.750000 0.895833\n",
|
|
"male 1 41.281386 0.311475\n",
|
|
" 2 30.740707 0.342593\n",
|
|
" 3 26.507589 0.498559"
|
|
]
|
|
},
|
|
"execution_count": 32,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th colspan=\"2\" halign=\"left\">mean</th>\n",
|
|
" <th colspan=\"2\" halign=\"left\">sum</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>34.611765</td>\n",
|
|
" <td>0.553191</td>\n",
|
|
" <td>2942.00</td>\n",
|
|
" <td>52</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>28.722973</td>\n",
|
|
" <td>0.486842</td>\n",
|
|
" <td>2125.50</td>\n",
|
|
" <td>37</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>21.750000</td>\n",
|
|
" <td>0.895833</td>\n",
|
|
" <td>2218.50</td>\n",
|
|
" <td>129</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>41.281386</td>\n",
|
|
" <td>0.311475</td>\n",
|
|
" <td>4169.42</td>\n",
|
|
" <td>38</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>30.740707</td>\n",
|
|
" <td>0.342593</td>\n",
|
|
" <td>3043.33</td>\n",
|
|
" <td>37</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>26.507589</td>\n",
|
|
" <td>0.498559</td>\n",
|
|
" <td>6706.42</td>\n",
|
|
" <td>173</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" mean sum \n",
|
|
" Age SibSp Age SibSp\n",
|
|
"Sex Pclass \n",
|
|
"female 1 34.611765 0.553191 2942.00 52\n",
|
|
" 2 28.722973 0.486842 2125.50 37\n",
|
|
" 3 21.750000 0.895833 2218.50 129\n",
|
|
"male 1 41.281386 0.311475 4169.42 38\n",
|
|
" 2 30.740707 0.342593 3043.33 37\n",
|
|
" 3 26.507589 0.498559 6706.42 173"
|
|
]
|
|
},
|
|
"execution_count": 33,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Try np.sum, np.size, len\n",
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th colspan=\"6\" halign=\"left\">mean</th>\n",
|
|
" <th colspan=\"6\" halign=\"left\">sum</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"6\" valign=\"top\">female</th>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
|
|
" <th>0</th>\n",
|
|
" <td>50.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>13.500000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>50.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>27.00</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>35.675676</td>\n",
|
|
" <td>33.000000</td>\n",
|
|
" <td>33.619048</td>\n",
|
|
" <td>0.523810</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>0.586957</td>\n",
|
|
" <td>1320.00</td>\n",
|
|
" <td>33.0</td>\n",
|
|
" <td>1412.00</td>\n",
|
|
" <td>22.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
|
|
" <th>0</th>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>36.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>0.500000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>216.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>3.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>19.142857</td>\n",
|
|
" <td>30.000000</td>\n",
|
|
" <td>29.091667</td>\n",
|
|
" <td>0.714286</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.475410</td>\n",
|
|
" <td>134.00</td>\n",
|
|
" <td>30.0</td>\n",
|
|
" <td>1745.50</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>29.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
|
|
" <th>0</th>\n",
|
|
" <td>20.700000</td>\n",
|
|
" <td>28.100000</td>\n",
|
|
" <td>23.688889</td>\n",
|
|
" <td>0.500000</td>\n",
|
|
" <td>0.111111</td>\n",
|
|
" <td>1.600000</td>\n",
|
|
" <td>103.50</td>\n",
|
|
" <td>140.5</td>\n",
|
|
" <td>1066.00</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>88.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>11.045455</td>\n",
|
|
" <td>17.600000</td>\n",
|
|
" <td>22.548387</td>\n",
|
|
" <td>0.600000</td>\n",
|
|
" <td>0.250000</td>\n",
|
|
" <td>0.636364</td>\n",
|
|
" <td>121.50</td>\n",
|
|
" <td>88.0</td>\n",
|
|
" <td>699.00</td>\n",
|
|
" <td>9.0</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>21.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"6\" valign=\"top\">male</th>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">1</th>\n",
|
|
" <th>0</th>\n",
|
|
" <td>43.050000</td>\n",
|
|
" <td>44.000000</td>\n",
|
|
" <td>45.362500</td>\n",
|
|
" <td>0.160000</td>\n",
|
|
" <td>2.000000</td>\n",
|
|
" <td>0.294118</td>\n",
|
|
" <td>861.00</td>\n",
|
|
" <td>44.0</td>\n",
|
|
" <td>1814.50</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>15.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>36.437500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>36.121667</td>\n",
|
|
" <td>0.352941</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>0.392857</td>\n",
|
|
" <td>583.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>866.92</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>11.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">2</th>\n",
|
|
" <th>0</th>\n",
|
|
" <td>29.500000</td>\n",
|
|
" <td>57.000000</td>\n",
|
|
" <td>33.414474</td>\n",
|
|
" <td>0.625000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.280488</td>\n",
|
|
" <td>206.50</td>\n",
|
|
" <td>57.0</td>\n",
|
|
" <td>2539.50</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>23.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>17.095000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>0.600000</td>\n",
|
|
" <td>1.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>239.33</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>9.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"2\" valign=\"top\">3</th>\n",
|
|
" <th>0</th>\n",
|
|
" <td>27.555556</td>\n",
|
|
" <td>28.076923</td>\n",
|
|
" <td>27.168478</td>\n",
|
|
" <td>0.181818</td>\n",
|
|
" <td>0.583333</td>\n",
|
|
" <td>0.562771</td>\n",
|
|
" <td>496.00</td>\n",
|
|
" <td>365.0</td>\n",
|
|
" <td>4999.00</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>21.0</td>\n",
|
|
" <td>130.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>18.488571</td>\n",
|
|
" <td>29.000000</td>\n",
|
|
" <td>22.933333</td>\n",
|
|
" <td>0.400000</td>\n",
|
|
" <td>0.666667</td>\n",
|
|
" <td>0.294118</td>\n",
|
|
" <td>129.42</td>\n",
|
|
" <td>29.0</td>\n",
|
|
" <td>688.00</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>10.0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" mean \\\n",
|
|
" Age SibSp \n",
|
|
"Embarked C Q S C Q \n",
|
|
"Sex Pclass Survived \n",
|
|
"female 1 0 50.000000 NaN 13.500000 0.000000 NaN \n",
|
|
" 1 35.675676 33.000000 33.619048 0.523810 1.000000 \n",
|
|
" 2 0 NaN NaN 36.000000 NaN NaN \n",
|
|
" 1 19.142857 30.000000 29.091667 0.714286 0.000000 \n",
|
|
" 3 0 20.700000 28.100000 23.688889 0.500000 0.111111 \n",
|
|
" 1 11.045455 17.600000 22.548387 0.600000 0.250000 \n",
|
|
"male 1 0 43.050000 44.000000 45.362500 0.160000 2.000000 \n",
|
|
" 1 36.437500 NaN 36.121667 0.352941 NaN \n",
|
|
" 2 0 29.500000 57.000000 33.414474 0.625000 0.000000 \n",
|
|
" 1 1.000000 NaN 17.095000 0.000000 NaN \n",
|
|
" 3 0 27.555556 28.076923 27.168478 0.181818 0.583333 \n",
|
|
" 1 18.488571 29.000000 22.933333 0.400000 0.666667 \n",
|
|
"\n",
|
|
" sum \n",
|
|
" Age SibSp \n",
|
|
"Embarked S C Q S C Q S \n",
|
|
"Sex Pclass Survived \n",
|
|
"female 1 0 1.000000 50.00 NaN 27.00 0.0 NaN 2.0 \n",
|
|
" 1 0.586957 1320.00 33.0 1412.00 22.0 1.0 27.0 \n",
|
|
" 2 0 0.500000 NaN NaN 216.00 NaN NaN 3.0 \n",
|
|
" 1 0.475410 134.00 30.0 1745.50 5.0 0.0 29.0 \n",
|
|
" 3 0 1.600000 103.50 140.5 1066.00 4.0 1.0 88.0 \n",
|
|
" 1 0.636364 121.50 88.0 699.00 9.0 6.0 21.0 \n",
|
|
"male 1 0 0.294118 861.00 44.0 1814.50 4.0 2.0 15.0 \n",
|
|
" 1 0.392857 583.00 NaN 866.92 6.0 NaN 11.0 \n",
|
|
" 2 0 0.280488 206.50 57.0 2539.50 5.0 0.0 23.0 \n",
|
|
" 1 0.600000 1.00 NaN 239.33 0.0 NaN 9.0 \n",
|
|
" 3 0 0.562771 496.00 365.0 4999.00 6.0 21.0 130.0 \n",
|
|
" 1 0.294118 129.42 29.0 688.00 4.0 2.0 10.0 "
|
|
]
|
|
},
|
|
"execution_count": 34,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Try np.sum, np.size, len\n",
|
|
"table = pd.pivot_table(df, index=['Sex', 'Pclass', 'Survived'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum],\n",
|
|
" columns=['Embarked'])\n",
|
|
"table"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 35,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th colspan=\"6\" halign=\"left\">mean</th>\n",
|
|
" <th colspan=\"6\" halign=\"left\">sum</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">Age</th>\n",
|
|
" <th colspan=\"3\" halign=\"left\">SibSp</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" <th>C</th>\n",
|
|
" <th>Q</th>\n",
|
|
" <th>S</th>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" <th></th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">female</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>35.675676</td>\n",
|
|
" <td>33.0</td>\n",
|
|
" <td>33.619048</td>\n",
|
|
" <td>0.523810</td>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>0.586957</td>\n",
|
|
" <td>1320.00</td>\n",
|
|
" <td>33.0</td>\n",
|
|
" <td>1412.00</td>\n",
|
|
" <td>22.0</td>\n",
|
|
" <td>1.0</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>19.142857</td>\n",
|
|
" <td>30.0</td>\n",
|
|
" <td>29.091667</td>\n",
|
|
" <td>0.714286</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>0.475410</td>\n",
|
|
" <td>134.00</td>\n",
|
|
" <td>30.0</td>\n",
|
|
" <td>1745.50</td>\n",
|
|
" <td>5.0</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>29.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>11.045455</td>\n",
|
|
" <td>17.6</td>\n",
|
|
" <td>22.548387</td>\n",
|
|
" <td>0.600000</td>\n",
|
|
" <td>0.250000</td>\n",
|
|
" <td>0.636364</td>\n",
|
|
" <td>121.50</td>\n",
|
|
" <td>88.0</td>\n",
|
|
" <td>699.00</td>\n",
|
|
" <td>9.0</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>21.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th rowspan=\"3\" valign=\"top\">male</th>\n",
|
|
" <th>1</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>36.437500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>36.121667</td>\n",
|
|
" <td>0.352941</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>0.392857</td>\n",
|
|
" <td>583.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>866.92</td>\n",
|
|
" <td>6.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>11.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>1.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>17.095000</td>\n",
|
|
" <td>0.000000</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>0.600000</td>\n",
|
|
" <td>1.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>239.33</td>\n",
|
|
" <td>0.0</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>9.0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <th>1</th>\n",
|
|
" <td>18.488571</td>\n",
|
|
" <td>29.0</td>\n",
|
|
" <td>22.933333</td>\n",
|
|
" <td>0.400000</td>\n",
|
|
" <td>0.666667</td>\n",
|
|
" <td>0.294118</td>\n",
|
|
" <td>129.42</td>\n",
|
|
" <td>29.0</td>\n",
|
|
" <td>688.00</td>\n",
|
|
" <td>4.0</td>\n",
|
|
" <td>2.0</td>\n",
|
|
" <td>10.0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" mean \\\n",
|
|
" Age SibSp \n",
|
|
"Embarked C Q S C Q \n",
|
|
"Sex Pclass Survived \n",
|
|
"female 1 1 35.675676 33.0 33.619048 0.523810 1.000000 \n",
|
|
" 2 1 19.142857 30.0 29.091667 0.714286 0.000000 \n",
|
|
" 3 1 11.045455 17.6 22.548387 0.600000 0.250000 \n",
|
|
"male 1 1 36.437500 NaN 36.121667 0.352941 NaN \n",
|
|
" 2 1 1.000000 NaN 17.095000 0.000000 NaN \n",
|
|
" 3 1 18.488571 29.0 22.933333 0.400000 0.666667 \n",
|
|
"\n",
|
|
" sum \n",
|
|
" Age SibSp \n",
|
|
"Embarked S C Q S C Q S \n",
|
|
"Sex Pclass Survived \n",
|
|
"female 1 1 0.586957 1320.00 33.0 1412.00 22.0 1.0 27.0 \n",
|
|
" 2 1 0.475410 134.00 30.0 1745.50 5.0 0.0 29.0 \n",
|
|
" 3 1 0.636364 121.50 88.0 699.00 9.0 6.0 21.0 \n",
|
|
"male 1 1 0.392857 583.00 NaN 866.92 6.0 NaN 11.0 \n",
|
|
" 2 1 0.600000 1.00 NaN 239.33 0.0 NaN 9.0 \n",
|
|
" 3 1 0.294118 129.42 29.0 688.00 4.0 2.0 10.0 "
|
|
]
|
|
},
|
|
"execution_count": 35,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"table.query('Survived == 1')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Duplicates"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 36,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"False"
|
|
]
|
|
},
|
|
"execution_count": 36,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.duplicated().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Null and missing values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here we check how many null values there are.\n",
|
|
"\n",
|
|
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"PassengerId 0\n",
|
|
"Survived 0\n",
|
|
"Pclass 0\n",
|
|
"Name 0\n",
|
|
"Sex 0\n",
|
|
"Age 177\n",
|
|
"SibSp 0\n",
|
|
"Parch 0\n",
|
|
"Ticket 0\n",
|
|
"Fare 0\n",
|
|
"Cabin 687\n",
|
|
"Embarked 2\n",
|
|
"dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 37,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.isnull().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 54,
|
|
"metadata": {
|
|
"collapsed": false,
|
|
"scrolled": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Original (891, 10)\n",
|
|
"Cleaned (889, 10)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Drop records with missing values\n",
|
|
"df_original = df.copy()\n",
|
|
"df_clean = df.dropna()\n",
|
|
"print(\"Original\", df.shape)\n",
|
|
"print(\"Cleaned\", df_clean.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 39,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>28.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.0 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.0 0 0 112053 30.00 B42 S \n",
|
|
"888 female 28.0 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 male 26.0 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.0 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 39,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Fill missing values with the median\n",
|
|
"df_filled = df.fillna(df.median())\n",
|
|
"df_filled[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 40,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.0 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.0 0 0 112053 30.00 B42 S \n",
|
|
"888 female NaN 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 male 26.0 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.0 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 40,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#The original df has not been modified\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
|
|
"\n",
|
|
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
|
|
"\n",
|
|
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
|
|
"\n",
|
|
"If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 41,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
|
|
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 male 26.000000 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.000000 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 41,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 42,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
|
|
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 male 26.000000 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.000000 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 42,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Another possibility is to assign the modified dataframe\n",
|
|
"# First we get the df with NaN values\n",
|
|
"df = df_original.copy()\n",
|
|
"#Fill NaN and assign to the column\n",
|
|
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we are going to see how to change the Sex value of PassengerId 889, and then replace the missing values of Sex. It is just an example for practicing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 43,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"PassengerId 890\n",
|
|
"Survived 1\n",
|
|
"Pclass 1\n",
|
|
"Name Behr, Mr. Karl Howell\n",
|
|
"Sex male\n",
|
|
"Age 26\n",
|
|
"SibSp 0\n",
|
|
"Parch 0\n",
|
|
"Ticket 111369\n",
|
|
"Fare 30\n",
|
|
"Cabin C148\n",
|
|
"Embarked C\n",
|
|
"Name: 889, dtype: object"
|
|
]
|
|
},
|
|
"execution_count": 43,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# There are not labels for rows, so we use the numeric index\n",
|
|
"df.iloc[889]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 44,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'male'"
|
|
]
|
|
},
|
|
"execution_count": 44,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#We access row and column\n",
|
|
"df.iloc[889]['Sex']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 45,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/home/cif/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: \n",
|
|
"A value is trying to be set on a copy of a slice from a DataFrame\n",
|
|
"\n",
|
|
"See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n",
|
|
" from ipykernel import kernelapp as app\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# But we are working on a copy \n",
|
|
"df.iloc[889]['Sex'] = np.nan"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 46,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'male'"
|
|
]
|
|
},
|
|
"execution_count": 46,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# If we want to change, we should not chain selections\n",
|
|
"# The selection can be done with the column name\n",
|
|
"df.loc[889, 'Sex']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 47,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"'male'"
|
|
]
|
|
},
|
|
"execution_count": 47,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Or with the index of the column\n",
|
|
"df.iloc[889, 4]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 48,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>26.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
|
|
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 NaN 26.000000 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.000000 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 48,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# This indexing works for changing values\n",
|
|
"df.loc[889, 'Sex'] = np.nan\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 49,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>211536</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>112053</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>B42</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>W./C. 6607</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>111369</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C148</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>370376</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Ticket Fare Cabin Embarked \n",
|
|
"886 male 27.000000 0 0 211536 13.00 NaN S \n",
|
|
"887 female 19.000000 0 0 112053 30.00 B42 S \n",
|
|
"888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n",
|
|
"889 male 26.000000 0 0 111369 30.00 C148 C \n",
|
|
"890 male 32.000000 0 0 370376 7.75 NaN Q "
|
|
]
|
|
},
|
|
"execution_count": 49,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df['Sex'].fillna('male', inplace=True)\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"source": [
|
|
"There are other interesting possibilities of **fillna**. We can fill with the previous valid value (**method=bfill**) or the next valid value (**method=ffill**). For example, with time series, it is frequent to use the last valid value (bfill). Another alternative is to use the method **interpolate()**.\n",
|
|
"\n",
|
|
"Look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more details.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Analysing non numerical columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
|
|
"\n",
|
|
"**Name** and **Ticket** do not seem informative.\n",
|
|
"\n",
|
|
"Regarding **Cabin**, most values were missing, so we can ignore it. \n",
|
|
"\n",
|
|
"**Sex** and **Embarked** are categorical features, so we will encode as integers."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 50,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Fare Embarked \n",
|
|
"886 male 27.000000 0 0 13.00 S \n",
|
|
"887 female 19.000000 0 0 30.00 S \n",
|
|
"888 female 29.699118 1 2 23.45 S \n",
|
|
"889 male 26.000000 0 0 30.00 C \n",
|
|
"890 male 32.000000 0 0 7.75 Q "
|
|
]
|
|
},
|
|
"execution_count": 50,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# We remove Cabin and Ticket. We should specify the axis\n",
|
|
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
|
|
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Encoding categorical values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 51,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"False"
|
|
]
|
|
},
|
|
"execution_count": 51,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#First we check if there is any null values. Observe the use of any()\n",
|
|
"df['Sex'].isnull().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 52,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array(['male', 'female'], dtype=object)"
|
|
]
|
|
},
|
|
"execution_count": 52,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Now we check the values of Sex\n",
|
|
"df['Sex'].unique()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we are going to encode the values with our pandas knowledge."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 53,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>27.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>S</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>26.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>C</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>32.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>Q</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Fare Embarked \n",
|
|
"886 0 27.000000 0 0 13.00 S \n",
|
|
"887 1 19.000000 0 0 30.00 S \n",
|
|
"888 1 29.699118 1 2 23.45 S \n",
|
|
"889 0 26.000000 0 0 30.00 C \n",
|
|
"890 0 32.000000 0 0 7.75 Q "
|
|
]
|
|
},
|
|
"execution_count": 53,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
|
|
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Ticket</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Cabin</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" <th>Gender</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Braund, Mr. Owen Harris</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>22.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>A/5 21171</td>\n",
|
|
" <td>7.2500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>2</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>38.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>PC 17599</td>\n",
|
|
" <td>71.2833</td>\n",
|
|
" <td>C85</td>\n",
|
|
" <td>C</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>3</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Heikkinen, Miss. Laina</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>26.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>STON/O2. 3101282</td>\n",
|
|
" <td>7.9250</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>4</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>113803</td>\n",
|
|
" <td>53.1000</td>\n",
|
|
" <td>C123</td>\n",
|
|
" <td>S</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>5</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Allen, Mr. William Henry</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>35.0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>373450</td>\n",
|
|
" <td>8.0500</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>S</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass \\\n",
|
|
"0 1 0 3 \n",
|
|
"1 2 1 1 \n",
|
|
"2 3 1 3 \n",
|
|
"3 4 1 1 \n",
|
|
"4 5 0 3 \n",
|
|
"\n",
|
|
" Name Sex Age SibSp \\\n",
|
|
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
|
|
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
|
|
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
|
|
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
|
|
"4 Allen, Mr. William Henry male 35.0 0 \n",
|
|
"\n",
|
|
" Parch Ticket Fare Cabin Embarked Gender \n",
|
|
"0 0 A/5 21171 7.2500 NaN S 0 \n",
|
|
"1 0 PC 17599 71.2833 C85 C 1 \n",
|
|
"2 0 STON/O2. 3101282 7.9250 NaN S 1 \n",
|
|
"3 0 113803 53.1000 C123 S 1 \n",
|
|
"4 0 373450 8.0500 NaN S 0 "
|
|
]
|
|
},
|
|
"execution_count": 8,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
|
|
"df = df_original.copy()\n",
|
|
"df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 51,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"True"
|
|
]
|
|
},
|
|
"execution_count": 51,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Check nulls\n",
|
|
"df['Embarked'].isnull().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 110,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"2"
|
|
]
|
|
},
|
|
"execution_count": 110,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Check how many nulls\n",
|
|
"\n",
|
|
"df['Embarked'].isnull().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 111,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"array(['S', 'C', 'Q', nan], dtype=object)"
|
|
]
|
|
},
|
|
"execution_count": 111,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Check values\n",
|
|
"df['Embarked'].unique()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 112,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Embarked\n",
|
|
"C 168\n",
|
|
"Q 77\n",
|
|
"S 644\n",
|
|
"dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 112,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Check distribution of Embarked\n",
|
|
"df.groupby('Embarked').size()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 113,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"False"
|
|
]
|
|
},
|
|
"execution_count": 113,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"#Replace nulls with the most common value\n",
|
|
"df['Embarked'].fillna('S', inplace=True)\n",
|
|
"df['Embarked'].isnull().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 114,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>PassengerId</th>\n",
|
|
" <th>Survived</th>\n",
|
|
" <th>Pclass</th>\n",
|
|
" <th>Name</th>\n",
|
|
" <th>Sex</th>\n",
|
|
" <th>Age</th>\n",
|
|
" <th>SibSp</th>\n",
|
|
" <th>Parch</th>\n",
|
|
" <th>Fare</th>\n",
|
|
" <th>Embarked</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>886</th>\n",
|
|
" <td>887</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>Montvila, Rev. Juozas</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>27.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>13.00</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>887</th>\n",
|
|
" <td>888</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Graham, Miss. Margaret Edith</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>19.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>888</th>\n",
|
|
" <td>889</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
|
|
" <td>female</td>\n",
|
|
" <td>29.699118</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>2</td>\n",
|
|
" <td>23.45</td>\n",
|
|
" <td>0</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>889</th>\n",
|
|
" <td>890</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>1</td>\n",
|
|
" <td>Behr, Mr. Karl Howell</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>26.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>30.00</td>\n",
|
|
" <td>1</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>890</th>\n",
|
|
" <td>891</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>3</td>\n",
|
|
" <td>Dooley, Mr. Patrick</td>\n",
|
|
" <td>male</td>\n",
|
|
" <td>32.000000</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>0</td>\n",
|
|
" <td>7.75</td>\n",
|
|
" <td>2</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" PassengerId Survived Pclass Name \\\n",
|
|
"886 887 0 2 Montvila, Rev. Juozas \n",
|
|
"887 888 1 1 Graham, Miss. Margaret Edith \n",
|
|
"888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n",
|
|
"889 890 1 1 Behr, Mr. Karl Howell \n",
|
|
"890 891 0 3 Dooley, Mr. Patrick \n",
|
|
"\n",
|
|
" Sex Age SibSp Parch Fare Embarked \n",
|
|
"886 male 27.000000 0 0 13.00 0 \n",
|
|
"887 female 19.000000 0 0 30.00 0 \n",
|
|
"888 female 29.699118 1 2 23.45 0 \n",
|
|
"889 male 26.000000 0 0 30.00 1 \n",
|
|
"890 male 32.000000 0 0 7.75 2 "
|
|
]
|
|
},
|
|
"execution_count": 114,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"# Now we replace as previosly the categories with integers\n",
|
|
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
|
|
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
|
|
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. \n",
|
|
"\n",
|
|
"To avoid this error, Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.\n",
|
|
"\n",
|
|
"We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [Pandas](http://pandas.pydata.org/)\n",
|
|
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n",
|
|
"* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)\n",
|
|
"* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
|
|
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
|
|
"* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.2"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
}
|