{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Data munging with Pandas and Scikit-learn](#Data-munging-with-Pandas-and-Scikit-learn)\n", "* [Examining a DataFrame](#Examining-a-DataFrame)\n", "* [Selecting rows in a DataFrame](#Selecting-rows-in-a-DataFrame)\n", "* [Grouping](#Grouping)\n", "* [Pivot tables](#Pivot-tables)\n", "* [Null and missing values](#Null-and-missing-values)\n", "* [Analysing non numerical columns](#Analysing-non-numerical-columns)\n", "* [Encoding categorical values](#Encoding-categorical-values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data munging with Pandas and Scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides a more detailed introduction to Pandas and scikit-learn using the Titanic dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n", "\n", "*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n", "Some of the most common tasks are:\n", "* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n", "* Encode categorical variables as integers\n", "* Combine datasets\n", "* Rename variables and convert types\n", "* Transform / scale variables\n", "\n", "We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n", "\n", "First we load the dataset and we get a dataframe." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from pandas import Series, DataFrame\n", "\n", "df = pd.read_csv('data-titanic/train.csv')\n", "\n", "# Show the first 5 rows\n", "df[:5]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Examining a DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can examine properties of the dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 891 entries, 0 to 890\n", "Data columns (total 12 columns):\n", "PassengerId 891 non-null int64\n", "Survived 891 non-null int64\n", "Pclass 891 non-null int64\n", "Name 891 non-null object\n", "Sex 891 non-null object\n", "Age 714 non-null float64\n", "SibSp 891 non-null int64\n", "Parch 891 non-null int64\n", "Ticket 891 non-null object\n", "Fare 891 non-null float64\n", "Cabin 204 non-null object\n", "Embarked 889 non-null object\n", "dtypes: float64(2), int64(5), object(5)\n", "memory usage: 83.6+ KB\n" ] } ], "source": [ "# Information about columns and their types\n", "df.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Name object\n", "Sex object\n", "Ticket object\n", "Cabin object\n", "Embarked object\n", "dtype: object" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n", "df.dtypes[df.dtypes == object]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's explore the DataFrame." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(891, 12)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of samples and features\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassAgeSibSpParchFare
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.204208
std257.3538420.4865920.83607114.5264971.1027430.80605749.693429
min1.0000000.0000001.0000000.4200000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.910400
50%446.0000000.0000003.00000028.0000000.0000000.00000014.454200
75%668.5000001.0000003.00000038.0000001.0000000.00000031.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.329200
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", "\n", " Parch Fare \n", "count 891.000000 891.000000 \n", "mean 0.381594 32.204208 \n", "std 0.806057 49.693429 \n", "min 0.000000 0.000000 \n", "25% 0.000000 7.910400 \n", "50% 0.000000 14.454200 \n", "75% 0.000000 31.000000 \n", "max 6.000000 512.329200 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Basic statistics of the dataset in all the numeric columns\n", "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedAgeSibSpParchFare
count891.000000714.000000891.000000891.000000891.000000
mean0.38383829.6991180.5230080.38159432.204208
std0.48659214.5264971.1027430.80605749.693429
min0.0000000.4200000.0000000.0000000.000000
25%0.00000020.1250000.0000000.0000007.910400
50%0.00000028.0000000.0000000.00000014.454200
75%1.00000038.0000001.0000000.00000031.000000
max1.00000080.0000008.0000006.000000512.329200
\n", "
" ], "text/plain": [ " Survived Age SibSp Parch Fare\n", "count 891.000000 714.000000 891.000000 891.000000 891.000000\n", "mean 0.383838 29.699118 0.523008 0.381594 32.204208\n", "std 0.486592 14.526497 1.102743 0.806057 49.693429\n", "min 0.000000 0.420000 0.000000 0.000000 0.000000\n", "25% 0.000000 20.125000 0.000000 0.000000 7.910400\n", "50% 0.000000 28.000000 0.000000 0.000000 14.454200\n", "75% 1.000000 38.000000 1.000000 0.000000 31.000000\n", "max 1.000000 80.000000 8.000000 6.000000 512.329200" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Describe statistics of relevant columns. We pass a list of columns\n", "df[['Survived', 'Age', 'SibSp', 'Parch', 'Fare']].describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Selecting rows in a DataFrame" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the first 5 rows\n", "df.head(5)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.00021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.0 0 0 211536 13.00 NaN S \n", "887 female 19.0 0 0 112053 30.00 B42 S \n", "888 female NaN 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.0 0 0 111369 30.00 C148 C \n", "890 male 32.0 0 0 370376 7.75 NaN Q " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the last 5 rows\n", "df.tail(5)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.100C123S
4503Allen, Mr. William Henrymale35.0003734508.050NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp Parch \\\n", "2 Heikkinen, Miss. Laina female 26.0 0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 \n", "4 Allen, Mr. William Henry male 35.0 0 0 \n", "\n", " Ticket Fare Cabin Embarked \n", "2 STON/O2. 3101282 7.925 NaN S \n", "3 113803 53.100 C123 S \n", "4 373450 8.050 NaN S " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select several rows\n", "df[2:5]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 0\n", "1 1\n", "2 1\n", "3 1\n", "4 0\n", "Name: Survived, dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select the first 5 values of a column by name\n", "df['Survived'][:5]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedSexAge
00male22.0
11female38.0
21female26.0
31female35.0
40male35.0
\n", "
" ], "text/plain": [ " Survived Sex Age\n", "0 0 male 22.0\n", "1 1 female 38.0\n", "2 1 female 26.0\n", "3 1 female 35.0\n", "4 0 male 35.0" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select several columns. Observe that the first parameter is a list\n", "df[['Survived', 'Sex', 'Age']][:5]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 True\n", "2 False\n", "3 True\n", "4 True\n", "5 False\n", "6 True\n", "7 False\n", "8 False\n", "9 False\n", "10 False\n", "11 True\n", "12 False\n", "13 True\n", "14 False\n", "15 True\n", "16 False\n", "17 False\n", "18 True\n", "19 False\n", "20 True\n", "21 True\n", "22 False\n", "23 False\n", "24 False\n", "25 True\n", "26 False\n", "27 False\n", "28 False\n", "29 False\n", " ... \n", "861 False\n", "862 True\n", "863 False\n", "864 False\n", "865 True\n", "866 False\n", "867 True\n", "868 False\n", "869 False\n", "870 False\n", "871 True\n", "872 True\n", "873 True\n", "874 False\n", "875 False\n", "876 False\n", "877 False\n", "878 False\n", "879 True\n", "880 False\n", "881 True\n", "882 False\n", "883 False\n", "884 False\n", "885 True\n", "886 False\n", "887 False\n", "888 False\n", "889 False\n", "890 True\n", "Name: Age, dtype: bool" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Passengers older than 20. Observe dataframe columns can be accessed like attributes.\n", "df.Age > 30" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88488503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.050NaNS
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.125NaNQ
88688702Montvila, Rev. Juozasmale27.00021153613.000NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.750NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "884 885 0 3 Sutehall, Mr. Henry Jr \n", "885 886 0 3 Rice, Mrs. William (Margaret Norton) \n", "886 887 0 2 Montvila, Rev. Juozas \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "884 male 25.0 0 0 SOTON/OQ 392076 7.050 NaN S \n", "885 female 39.0 0 5 382652 29.125 NaN Q \n", "886 male 27.0 0 0 211536 13.000 NaN S \n", "889 male 26.0 0 0 111369 30.000 C148 C \n", "890 male 32.0 0 0 370376 7.750 NaN Q " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select passengers older than 20 (only the last 5). We use boolean indexing\n", "df[df.Age > 20][-5:]" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88088112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "871 872 1 1 \n", "874 875 1 2 \n", "879 880 1 1 \n", "880 881 1 2 \n", "889 890 1 1 \n", "\n", " Name Sex Age SibSp \\\n", "871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 \n", "874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 \n", "879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 \n", "880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 \n", "889 Behr, Mr. Karl Howell male 26.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "871 1 11751 52.5542 D35 S \n", "874 0 P/PP 3381 24.0000 NaN C \n", "879 1 11767 83.1583 C50 C \n", "880 1 230433 26.0000 NaN S \n", "889 0 111369 30.0000 C148 C " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select passengers older than 20 that survived (only the last 5)\n", "df[(df.Age > 20) & (df.Survived == 1)][-5:]" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
87187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
87487512Abelson, Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
87988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88088112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "871 872 1 1 \n", "874 875 1 2 \n", "879 880 1 1 \n", "880 881 1 2 \n", "889 890 1 1 \n", "\n", " Name Sex Age SibSp \\\n", "871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 \n", "874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 \n", "879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 \n", "880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 \n", "889 Behr, Mr. Karl Howell male 26.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "871 1 11751 52.5542 D35 S \n", "874 0 P/PP 3381 24.0000 NaN C \n", "879 1 11767 83.1583 C50 C \n", "880 1 230433 26.0000 NaN S \n", "889 0 111369 30.0000 C148 C " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Alternative syntax with query to the standard Python \n", "# In large dataframes, the perfomance of DataFrame.query() using numexpr is considerable faster, look at the references\n", "df.query('Age > 20 and Survived == 1')[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "DataFrames provide a set of functions for selection that we will need later\n", "\n", "\n", "|Operation | Syntax | Result |\n", "|-----------------------------|\n", "|Select column | df[col] | Series |\n", "|Select row by label | df.loc[label] | Series |\n", "|Select row by integer location | df.iloc[loc] | Series |\n", "|Slice rows\t | df[5:10]\t | DataFrame |\n", "|Select rows by boolean vector | df[bool_vec] | DataFrame |" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "887 19.0\n", "888 NaN\n", "889 26.0\n", "890 32.0\n", "Name: Age, dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select column and show last 4\n", "df['Age'][-4:]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "887 19.0\n", "888 NaN\n", "889 26.0\n", "890 32.0\n", "Name: Age, dtype: float64" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select row by label. We select with [index-labels, column-labels], and show last 4\n", "df.loc[:, 'Age'][-4:]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "887 19.0\n", "888 NaN\n", "889 26.0\n", "890 32.0\n", "Name: Age, dtype: float64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Select row by column index (Age is the column 5), and show last 4\n", "df.iloc[:, 5][-4:]" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.00021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.0 0 0 211536 13.00 NaN S \n", "887 female 19.0 0 0 112053 30.00 B42 S \n", "888 female NaN 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.0 0 0 111369 30.00 C148 C \n", "890 male 32.0 0 0 370376 7.75 NaN Q " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Slice rows - last 5 columns\n", "df[-5:]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88488503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.050NaNS
88588603Rice, Mrs. William (Margaret Norton)female39.00538265229.125NaNQ
88688702Montvila, Rev. Juozasmale27.00021153613.000NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.750NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "884 885 0 3 Sutehall, Mr. Henry Jr \n", "885 886 0 3 Rice, Mrs. William (Margaret Norton) \n", "886 887 0 2 Montvila, Rev. Juozas \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "884 male 25.0 0 0 SOTON/OQ 392076 7.050 NaN S \n", "885 female 39.0 0 5 382652 29.125 NaN Q \n", "886 male 27.0 0 0 211536 13.000 NaN S \n", "889 male 26.0 0 0 111369 30.000 C148 C \n", "890 male 32.0 0 0 370376 7.750 NaN Q " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select based on boolean vector and show last 5 columns\n", "df[df.Age > 20][-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grouping" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rows can be grouped by one or more columns, and apply aggregated operators on the GroupBy object." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Sex\n", "female 314\n", "male 577\n", "dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Number of users per sex (SQL like)\n", "df.groupby('Sex').size()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedAgeSibSpParchFare
Pclass
1461.5972220.62963038.2334410.4166670.35648184.154687
2445.9565220.47282629.8776300.4021740.38043520.662183
3439.1547860.24236325.1406200.6150710.39307513.675550
\n", "
" ], "text/plain": [ " PassengerId Survived Age SibSp Parch Fare\n", "Pclass \n", "1 461.597222 0.629630 38.233441 0.416667 0.356481 84.154687\n", "2 445.956522 0.472826 29.877630 0.402174 0.380435 20.662183\n", "3 439.154786 0.242363 25.140620 0.615071 0.393075 13.675550" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Mean age of passengers per Passenger class\n", "\n", "#First we calculate the mean\n", "df.groupby('Pclass').mean()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Pclass\n", "1 38.233441\n", "2 29.877630\n", "3 25.140620\n", "Name: Age, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#And now we answer the initial query (only mean age)\n", "df.groupby('Pclass')['Age'].mean()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Pclass\n", "1 38.233441\n", "2 29.877630\n", "3 25.140620\n", "Name: Age, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Alternative syntax\n", "df.groupby('Pclass').Age.mean()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSibSp
PclassSex
1female34.6117650.553191
male41.2813860.311475
2female28.7229730.486842
male30.7407070.342593
3female21.7500000.895833
male26.5075890.498559
\n", "
" ], "text/plain": [ " Age SibSp\n", "Pclass Sex \n", "1 female 34.611765 0.553191\n", " male 41.281386 0.311475\n", "2 female 28.722973 0.486842\n", " male 30.740707 0.342593\n", "3 female 21.750000 0.895833\n", " male 26.507589 0.498559" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Mean Age and SibSp of passengers grouped by passenger class and sex\n", "df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSibSp
PclassSex
1female42.0526320.473684
male45.0172410.333333
2female36.5666670.444444
male38.8095240.301587
3female34.9594590.513514
male35.7782260.185484
\n", "
" ], "text/plain": [ " Age SibSp\n", "Pclass Sex \n", "1 female 42.052632 0.473684\n", " male 45.017241 0.333333\n", "2 female 36.566667 0.444444\n", " male 38.809524 0.301587\n", "3 female 34.959459 0.513514\n", " male 35.778226 0.185484" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Show mean Age and SibSp for passengers older than 25 grouped by Passenger Class and Sex\n", "df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSibSpSurvived
PclassSex
1female34.6117650.5411760.964706
male41.6850000.3700000.390000
2female28.7229730.5000000.918919
male32.3297870.3510640.106383
3female22.6020410.8061220.438776
male26.7131470.4900400.143426
\n", "
" ], "text/plain": [ " Age SibSp Survived\n", "Pclass Sex \n", "1 female 34.611765 0.541176 0.964706\n", " male 41.685000 0.370000 0.390000\n", "2 female 28.722973 0.500000 0.918919\n", " male 32.329787 0.351064 0.106383\n", "3 female 22.602041 0.806122 0.438776\n", " male 26.713147 0.490040 0.143426" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n", "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSibSpSurvived
PclassSex
1female34.6117650.54117685
male41.6850000.370000100
2female28.7229730.50000074
male32.3297870.35106494
3female22.6020410.80612298
male26.7131470.490040251
\n", "
" ], "text/plain": [ " Age SibSp Survived\n", "Pclass Sex \n", "1 female 34.611765 0.541176 85\n", " male 41.685000 0.370000 100\n", "2 female 28.722973 0.500000 74\n", " male 32.329787 0.351064 94\n", "3 female 22.602041 0.806122 98\n", " male 26.713147 0.490040 251" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can also decide which function apply in each column\n", "\n", "#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n", "df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, \n", " 'SibSp': np.mean, 'Survived': np.size})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pivot tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pivot tables are an intuitive way to analyze data, and alternative to group columns." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeFareParchPassengerIdPclassSibSpSurvived
Sex
female27.91570944.4798180.649682431.0286622.1592360.6942680.742038
male30.72664525.5238930.235702454.1473142.3899480.4298090.188908
\n", "
" ], "text/plain": [ " Age Fare Parch PassengerId Pclass SibSp \\\n", "Sex \n", "female 27.915709 44.479818 0.649682 431.028662 2.159236 0.694268 \n", "male 30.726645 25.523893 0.235702 454.147314 2.389948 0.429809 \n", "\n", " Survived \n", "Sex \n", "female 0.742038 \n", "male 0.188908 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df, index='Sex')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeFareParchPassengerIdSibSpSurvived
SexPclass
female134.611765106.1257980.457447469.2127660.5531910.968085
228.72297321.9701210.605263443.1052630.4868420.921053
321.75000016.1188100.798611399.7291670.8958330.500000
male141.28138667.2261270.278689455.7295080.3114750.368852
230.74070719.7417820.222222447.9629630.3425930.157407
326.50758912.6616330.224784455.5158500.4985590.135447
\n", "
" ], "text/plain": [ " Age Fare Parch PassengerId SibSp \\\n", "Sex Pclass \n", "female 1 34.611765 106.125798 0.457447 469.212766 0.553191 \n", " 2 28.722973 21.970121 0.605263 443.105263 0.486842 \n", " 3 21.750000 16.118810 0.798611 399.729167 0.895833 \n", "male 1 41.281386 67.226127 0.278689 455.729508 0.311475 \n", " 2 30.740707 19.741782 0.222222 447.962963 0.342593 \n", " 3 26.507589 12.661633 0.224784 455.515850 0.498559 \n", "\n", " Survived \n", "Sex Pclass \n", "female 1 0.968085 \n", " 2 0.921053 \n", " 3 0.500000 \n", "male 1 0.368852 \n", " 2 0.157407 \n", " 3 0.135447 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df, index=['Sex', 'Pclass'])" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSibSp
SexPclass
female134.6117650.553191
228.7229730.486842
321.7500000.895833
male141.2813860.311475
230.7407070.342593
326.5075890.498559
\n", "
" ], "text/plain": [ " Age SibSp\n", "Sex Pclass \n", "female 1 34.611765 0.553191\n", " 2 28.722973 0.486842\n", " 3 21.750000 0.895833\n", "male 1 41.281386 0.311475\n", " 2 30.740707 0.342593\n", " 3 26.507589 0.498559" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AgeSibSp
SexPclass
female134.6117650.553191
228.7229730.486842
321.7500000.895833
male141.2813860.311475
230.7407070.342593
326.5075890.498559
\n", "
" ], "text/plain": [ " Age SibSp\n", "Sex Pclass \n", "female 1 34.611765 0.553191\n", " 2 28.722973 0.486842\n", " 3 21.750000 0.895833\n", "male 1 41.281386 0.311475\n", " 2 30.740707 0.342593\n", " 3 26.507589 0.498559" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meansum
AgeSibSpAgeSibSp
SexPclass
female134.6117650.5531912942.0052
228.7229730.4868422125.5037
321.7500000.8958332218.50129
male141.2813860.3114754169.4238
230.7407070.3425933043.3337
326.5075890.4985596706.42173
\n", "
" ], "text/plain": [ " mean sum \n", " Age SibSp Age SibSp\n", "Sex Pclass \n", "female 1 34.611765 0.553191 2942.00 52\n", " 2 28.722973 0.486842 2125.50 37\n", " 3 21.750000 0.895833 2218.50 129\n", "male 1 41.281386 0.311475 4169.42 38\n", " 2 30.740707 0.342593 3043.33 37\n", " 3 26.507589 0.498559 6706.42 173" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Try np.sum, np.size, len\n", "pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meansum
AgeSibSpAgeSibSp
EmbarkedCQSCQSCQSCQS
SexPclassSurvived
female1050.000000NaN13.5000000.000000NaN1.00000050.00NaN27.000.0NaN2.0
135.67567633.00000033.6190480.5238101.0000000.5869571320.0033.01412.0022.01.027.0
20NaNNaN36.000000NaNNaN0.500000NaNNaN216.00NaNNaN3.0
119.14285730.00000029.0916670.7142860.0000000.475410134.0030.01745.505.00.029.0
3020.70000028.10000023.6888890.5000000.1111111.600000103.50140.51066.004.01.088.0
111.04545517.60000022.5483870.6000000.2500000.636364121.5088.0699.009.06.021.0
male1043.05000044.00000045.3625000.1600002.0000000.294118861.0044.01814.504.02.015.0
136.437500NaN36.1216670.352941NaN0.392857583.00NaN866.926.0NaN11.0
2029.50000057.00000033.4144740.6250000.0000000.280488206.5057.02539.505.00.023.0
11.000000NaN17.0950000.000000NaN0.6000001.00NaN239.330.0NaN9.0
3027.55555628.07692327.1684780.1818180.5833330.562771496.00365.04999.006.021.0130.0
118.48857129.00000022.9333330.4000000.6666670.294118129.4229.0688.004.02.010.0
\n", "
" ], "text/plain": [ " mean \\\n", " Age SibSp \n", "Embarked C Q S C Q \n", "Sex Pclass Survived \n", "female 1 0 50.000000 NaN 13.500000 0.000000 NaN \n", " 1 35.675676 33.000000 33.619048 0.523810 1.000000 \n", " 2 0 NaN NaN 36.000000 NaN NaN \n", " 1 19.142857 30.000000 29.091667 0.714286 0.000000 \n", " 3 0 20.700000 28.100000 23.688889 0.500000 0.111111 \n", " 1 11.045455 17.600000 22.548387 0.600000 0.250000 \n", "male 1 0 43.050000 44.000000 45.362500 0.160000 2.000000 \n", " 1 36.437500 NaN 36.121667 0.352941 NaN \n", " 2 0 29.500000 57.000000 33.414474 0.625000 0.000000 \n", " 1 1.000000 NaN 17.095000 0.000000 NaN \n", " 3 0 27.555556 28.076923 27.168478 0.181818 0.583333 \n", " 1 18.488571 29.000000 22.933333 0.400000 0.666667 \n", "\n", " sum \n", " Age SibSp \n", "Embarked S C Q S C Q S \n", "Sex Pclass Survived \n", "female 1 0 1.000000 50.00 NaN 27.00 0.0 NaN 2.0 \n", " 1 0.586957 1320.00 33.0 1412.00 22.0 1.0 27.0 \n", " 2 0 0.500000 NaN NaN 216.00 NaN NaN 3.0 \n", " 1 0.475410 134.00 30.0 1745.50 5.0 0.0 29.0 \n", " 3 0 1.600000 103.50 140.5 1066.00 4.0 1.0 88.0 \n", " 1 0.636364 121.50 88.0 699.00 9.0 6.0 21.0 \n", "male 1 0 0.294118 861.00 44.0 1814.50 4.0 2.0 15.0 \n", " 1 0.392857 583.00 NaN 866.92 6.0 NaN 11.0 \n", " 2 0 0.280488 206.50 57.0 2539.50 5.0 0.0 23.0 \n", " 1 0.600000 1.00 NaN 239.33 0.0 NaN 9.0 \n", " 3 0 0.562771 496.00 365.0 4999.00 6.0 21.0 130.0 \n", " 1 0.294118 129.42 29.0 688.00 4.0 2.0 10.0 " ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Try np.sum, np.size, len\n", "table = pd.pivot_table(df, index=['Sex', 'Pclass', 'Survived'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum],\n", " columns=['Embarked'])\n", "table" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meansum
AgeSibSpAgeSibSp
EmbarkedCQSCQSCQSCQS
SexPclassSurvived
female1135.67567633.033.6190480.5238101.0000000.5869571320.0033.01412.0022.01.027.0
2119.14285730.029.0916670.7142860.0000000.475410134.0030.01745.505.00.029.0
3111.04545517.622.5483870.6000000.2500000.636364121.5088.0699.009.06.021.0
male1136.437500NaN36.1216670.352941NaN0.392857583.00NaN866.926.0NaN11.0
211.000000NaN17.0950000.000000NaN0.6000001.00NaN239.330.0NaN9.0
3118.48857129.022.9333330.4000000.6666670.294118129.4229.0688.004.02.010.0
\n", "
" ], "text/plain": [ " mean \\\n", " Age SibSp \n", "Embarked C Q S C Q \n", "Sex Pclass Survived \n", "female 1 1 35.675676 33.0 33.619048 0.523810 1.000000 \n", " 2 1 19.142857 30.0 29.091667 0.714286 0.000000 \n", " 3 1 11.045455 17.6 22.548387 0.600000 0.250000 \n", "male 1 1 36.437500 NaN 36.121667 0.352941 NaN \n", " 2 1 1.000000 NaN 17.095000 0.000000 NaN \n", " 3 1 18.488571 29.0 22.933333 0.400000 0.666667 \n", "\n", " sum \n", " Age SibSp \n", "Embarked S C Q S C Q S \n", "Sex Pclass Survived \n", "female 1 1 0.586957 1320.00 33.0 1412.00 22.0 1.0 27.0 \n", " 2 1 0.475410 134.00 30.0 1745.50 5.0 0.0 29.0 \n", " 3 1 0.636364 121.50 88.0 699.00 9.0 6.0 21.0 \n", "male 1 1 0.392857 583.00 NaN 866.92 6.0 NaN 11.0 \n", " 2 1 0.600000 1.00 NaN 239.33 0.0 NaN 9.0 \n", " 3 1 0.294118 129.42 29.0 688.00 4.0 2.0 10.0 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table.query('Survived == 1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Duplicates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.duplicated().any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Null and missing values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we check how many null values there are.\n", "\n", "We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "PassengerId 0\n", "Survived 0\n", "Pclass 0\n", "Name 0\n", "Sex 0\n", "Age 177\n", "SibSp 0\n", "Parch 0\n", "Ticket 0\n", "Fare 0\n", "Cabin 687\n", "Embarked 2\n", "dtype: int64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original (891, 10)\n", "Cleaned (889, 10)\n" ] } ], "source": [ "# Drop records with missing values\n", "df_original = df.copy()\n", "df_clean = df.dropna()\n", "print(\"Original\", df.shape)\n", "print(\"Cleaned\", df_clean.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.00021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"female28.012W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.0 0 0 211536 13.00 NaN S \n", "887 female 19.0 0 0 112053 30.00 B42 S \n", "888 female 28.0 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.0 0 0 111369 30.00 C148 C \n", "890 male 32.0 0 0 370376 7.75 NaN Q " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fill missing values with the median\n", "df_filled = df.fillna(df.median())\n", "df_filled[-5:]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.00021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.00C148C
89089103Dooley, Mr. Patrickmale32.0003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.0 0 0 211536 13.00 NaN S \n", "887 female 19.0 0 0 112053 30.00 B42 S \n", "888 female NaN 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.0 0 0 111369 30.00 C148 C \n", "890 male 32.0 0 0 370376 7.75 NaN Q " ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#The original df has not been modified\n", "df[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n", "\n", "Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n", "\n", "In addition, we could drop rows with any or all null values (method *dropna()*).\n", "\n", "If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.0000000021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.0000000011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"female29.69911812W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.0000000011136930.00C148C
89089103Dooley, Mr. Patrickmale32.000000003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.000000 0 0 211536 13.00 NaN S \n", "887 female 19.000000 0 0 112053 30.00 B42 S \n", "888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.000000 0 0 111369 30.00 C148 C \n", "890 male 32.000000 0 0 370376 7.75 NaN Q " ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Age'].fillna(df['Age'].mean(), inplace=True)\n", "df[-5:]" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.0000000021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.0000000011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"female29.69911812W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.0000000011136930.00C148C
89089103Dooley, Mr. Patrickmale32.000000003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.000000 0 0 211536 13.00 NaN S \n", "887 female 19.000000 0 0 112053 30.00 B42 S \n", "888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.000000 0 0 111369 30.00 C148 C \n", "890 male 32.000000 0 0 370376 7.75 NaN Q " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Another possibility is to assign the modified dataframe\n", "# First we get the df with NaN values\n", "df = df_original.copy()\n", "#Fill NaN and assign to the column\n", "df['Age'] = df['Age'].fillna(df['Age'].median())\n", "df[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are going to see how to change the Sex value of PassengerId 889, and then replace the missing values of Sex. It is just an example for practicing." ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "PassengerId 890\n", "Survived 1\n", "Pclass 1\n", "Name Behr, Mr. Karl Howell\n", "Sex male\n", "Age 26\n", "SibSp 0\n", "Parch 0\n", "Ticket 111369\n", "Fare 30\n", "Cabin C148\n", "Embarked C\n", "Name: 889, dtype: object" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# There are not labels for rows, so we use the numeric index\n", "df.iloc[889]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'male'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#We access row and column\n", "df.iloc[889]['Sex']" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/cif/anaconda3/lib/python3.5/site-packages/ipykernel/__main__.py:2: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame\n", "\n", "See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " from ipykernel import kernelapp as app\n" ] } ], "source": [ "# But we are working on a copy \n", "df.iloc[889]['Sex'] = np.nan" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'male'" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# If we want to change, we should not chain selections\n", "# The selection can be done with the column name\n", "df.loc[889, 'Sex']" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'male'" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Or with the index of the column\n", "df.iloc[889, 4]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.0000000021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.0000000011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"female29.69911812W./C. 660723.45NaNS
88989011Behr, Mr. Karl HowellNaN26.0000000011136930.00C148C
89089103Dooley, Mr. Patrickmale32.000000003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.000000 0 0 211536 13.00 NaN S \n", "887 female 19.000000 0 0 112053 30.00 B42 S \n", "888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n", "889 NaN 26.000000 0 0 111369 30.00 C148 C \n", "890 male 32.000000 0 0 370376 7.75 NaN Q " ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This indexing works for changing values\n", "df.loc[889, 'Sex'] = np.nan\n", "df[-5:]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
88688702Montvila, Rev. Juozasmale27.0000000021153613.00NaNS
88788811Graham, Miss. Margaret Edithfemale19.0000000011205330.00B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"female29.69911812W./C. 660723.45NaNS
88989011Behr, Mr. Karl Howellmale26.0000000011136930.00C148C
89089103Dooley, Mr. Patrickmale32.000000003703767.75NaNQ
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Ticket Fare Cabin Embarked \n", "886 male 27.000000 0 0 211536 13.00 NaN S \n", "887 female 19.000000 0 0 112053 30.00 B42 S \n", "888 female 29.699118 1 2 W./C. 6607 23.45 NaN S \n", "889 male 26.000000 0 0 111369 30.00 C148 C \n", "890 male 32.000000 0 0 370376 7.75 NaN Q " ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Sex'].fillna('male', inplace=True)\n", "df[-5:]" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "There are other interesting possibilities of **fillna**. We can fill with the previous valid value (**method=bfill**) or the next valid value (**method=ffill**). For example, with time series, it is frequent to use the last valid value (bfill). Another alternative is to use the method **interpolate()**.\n", "\n", "Look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more details.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysing non numerical columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n", "\n", "**Name** and **Ticket** do not seem informative.\n", "\n", "Regarding **Cabin**, most values were missing, so we can ignore it. \n", "\n", "**Sex** and **Embarked** are categorical features, so we will encode as integers." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchFareEmbarked
88688702Montvila, Rev. Juozasmale27.0000000013.00S
88788811Graham, Miss. Margaret Edithfemale19.0000000030.00S
88888903Johnston, Miss. Catherine Helen \"Carrie\"female29.6991181223.45S
88989011Behr, Mr. Karl Howellmale26.0000000030.00C
89089103Dooley, Mr. Patrickmale32.000000007.75Q
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Fare Embarked \n", "886 male 27.000000 0 0 13.00 S \n", "887 female 19.000000 0 0 30.00 S \n", "888 female 29.699118 1 2 23.45 S \n", "889 male 26.000000 0 0 30.00 C \n", "890 male 32.000000 0 0 7.75 Q " ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We remove Cabin and Ticket. We should specify the axis\n", "# Use axis 0 for dropping rows and axis 1 for dropping columns\n", "df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n", "df[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Encoding categorical values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. " ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#First we check if there is any null values. Observe the use of any()\n", "df['Sex'].isnull().any()" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['male', 'female'], dtype=object)" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Now we check the values of Sex\n", "df['Sex'].unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are going to encode the values with our pandas knowledge." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchFareEmbarked
88688702Montvila, Rev. Juozas027.0000000013.00S
88788811Graham, Miss. Margaret Edith119.0000000030.00S
88888903Johnston, Miss. Catherine Helen \"Carrie\"129.6991181223.45S
88989011Behr, Mr. Karl Howell026.0000000030.00C
89089103Dooley, Mr. Patrick032.000000007.75Q
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Fare Embarked \n", "886 0 27.000000 0 0 13.00 S \n", "887 1 19.000000 0 0 30.00 S \n", "888 1 29.699118 1 2 23.45 S \n", "889 0 26.000000 0 0 30.00 C \n", "890 0 32.000000 0 0 7.75 Q " ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n", "df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n", "df[-5:]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedGender
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS0
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS1
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked Gender \n", "0 0 A/5 21171 7.2500 NaN S 0 \n", "1 0 PC 17599 71.2833 C85 C 1 \n", "2 0 STON/O2. 3101282 7.9250 NaN S 1 \n", "3 0 113803 53.1000 C123 S 1 \n", "4 0 373450 8.0500 NaN S 0 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#An alternative is to create a new column with the encoded valuesm and define a mapping\n", "df = df_original.copy()\n", "df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Check nulls\n", "df['Embarked'].isnull().any()" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Check how many nulls\n", "\n", "df['Embarked'].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['S', 'C', 'Q', nan], dtype=object)" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Check values\n", "df['Embarked'].unique()" ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Embarked\n", "C 168\n", "Q 77\n", "S 644\n", "dtype: int64" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Check distribution of Embarked\n", "df.groupby('Embarked').size()" ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Replace nulls with the most common value\n", "df['Embarked'].fillna('S', inplace=True)\n", "df['Embarked'].isnull().any()" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchFareEmbarked
88688702Montvila, Rev. Juozasmale27.0000000013.000
88788811Graham, Miss. Margaret Edithfemale19.0000000030.000
88888903Johnston, Miss. Catherine Helen \"Carrie\"female29.6991181223.450
88989011Behr, Mr. Karl Howellmale26.0000000030.001
89089103Dooley, Mr. Patrickmale32.000000007.752
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Name \\\n", "886 887 0 2 Montvila, Rev. Juozas \n", "887 888 1 1 Graham, Miss. Margaret Edith \n", "888 889 0 3 Johnston, Miss. Catherine Helen \"Carrie\" \n", "889 890 1 1 Behr, Mr. Karl Howell \n", "890 891 0 3 Dooley, Mr. Patrick \n", "\n", " Sex Age SibSp Parch Fare Embarked \n", "886 male 27.000000 0 0 13.00 0 \n", "887 female 19.000000 0 0 30.00 0 \n", "888 female 29.699118 1 2 23.45 0 \n", "889 male 26.000000 0 0 30.00 1 \n", "890 male 32.000000 0 0 7.75 2 " ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Now we replace as previosly the categories with integers\n", "df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n", "df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n", "df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n", "df[-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. \n", "\n", "To avoid this error, Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.\n", "\n", "We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [Pandas](http://pandas.pydata.org/)\n", "* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n", "* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)\n", "* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n", "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n", "* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }