1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-11-22 14:32:28 +00:00
sitc/ml2/3_3_Data_Munging_with_Pandas.ipynb

998 lines
25 KiB
Plaintext
Raw Normal View History

2016-03-28 12:03:08 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2019-03-06 16:46:12 +00:00
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
2016-03-28 12:03:08 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Data munging with Pandas and Scikit-learn](#Data-munging-with-Pandas-and-Scikit-learn)\n",
"* [Examining a DataFrame](#Examining-a-DataFrame)\n",
"* [Selecting rows in a DataFrame](#Selecting-rows-in-a-DataFrame)\n",
"* [Grouping](#Grouping)\n",
"* [Pivot tables](#Pivot-tables)\n",
"* [Null and missing values](#Null-and-missing-values)\n",
"* [Analysing non numerical columns](#Analysing-non-numerical-columns)\n",
"* [Encoding categorical values](#Encoding-categorical-values)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data munging with Pandas and Scikit-learn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook provides a more detailed introduction to Pandas and scikit-learn using the Titanic dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-04-05 14:34:35 +00:00
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
2016-03-28 12:03:08 +00:00
"\n",
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
"Some of the most common tasks are:\n",
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
"* Encode categorical variables as integers\n",
"* Combine datasets\n",
"* Rename variables and convert types\n",
"* Transform / scale variables\n",
"\n",
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
"\n",
"First we load the dataset and we get a dataframe."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"from pandas import Series, DataFrame\n",
"\n",
"df = pd.read_csv('data-titanic/train.csv')\n",
"\n",
"# Show the first 5 rows\n",
"df[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Examining a DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can examine properties of the dataset."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Information about columns and their types\n",
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
"df.dtypes[df.dtypes == object]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's explore the DataFrame."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Number of samples and features\n",
"df.shape"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Basic statistics of the dataset in all the numeric columns\n",
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Describe statistics of relevant columns. We pass a list of columns\n",
"df[['Survived', 'Age', 'SibSp', 'Parch', 'Fare']].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting rows in a DataFrame"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select the first 5 rows\n",
"df.head(5)"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select the last 5 rows\n",
"df.tail(5)"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select several rows\n",
"df[2:5]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select the first 5 values of a column by name\n",
"df['Survived'][:5]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select several columns. Observe that the first parameter is a list\n",
"df[['Survived', 'Sex', 'Age']][:5]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Passengers older than 20. Observe dataframe columns can be accessed like attributes.\n",
"df.Age > 30"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select passengers older than 20 (only the last 5). We use boolean indexing\n",
"df[df.Age > 20][-5:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select passengers older than 20 that survived (only the last 5)\n",
"df[(df.Age > 20) & (df.Survived == 1)][-5:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Alternative syntax with query to the standard Python \n",
"# In large dataframes, the perfomance of DataFrame.query() using numexpr is considerable faster, look at the references\n",
"df.query('Age > 20 and Survived == 1')[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"DataFrames provide a set of functions for selection that we will need later\n",
"\n",
"\n",
"|Operation | Syntax | Result |\n",
"|-----------------------------|\n",
"|Select column | df[col] | Series |\n",
"|Select row by label | df.loc[label] | Series |\n",
"|Select row by integer location | df.iloc[loc] | Series |\n",
"|Slice rows\t | df[5:10]\t | DataFrame |\n",
"|Select rows by boolean vector | df[bool_vec] | DataFrame |"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select column and show last 4\n",
"df['Age'][-4:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select row by label. We select with [index-labels, column-labels], and show last 4\n",
"df.loc[:, 'Age'][-4:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Select row by column index (Age is the column 5), and show last 4\n",
"df.iloc[:, 5][-4:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Slice rows - last 5 columns\n",
"df[-5:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Select based on boolean vector and show last 5 columns\n",
"df[df.Age > 20][-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Grouping"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rows can be grouped by one or more columns, and apply aggregated operators on the GroupBy object."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Number of users per sex (SQL like)\n",
"df.groupby('Sex').size()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Mean age of passengers per Passenger class\n",
"\n",
"#First we calculate the mean\n",
"df.groupby('Pclass').mean()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#And now we answer the initial query (only mean age)\n",
"df.groupby('Pclass')['Age'].mean()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Alternative syntax\n",
"df.groupby('Pclass').Age.mean()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Mean Age and SibSp of passengers grouped by passenger class and sex\n",
"df.groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Show mean Age and SibSp for passengers older than 25 grouped by Passenger Class and Sex\n",
"df[df.Age > 25].groupby(['Pclass', 'Sex'])['Age','SibSp'].mean()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].mean()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# We can also decide which function apply in each column\n",
"\n",
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])['Age','SibSp','Survived'].agg({'Age': np.mean, \n",
" 'SibSp': np.mean, 'Survived': np.sum})"
2016-03-28 12:03:08 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pivot tables"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pivot tables are an intuitive way to analyze data, and alternative to group columns."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"pd.pivot_table(df, index='Sex')"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'])"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Try np.sum, np.size, len\n",
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Try np.sum, np.size, len\n",
"table = pd.pivot_table(df, index=['Sex', 'Pclass', 'Survived'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum],\n",
" columns=['Embarked'])\n",
"table"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"table.query('Survived == 1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Duplicates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"df.duplicated().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Null and missing values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we check how many null values there are.\n",
"\n",
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"df.isnull().sum()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Drop records with missing values\n",
"df_original = df.copy()\n",
"df_clean = df.dropna()\n",
"print(\"Original\", df.shape)\n",
"print(\"Cleaned\", df_clean.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Fill missing values with the median\n",
"df_filled = df.fillna(df.median())\n",
"df_filled[-5:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#The original df has not been modified\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
"\n",
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
"\n",
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
"\n",
"If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
"df[-5:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Another possibility is to assign the modified dataframe\n",
"# First we get the df with NaN values\n",
"df = df_original.copy()\n",
"#Fill NaN and assign to the column\n",
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to see how to change the Sex value of PassengerId 889, and then replace the missing values of Sex. It is just an example for practicing."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# There are not labels for rows, so we use the numeric index\n",
"df.iloc[889]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#We access row and column\n",
"df.iloc[889]['Sex']"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# But we are working on a copy \n",
"df.iloc[889]['Sex'] = np.nan"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# If we want to change, we should not chain selections\n",
"# The selection can be done with the column name\n",
"df.loc[889, 'Sex']"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Or with the index of the column\n",
"df.iloc[889, 4]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# This indexing works for changing values\n",
"df.loc[889, 'Sex'] = np.nan\n",
"df[-5:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"df['Sex'].fillna('male', inplace=True)\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
2019-02-28 14:30:33 +00:00
"metadata": {},
2016-03-28 12:03:08 +00:00
"source": [
"There are other interesting possibilities of **fillna**. We can fill with the previous valid value (**method=bfill**) or the next valid value (**method=ffill**). For example, with time series, it is frequent to use the last valid value (bfill). Another alternative is to use the method **interpolate()**.\n",
"\n",
"Look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more details.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Analysing non numerical columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
"\n",
"**Name** and **Ticket** do not seem informative.\n",
"\n",
"Regarding **Cabin**, most values were missing, so we can ignore it. \n",
"\n",
"**Sex** and **Embarked** are categorical features, so we will encode as integers."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# We remove Cabin and Ticket. We should specify the axis\n",
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Encoding categorical values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#First we check if there is any null values. Observe the use of any()\n",
"df['Sex'].isnull().any()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Now we check the values of Sex\n",
"df['Sex'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to encode the values with our pandas knowledge."
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
"df[-5:]"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
"df = df_original.copy()\n",
"df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
"df.head()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Check nulls\n",
"df['Embarked'].isnull().any()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Check how many nulls\n",
"\n",
"df['Embarked'].isnull().sum()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Check values\n",
"df['Embarked'].unique()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Check distribution of Embarked\n",
"df.groupby('Embarked').size()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"#Replace nulls with the most common value\n",
"df['Embarked'].fillna('S', inplace=True)\n",
"df['Embarked'].isnull().any()"
]
},
{
"cell_type": "code",
2019-02-28 14:30:33 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-28 12:03:08 +00:00
"source": [
"# Now we replace as previosly the categories with integers\n",
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
"df[-5:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. \n",
"\n",
"To avoid this error, Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.\n",
"\n",
"We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Pandas](http://pandas.pydata.org/)\n",
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n",
"* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)\n",
"* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
"* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
2019-03-06 16:46:12 +00:00
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
2016-03-28 12:03:08 +00:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2019-03-06 16:46:12 +00:00
"version": "3.7.1"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
2016-03-28 12:03:08 +00:00
}
},
"nbformat": 4,
2019-03-06 16:46:12 +00:00
"nbformat_minor": 1
2016-03-28 12:03:08 +00:00
}