mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-14 02:32:27 +00:00
998 lines
25 KiB
Plaintext
998 lines
25 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Data munging with Pandas and Scikit-learn](#Data-munging-with-Pandas-and-Scikit-learn)\n",
|
|
"* [Examining a DataFrame](#Examining-a-DataFrame)\n",
|
|
"* [Selecting rows in a DataFrame](#Selecting-rows-in-a-DataFrame)\n",
|
|
"* [Grouping](#Grouping)\n",
|
|
"* [Pivot tables](#Pivot-tables)\n",
|
|
"* [Null and missing values](#Null-and-missing-values)\n",
|
|
"* [Analysing non numerical columns](#Analysing-non-numerical-columns)\n",
|
|
"* [Encoding categorical values](#Encoding-categorical-values)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data munging with Pandas and Scikit-learn"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This notebook provides a more detailed introduction to Pandas and scikit-learn using the Titanic dataset."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
|
|
"\n",
|
|
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
|
"Some of the most common tasks are:\n",
|
|
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
|
|
"* Encode categorical variables as integers\n",
|
|
"* Combine datasets\n",
|
|
"* Rename variables and convert types\n",
|
|
"* Transform / scale variables\n",
|
|
"\n",
|
|
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
|
|
"\n",
|
|
"First we load the dataset and we get a dataframe."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"from pandas import Series, DataFrame\n",
|
|
"\n",
|
|
"df = pd.read_csv('data-titanic/train.csv')\n",
|
|
"\n",
|
|
"# Show the first 5 rows\n",
|
|
"df[:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Examining a DataFrame"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can examine properties of the dataset."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Information about columns and their types\n",
|
|
"df.info()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We see some features have a numerical type (int64 and float64), and others has a type *object*. The object type is a String in Pandas. We observe that most features are integers, except for Name, Sex, Ticket, Cabin and Embarked."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We can list non numerical properties, with a boolean indexing of the Series df.dtypes\n",
|
|
"df.dtypes[df.dtypes == object]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's explore the DataFrame."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Number of samples and features\n",
|
|
"df.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Basic statistics of the dataset in all the numeric columns\n",
|
|
"df.describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Observe that some of the statistics do not make sense in some columns (PassengerId or Pclass), we could have selected only the interesting columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Describe statistics of relevant columns. We pass a list of columns\n",
|
|
"df[['Survived', 'Age', 'SibSp', 'Parch', 'Fare']].describe()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Selecting rows in a DataFrame"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select the first 5 rows\n",
|
|
"df.head(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select the last 5 rows\n",
|
|
"df.tail(5)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select several rows\n",
|
|
"df[2:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select the first 5 values of a column by name\n",
|
|
"df['Survived'][:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select several columns. Observe that the first parameter is a list\n",
|
|
"df[['Survived', 'Sex', 'Age']][:5]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Passengers older than 20. Observe dataframe columns can be accessed like attributes.\n",
|
|
"df.Age > 30"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select passengers older than 20 (only the last 5). We use boolean indexing\n",
|
|
"df[df.Age > 20][-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select passengers older than 20 that survived (only the last 5)\n",
|
|
"df[(df.Age > 20) & (df.Survived == 1)][-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Alternative syntax with query to the standard Python \n",
|
|
"# In large dataframes, the perfomance of DataFrame.query() using numexpr is considerable faster, look at the references\n",
|
|
"df.query('Age > 20 and Survived == 1')[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"DataFrames provide a set of functions for selection that we will need later\n",
|
|
"\n",
|
|
"\n",
|
|
"|Operation | Syntax | Result |\n",
|
|
"|-----------------------------|\n",
|
|
"|Select column | df[col] | Series |\n",
|
|
"|Select row by label | df.loc[label] | Series |\n",
|
|
"|Select row by integer location | df.iloc[loc] | Series |\n",
|
|
"|Slice rows\t | df[5:10]\t | DataFrame |\n",
|
|
"|Select rows by boolean vector | df[bool_vec] | DataFrame |"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select column and show last 4\n",
|
|
"df['Age'][-4:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select row by label. We select with [index-labels, column-labels], and show last 4\n",
|
|
"df.loc[:, 'Age'][-4:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Select row by column index (Age is the column 5), and show last 4\n",
|
|
"df.iloc[:, 5][-4:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Slice rows - last 5 columns\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Select based on boolean vector and show last 5 columns\n",
|
|
"df[df.Age > 20][-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Grouping"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Rows can be grouped by one or more columns, and apply aggregated operators on the GroupBy object."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Number of users per sex (SQL like)\n",
|
|
"df.groupby('Sex').size()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Mean age of passengers per Passenger class\n",
|
|
"\n",
|
|
"#First we calculate the mean\n",
|
|
"df.groupby('Pclass').mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#And now we answer the initial query (only mean age)\n",
|
|
"df.groupby('Pclass')['Age'].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Alternative syntax\n",
|
|
"df.groupby('Pclass').Age.mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Mean Age and SibSp of passengers grouped by passenger class and sex\n",
|
|
"df.groupby(['Pclass', 'Sex'])[['Age','SibSp']].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Show mean Age and SibSp for passengers older than 25 grouped by Passenger Class and Sex\n",
|
|
"df[df.Age > 25].groupby(['Pclass', 'Sex'])[['Age','SibSp']].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Mean age, SibSp , Survived of passengers older than 25 which survived, grouped by Passenger Class and Sex \n",
|
|
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].mean()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We can also decide which function apply in each column\n",
|
|
"\n",
|
|
"#Show mean Age, mean SibSp, and number of passengers older than 25 that survived, grouped by Passenger Class and Sex\n",
|
|
"df[(df.Age > 25 & (df.Survived == 1))].groupby(['Pclass', 'Sex'])[['Age','SibSp','Survived']].agg({'Age': np.mean, \n",
|
|
" 'SibSp': np.mean, 'Survived': np.sum})"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Pivot tables"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Pivot tables are an intuitive way to analyze data, and alternative to group columns."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pd.pivot_table(df, index='Sex')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=np.mean)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try np.sum, np.size, len\n",
|
|
"pd.pivot_table(df, index=['Sex', 'Pclass'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Try np.sum, np.size, len\n",
|
|
"table = pd.pivot_table(df, index=['Sex', 'Pclass', 'Survived'], values=['Age', 'SibSp'], aggfunc=[np.mean, np.sum],\n",
|
|
" columns=['Embarked'])\n",
|
|
"table"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"table.query('Survived == 1')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Duplicates"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.duplicated().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Null and missing values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here we check how many null values there are.\n",
|
|
"\n",
|
|
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.isnull().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Drop records with missing values\n",
|
|
"df_original = df.copy()\n",
|
|
"df_clean = df.dropna()\n",
|
|
"print(\"Original\", df.shape)\n",
|
|
"print(\"Cleaned\", df_clean.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Most of samples have been deleted. We could have used [*dropna*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) with the argument *how=all* that deletes a sample if all the values are missing, instead of the default *how=any*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Fill missing values with the median, we avoid empty (None) values with numeric_only\n",
|
|
"df_filled = df.fillna(df.median(numeric_only=True))\n",
|
|
"df_filled[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#The original df has not been modified\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
|
|
"\n",
|
|
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
|
|
"\n",
|
|
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
|
|
"\n",
|
|
"If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Another possibility is to assign the modified dataframe\n",
|
|
"# First we get the df with NaN values\n",
|
|
"df = df_original.copy()\n",
|
|
"#Fill NaN and assign to the column\n",
|
|
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we are going to see how to change the Sex value of PassengerId 889, and then replace the missing values of Sex. It is just an example for practicing."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# There are not labels for rows, so we use the numeric index\n",
|
|
"df.iloc[889]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#We access row and column\n",
|
|
"df.iloc[889]['Sex']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# But we are working on a copy, so we get a warning\n",
|
|
"df.iloc[889]['Sex'] = np.nan"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# If we want to change it, we should not chain selections\n",
|
|
"# The selection can be done with the column name\n",
|
|
"df.loc[889, 'Sex']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Or with the index of the column\n",
|
|
"df.iloc[889, 4]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# This indexing works for changing values\n",
|
|
"df.loc[889, 'Sex'] = np.nan\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df['Sex'].fillna('male', inplace=True)\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"There are other interesting possibilities of **fillna**. We can fill with the previous valid value (**method=bfill**) or the next valid value (**method=ffill**). For example, with time series, it is frequent to use the last valid value (bfill). Another alternative is to use the method **interpolate()**.\n",
|
|
"\n",
|
|
"Look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/missing_data.html) for more details.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"\n",
|
|
"**Scikit-learn** provides also a preprocessing facility for managing null values in the [**Imputer**](http://scikit-learn.org/stable/modules/preprocessing.html) class. We can include *Imputer* as a step in the *Pipeline*."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Analysing non numerical columns"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
|
|
"\n",
|
|
"**Name** and **Ticket** do not seem informative.\n",
|
|
"\n",
|
|
"Regarding **Cabin**, most values were missing, so we can ignore it. \n",
|
|
"\n",
|
|
"**Sex** and **Embarked** are categorical features, so we will encode as integers."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We remove Cabin and Ticket. We should specify the axis\n",
|
|
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
|
|
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Encoding categorical values"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"*Sex* has been codified as a categorical feature. It is better to encode features as continuous variables, since scikit-learn estimators expect continuous input, and they would interpret the categories as being ordered, which is not the case. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#First we check if there is any null values. Observe the use of any()\n",
|
|
"df['Sex'].isnull().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Now we check the values of Sex\n",
|
|
"df['Sex'].unique()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we are going to encode the values with our pandas knowledge."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
|
|
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#An alternative is to create a new column with the encoded valuesm and define a mapping\n",
|
|
"df = df_original.copy()\n",
|
|
"df['Gender'] = df['Sex'].map( {'male': 0, 'female': 1} ).astype(int)\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Check nulls\n",
|
|
"df['Embarked'].isnull().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Check how many nulls\n",
|
|
"\n",
|
|
"df['Embarked'].isnull().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Check values\n",
|
|
"df['Embarked'].unique()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Check distribution of Embarked\n",
|
|
"df.groupby('Embarked').size()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#Replace nulls with the most common value\n",
|
|
"df['Embarked'].fillna('S', inplace=True)\n",
|
|
"df['Embarked'].isnull().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Now we replace as previosly the categories with integers\n",
|
|
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
|
|
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
|
|
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
|
|
"df[-5:]"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. \n",
|
|
"\n",
|
|
"To avoid this error, Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.\n",
|
|
"\n",
|
|
"We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [Pandas](http://pandas.pydata.org/)\n",
|
|
"* [Learning Pandas, Michael Heydt, Packt Publishing, 2017](https://learning.oreilly.com/library/view/learning-pandas/9781787123137/)\n",
|
|
"* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
|
|
"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
|
|
"* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)\n",
|
|
"* [Useful Pandas Snippets](https://gist.github.com/bsweger/e5817488d161f37dcbd2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.12"
|
|
},
|
|
"latex_envs": {
|
|
"LaTeX_envs_menu_present": true,
|
|
"autocomplete": true,
|
|
"bibliofile": "biblio.bib",
|
|
"cite_by": "apalike",
|
|
"current_citInitial": 1,
|
|
"eqLabelWithNumbers": true,
|
|
"eqNumInitial": 1,
|
|
"hotkeys": {
|
|
"equation": "Ctrl-E",
|
|
"itemize": "Ctrl-I"
|
|
},
|
|
"labels_anchors": false,
|
|
"latex_user_defs": false,
|
|
"report_style_numbering": false,
|
|
"user_envs_cfg": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|