sitc/ml2/3_4_Visualisation_Pandas.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Introduction: preprocessing](#Introduction:-preprocessing)\n",
    "* [Visualisation with Pandas](#Visualisation-with-Pandas)\n",
    "* [Loading and Cleaning](#Loading-and-Cleaning)\n",
    "* [General exploration](#General-exploration)\n",
    "* [Feature Age](#Feature-Age)\n",
    "* [Feature Sex](#Feature-Sex)\n",
    "* [Feature Pclass](#Feature-Pclass)\n",
    "* [Feature Fare](#Feature-Fare)\n",
    "* [Feature Embarked](#Feature-Embarked)\n",
    "* [Features SibSp](#Features-SibSp)\n",
    "* [Feature ParCh](#Feature-ParCh)\n",
    "* [Recap: Filling null values](#Recap:-Filling-null-values)\n",
    "\t* [Feature Age: null values](#Feature-Age:-null-values)\n",
    "\t* [Feature Embarking: null values](#Feature-Embarking:-null-values)\n",
    "\t* [Feature Cabin: null values](#Feature-Cabin:-null-values)\n",
    "* [Encoding categorical features](#Encoding-categorical-features)\n",
    "\t* [Recap: encoding categorical features](#Recap:-encoding-categorical-features)\n",
    "\t* [Encoding Categorical Variables as Binary ones](#Encoding-Categorical-Variables-as-Binary-ones)\n",
    "* [Cleaning: dropping](#Cleaning:-dropping)\n",
    "* [Feature Engineering](#Feature-Engineering)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction: preprocessing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous session, we introduced two libraries for visualisation: *matplotlib* and *seaborn*. We are going to review new functionalities in this notebook, as well as the integration of *pandas* with *matplotlib*.\n",
    "\n",
    "Visualisation is usually combined with munging. We have done this in separate notebooks for learning purposes. We are going to examine the dataset again, combining both techniques, and applying the knowledge we got in the previous notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualisation with Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas provides a very good integration with matplotlib. DataFrames have the following methods:\n",
    "* **plot()**, for a number of charts, that can be selected with the argument *kind*:\n",
    "  * 'bar' for bar plots\n",
    "  * 'hist' for histograms\n",
    "  * 'box' for boxplots\n",
    "  * 'kde' for density plots\n",
    "  * 'area' for area plots\n",
    "  * 'scatter' for scatter plots\n",
    "  * 'hexbin' for hexagonal bin plots\n",
    "  * 'pie' for pie charts\n",
    "  \n",
    "Every plot kind has an equivalent on the Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of the parameters.\n",
    "\n",
    "In addition, the module *pandas.tools.plotting* provides: **scatter_matrix**.\n",
    "\n",
    "You can consult the [documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading and Cleaning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General import and load data\n",
    "import pandas as pd\n",
    "\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "sns.set(color_codes=True)\n",
    "\n",
    "# if matplotlib is not set inline, you will not see plots\n",
    "\n",
    "#alternatives auto gtk gtk2 inline osx qt qt5 wx tk\n",
    "#%matplotlib auto\n",
    "#%matplotlib qt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#We get a URL with raw content (not an HTML one)\n",
    "url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
    "df = pd.read_csv(url)\n",
    "df_original = df.copy() # Copy to have a version of df without modifications\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cleaning\n",
    "df_clean = df.copy() # We copy to see what happens with na values\n",
    "df_clean['Age'] = df['Age'].fillna(df['Age'].median())\n",
    "df_clean['Sex'] = df_clean['Sex'].map({'male': 0, 'female': 1})\n",
    "df_clean = df_clean.drop(['Cabin', 'Ticket'], axis=1)\n",
    "df_clean['Embarked'] = df_clean['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n",
    "df_clean.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  General exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous session, we saw that *Seaborn* provides several facilities for working with DataFrames. We are going to review some of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General description of the dataset\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column types\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Columns non numeric\n",
    "df.dtypes[df.dtypes == object]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of null values\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analise distribution\n",
    "df.hist(figsize=(10,10))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can see the pairwise correlation between variables. A value near 0 means low correlation\n",
    "# while a value  near -1 or 1 indicates strong correlation.\n",
    "df.corr(numeric_only = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We do not find any relevant correlation. We could also represent this with a scatterplot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General description of the relationship between variables using Seaborn PairGrid\n",
    "# We use df_clean because the null values in df would cause an error; you can check it.\n",
    "g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
    "g.map(sns.scatterplot)\n",
    "g.add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are too many variables, we are going to represent only a subset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# PairGrid of variables\n",
    "g = sns.PairGrid(df_clean, hue=\"Survived\", vars=['Pclass', 'Sex', 'Age'])\n",
    "g.map_diag(plt.hist)\n",
    "g.map_offdiag(plt.scatter)\n",
    "g.add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can observe, for example, that more women and more people in the 3rd class survived. \n",
    "\n",
    "We can represent these findings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"Pclass\", y='Survived', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that more women survived in all the passenger classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to put in practice our knowledge about munging and visualisation. We will analyse every feature of the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We saw that there are 177 missing values of age. We are going to implement this feature with more detail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Histogram of Age\n",
    "# For Series, you can use hist(), plot.hist() or plot(kind='hist')\n",
    "df['Age'].hist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see the histogram is slightly *right-skewed* (*sesgada a la derecha*), so we will replace null values with the median instead of the mean.\n",
    "\n",
    "If we have a significantly skewed distribution, extreme values in the long tail can exert a disproportionately large influence on our model. So, it can be good to transform the variable before building our model to reduce skewness. Taking the natural logarithm or the square root of each point is a simple transformation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We see with more bins the distribution\n",
    "df['Age'].hist(bins=30, range=(0, df['Age'].max()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we analyse the relationship of Age and Survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we visualise age and survived to see if there is some relationship\n",
    "sns.FacetGrid(df, hue=\"Survived\", height=5).map(sns.kdeplot, \"Age\").add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We do no observe significant differences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We plot the histogram per age\n",
    "g = sns.FacetGrid(df, col='Survived')\n",
    "g.map(plt.hist, \"Age\", color=\"steelblue\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that non-survived is left skewed. Most children survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Alternative to Seaborn with matplotlib integrated in pandas\n",
    "df.hist(column='Age', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can observe the details for children\n",
    "df[df.Age < 20].hist(column='Age', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Mean of survival for young\n",
    "df[df.Age < 20]['Survived'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There were null values; we will recap at the end of this notebook how to manage them.\n",
    "\n",
    "We are now going to see the distribution of passengers younger than 20 who survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Pclass']).plot(kind='bar')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Passengers older than 25 who survived grouped by Sex\n",
    "\n",
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().plot(kind='bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to improve it a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We pass 'Sex' from columns to rows with unstack, so that now Pclass is in the columns\n",
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we make that the plot shows both values combined, and change the labels\n",
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar', stacked=True)                                                                                                    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-06T17:31:13.351504Z",
     "start_time": "2019-03-06T17:31:13.092499Z"
    }
   },
   "outputs": [],
   "source": [
    "#Small touches\n",
    "\n",
    "pclass_labels = ['First', 'Second', 'Third']\n",
    "sex_labels = ['Female', 'Male']\n",
    "\n",
    "plt = df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar', \n",
    "                                                            stacked=True, rot=0, subplots=False, figsize=(5,10))\n",
    "plt.set_xticklabels(pclass_labels)\n",
    "plt.legend(labels=sex_labels)\n",
    "plt.set_xlabel('Passenger class')\n",
    "plt.set_title('Passenger class per sex')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-06T17:31:13.578569Z",
     "start_time": "2019-03-06T17:31:13.353882Z"
    }
   },
   "outputs": [],
   "source": [
    "#The same horizontal\n",
    "pclass_labels = ['First', 'Second', 'Third']\n",
    "sex_labels = ['Female', 'Male']\n",
    "\n",
    "plt = df.query('Age > 25 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='barh', \n",
    "                                                            stacked=True, rot=0, subplots=False)\n",
    "plt.set_yticklabels(pclass_labels)\n",
    "plt.legend(labels=sex_labels)\n",
    "\n",
    "plt.set_ylabel('Passenger class')\n",
    "plt.set_title('Passenger class per sex')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Sex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now going to explore the Sex attribute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many passengers by sex\n",
    "df.groupby('Sex').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see men are more numerous than women."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot with seaborn\n",
    "sns.countplot(x='Sex', data=df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Same graph with matplotlib and pandas\n",
    "colors_sex = ['#ff69b4', 'b']\n",
    "df.groupby('Sex').size().plot(kind='bar', rot=0, color=colors_sex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many passengers survived by sex\n",
    "df.groupby('Sex')['Survived'].sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many passengers survived by sex\n",
    "df.groupby('Sex')['Survived'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that 74% of females survived, while only 18% of males survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Graphical representation\n",
    "# You can add the parameter estimator to change the estimator. (e.g. estimator=np.median)\n",
    "# For example, estimator=np.size is you get the same as with countplot\n",
    "#sns.barplot(x='Sex', y='Survived', data=df, estimator=np.size)\n",
    "sns.barplot(x='Sex', y='Survived', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see now whether men and women follow the same age distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='Age', by='Sex')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems they follow a similar distribution. We can separate per passenger class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='Age', by='Pclass')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see there are more young men in third class. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Pclass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have already seen how passengers are distributed with Pclass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Pclass').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most passengers are in 3rd class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Survivors per class\n",
    "sns.barplot(x='Pclass', y='Survived', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, passenger class is very significant, since most survivors are in first class.\n",
    "\n",
    "We can also see the distribution of classes per sex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.catplot(x='Pclass',data=df,hue='Sex',kind='count')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['Pclass', 'Sex']).Survived.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see most women in first class and second survived, 96% and 92% respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Fare"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to analyse the feature *Fare* and will take the opportunity to introduce how to manage outliers.\n",
    "\n",
    "As we see in the PairGrid chart, Fare is directly related to the Passenger class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Fare'].hist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(['Fare','Pclass'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see the distribution is right-skewed. We are going to detect outliers using a box plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(data=df['Fare'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can see the same with matplotlib.\n",
    "# There is a bug, and if you import seaborn, you should add 'sym='k.' to show the outliers\n",
    "df.boxplot(column='Fare', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since Fare depends on Pclass, we are going to show outliers per passenger class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.boxplot(column='Fare', by = 'Pclass', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that most outliers are in class 1. In particular, we see some values higher than 500 that should be an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.Fare > 400]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can replace this value by the median(), the mean(), or the second highest value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Calculate hight values\n",
    "df.sort_values('Fare', ascending=False).head(8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace\n",
    "df.loc[df.Fare > 400, 'Fare'] = 263.0\n",
    "\n",
    "# Check we have removed outliers\n",
    "df.sort_values('Fare', ascending=False).head(8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.boxplot(column='Fare', by='Pclass', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Embarked"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can analyze the distribution based on the port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Embarked').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot(x='Embarked', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are missing values, we will replace them with the most popular value ('S'), and we will also encode it since it is a categorical variable.\n",
    "\n",
    "We can see if this has an impact on its survival."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['Embarked']).Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x='Embarked', y='Survived', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems passengers embarked in C (Cherbourg) have a higher chance of survival.\n",
    "We can analyse this by sex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"Embarked\", y='Survived', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is also an improvement by gender for passengers embarking in Cherbourg."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have to fill in the null values (2 nulls) and encode this variable, since it is categorical. We will do it after reviewing the rest of the features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Features SibSp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We analyse the distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('SibSp').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot(x='SibSp', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that most passengers traveled without siblings or spouses. \n",
    "\n",
    "We analyse whether this had an impact on its survival."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('SibSp').Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='SibSp', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that it does not provide too much information. While the survival rate for all passengers is 38%, passengers with 0 SibSp have a 34% survival rate. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Sex']).size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that for SibSp, there is almost the same number of men and women. Now we calculate the survival probability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Sex']).Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"SibSp\", y='Survived', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that when SibSp > 2, the survival probability decreases to half. We are going to check if there is a difference in the age. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Sex']).Age.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"SibSp\", y='Age', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Effectively, when SibSp > 3, age is lower. We are going to check the relationship with Pclass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Pclass']).size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Pclass']).Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"Sex\", y='SibSp', hue='Pclass', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that in 3rd class, females had higher SibSp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"SibSp\", y='Survived', hue='Pclass', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that SibSp is relevant for determining the survival rate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature ParCh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The feature Parch (Parents-Children Aboard) is somewhat related to the previous one, since it reflects family ties. It is well known that in emergencies, family groups often die or evacuate together, so it is expected that this will also affect our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Parch').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot(x='Parch', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see most of the passenger had any parent or children.\n",
    "\n",
    "We analyze now the relationship with Survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Parch').Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Probability survival\n",
    "df.groupby('Parch').Survived.mean().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see the probability of surviving is higher in 2 and 3. Since there were too few rows for Parch >= 3, this part is not relevant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='Parch', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['Pclass', 'Sex', 'Parch'])[['Parch', 'SibSp', 'Survived']].agg({'Parch': np.size, 'SibSp': np.mean, 'Survived': np.mean})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that Parch has an important impact on men in first and second class. We are going to check the age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query('(Sex == \"male\") and (Pclass == [1, 2]) and (Parch == [1, 2])')[['Survived', 'Age']].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that in those cases, the age is 27. We can compare with the rest of men if first and second class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query('(Sex == \"male\") and (Pclass == [1, 2])')[['Survived', 'Age']].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that there is a significant difference, so we suspect that this feature has an impact on men in first and second class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recap: Filling null values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Age: null values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We fill null values of Age with its median."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We create a new feature to maintain the original \n",
    "df['AgeFilled'] = df['Age'].fillna(df['Age'].median())\n",
    "df['AgeFilled'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bug: if you include Seaborn,  add 'sym='k.' to show the outliers\n",
    "df.boxplot(column='AgeFilled', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another alternative is to use the function interpolate()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['AgeFilled'] = df['Age'].interpolate()\n",
    "df['AgeFilled'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bug: if you include Seaborn,  add 'sym='k.' to show the outliers\n",
    "df.boxplot(column='AgeFilled', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Embarking: null values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see most passengers are in 'S'. There were also missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Embarked'].isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we discussed previously, we will replace these missing values with the most popular one (mode): S."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Replace nulls with the most common value\n",
    "df['Embarked'] = df['Embarked'].fillna('S')\n",
    "df['Embarked'].isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Cabin: null values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to analyse Cabin in the exercise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encoding categorical features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recap: encoding categorical features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous notebook, we saw how to encode categorical features. We are going to explore an alternative way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#df = df_original.copy()\n",
    "#df['SexEncoded'] = df.Sex\n",
    "#\n",
    "#df[\"SexEncoded\"] = df[\"Sex\"].map({\"male\": 0, \"female\": 1})\n",
    "#\n",
    "#df['EmbarkedEncoded'] = df.Embarked\n",
    "#df[\"EmbarkedEncoded\"] = df[\"Embarked\"].map({\"S\": 0, \"C\": 1, \"Q\": 2})\n",
    "#df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encoding Categorical Variables as Binary Ones"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we saw previously, translating categorical variables into integers can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable and we can assume an order in *Pclass*.\n",
    "\n",
    "Nevertheless, we will introduce a general approach to encoding categorical variables using facilities provided by scikit-learn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**LabelEncoder** transform categories into integers (0, 1, ...). We are going to use it for *Sex*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
    "\n",
    "df = df_original.copy() # take original df\n",
    "\n",
    "# We define here the categorical columns have non-integer values, so we need to convert them\n",
    "# into integers first with LabelEncoder. This can be omitted if they are already integers.\n",
    "\n",
    "label_enc = LabelEncoder()\n",
    "label_sex = label_enc.fit_transform(df['Sex'])\n",
    "df['SexCoded'] = label_sex\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, we see it has been easy and we have *Sex* as a binary variable.\n",
    "\n",
    "Now we are going to do the same with *Embarked* and *Pclass*. There are several alternatives in scikit-learn, such as *DictVectorizer* or *OneHotEncoder*.\n",
    "\n",
    "We are going to use *pd.get_dummies*, which provides a very easy-to-use way to encode categorical variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Remove nulls\n",
    "# Fill missing values in 'Embarked'\n",
    "df['Embarked'] = df['Embarked'].fillna('S')\n",
    "df = pd.get_dummies(df, columns=['Embarked', 'Pclass'])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning: dropping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should drop columns we will not use. In the exercise, you will need to use 'Cabin'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Drop unwanted columns\n",
    "df = df.drop(columns=['Cabin', 'Ticket'])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feature Engineering is the process of using domain/expert  knowledge of the data to create features that make machine learning algorithms work better. We are going to define several [new ones](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/) in the exercise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Basic Feature Engineering with the Titanic Data](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}