sitc/ml2/3_4_Visualisation_Pandas.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Introduction: preprocessing](#Introduction:-preprocessing)\n",
    "* [Visualisation with Pandas](#Visualisation-with-Pandas)\n",
    "* [Loading and Cleaning](#Loading-and-Cleaning)\n",
    "* [General exploration](#General-exploration)\n",
    "* [Feature Age](#Feature-Age)\n",
    "* [Feature Sex](#Feature-Sex)\n",
    "* [Feature Pclass](#Feature-Pclass)\n",
    "* [Feature Fare](#Feature-Fare)\n",
    "* [Feature Embarked](#Feature-Embarked)\n",
    "* [Features SibSp](#Features-SibSp)\n",
    "* [Feature ParCh](#Feature-ParCh)\n",
    "* [Recap: Filling null values](#Recap:-Filling-null-values)\n",
    "\t* [Feature Age: null values](#Feature-Age:-null-values)\n",
    "\t* [Feature Embarking: null values](#Feature-Embarking:-null-values)\n",
    "\t* [Feature Cabin: null values](#Feature-Cabin:-null-values)\n",
    "* [Encoding categorical features](#Encoding-categorical-features)\n",
    "\t* [Recap: encoding categorical features](#Recap:-encoding-categorical-features)\n",
    "\t* [Encoding Categorical Variables as Binary ones](#Encoding-Categorical-Variables-as-Binary-ones)\n",
    "* [Cleaning: dropping](#Cleaning:-dropping)\n",
    "* [Feature Engineering](#Feature-Engineering)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction: preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous session, we introduced two libraries for visualisation: *matplotlib* and *seaborn*. We are going to review new functionalities in this notebook, as well as the integration of *pandas* with *matplotlib*.\n",
    "\n",
    "Visualisation is usually combined with munging. We have done this in separated notebooks for learning purposes. We we are going to examine again the dataset, combinging both techniques, and applying the knowledge we got in the previous notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualisation with Pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas provides a very good integration with matplotlib. DataFrames have the following methods:\n",
    "* **plot()**, for a number of charts, that can be selected with the argument *kind*:\n",
    "  * 'bar' for bar plots\n",
    "  * 'hist' for histograms\n",
    "  * 'box' for boxplots\n",
    "  * 'kde' for density plots\n",
    "  * 'area' for area plots\n",
    "  * 'scatter' for scatter plots\n",
    "  * 'hexbin' for hexagonal bin plots\n",
    "  * 'pie' for pie charts\n",
    "  \n",
    "Every plot kind has an equivalent on Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of parameters.\n",
    "\n",
    "In addition, the module *pandas.tools.plotting* provides: **scatter_matrix**.\n",
    "\n",
    "You can consult more details in the [documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Loading and Cleaning"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General import and load data\n",
    "import pandas as pd\n",
    "\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "sns.set(color_codes=True)\n",
    "\n",
    "# if matplotlib is not set inline, you will not see plots\n",
    "\n",
    "#alternatives auto gtk gtk2 inline osx qt qt5 wx tk\n",
    "#%matplotlib auto\n",
    "#%matplotlib qt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#We get a URL with raw content (not HTML one)\n",
    "url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
    "df = pd.read_csv(url)\n",
    "df_original = df.copy() # Copy to have a version of df without modifications\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cleaning\n",
    "df_clean = df.copy() # We copy to see what happens with na values\n",
    "df_clean['Age'] = df['Age'].fillna(df['Age'].median())\n",
    "df_clean.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
    "df_clean.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
    "df_clean.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
    "df_clean.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
    "df_clean.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
    "df_clean.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
    "df_clean.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  General exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous session we saw that *Seaborn* provides several facilities for working with DataFrames. We are going to review some of them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General description of the dataset\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Column types\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Columns non numeric\n",
    "df.dtypes[df.dtypes == object]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of null values\n",
    "df.isnull().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Analise distributon\n",
    "df.hist(figsize=(10,10))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can see the pairwise correlation between variables. A value near 0 means low correlation\n",
    "# while a value  near -1 or 1 indicates strong correlation.\n",
    "df.corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We do not find any relevant correlation. We could also represent this with a scatterplot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General description of relationship betweek variables uwing Seaborn PairGrid\n",
    "# We use df_clean, since the null values of df would gives us an error, you can check it.\n",
    "g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
    "g.map_diag(plt.hist)\n",
    "g.map_offdiag(plt.scatter)\n",
    "g.add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two many variables, we are going to represent only a subset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# PairGrid of variables\n",
    "g = sns.PairGrid(df_clean, hue=\"Survived\", vars=['Pclass', 'Sex', 'Age'])\n",
    "g.map_diag(plt.hist)\n",
    "g.map_offdiag(plt.scatter)\n",
    "g.add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can observe, for example, that more women survived as well as more people in 3rd class. \n",
    "\n",
    "We can represent these findings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"Pclass\", y='Survived', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that more women survived in all the passenger classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to put in practice our knowledge about munging and visualisation. We will analyse every feature of the dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We saw that there are 177 missing values of age. We are going this feature with more detail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Histogram of Age\n",
    "# For Series, you can use hist(), plot.hist() or plot(kind='hist')\n",
    "df['Age'].hist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see the histogram is slightly *right skewed* (*sesgada a la derecha*), so we will replace null values with the median instead of the mean.\n",
    "\n",
    "In case we have a significant *skewed distribution*, the extreme values in the long tail can have a disproportionately large influence on our model. So, it can be good to transform the variable before building our model to reduce skewness.Taking the natural logarithm or the square root of each point are two simple transformations. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We see with more bins the distribution\n",
    "df['Age'].hist(bins=30, range=(0, df['Age'].max()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we analyse the relationship of Age and Survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we visualise age and survived to see if there is some relationship\n",
    "sns.FacetGrid(df, hue=\"Survived\", size=5).map(sns.kdeplot, \"Age\").add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We do no observe significant differences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We plot the histogram per age\n",
    "g = sns.FacetGrid(df, col='Survived')\n",
    "g.map(plt.hist, \"Age\", color=\"steelblue\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that non survived is left skewed. Most children survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Alternative to Seaborn with matplotlib integrated in pandas\n",
    "df.hist(column='Age', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can observe the detail for children\n",
    "df[df.Age < 20].hist(column='Age', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Mean of survival for young\n",
    "df[df.Age < 20]['Survived'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There were null values, we will recap at the end of this notebook how to manage them.\n",
    "\n",
    "We are going now to see the distribution of passengers younger than 20 that survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Pclass']).plot(kind='bar')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Passengers older than 25 that survived grouped by Sex\n",
    "\n",
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().plot(kind='bar')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to improve it a bit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We pass 'Sex' from columns to rows with unstack, so that now Pclass is in the columns\n",
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we make that the plot shows both values combined, and change the labels\n",
    "df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar', stacked=True)                                                                                                    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-06T17:31:13.351504Z",
     "start_time": "2019-03-06T17:31:13.092499Z"
    }
   },
   "outputs": [],
   "source": [
    "#Small touches\n",
    "\n",
    "pclass_labels = ['First', 'Second', 'Third']\n",
    "sex_labels = ['Female', 'Male']\n",
    "\n",
    "plt = df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='bar', \n",
    "                                                            stacked=True, rot=0, subplots=False, figsize=(5,10))\n",
    "plt.set_xticklabels(pclass_labels)\n",
    "plt.legend(labels=sex_labels)\n",
    "plt.set_xlabel('Passenger class')\n",
    "plt.set_title('Passenger class per sex')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-06T17:31:13.578569Z",
     "start_time": "2019-03-06T17:31:13.353882Z"
    }
   },
   "outputs": [],
   "source": [
    "#The same horizontal\n",
    "pclass_labels = ['First', 'Second', 'Third']\n",
    "sex_labels = ['Female', 'Male']\n",
    "\n",
    "plt = df.query('Age > 25 and Survived == 1').groupby(['Sex','Pclass']).size().unstack(['Sex']).plot(kind='barh', \n",
    "                                                            stacked=True, rot=0, subplots=False)\n",
    "plt.set_yticklabels(pclass_labels)\n",
    "plt.legend(labels=sex_labels)\n",
    "\n",
    "plt.set_ylabel('Passenger class')\n",
    "plt.set_title('Passenger class per sex')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Sex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now going to explore the Sex attribute"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many passengers by sex\n",
    "df.groupby('Sex').size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see men are more numerous than women."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot with seaborn\n",
    "sns.countplot('Sex', data=df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Same graph with matplotlib and pandas\n",
    "colors_sex = ['#ff69b4', 'b']\n",
    "df.groupby('Sex').size().plot(kind='bar', rot=0, color=colors_sex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many passergers survived by sex\n",
    "df.groupby('Sex')['Survived'].sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many passergers survived by sex\n",
    "df.groupby('Sex')['Survived'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that 74% of female survived, while only 18% of male survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Graphical representation\n",
    "# You can add the parameter estimator to change the estimator. (e.g. estimator=np.median)\n",
    "# For example, estimator=np.size is you get the same chart than with countplot\n",
    "#sns.barplot(x='Sex', y='Survived', data=df, estimator=np.size)\n",
    "sns.barplot(x='Sex', y='Survived', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see now if men and women follow the same age distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='Age', by='Sex')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems they follow a similar distribution. We can separate per passenger class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='Age', by='Pclass')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see there are more young men in third class. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Pclass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have already seen how passengers are distributed with Pclass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Pclass').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot('Pclass', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Most passengers are in 3rd class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Survivors per class\n",
    "sns.barplot(x='Pclass', y='Survived', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, passenger class is very significant, since most survivors are in first class.\n",
    "\n",
    "We can also see the distribution of classes per sex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.factorplot('Pclass',data=df,hue='Sex',kind='count')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['Pclass', 'Sex']).Survived.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see most women in first class and second survived, 96% and 92% respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Fare"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to analyse the feature *Fare* and will take the opportunity to introduce how to manage outliers.\n",
    "\n",
    "As we see in the PairGrid chart, Fare is directly related to the Passenger class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Fare'].hist()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(['Fare','Pclass'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see the distribution is right sweked. We are going to detect outliers using a box plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.boxplot(data=df['Fare'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can see the same with matplotlib.\n",
    "# There is a bug and if you import seaborn, you should add 'sym='k.' to show the outliers\n",
    "df.boxplot(column='Fare', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since Fare depends on Pclass, we are going to show outliers per passenger class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.boxplot(column='Fare', by = 'Pclass', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that most outliers are in class 1. In particular, we see some values higher thatn 500 that should be an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[df.Fare > 400]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can replace this value by the median(), the mean(), or the second highest value."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Calculate hight values\n",
    "df.sort_values('Fare', ascending=False).head(8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replace\n",
    "df.loc[df.Fare > 400, 'Fare'] = 263.0\n",
    "\n",
    "# Check we have removed outliers\n",
    "df.sort_values('Fare', ascending=False).head(8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.boxplot(column='Fare', by='Pclass', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Embarked"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can analyze the distribution based on the port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Embarked').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot('Embarked', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since there are missing values, we will replace them by the most popular value ('S'), and we will also encode it since it is a categorical variable.\n",
    "\n",
    "We can see if this has impact on its survival."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['Embarked']).Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x='Embarked', y='Survived', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems passengers embarked in C (Cherbourg) have a higher chance of survival.\n",
    "We can analyse this by sex."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"Embarked\", y='Survived', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is also an improvement by gender for passengers embarking in Cherbourg."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have to fill null values (2 null values) and encode this variable, since it is categorical. We will do it after reviewing the rest of features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Features SibSp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We analyse the distribution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('SibSp').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot('SibSp', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that most passengers traveled without siblings or spouses. \n",
    "\n",
    "We analyse if this had impact on its survival."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('SibSp').Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='SibSp', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that it does not provide too much information. While the survival mean of all passengers is 38%, passengers with 0 SibSp has 34% of probability. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Sex']).size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that for SibSp, there is almost the same number of men and women. Now we calculate the survival probability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Sex']).Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"SibSp\", y='Survived', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that when SibSp > 2, the survival probability decreases to the half. We are going to check if there is a difference in the age. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Sex']).Age.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"SibSp\", y='Age', hue='Sex', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Effectively, when SibSp > 3, age is lower. We are going to check the relationship with Pclass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Pclass']).size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['SibSp', 'Pclass']).Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"Sex\", y='SibSp', hue='Pclass', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that in 3rd class, females had higher SibSp."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.barplot(x=\"SibSp\", y='Survived', hue='Pclass', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems that SibSp is relevant for determining the survival rate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature ParCh"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The feature Parch (Parents-Children Aboard) is somewhat related to the previous one, since it reflects family ties. It is well known that in emergencies, family groups often all die or evacuate together, so it is expected that it will also have an impact on our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Parch').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "sns.countplot('Parch', data=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see most of the passenger had any parent or children.\n",
    "\n",
    "We analyze now the relationship with Survived."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('Parch').Survived.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Probability survival\n",
    "df.groupby('Parch').Survived.mean().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see the probability of surviving is higher in 2 and 3. Sincethere were too few rows for Parch >= 3, this part is not relevant."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.hist(column='Parch', by='Survived', sharey=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['Pclass', 'Sex', 'Parch'])['Parch', 'SibSp', 'Survived'].agg({'Parch': np.size, 'SibSp': np.mean, 'Survived': np.mean})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that Parch has an important impact for men in first and second class. We are going to check the age."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query('(Sex == \"male\") and (Pclass == [1, 2]) and (Parch == [1, 2])')[['Survived', 'Age']].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that in those cases, the age is 27. We can compare with the rest of men if first and second class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.query('(Sex == \"male\") and (Pclass == [1, 2])')[['Survived', 'Age']].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We observe that there is a significant difference, so we suspect that this feature has impact of men in first and second class."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recap: Filling null values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Age: null values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We fill null values of Age with its median."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We create a new feature to maintain the original \n",
    "df['AgeFilled'] = df['Age'].fillna(df['Age'].median())\n",
    "df['AgeFilled'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bug: if you include Seaborn,  add 'sym='k.' to show the outliers\n",
    "df.boxplot(column='AgeFilled', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another alternative is to use the function interpolate()."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['AgeFilled'] = df['Age'].interpolate()\n",
    "df['AgeFilled'].describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bug: if you include Seaborn,  add 'sym='k.' to show the outliers\n",
    "df.boxplot(column='AgeFilled', return_type='axes', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Embarking: null values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see most passengers are in 'S'. There were also missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Embarked'].isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we discussed previously, we will replace these missing values by the most popular one (mode): S."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Replace nulls with the most common value\n",
    "df['Embarked'].fillna('S', inplace=True)\n",
    "df['Embarked'].isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Cabin: null values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to analyse Cabin in the exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Encoding categorical features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Recap: encoding categorical features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous notebook we saw how to encode categorical features. We are going to explore an alternative way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#df = df_original.copy()\n",
    "#df['SexEncoded'] = df.Sex\n",
    "#\n",
    "#df.loc[df[\"SexEncoded\"] == 'male', \"SexEncoded\"] = 0\n",
    "#df.loc[df[\"SexEncoded\"] == \"female\", \"SexEncoded\"] = 1\n",
    "#\n",
    "#df['EmbarkedEncoded'] = df.Embarked\n",
    "#df.loc[df[\"EmbarkedEncoded\"] == \"S\", \"EmbarkedEncoded\"] = 0\n",
    "#df.loc[df[\"EmbarkedEncoded\"] == \"C\", \"EmbarkedEncoded\"] = 1\n",
    "#df.loc[df[\"EmbarkedEncoded\"] == \"Q\", \"EmbarkedEncoded\"] = 2\n",
    "#df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Encoding Categorical Variables as Binary ones"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see previously, translating categorical variables into integer can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable, and we can consider there exists an order in *Pclass*.\n",
    "\n",
    "Nevertheless, we are going to introduce a general approach to encode categorical variables using some facilities provided by scikit-learn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**LabelEncoder** transform categories into integers (0, 1, ...). We are going to use it for *Sex*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import LabelEncoder, OneHotEncoder\n",
    "\n",
    "df = df_original.copy() # take original df\n",
    "\n",
    "# We define here the categorical columns have non integer values, so we need to convert them\n",
    "# into integers first with LabelEncoder. This can be omitted if the are already integers.\n",
    "\n",
    "label_enc = LabelEncoder()\n",
    "label_sex = label_enc.fit_transform(df['Sex'])\n",
    "df['SexCoded'] = label_sex\n",
    "\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, we see it has been easy and we have *Sex* as a binary variable.\n",
    "\n",
    "Now we are going to do the same with *Embarked* and *Pclass*. There are several alternatives in scikit-learn, such as *DictVectorizer* or *OneHotEncoder*.\n",
    "\n",
    "We are going to use *pd.get_dummies*, which provides a very easy-to-use way to encode categorical variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Remove nulls\n",
    "df['Embarked'].fillna('S', inplace=True)\n",
    "df = pd.get_dummies(df, columns=['Embarked', 'Pclass'])\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning: dropping"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We should drop columns we will not use. In the exercise, you will need to use 'Cabin'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feature Engineering is the process of using domain/expert  knowledge of the data to create features that make machine learning algorithms work better. We are going to define several [new ones](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/) in the exercise."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Basic Feature Engineering with the Titanic Data](https://triangleinequality.wordpress.com/2013/09/08/basic-feature-engineering-with-the-titanic-data/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}