sitc/ml2/3_5_Exercise_1.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise - The Titanic Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this exercise we are going to put in practice what we have learnt in the notebooks of the session. \n",
    "\n",
    "Answer directly in your copy of the exercise and submit it as a moodle task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "sns.set(color_codes=True)\n",
    "\n",
    "# if matplotlib is not set inline, you will not see plots\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv.\n",
    "\n",
    "Print *df*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Munging and Exploratory visualisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain number of passengers and features of the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain general statistics (count, mean, std, min, max, 25%, 50%, 75%) about the column Age"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain the median of the age of the passengers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obtain number of missing values per feature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How many passsengers have survived? List them grouped by Sex and Pclass.\n",
    "\n",
    "Assign the result to a variable df_1 and print it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visualise df_1 as an histogram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here you can find some features that have been proposed for this dataset. Your task is to analyse them and provide some insights. \n",
    "\n",
    "Use pandas and visualisation to justify your conclusions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature FamilySize "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Regarding SbSp and Parch, we can define a new feature, 'FamilySize' that is the combination of both."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['FamilySize'] = df['SibSp'] + df['Parch']\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Alone"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It seems many people who went alone survived. We can define a new feature 'Alone'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Alone'] = (df.FamilySize == 0)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Salutation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we observe well in the name variable, there is a 'title' (Mr., Miss., Mrs.). We can add a feature wit this title."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Taken from http://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-steps-python/\n",
    "def name_extract(word):\n",
    "    return word.split(',')[1].split('.')[0].strip()\n",
    "\n",
    "df['Salutation'] = df['Name'].apply(name_extract)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can list the different salutations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Salutation'].unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby(['Salutation']).size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There only 4 main salutations, so we combine the rest of salutations in 'Others'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def group_salutation(old_salutation):\n",
    "    if old_salutation == 'Mr':\n",
    "        return('Mr')\n",
    "    else:\n",
    "        if old_salutation == 'Mrs':\n",
    "            return('Mrs')\n",
    "        else:\n",
    "            if old_salutation == 'Master':\n",
    "                return('Master')\n",
    "            else: \n",
    "                if old_salutation == 'Miss':\n",
    "                    return('Miss')\n",
    "                else:\n",
    "                    return('Others')\n",
    "df['Salutation'] = df['Salutation'].apply(group_salutation)\n",
    "df.groupby(['Salutation']).size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Distribution\n",
    "colors_sex = ['#ff69b4', 'b', 'r', 'y', 'm', 'c']\n",
    "df.groupby('Salutation').size().plot(kind='bar', color=colors_sex)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.boxplot(column='Age', by = 'Salutation', sym='k.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Features Children and Female"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specific features for Children and Female since there are more survivors\n",
    "df['Children']   = df['Age'].map(lambda x: 1 if x < 6.0 else 0)\n",
    "df['Female']     = df['Sex'].map(lambda x: 1 if x == \"female\" else 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature AgeGroup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Group ages to simplify machine learning algorithms.  0: 0-5, 1: 6-10, 2: 11-15, 3: 16-59 and 4: 60-80\n",
    "df['AgeGroup'] = np.nan\n",
    "df.loc[(df.Age<6),'AgeGroup'] = 0\n",
    "df.loc[(df.Age>=6) & (df.Age < 11),'AgeGroup'] = 1\n",
    "df.loc[(df.Age>=11) & (df.Age < 16),'AgeGroup'] = 2\n",
    "df.loc[(df.Age>=16) & (df.Age < 60),'AgeGroup'] = 3\n",
    "df.loc[(df.Age>=60),'AgeGroup'] = 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature Deck\n",
    "Only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def substrings_in_string(big_string, substrings):\n",
    "    if type(big_string) == float:\n",
    "        if np.isnan(big_string):\n",
    "            return 'X'\n",
    "    for substring in substrings:\n",
    "        if substring in big_string:\n",
    "            return substring[0::]\n",
    "    print(big_string)\n",
    "    return 'X'\n",
    " \n",
    "#Turning cabin number into Deck\n",
    "cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']\n",
    "df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature FarePerPerson"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This feature is created from two previous features: Fare and FamilySize."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['FarePerPerson']= df['Fare'] / (df['FamilySize'] + 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Feature AgeClass"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since age and class are both numbers we can just multiply them and get a new feature.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['AgeClass']=df['Age']*df['Pclass']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}