You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
sitc/ml2/3_5_Exercise_1.ipynb

845 lines
24 KiB
Plaintext

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercise - The Titanic Dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise we are going to put in practice what we have learnt in the notebooks of the session. \n",
"\n",
"Answer directly in your copy of the exercise and submit it as a moodle task."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"sns.set(color_codes=True)\n",
"\n",
"# if matplotlib is not set inline, you will not see plots\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reading Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
"\n",
"Print *df*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Munging and Exploratory visualisation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain number of passengers and features of the dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain general statistics (count, mean, std, min, max, 25%, 50%, 75%) about the column Age"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain the median of the age of the passengers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Obtain number of missing values per feature"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How many passsengers have survived? List them grouped by Sex and Pclass.\n",
"\n",
"Assign the result to a variable df_1 and print it"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visualise df_1 as an histogram."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Engineering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here you can find some features that have been proposed for this dataset. Your task is to analyse them and provide some insights. \n",
"\n",
"Use pandas and visualisation to justify your conclusions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature FamilySize "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Regarding SbSp and Parch, we can define a new feature, 'FamilySize' that is the combination of both."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>PassengerId</th>\n",
" <th>Survived</th>\n",
" <th>Pclass</th>\n",
" <th>Name</th>\n",
" <th>Sex</th>\n",
" <th>Age</th>\n",
" <th>SibSp</th>\n",
" <th>Parch</th>\n",
" <th>Ticket</th>\n",
" <th>Fare</th>\n",
" <th>Cabin</th>\n",
" <th>Embarked</th>\n",
" <th>FamilySize</th>\n",
" <th>AgeGroup</th>\n",
" <th>Deck</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Braund, Mr. Owen Harris</td>\n",
" <td>male</td>\n",
" <td>22.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>A/5 21171</td>\n",
" <td>7.2500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>X</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
" <td>female</td>\n",
" <td>38.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>PC 17599</td>\n",
" <td>71.2833</td>\n",
" <td>C85</td>\n",
" <td>C</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>3</td>\n",
" <td>Heikkinen, Miss. Laina</td>\n",
" <td>female</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>STON/O2. 3101282</td>\n",
" <td>7.9250</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>X</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
" <td>female</td>\n",
" <td>35.0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>113803</td>\n",
" <td>53.1000</td>\n",
" <td>C123</td>\n",
" <td>S</td>\n",
" <td>1</td>\n",
" <td>3.0</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Allen, Mr. William Henry</td>\n",
" <td>male</td>\n",
" <td>35.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>373450</td>\n",
" <td>8.0500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>X</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>886</th>\n",
" <td>887</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>Montvila, Rev. Juozas</td>\n",
" <td>male</td>\n",
" <td>27.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>211536</td>\n",
" <td>13.0000</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>X</td>\n",
" </tr>\n",
" <tr>\n",
" <th>887</th>\n",
" <td>888</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Graham, Miss. Margaret Edith</td>\n",
" <td>female</td>\n",
" <td>19.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>112053</td>\n",
" <td>30.0000</td>\n",
" <td>B42</td>\n",
" <td>S</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>B</td>\n",
" </tr>\n",
" <tr>\n",
" <th>888</th>\n",
" <td>889</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
" <td>female</td>\n",
" <td>NaN</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>W./C. 6607</td>\n",
" <td>23.4500</td>\n",
" <td>NaN</td>\n",
" <td>S</td>\n",
" <td>3</td>\n",
" <td>NaN</td>\n",
" <td>X</td>\n",
" </tr>\n",
" <tr>\n",
" <th>889</th>\n",
" <td>890</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>Behr, Mr. Karl Howell</td>\n",
" <td>male</td>\n",
" <td>26.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>111369</td>\n",
" <td>30.0000</td>\n",
" <td>C148</td>\n",
" <td>C</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>C</td>\n",
" </tr>\n",
" <tr>\n",
" <th>890</th>\n",
" <td>891</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>Dooley, Mr. Patrick</td>\n",
" <td>male</td>\n",
" <td>32.0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>370376</td>\n",
" <td>7.7500</td>\n",
" <td>NaN</td>\n",
" <td>Q</td>\n",
" <td>0</td>\n",
" <td>3.0</td>\n",
" <td>X</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>891 rows × 15 columns</p>\n",
"</div>"
],
"text/plain": [
" PassengerId Survived Pclass \\\n",
"0 1 0 3 \n",
"1 2 1 1 \n",
"2 3 1 3 \n",
"3 4 1 1 \n",
"4 5 0 3 \n",
".. ... ... ... \n",
"886 887 0 2 \n",
"887 888 1 1 \n",
"888 889 0 3 \n",
"889 890 1 1 \n",
"890 891 0 3 \n",
"\n",
" Name Sex Age SibSp \\\n",
"0 Braund, Mr. Owen Harris male 22.0 1 \n",
"1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n",
"2 Heikkinen, Miss. Laina female 26.0 0 \n",
"3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n",
"4 Allen, Mr. William Henry male 35.0 0 \n",
".. ... ... ... ... \n",
"886 Montvila, Rev. Juozas male 27.0 0 \n",
"887 Graham, Miss. Margaret Edith female 19.0 0 \n",
"888 Johnston, Miss. Catherine Helen \"Carrie\" female NaN 1 \n",
"889 Behr, Mr. Karl Howell male 26.0 0 \n",
"890 Dooley, Mr. Patrick male 32.0 0 \n",
"\n",
" Parch Ticket Fare Cabin Embarked FamilySize AgeGroup \\\n",
"0 0 A/5 21171 7.2500 NaN S 1 3.0 \n",
"1 0 PC 17599 71.2833 C85 C 1 3.0 \n",
"2 0 STON/O2. 3101282 7.9250 NaN S 0 3.0 \n",
"3 0 113803 53.1000 C123 S 1 3.0 \n",
"4 0 373450 8.0500 NaN S 0 3.0 \n",
".. ... ... ... ... ... ... ... \n",
"886 0 211536 13.0000 NaN S 0 3.0 \n",
"887 0 112053 30.0000 B42 S 0 3.0 \n",
"888 2 W./C. 6607 23.4500 NaN S 3 NaN \n",
"889 0 111369 30.0000 C148 C 0 3.0 \n",
"890 0 370376 7.7500 NaN Q 0 3.0 \n",
"\n",
" Deck \n",
"0 X \n",
"1 C \n",
"2 X \n",
"3 C \n",
"4 X \n",
".. ... \n",
"886 X \n",
"887 B \n",
"888 X \n",
"889 C \n",
"890 X \n",
"\n",
"[891 rows x 15 columns]"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df['FamilySize'] = df['SibSp'] + df['Parch']\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Alone"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It seems many people who went alone survived. We can define a new feature 'Alone'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['Alone'] = (df.FamilySize == 0)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Salutation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we observe well in the name variable, there is a 'title' (Mr., Miss., Mrs.). We can add a feature wit this title."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Taken from http://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-steps-python/\n",
"def name_extract(word):\n",
" return word.split(',')[1].split('.')[0].strip()\n",
"\n",
"df['Salutation'] = df['Name'].apply(name_extract)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can list the different salutations."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['Salutation'].unique()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.groupby(['Salutation']).size()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There only 4 main salutations, so we combine the rest of salutations in 'Others'."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'df' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m<ipython-input-2-515fd9f54fd1>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 13\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0;32mreturn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Others'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Salutation'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Salutation'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgroup_salutation\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 16\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Salutation'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
]
}
],
"source": [
"def group_salutation(old_salutation):\n",
" if old_salutation == 'Mr':\n",
" return('Mr')\n",
" else:\n",
" if old_salutation == 'Mrs':\n",
" return('Mrs')\n",
" else:\n",
" if old_salutation == 'Master':\n",
" return('Master')\n",
" else: \n",
" if old_salutation == 'Miss':\n",
" return('Miss')\n",
" else:\n",
" return('Others')\n",
"df['Salutation'] = df['Salutation'].apply(group_salutation)\n",
"df.groupby(['Salutation']).size()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Distribution\n",
"colors_sex = ['#ff69b4', 'b', 'r', 'y', 'm', 'c']\n",
"df.groupby('Salutation').size().plot(kind='bar', color=colors_sex)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.boxplot(column='Age', by = 'Salutation', sym='k.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Features Children and Female"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Specific features for Children and Female since there are more survivors\n",
"df['Children'] = df['Age'].map(lambda x: 1 if x < 6.0 else 0)\n",
"df['Female'] = df['Sex'].map(lambda x: 1 if x == \"female\" else 0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature AgeGroup"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# Group ages to simplify machine learning algorithms. 0: 0-5, 1: 6-10, 2: 11-15, 3: 16-59 and 4: 60-80\n",
"df['AgeGroup'] = np.nan\n",
"df.loc[(df.Age<6),'AgeGroup'] = 0\n",
"df.loc[(df.Age>=6) & (df.Age < 11),'AgeGroup'] = 1\n",
"df.loc[(df.Age>=11) & (df.Age < 16),'AgeGroup'] = 2\n",
"df.loc[(df.Age>=16) & (df.Age < 60),'AgeGroup'] = 3\n",
"df.loc[(df.Age>=60),'AgeGroup'] = 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature Deck\n",
"Only 1st class passengers have cabins, the rest are Unknown. A cabin number looks like C123. The letter refers to the deck."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"def substrings_in_string(big_string, substrings):\n",
" if type(big_string) == float:\n",
" if np.isnan(big_string):\n",
" return 'X'\n",
" for substring in substrings:\n",
" if substring in big_string:\n",
" return substring[0::]\n",
" print(big_string)\n",
" return 'X'\n",
" \n",
"#Turning cabin number into Deck\n",
"cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']\n",
"df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature FarePerPerson"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This feature is created from two previous features: Fare and FamilySize."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['FarePerPerson']= df['Fare'] / (df['FamilySize'] + 1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Feature AgeClass"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since age and class are both numbers we can just multiply them and get a new feature.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['AgeClass']=df['Age']*df['Pclass']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}