mirror of
				https://github.com/gsi-upm/sitc
				synced 2025-11-04 01:18:16 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			845 lines
		
	
	
		
			24 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			845 lines
		
	
	
		
			24 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
{
 | 
						||
 "cells": [
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    ""
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "# Course Notes for Learning Intelligent Systems"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "# Exercise - The Titanic Dataset"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "In this exercise we are going to put in practice what we have learnt in the notebooks of the session. \n",
 | 
						||
    "\n",
 | 
						||
    "Answer directly in your copy of the exercise and submit it as a moodle task."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": 11,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "import pandas as pd\n",
 | 
						||
    "\n",
 | 
						||
    "import seaborn as sns\n",
 | 
						||
    "import matplotlib.pyplot as plt\n",
 | 
						||
    "import numpy as np\n",
 | 
						||
    "sns.set(color_codes=True)\n",
 | 
						||
    "\n",
 | 
						||
    "# if matplotlib is not set inline, you will not see plots\n",
 | 
						||
    "%matplotlib inline"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "# Reading Data"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
 | 
						||
    "\n",
 | 
						||
    "Print *df*."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": []
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "# Munging and Exploratory visualisation"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Obtain number of passengers and features of the dataset"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": []
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Obtain general statistics (count, mean, std, min, max, 25%, 50%, 75%) about the column Age"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": []
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Obtain the median of the age of the passengers"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": []
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Obtain number of missing values per feature"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": []
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "How many passsengers have survived? List them grouped by Sex and Pclass.\n",
 | 
						||
    "\n",
 | 
						||
    "Assign the result to a variable df_1 and print it"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": []
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Visualise df_1 as an histogram."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": []
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "# Feature Engineering"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Here you can find some features that have been proposed for this dataset. Your task is to analyse them and provide some insights. \n",
 | 
						||
    "\n",
 | 
						||
    "Use pandas and visualisation to justify your conclusions"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Feature FamilySize "
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Regarding SbSp and Parch, we can define a new feature, 'FamilySize' that is the combination of both."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": 20,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [
 | 
						||
    {
 | 
						||
     "data": {
 | 
						||
      "text/html": [
 | 
						||
       "<div>\n",
 | 
						||
       "<style scoped>\n",
 | 
						||
       "    .dataframe tbody tr th:only-of-type {\n",
 | 
						||
       "        vertical-align: middle;\n",
 | 
						||
       "    }\n",
 | 
						||
       "\n",
 | 
						||
       "    .dataframe tbody tr th {\n",
 | 
						||
       "        vertical-align: top;\n",
 | 
						||
       "    }\n",
 | 
						||
       "\n",
 | 
						||
       "    .dataframe thead th {\n",
 | 
						||
       "        text-align: right;\n",
 | 
						||
       "    }\n",
 | 
						||
       "</style>\n",
 | 
						||
       "<table border=\"1\" class=\"dataframe\">\n",
 | 
						||
       "  <thead>\n",
 | 
						||
       "    <tr style=\"text-align: right;\">\n",
 | 
						||
       "      <th></th>\n",
 | 
						||
       "      <th>PassengerId</th>\n",
 | 
						||
       "      <th>Survived</th>\n",
 | 
						||
       "      <th>Pclass</th>\n",
 | 
						||
       "      <th>Name</th>\n",
 | 
						||
       "      <th>Sex</th>\n",
 | 
						||
       "      <th>Age</th>\n",
 | 
						||
       "      <th>SibSp</th>\n",
 | 
						||
       "      <th>Parch</th>\n",
 | 
						||
       "      <th>Ticket</th>\n",
 | 
						||
       "      <th>Fare</th>\n",
 | 
						||
       "      <th>Cabin</th>\n",
 | 
						||
       "      <th>Embarked</th>\n",
 | 
						||
       "      <th>FamilySize</th>\n",
 | 
						||
       "      <th>AgeGroup</th>\n",
 | 
						||
       "      <th>Deck</th>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "  </thead>\n",
 | 
						||
       "  <tbody>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>0</th>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3</td>\n",
 | 
						||
       "      <td>Braund, Mr. Owen Harris</td>\n",
 | 
						||
       "      <td>male</td>\n",
 | 
						||
       "      <td>22.0</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>A/5 21171</td>\n",
 | 
						||
       "      <td>7.2500</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>S</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>X</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>1</th>\n",
 | 
						||
       "      <td>2</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
 | 
						||
       "      <td>female</td>\n",
 | 
						||
       "      <td>38.0</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>PC 17599</td>\n",
 | 
						||
       "      <td>71.2833</td>\n",
 | 
						||
       "      <td>C85</td>\n",
 | 
						||
       "      <td>C</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>C</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>2</th>\n",
 | 
						||
       "      <td>3</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>3</td>\n",
 | 
						||
       "      <td>Heikkinen, Miss. Laina</td>\n",
 | 
						||
       "      <td>female</td>\n",
 | 
						||
       "      <td>26.0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>STON/O2. 3101282</td>\n",
 | 
						||
       "      <td>7.9250</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>S</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>X</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>3</th>\n",
 | 
						||
       "      <td>4</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
 | 
						||
       "      <td>female</td>\n",
 | 
						||
       "      <td>35.0</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>113803</td>\n",
 | 
						||
       "      <td>53.1000</td>\n",
 | 
						||
       "      <td>C123</td>\n",
 | 
						||
       "      <td>S</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>C</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>4</th>\n",
 | 
						||
       "      <td>5</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3</td>\n",
 | 
						||
       "      <td>Allen, Mr. William Henry</td>\n",
 | 
						||
       "      <td>male</td>\n",
 | 
						||
       "      <td>35.0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>373450</td>\n",
 | 
						||
       "      <td>8.0500</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>S</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>X</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>...</th>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "      <td>...</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>886</th>\n",
 | 
						||
       "      <td>887</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>2</td>\n",
 | 
						||
       "      <td>Montvila, Rev. Juozas</td>\n",
 | 
						||
       "      <td>male</td>\n",
 | 
						||
       "      <td>27.0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>211536</td>\n",
 | 
						||
       "      <td>13.0000</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>S</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>X</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>887</th>\n",
 | 
						||
       "      <td>888</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>Graham, Miss. Margaret Edith</td>\n",
 | 
						||
       "      <td>female</td>\n",
 | 
						||
       "      <td>19.0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>112053</td>\n",
 | 
						||
       "      <td>30.0000</td>\n",
 | 
						||
       "      <td>B42</td>\n",
 | 
						||
       "      <td>S</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>B</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>888</th>\n",
 | 
						||
       "      <td>889</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3</td>\n",
 | 
						||
       "      <td>Johnston, Miss. Catherine Helen \"Carrie\"</td>\n",
 | 
						||
       "      <td>female</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>2</td>\n",
 | 
						||
       "      <td>W./C. 6607</td>\n",
 | 
						||
       "      <td>23.4500</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>S</td>\n",
 | 
						||
       "      <td>3</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>X</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>889</th>\n",
 | 
						||
       "      <td>890</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>1</td>\n",
 | 
						||
       "      <td>Behr, Mr. Karl Howell</td>\n",
 | 
						||
       "      <td>male</td>\n",
 | 
						||
       "      <td>26.0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>111369</td>\n",
 | 
						||
       "      <td>30.0000</td>\n",
 | 
						||
       "      <td>C148</td>\n",
 | 
						||
       "      <td>C</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>C</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "    <tr>\n",
 | 
						||
       "      <th>890</th>\n",
 | 
						||
       "      <td>891</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3</td>\n",
 | 
						||
       "      <td>Dooley, Mr. Patrick</td>\n",
 | 
						||
       "      <td>male</td>\n",
 | 
						||
       "      <td>32.0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>370376</td>\n",
 | 
						||
       "      <td>7.7500</td>\n",
 | 
						||
       "      <td>NaN</td>\n",
 | 
						||
       "      <td>Q</td>\n",
 | 
						||
       "      <td>0</td>\n",
 | 
						||
       "      <td>3.0</td>\n",
 | 
						||
       "      <td>X</td>\n",
 | 
						||
       "    </tr>\n",
 | 
						||
       "  </tbody>\n",
 | 
						||
       "</table>\n",
 | 
						||
       "<p>891 rows × 15 columns</p>\n",
 | 
						||
       "</div>"
 | 
						||
      ],
 | 
						||
      "text/plain": [
 | 
						||
       "     PassengerId  Survived  Pclass  \\\n",
 | 
						||
       "0              1         0       3   \n",
 | 
						||
       "1              2         1       1   \n",
 | 
						||
       "2              3         1       3   \n",
 | 
						||
       "3              4         1       1   \n",
 | 
						||
       "4              5         0       3   \n",
 | 
						||
       "..           ...       ...     ...   \n",
 | 
						||
       "886          887         0       2   \n",
 | 
						||
       "887          888         1       1   \n",
 | 
						||
       "888          889         0       3   \n",
 | 
						||
       "889          890         1       1   \n",
 | 
						||
       "890          891         0       3   \n",
 | 
						||
       "\n",
 | 
						||
       "                                                  Name     Sex   Age  SibSp  \\\n",
 | 
						||
       "0                              Braund, Mr. Owen Harris    male  22.0      1   \n",
 | 
						||
       "1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
 | 
						||
       "2                               Heikkinen, Miss. Laina  female  26.0      0   \n",
 | 
						||
       "3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
 | 
						||
       "4                             Allen, Mr. William Henry    male  35.0      0   \n",
 | 
						||
       "..                                                 ...     ...   ...    ...   \n",
 | 
						||
       "886                              Montvila, Rev. Juozas    male  27.0      0   \n",
 | 
						||
       "887                       Graham, Miss. Margaret Edith  female  19.0      0   \n",
 | 
						||
       "888           Johnston, Miss. Catherine Helen \"Carrie\"  female   NaN      1   \n",
 | 
						||
       "889                              Behr, Mr. Karl Howell    male  26.0      0   \n",
 | 
						||
       "890                                Dooley, Mr. Patrick    male  32.0      0   \n",
 | 
						||
       "\n",
 | 
						||
       "     Parch            Ticket     Fare Cabin Embarked  FamilySize  AgeGroup  \\\n",
 | 
						||
       "0        0         A/5 21171   7.2500   NaN        S           1       3.0   \n",
 | 
						||
       "1        0          PC 17599  71.2833   C85        C           1       3.0   \n",
 | 
						||
       "2        0  STON/O2. 3101282   7.9250   NaN        S           0       3.0   \n",
 | 
						||
       "3        0            113803  53.1000  C123        S           1       3.0   \n",
 | 
						||
       "4        0            373450   8.0500   NaN        S           0       3.0   \n",
 | 
						||
       "..     ...               ...      ...   ...      ...         ...       ...   \n",
 | 
						||
       "886      0            211536  13.0000   NaN        S           0       3.0   \n",
 | 
						||
       "887      0            112053  30.0000   B42        S           0       3.0   \n",
 | 
						||
       "888      2        W./C. 6607  23.4500   NaN        S           3       NaN   \n",
 | 
						||
       "889      0            111369  30.0000  C148        C           0       3.0   \n",
 | 
						||
       "890      0            370376   7.7500   NaN        Q           0       3.0   \n",
 | 
						||
       "\n",
 | 
						||
       "    Deck  \n",
 | 
						||
       "0      X  \n",
 | 
						||
       "1      C  \n",
 | 
						||
       "2      X  \n",
 | 
						||
       "3      C  \n",
 | 
						||
       "4      X  \n",
 | 
						||
       "..   ...  \n",
 | 
						||
       "886    X  \n",
 | 
						||
       "887    B  \n",
 | 
						||
       "888    X  \n",
 | 
						||
       "889    C  \n",
 | 
						||
       "890    X  \n",
 | 
						||
       "\n",
 | 
						||
       "[891 rows x 15 columns]"
 | 
						||
      ]
 | 
						||
     },
 | 
						||
     "execution_count": 20,
 | 
						||
     "metadata": {},
 | 
						||
     "output_type": "execute_result"
 | 
						||
    }
 | 
						||
   ],
 | 
						||
   "source": [
 | 
						||
    "df['FamilySize'] = df['SibSp'] + df['Parch']\n",
 | 
						||
    "df"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Feature Alone"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "It seems many people who went alone survived. We can define a new feature 'Alone'"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "df['Alone'] = (df.FamilySize == 0)\n",
 | 
						||
    "df.head()"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Feature Salutation"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "If we observe well in the name variable, there is a 'title' (Mr., Miss., Mrs.). We can add a feature wit this title."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "#Taken from http://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-steps-python/\n",
 | 
						||
    "def name_extract(word):\n",
 | 
						||
    "    return word.split(',')[1].split('.')[0].strip()\n",
 | 
						||
    "\n",
 | 
						||
    "df['Salutation'] = df['Name'].apply(name_extract)\n",
 | 
						||
    "df.head()"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "We can list the different salutations."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "df['Salutation'].unique()"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "df.groupby(['Salutation']).size()"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "There only 4 main salutations, so we combine the rest of salutations in 'Others'."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": 2,
 | 
						||
   "metadata": {
 | 
						||
    "collapsed": true
 | 
						||
   },
 | 
						||
   "outputs": [
 | 
						||
    {
 | 
						||
     "ename": "NameError",
 | 
						||
     "evalue": "name 'df' is not defined",
 | 
						||
     "output_type": "error",
 | 
						||
     "traceback": [
 | 
						||
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
 | 
						||
      "\u001b[0;31mNameError\u001b[0m                                 Traceback (most recent call last)",
 | 
						||
      "\u001b[0;32m<ipython-input-2-515fd9f54fd1>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m     13\u001b[0m                 \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     14\u001b[0m                     \u001b[0;32mreturn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Others'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Salutation'\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Salutation'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mapply\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mgroup_salutation\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     16\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mgroupby\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'Salutation'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
 | 
						||
      "\u001b[0;31mNameError\u001b[0m: name 'df' is not defined"
 | 
						||
     ]
 | 
						||
    }
 | 
						||
   ],
 | 
						||
   "source": [
 | 
						||
    "def group_salutation(old_salutation):\n",
 | 
						||
    "    if old_salutation == 'Mr':\n",
 | 
						||
    "        return('Mr')\n",
 | 
						||
    "    else:\n",
 | 
						||
    "        if old_salutation == 'Mrs':\n",
 | 
						||
    "            return('Mrs')\n",
 | 
						||
    "        else:\n",
 | 
						||
    "            if old_salutation == 'Master':\n",
 | 
						||
    "                return('Master')\n",
 | 
						||
    "            else: \n",
 | 
						||
    "                if old_salutation == 'Miss':\n",
 | 
						||
    "                    return('Miss')\n",
 | 
						||
    "                else:\n",
 | 
						||
    "                    return('Others')\n",
 | 
						||
    "df['Salutation'] = df['Salutation'].apply(group_salutation)\n",
 | 
						||
    "df.groupby(['Salutation']).size()"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "# Distribution\n",
 | 
						||
    "colors_sex = ['#ff69b4', 'b', 'r', 'y', 'm', 'c']\n",
 | 
						||
    "df.groupby('Salutation').size().plot(kind='bar', color=colors_sex)"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "df.boxplot(column='Age', by = 'Salutation', sym='k.')"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Features Children and Female"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "# Specific features for Children and Female since there are more survivors\n",
 | 
						||
    "df['Children']   = df['Age'].map(lambda x: 1 if x < 6.0 else 0)\n",
 | 
						||
    "df['Female']     = df['Sex'].map(lambda x: 1 if x == \"female\" else 0)"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Feature AgeGroup"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": 12,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "# Group ages to simplify machine learning algorithms.  0: 0-5, 1: 6-10, 2: 11-15, 3: 16-59 and 4: 60-80\n",
 | 
						||
    "df['AgeGroup'] = np.nan\n",
 | 
						||
    "df.loc[(df.Age<6),'AgeGroup'] = 0\n",
 | 
						||
    "df.loc[(df.Age>=6) & (df.Age < 11),'AgeGroup'] = 1\n",
 | 
						||
    "df.loc[(df.Age>=11) & (df.Age < 16),'AgeGroup'] = 2\n",
 | 
						||
    "df.loc[(df.Age>=16) & (df.Age < 60),'AgeGroup'] = 3\n",
 | 
						||
    "df.loc[(df.Age>=60),'AgeGroup'] = 4"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Feature Deck\n",
 | 
						||
    "Only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": 14,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "def substrings_in_string(big_string, substrings):\n",
 | 
						||
    "    if type(big_string) == float:\n",
 | 
						||
    "        if np.isnan(big_string):\n",
 | 
						||
    "            return 'X'\n",
 | 
						||
    "    for substring in substrings:\n",
 | 
						||
    "        if substring in big_string:\n",
 | 
						||
    "            return substring[0::]\n",
 | 
						||
    "    print(big_string)\n",
 | 
						||
    "    return 'X'\n",
 | 
						||
    " \n",
 | 
						||
    "#Turning cabin number into Deck\n",
 | 
						||
    "cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']\n",
 | 
						||
    "df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Feature FarePerPerson"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "This feature is created from two previous features: Fare and FamilySize."
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "df['FarePerPerson']= df['Fare'] / (df['FamilySize'] + 1)"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Feature AgeClass"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "Since age and class are both numbers we can just multiply them and get a new feature.\n"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "code",
 | 
						||
   "execution_count": null,
 | 
						||
   "metadata": {},
 | 
						||
   "outputs": [],
 | 
						||
   "source": [
 | 
						||
    "df['AgeClass']=df['Age']*df['Pclass']"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "## Licence"
 | 
						||
   ]
 | 
						||
  },
 | 
						||
  {
 | 
						||
   "cell_type": "markdown",
 | 
						||
   "metadata": {},
 | 
						||
   "source": [
 | 
						||
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
 | 
						||
    "\n",
 | 
						||
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
 | 
						||
   ]
 | 
						||
  }
 | 
						||
 ],
 | 
						||
 "metadata": {
 | 
						||
  "datacleaner": {
 | 
						||
   "position": {
 | 
						||
    "top": "50px"
 | 
						||
   },
 | 
						||
   "python": {
 | 
						||
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
 | 
						||
   },
 | 
						||
   "window_display": false
 | 
						||
  },
 | 
						||
  "kernelspec": {
 | 
						||
   "display_name": "Python 3",
 | 
						||
   "language": "python",
 | 
						||
   "name": "python3"
 | 
						||
  },
 | 
						||
  "language_info": {
 | 
						||
   "codemirror_mode": {
 | 
						||
    "name": "ipython",
 | 
						||
    "version": 3
 | 
						||
   },
 | 
						||
   "file_extension": ".py",
 | 
						||
   "mimetype": "text/x-python",
 | 
						||
   "name": "python",
 | 
						||
   "nbconvert_exporter": "python",
 | 
						||
   "pygments_lexer": "ipython3",
 | 
						||
   "version": "3.8.8"
 | 
						||
  },
 | 
						||
  "latex_envs": {
 | 
						||
   "LaTeX_envs_menu_present": true,
 | 
						||
   "autocomplete": true,
 | 
						||
   "bibliofile": "biblio.bib",
 | 
						||
   "cite_by": "apalike",
 | 
						||
   "current_citInitial": 1,
 | 
						||
   "eqLabelWithNumbers": true,
 | 
						||
   "eqNumInitial": 1,
 | 
						||
   "hotkeys": {
 | 
						||
    "equation": "Ctrl-E",
 | 
						||
    "itemize": "Ctrl-I"
 | 
						||
   },
 | 
						||
   "labels_anchors": false,
 | 
						||
   "latex_user_defs": false,
 | 
						||
   "report_style_numbering": false,
 | 
						||
   "user_envs_cfg": false
 | 
						||
  }
 | 
						||
 },
 | 
						||
 "nbformat": 4,
 | 
						||
 "nbformat_minor": 1
 | 
						||
}
 |