mirror of
				https://github.com/gsi-upm/sitc
				synced 2025-10-25 20:58:19 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			528 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			528 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| {
 | ||
|  "cells": [
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     ""
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "# Course Notes for Learning Intelligent Systems"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "# Exercise - The Titanic Dataset"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "In this exercise we are going to put in practice what we have learnt in the notebooks of the session. \n",
 | ||
|     "\n",
 | ||
|     "Answer directly in your copy of the exercise and submit it as a moodle task."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "import pandas as pd\n",
 | ||
|     "\n",
 | ||
|     "import seaborn as sns\n",
 | ||
|     "import matplotlib.pyplot as plt\n",
 | ||
|     "import numpy as np\n",
 | ||
|     "sns.set(color_codes=True)\n",
 | ||
|     "\n",
 | ||
|     "# if matplotlib is not set inline, you will not see plots\n",
 | ||
|     "%matplotlib inline"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "# Reading Data"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Assign the variable *df* a Dataframe with the Titanic Dataset from the URL https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv.\n",
 | ||
|     "\n",
 | ||
|     "Print *df*."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": []
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "# Munging and Exploratory visualisation"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Obtain number of passengers and features of the dataset"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": []
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Obtain general statistics (count, mean, std, min, max, 25%, 50%, 75%) about the column Age"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": []
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Obtain the median of the age of the passengers"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": []
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Obtain number of missing values per feature"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": []
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "How many passsengers have survived? List them grouped by Sex and Pclass.\n",
 | ||
|     "\n",
 | ||
|     "Assign the result to a variable df_1 and print it"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": []
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Visualise df_1 as an histogram."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": []
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "# Feature Engineering"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Here you can find some features that have been proposed for this dataset. Your task is to analyse them and provide some insights. \n",
 | ||
|     "\n",
 | ||
|     "Use pandas and visualisation to justify your conclusions"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Feature FamilySize "
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Regarding SbSp and Parch, we can define a new feature, 'FamilySize' that is the combination of both."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "df['FamilySize'] = df['SibSp'] + df['Parch']\n",
 | ||
|     "df"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Feature Alone"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "It seems many people who went alone survived. We can define a new feature 'Alone'"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "df['Alone'] = (df.FamilySize == 0)\n",
 | ||
|     "df.head()"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Feature Salutation"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "If we observe well in the name variable, there is a 'title' (Mr., Miss., Mrs.). We can add a feature wit this title."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "#Taken from http://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-steps-python/\n",
 | ||
|     "def name_extract(word):\n",
 | ||
|     "    return word.split(',')[1].split('.')[0].strip()\n",
 | ||
|     "\n",
 | ||
|     "df['Salutation'] = df['Name'].apply(name_extract)\n",
 | ||
|     "df.head()"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "We can list the different salutations."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "df['Salutation'].unique()"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "df.groupby(['Salutation']).size()"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "There only 4 main salutations, so we combine the rest of salutations in 'Others'."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "def group_salutation(old_salutation):\n",
 | ||
|     "    if old_salutation == 'Mr':\n",
 | ||
|     "        return('Mr')\n",
 | ||
|     "    else:\n",
 | ||
|     "        if old_salutation == 'Mrs':\n",
 | ||
|     "            return('Mrs')\n",
 | ||
|     "        else:\n",
 | ||
|     "            if old_salutation == 'Master':\n",
 | ||
|     "                return('Master')\n",
 | ||
|     "            else: \n",
 | ||
|     "                if old_salutation == 'Miss':\n",
 | ||
|     "                    return('Miss')\n",
 | ||
|     "                else:\n",
 | ||
|     "                    return('Others')\n",
 | ||
|     "df['Salutation'] = df['Salutation'].apply(group_salutation)\n",
 | ||
|     "df.groupby(['Salutation']).size()"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "# Distribution\n",
 | ||
|     "colors_sex = ['#ff69b4', 'b', 'r', 'y', 'm', 'c']\n",
 | ||
|     "df.groupby('Salutation').size().plot(kind='bar', color=colors_sex)"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "df.boxplot(column='Age', by = 'Salutation', sym='k.')"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Features Children and Female"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "# Specific features for Children and Female since there are more survivors\n",
 | ||
|     "df['Children']   = df['Age'].map(lambda x: 1 if x < 6.0 else 0)\n",
 | ||
|     "df['Female']     = df['Sex'].map(lambda x: 1 if x == \"female\" else 0)"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Feature AgeGroup"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "# Group ages to simplify machine learning algorithms.  0: 0-5, 1: 6-10, 2: 11-15, 3: 16-59 and 4: 60-80\n",
 | ||
|     "df['AgeGroup'] = np.nan\n",
 | ||
|     "df.loc[(df.Age<6),'AgeGroup'] = 0\n",
 | ||
|     "df.loc[(df.Age>=6) & (df.Age < 11),'AgeGroup'] = 1\n",
 | ||
|     "df.loc[(df.Age>=11) & (df.Age < 16),'AgeGroup'] = 2\n",
 | ||
|     "df.loc[(df.Age>=16) & (df.Age < 60),'AgeGroup'] = 3\n",
 | ||
|     "df.loc[(df.Age>=60),'AgeGroup'] = 4"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Feature Deck\n",
 | ||
|     "Only 1st class passengers have cabins, the rest are ‘Unknown’. A cabin number looks like ‘C123’. The letter refers to the deck."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "def substrings_in_string(big_string, substrings):\n",
 | ||
|     "    if type(big_string) == float:\n",
 | ||
|     "        if np.isnan(big_string):\n",
 | ||
|     "            return 'X'\n",
 | ||
|     "    for substring in substrings:\n",
 | ||
|     "        if substring in big_string:\n",
 | ||
|     "            return substring[0::]\n",
 | ||
|     "    print(big_string)\n",
 | ||
|     "    return 'X'\n",
 | ||
|     " \n",
 | ||
|     "#Turning cabin number into Deck\n",
 | ||
|     "cabin_list = ['A', 'B', 'C', 'D', 'E', 'F', 'T', 'G', 'Unknown']\n",
 | ||
|     "df['Deck']=df['Cabin'].map(lambda x: substrings_in_string(x, cabin_list))"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Feature FarePerPerson"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "This feature is created from two previous features: Fare and FamilySize."
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "df['FarePerPerson']= df['Fare'] / (df['FamilySize'] + 1)"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Feature AgeClass"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "Since age and class are both numbers we can just multiply them and get a new feature.\n"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "code",
 | ||
|    "execution_count": null,
 | ||
|    "metadata": {},
 | ||
|    "outputs": [],
 | ||
|    "source": [
 | ||
|     "df['AgeClass']=df['Age']*df['Pclass']"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "## Licence"
 | ||
|    ]
 | ||
|   },
 | ||
|   {
 | ||
|    "cell_type": "markdown",
 | ||
|    "metadata": {},
 | ||
|    "source": [
 | ||
|     "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
 | ||
|     "\n",
 | ||
|     "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
 | ||
|    ]
 | ||
|   }
 | ||
|  ],
 | ||
|  "metadata": {
 | ||
|   "datacleaner": {
 | ||
|    "position": {
 | ||
|     "top": "50px"
 | ||
|    },
 | ||
|    "python": {
 | ||
|     "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
 | ||
|    },
 | ||
|    "window_display": false
 | ||
|   },
 | ||
|   "kernelspec": {
 | ||
|    "display_name": "Python 3 (ipykernel)",
 | ||
|    "language": "python",
 | ||
|    "name": "python3"
 | ||
|   },
 | ||
|   "language_info": {
 | ||
|    "codemirror_mode": {
 | ||
|     "name": "ipython",
 | ||
|     "version": 3
 | ||
|    },
 | ||
|    "file_extension": ".py",
 | ||
|    "mimetype": "text/x-python",
 | ||
|    "name": "python",
 | ||
|    "nbconvert_exporter": "python",
 | ||
|    "pygments_lexer": "ipython3",
 | ||
|    "version": "3.8.12"
 | ||
|   },
 | ||
|   "latex_envs": {
 | ||
|    "LaTeX_envs_menu_present": true,
 | ||
|    "autocomplete": true,
 | ||
|    "bibliofile": "biblio.bib",
 | ||
|    "cite_by": "apalike",
 | ||
|    "current_citInitial": 1,
 | ||
|    "eqLabelWithNumbers": true,
 | ||
|    "eqNumInitial": 1,
 | ||
|    "hotkeys": {
 | ||
|     "equation": "Ctrl-E",
 | ||
|     "itemize": "Ctrl-I"
 | ||
|    },
 | ||
|    "labels_anchors": false,
 | ||
|    "latex_user_defs": false,
 | ||
|    "report_style_numbering": false,
 | ||
|    "user_envs_cfg": false
 | ||
|   }
 | ||
|  },
 | ||
|  "nbformat": 4,
 | ||
|  "nbformat_minor": 1
 | ||
| }
 |