![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias

## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)

# Table of Contents

* [The Titanic dataset](#The-Titanic-dataset)
* [Reading Data](#Reading-Data)
* [Reading Data from a File](#Reading-Data-from-a-File)

# The Titanic dataset

In this session we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.

![Titanic](images/titanic.jpg)


The main objective is predicting which passengers survived the sinking of the Titanic.

The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder *data-titanic*.


Here follows a description of the variables.

|Variable | Description| Values|
|-------------------------------|
| survival| Survival| (0 = No; 1 = Yes)|
|Pclass |Name | |
|Sex |Sex | male, female|
|Age |Age|
|SibSp |Number of Siblings/Spouses Aboard||
|Parch |Number of Parents/Children Aboard||
|Ticket|Ticket Number||
|Fare |Passenger Fare||
|Cabin |Cabin||
|Embarked |Port of Embarkation| (C = Cherbourg; Q = Queenstown; S = Southampton)|


The definitions used for SibSp and Parch are:
* *Sibling*: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
* *Spouse*: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
* *Parent*: Mother or Father of Passenger Aboard Titanic
* *Child*: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

# Reading Data

In the previous dataset we load a bundle dataset in scikit-learn. In this notebook we are going to learn how to read from a file or a url using the Pandas library.

## Reading Data from a File

In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

df = pd.read_csv('data-titanic/train.csv')
df

In [None]:
# we can get the number of samples and features
df.shape

In [None]:
#I can read only a number of rows and tell where the header is, among other options.
df = df = pd.read_csv('data-titanic/train.csv', header=0, nrows=5)
df

Pandas provides methods for reading other formats, such as Excel (*read_excel()*), JSON (*read_json()*), or HTML (*read_html()*), look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output) for more details.

## Reading data from a URL

In [None]:
import pandas as pd
#We get a URL with raw content (not HTML one)
url = "https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv"
df = pd.read_csv(url)
df

An alternative option is reading the file with the library *requests* and then use *pandas*.

In [None]:
# First we open the file
import pandas as pd
import io
import requests
url = "https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv"
s = requests.get(url, stream=True).content
#Print the first 320 characters for understanding how it works
s[:320]

In [None]:
df = pd.read_csv(io.StringIO(s.decode('utf-8')))
df

## References

* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)
* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)
* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)
* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). 

© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid.