sitc/ml2/3_1_Read_Data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [The Titanic dataset](#The-Titanic-dataset)\n",
    "* [Reading Data](#Reading-Data)\n",
    "* [Reading Data from a File](#Reading-Data-from-a-File)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# The Titanic dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this session we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",
    "\n",
    "![Titanic](images/titanic.jpg)\n",
    "\n",
    "\n",
    "The main objective is predicting which passengers survived the sinking of the Titanic.\n",
    "\n",
    "The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder *data-titanic*.\n",
    "\n",
    "\n",
    "Here follows a description of the variables.\n",
    "\n",
    "|Variable | Description| Values|\n",
    "|-------------------------------|\n",
    "| survival| Survival| (0 = No; 1 = Yes)|\n",
    "|Pclass |Name | |\n",
    "|Sex  |Sex | male, female|\n",
    "|Age |Age|\n",
    "|SibSp |Number of Siblings/Spouses Aboard||\n",
    "|Parch |Number of Parents/Children Aboard||\n",
    "|Ticket|Ticket Number||\n",
    "|Fare            |Passenger Fare||\n",
    "|Cabin           |Cabin||\n",
    "|Embarked        |Port of Embarkation| (C = Cherbourg; Q = Queenstown; S = Southampton)|\n",
    "\n",
    "\n",
    "The definitions used for SibSp and Parch are:\n",
    "* *Sibling*: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic\n",
    "* *Spouse*: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)\n",
    "* *Parent*: Mother or Father of Passenger Aboard Titanic\n",
    "* *Child*: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous dataset we load a bundle dataset in scikit-learn. In this notebook we are going to learn how to read from a file or a url using the Pandas library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading Data from a File"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "df = pd.read_csv('data-titanic/train.csv')\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can get the number of samples and features\n",
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#I can read only a number of rows and tell where the header is, among other options.\n",
    "df = df = pd.read_csv('data-titanic/train.csv', header=0, nrows=5)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas provides methods for reading other formats, such as Excel (*read_excel()*), JSON (*read_json()*), or HTML (*read_html()*), look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading data from a URL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "#We get a URL with raw content (not HTML one)\n",
    "url = \"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
    "df = pd.read_csv(url)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An alternative option is reading the file with the library *requests* and then use *pandas*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# First we open the file\n",
    "import pandas as pd\n",
    "import io\n",
    "import requests\n",
    "url = \"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
    "s = requests.get(url, stream=True).content\n",
    "#Print the first 320 characters for understanding how it works\n",
    "s[:320]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(io.StringIO(s.decode('utf-8')))\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)\n",
    "* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n",
    "* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n",
    "* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",
    "* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"\n",`
			`"* [The Titanic dataset](#The-Titanic-dataset)\n",`
			`"* [Reading Data](#Reading-Data)\n",`
			`"* [Reading Data from a File](#Reading-Data-from-a-File)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# The Titanic dataset"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In this session we will work with the Titanic dataset. This dataset is provided by [Kaggle](http://www.kaggle.com). Kaggle is a crowdsourcing platform that organizes competitions where researchers and companies post their data and users compete to obtain the best models.\n",`
			`"\n",`
			`"![Titanic](images/titanic.jpg)\n",`
			`"\n",`
			`"\n",`
			`"The main objective is predicting which passengers survived the sinking of the Titanic.\n",`
			`"\n",`
			`"The data is available [here](https://www.kaggle.com/c/titanic/data). There are two files, one for training ([train.csv](files/data-titanic/train.csv)) and another file for testing [test.csv](files/data-titanic/test.csv). A local copy has been included in this notebook under the folder data-titanic.\n",`
			`"\n",`
			`"\n",`
			`"Here follows a description of the variables.\n",`
			`"\n",`
			`"\|Variable \| Description\| Values\|\n",`
			`"\|-------------------------------\|\n",`
			`"\| survival\| Survival\| (0 = No; 1 = Yes)\|\n",`
			`"\|Pclass \|Name \| \|\n",`
			`"\|Sex \|Sex \| male, female\|\n",`
			`"\|Age \|Age\|\n",`
			`"\|SibSp \|Number of Siblings/Spouses Aboard\|\|\n",`
			`"\|Parch \|Number of Parents/Children Aboard\|\|\n",`
			`"\|Ticket\|Ticket Number\|\|\n",`
			`"\|Fare \|Passenger Fare\|\|\n",`
			`"\|Cabin \|Cabin\|\|\n",`
			`"\|Embarked \|Port of Embarkation\| (C = Cherbourg; Q = Queenstown; S = Southampton)\|\n",`
			`"\n",`
			`"\n",`
			`"The definitions used for SibSp and Parch are:\n",`
			`"* Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic\n",`
			`"* Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)\n",`
			`"* Parent: Mother or Father of Passenger Aboard Titanic\n",`
			`"* Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Reading Data"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In the previous dataset we load a bundle dataset in scikit-learn. In this notebook we are going to learn how to read from a file or a url using the Pandas library."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Reading Data from a File"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"source": [`
			`"import numpy as np\n",`
			`"import pandas as pd\n",`
			`"from pandas import Series, DataFrame\n",`
			`"\n",`
			`"df = pd.read_csv('data-titanic/train.csv')\n",`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"source": [`
			`"# we can get the number of samples and features\n",`
			`"df.shape"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"source": [`
			`"#I can read only a number of rows and tell where the header is, among other options.\n",`
			`"df = df = pd.read_csv('data-titanic/train.csv', header=0, nrows=5)\n",`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Pandas provides methods for reading other formats, such as Excel (read_excel()), JSON (read_json()), or HTML (read_html()), look at the [documentation](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output) for more details."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Reading data from a URL"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"source": [`
			`"import pandas as pd\n",`
			`"#We get a URL with raw content (not HTML one)\n",`
Changed cif2cif 2 gsi-upm 2016-03-29 09:21:43 +00:00			`"url = \"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"df = pd.read_csv(url)\n",`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"An alternative option is reading the file with the library requests and then use pandas."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"source": [`
			`"# First we open the file\n",`
			`"import pandas as pd\n",`
			`"import io\n",`
			`"import requests\n",`
Changed cif2cif 2 gsi-upm 2016-03-29 09:21:43 +00:00			`"url = \"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"s = requests.get(url, stream=True).content\n",`
			`"#Print the first 320 characters for understanding how it works\n",`
			`"s[:320]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Remove outputs and metadata 2019-02-28 14:30:33 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"source": [`
			`"df = pd.read_csv(io.StringIO(s.decode('utf-8')))\n",`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [Pandas API input-output](http://pandas.pydata.org/pandas-docs/stable/api.html#input-output)\n",`
			`"* [Pandas API - pandas.read_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)\n",`
			`"* [DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)\n",`
			`"* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",`
			`"* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
			`"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
added import nltk 2017-04-20 14:07:10 +00:00			`"version": "3.5.2"`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 0`
			`}`