sitc/ml1/2_2_Read_Data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](./images/EscUpmPolit_p.gif \"UPM\")\n",
    "\n",
    "# Course Notes for Learning Intelligent Systems\n",
    "\n",
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias\n",
    "\n",
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Reading Data](#Reading-Data)\n",
    "* [Iris flower dataset](#Iris-flower-dataset)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook aims to learn how to read and load a sample dataset.\n",
    "\n",
    "Scikit-learn comes with some bundled [datasets](https://scikit-learn.org/stable/datasets.html): iris, digits, boston, etc.\n",
    "\n",
    "In this notebook, we will use the Iris dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iris flower dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",
    "\n",
    "The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",
    "\n",
     "![Iris dataset](./images/iris-dataset.jpg \"Iris\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here you can see the species and the features.\n",
    "![Iris features](./images/iris-features.png \"Iris features\")\n",
    "![Iris classes](./images/iris-classes.png \"Iris classes\")"
   ]
  },
   {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To read the dataset, we import the datasets bundle and then load the Iris dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import datasets from scikit-learn\n",
    "from sklearn import datasets\n",
    "\n",
    "# load iris dataset\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the `.data` member, which is a 2D (`n_samples`, `n_features`) array. In the case of supervised problem, one or more response variables are stored in the `.target` member."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#type 'bunch' of a dataset\n",
    "type(iris)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print descrition of the dataset\n",
    "print(iris.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# names of the features (attributes of the entities)\n",
    "print(iris.feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#names of the targets(classes of the classifier)\n",
    "print(iris.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#type numpy array\n",
    "type(iris.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to inspect the dataset. You can consult the NumPy tutorial listed in the references."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Data in the iris dataset. The value of the features of the samples.\n",
    "print(iris.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Target.  Category of every sample\n",
    "print(iris.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Iris data is a numpy array\n",
    "# We can inspect its shape (rows, columns). In our case, (n_samples, n_features)\n",
    "print(iris.data.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Using numpy, I can print the dimensions (here we are working with a 2D matrix)\n",
    "print(iris.data.ndim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# I can print n_samples\n",
    "print(iris.data.shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ... n_features\n",
    "print(iris.data.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# names of the features\n",
    "print(iris.feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following sessions, we will learn how to load a dataset from a file (CSV, Excel, ...) using the pandas library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",
    "* [How to load an example dataset with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset)\n",
    "* [Dataset loading utilities in scikit-learn](http://scikit-learn.org/stable/datasets/)\n",
    "* [How to plot the Iris dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)\n",
    "* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",
    "* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "\n",
    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}