sitc/ml1/2_2_Read_Data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")\n",
    "\n",
    "# Course Notes for Learning Intelligent Systems\n",
    "\n",
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias\n",
    "\n",
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Reading Data](#Reading-Data)\n",
    "* [Iris flower dataset](#Iris-flower-dataset)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to learn how to read and load a sample dataset.\n",
    "\n",
    "Scikit-learn comes with some bundled [datasets](http://scikit-learn.org/stable/datasets/): iris, digits, boston, etc.\n",
    "\n",
    "In this notebook we are going to use the Iris dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iris flower dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",
    "\n",
    "The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features.\n",
    "\n",
    "![Iris](files/images/iris-dataset.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In ordert to read the dataset, we import the datasets bundle and then load the Iris dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# import datasets from scikit-learn\n",
    "from sklearn import datasets\n",
    "\n",
    "# load iris dataset\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the `.data` member, which is a 2D (`n_samples`, `n_features`) array. In the case of supervised problem, one or more response variables are stored in the `.target` member."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sklearn.datasets.base.Bunch"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#type 'bunch' of a dataset\n",
    "type(iris)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iris Plants Database\n",
      "\n",
      "Notes\n",
      "-----\n",
      "Data Set Characteristics:\n",
      "    :Number of Instances: 150 (50 in each of three classes)\n",
      "    :Number of Attributes: 4 numeric, predictive attributes and the class\n",
      "    :Attribute Information:\n",
      "        - sepal length in cm\n",
      "        - sepal width in cm\n",
      "        - petal length in cm\n",
      "        - petal width in cm\n",
      "        - class:\n",
      "                - Iris-Setosa\n",
      "                - Iris-Versicolour\n",
      "                - Iris-Virginica\n",
      "    :Summary Statistics:\n",
      "\n",
      "    ============== ==== ==== ======= ===== ====================\n",
      "                    Min  Max   Mean    SD   Class Correlation\n",
      "    ============== ==== ==== ======= ===== ====================\n",
      "    sepal length:   4.3  7.9   5.84   0.83    0.7826\n",
      "    sepal width:    2.0  4.4   3.05   0.43   -0.4194\n",
      "    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)\n",
      "    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)\n",
      "    ============== ==== ==== ======= ===== ====================\n",
      "\n",
      "    :Missing Attribute Values: None\n",
      "    :Class Distribution: 33.3% for each of 3 classes.\n",
      "    :Creator: R.A. Fisher\n",
      "    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n",
      "    :Date: July, 1988\n",
      "\n",
      "This is a copy of UCI ML iris datasets.\n",
      "http://archive.ics.uci.edu/ml/datasets/Iris\n",
      "\n",
      "The famous Iris database, first used by Sir R.A Fisher\n",
      "\n",
      "This is perhaps the best known database to be found in the\n",
      "pattern recognition literature.  Fisher's paper is a classic in the field and\n",
      "is referenced frequently to this day.  (See Duda & Hart, for example.)  The\n",
      "data set contains 3 classes of 50 instances each, where each class refers to a\n",
      "type of iris plant.  One class is linearly separable from the other 2; the\n",
      "latter are NOT linearly separable from each other.\n",
      "\n",
      "References\n",
      "----------\n",
      "   - Fisher,R.A. \"The use of multiple measurements in taxonomic problems\"\n",
      "     Annual Eugenics, 7, Part II, 179-188 (1936); also in \"Contributions to\n",
      "     Mathematical Statistics\" (John Wiley, NY, 1950).\n",
      "   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.\n",
      "     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.\n",
      "   - Dasarathy, B.V. (1980) \"Nosing Around the Neighborhood: A New System\n",
      "     Structure and Classification Rule for Recognition in Partially Exposed\n",
      "     Environments\".  IEEE Transactions on Pattern Analysis and Machine\n",
      "     Intelligence, Vol. PAMI-2, No. 1, 67-71.\n",
      "   - Gates, G.W. (1972) \"The Reduced Nearest Neighbor Rule\".  IEEE Transactions\n",
      "     on Information Theory, May 1972, 431-433.\n",
      "   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al\"s AUTOCLASS II\n",
      "     conceptual clustering system finds 3 classes in the data.\n",
      "   - Many, many more ...\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# print descrition of the dataset\n",
    "print(iris.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n"
     ]
    }
   ],
   "source": [
    "# names of the features (attributes of the entities)\n",
    "print(iris.feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['setosa' 'versicolor' 'virginica']\n"
     ]
    }
   ],
   "source": [
    "#names of the targets(classes of the classifier)\n",
    "print(iris.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numpy.ndarray"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#type numpy array\n",
    "type(iris.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to inspect the dataset. You can consult the NumPy tutorial listed in the references."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 5.1  3.5  1.4  0.2]\n",
      " [ 4.9  3.   1.4  0.2]\n",
      " [ 4.7  3.2  1.3  0.2]\n",
      " [ 4.6  3.1  1.5  0.2]\n",
      " [ 5.   3.6  1.4  0.2]\n",
      " [ 5.4  3.9  1.7  0.4]\n",
      " [ 4.6  3.4  1.4  0.3]\n",
      " [ 5.   3.4  1.5  0.2]\n",
      " [ 4.4  2.9  1.4  0.2]\n",
      " [ 4.9  3.1  1.5  0.1]\n",
      " [ 5.4  3.7  1.5  0.2]\n",
      " [ 4.8  3.4  1.6  0.2]\n",
      " [ 4.8  3.   1.4  0.1]\n",
      " [ 4.3  3.   1.1  0.1]\n",
      " [ 5.8  4.   1.2  0.2]\n",
      " [ 5.7  4.4  1.5  0.4]\n",
      " [ 5.4  3.9  1.3  0.4]\n",
      " [ 5.1  3.5  1.4  0.3]\n",
      " [ 5.7  3.8  1.7  0.3]\n",
      " [ 5.1  3.8  1.5  0.3]\n",
      " [ 5.4  3.4  1.7  0.2]\n",
      " [ 5.1  3.7  1.5  0.4]\n",
      " [ 4.6  3.6  1.   0.2]\n",
      " [ 5.1  3.3  1.7  0.5]\n",
      " [ 4.8  3.4  1.9  0.2]\n",
      " [ 5.   3.   1.6  0.2]\n",
      " [ 5.   3.4  1.6  0.4]\n",
      " [ 5.2  3.5  1.5  0.2]\n",
      " [ 5.2  3.4  1.4  0.2]\n",
      " [ 4.7  3.2  1.6  0.2]\n",
      " [ 4.8  3.1  1.6  0.2]\n",
      " [ 5.4  3.4  1.5  0.4]\n",
      " [ 5.2  4.1  1.5  0.1]\n",
      " [ 5.5  4.2  1.4  0.2]\n",
      " [ 4.9  3.1  1.5  0.1]\n",
      " [ 5.   3.2  1.2  0.2]\n",
      " [ 5.5  3.5  1.3  0.2]\n",
      " [ 4.9  3.1  1.5  0.1]\n",
      " [ 4.4  3.   1.3  0.2]\n",
      " [ 5.1  3.4  1.5  0.2]\n",
      " [ 5.   3.5  1.3  0.3]\n",
      " [ 4.5  2.3  1.3  0.3]\n",
      " [ 4.4  3.2  1.3  0.2]\n",
      " [ 5.   3.5  1.6  0.6]\n",
      " [ 5.1  3.8  1.9  0.4]\n",
      " [ 4.8  3.   1.4  0.3]\n",
      " [ 5.1  3.8  1.6  0.2]\n",
      " [ 4.6  3.2  1.4  0.2]\n",
      " [ 5.3  3.7  1.5  0.2]\n",
      " [ 5.   3.3  1.4  0.2]\n",
      " [ 7.   3.2  4.7  1.4]\n",
      " [ 6.4  3.2  4.5  1.5]\n",
      " [ 6.9  3.1  4.9  1.5]\n",
      " [ 5.5  2.3  4.   1.3]\n",
      " [ 6.5  2.8  4.6  1.5]\n",
      " [ 5.7  2.8  4.5  1.3]\n",
      " [ 6.3  3.3  4.7  1.6]\n",
      " [ 4.9  2.4  3.3  1. ]\n",
      " [ 6.6  2.9  4.6  1.3]\n",
      " [ 5.2  2.7  3.9  1.4]\n",
      " [ 5.   2.   3.5  1. ]\n",
      " [ 5.9  3.   4.2  1.5]\n",
      " [ 6.   2.2  4.   1. ]\n",
      " [ 6.1  2.9  4.7  1.4]\n",
      " [ 5.6  2.9  3.6  1.3]\n",
      " [ 6.7  3.1  4.4  1.4]\n",
      " [ 5.6  3.   4.5  1.5]\n",
      " [ 5.8  2.7  4.1  1. ]\n",
      " [ 6.2  2.2  4.5  1.5]\n",
      " [ 5.6  2.5  3.9  1.1]\n",
      " [ 5.9  3.2  4.8  1.8]\n",
      " [ 6.1  2.8  4.   1.3]\n",
      " [ 6.3  2.5  4.9  1.5]\n",
      " [ 6.1  2.8  4.7  1.2]\n",
      " [ 6.4  2.9  4.3  1.3]\n",
      " [ 6.6  3.   4.4  1.4]\n",
      " [ 6.8  2.8  4.8  1.4]\n",
      " [ 6.7  3.   5.   1.7]\n",
      " [ 6.   2.9  4.5  1.5]\n",
      " [ 5.7  2.6  3.5  1. ]\n",
      " [ 5.5  2.4  3.8  1.1]\n",
      " [ 5.5  2.4  3.7  1. ]\n",
      " [ 5.8  2.7  3.9  1.2]\n",
      " [ 6.   2.7  5.1  1.6]\n",
      " [ 5.4  3.   4.5  1.5]\n",
      " [ 6.   3.4  4.5  1.6]\n",
      " [ 6.7  3.1  4.7  1.5]\n",
      " [ 6.3  2.3  4.4  1.3]\n",
      " [ 5.6  3.   4.1  1.3]\n",
      " [ 5.5  2.5  4.   1.3]\n",
      " [ 5.5  2.6  4.4  1.2]\n",
      " [ 6.1  3.   4.6  1.4]\n",
      " [ 5.8  2.6  4.   1.2]\n",
      " [ 5.   2.3  3.3  1. ]\n",
      " [ 5.6  2.7  4.2  1.3]\n",
      " [ 5.7  3.   4.2  1.2]\n",
      " [ 5.7  2.9  4.2  1.3]\n",
      " [ 6.2  2.9  4.3  1.3]\n",
      " [ 5.1  2.5  3.   1.1]\n",
      " [ 5.7  2.8  4.1  1.3]\n",
      " [ 6.3  3.3  6.   2.5]\n",
      " [ 5.8  2.7  5.1  1.9]\n",
      " [ 7.1  3.   5.9  2.1]\n",
      " [ 6.3  2.9  5.6  1.8]\n",
      " [ 6.5  3.   5.8  2.2]\n",
      " [ 7.6  3.   6.6  2.1]\n",
      " [ 4.9  2.5  4.5  1.7]\n",
      " [ 7.3  2.9  6.3  1.8]\n",
      " [ 6.7  2.5  5.8  1.8]\n",
      " [ 7.2  3.6  6.1  2.5]\n",
      " [ 6.5  3.2  5.1  2. ]\n",
      " [ 6.4  2.7  5.3  1.9]\n",
      " [ 6.8  3.   5.5  2.1]\n",
      " [ 5.7  2.5  5.   2. ]\n",
      " [ 5.8  2.8  5.1  2.4]\n",
      " [ 6.4  3.2  5.3  2.3]\n",
      " [ 6.5  3.   5.5  1.8]\n",
      " [ 7.7  3.8  6.7  2.2]\n",
      " [ 7.7  2.6  6.9  2.3]\n",
      " [ 6.   2.2  5.   1.5]\n",
      " [ 6.9  3.2  5.7  2.3]\n",
      " [ 5.6  2.8  4.9  2. ]\n",
      " [ 7.7  2.8  6.7  2. ]\n",
      " [ 6.3  2.7  4.9  1.8]\n",
      " [ 6.7  3.3  5.7  2.1]\n",
      " [ 7.2  3.2  6.   1.8]\n",
      " [ 6.2  2.8  4.8  1.8]\n",
      " [ 6.1  3.   4.9  1.8]\n",
      " [ 6.4  2.8  5.6  2.1]\n",
      " [ 7.2  3.   5.8  1.6]\n",
      " [ 7.4  2.8  6.1  1.9]\n",
      " [ 7.9  3.8  6.4  2. ]\n",
      " [ 6.4  2.8  5.6  2.2]\n",
      " [ 6.3  2.8  5.1  1.5]\n",
      " [ 6.1  2.6  5.6  1.4]\n",
      " [ 7.7  3.   6.1  2.3]\n",
      " [ 6.3  3.4  5.6  2.4]\n",
      " [ 6.4  3.1  5.5  1.8]\n",
      " [ 6.   3.   4.8  1.8]\n",
      " [ 6.9  3.1  5.4  2.1]\n",
      " [ 6.7  3.1  5.6  2.4]\n",
      " [ 6.9  3.1  5.1  2.3]\n",
      " [ 5.8  2.7  5.1  1.9]\n",
      " [ 6.8  3.2  5.9  2.3]\n",
      " [ 6.7  3.3  5.7  2.5]\n",
      " [ 6.7  3.   5.2  2.3]\n",
      " [ 6.3  2.5  5.   1.9]\n",
      " [ 6.5  3.   5.2  2. ]\n",
      " [ 6.2  3.4  5.4  2.3]\n",
      " [ 5.9  3.   5.1  1.8]]\n"
     ]
    }
   ],
   "source": [
    "#Data in the iris dataset. The value of the features of the samples.\n",
    "print(iris.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n",
      " 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1\n",
      " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2\n",
      " 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2\n",
      " 2 2]\n"
     ]
    }
   ],
   "source": [
    "# Target.  Category of every sample\n",
    "print(iris.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(150, 4)\n"
     ]
    }
   ],
   "source": [
    "# Iris data is a numpy array\n",
    "# We can inspect its shape (rows, columns). In our case, (n_samples, n_features)\n",
    "print(iris.data.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\n"
     ]
    }
   ],
   "source": [
    "#Using numpy, I can print the dimensions (here we are working with 2D matriz)\n",
    "print(iris.data.ndim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "150\n"
     ]
    }
   ],
   "source": [
    "# I can print n_samples\n",
    "print(iris.data.shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "4\n"
     ]
    }
   ],
   "source": [
    "# ... n_features\n",
    "print(iris.data.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n"
     ]
    }
   ],
   "source": [
    "# names of the features\n",
    "print(iris.feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In following sessions we will learn how to load a dataset from a file (csv, excel, ...) using the pandas library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",
    "* [How to load an example dataset with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset)\n",
    "* [Dataset loading utilities in scikit-learn](http://scikit-learn.org/stable/datasets/)\n",
    "* [How to plot the Iris dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)\n",
    "* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",
    "* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1+"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}