sitc/ml1/2_4_Preprocessing.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Preprocessing](#Preprocessing)\n",
    "* [Training set and Test set](#Training-set-and-Test-set)\n",
    "* [Preprocessing](#Preprocessing)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training set and Test set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
    "\n",
    "We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "x_iris, y_iris = iris.data, iris.target\n",
    "# Test set will be the 25% taken randomly\n",
    "x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(112, 4) (38, 4)\n"
     ]
    }
   ],
   "source": [
    "# Dimensions of train and testing\n",
    "print(x_train.shape, x_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 5.7  2.9  4.2  1.3]\n",
      " [ 6.7  3.1  4.4  1.4]\n",
      " [ 4.7  3.2  1.6  0.2]\n",
      " [ 6.5  2.8  4.6  1.5]\n",
      " [ 6.1  2.6  5.6  1.4]\n",
      " [ 6.3  3.3  6.   2.5]\n",
      " [ 4.8  3.4  1.9  0.2]\n",
      " [ 5.1  3.5  1.4  0.3]\n",
      " [ 6.4  3.1  5.5  1.8]\n",
      " [ 6.9  3.2  5.7  2.3]\n",
      " [ 6.8  3.2  5.9  2.3]\n",
      " [ 4.4  3.   1.3  0.2]\n",
      " [ 6.3  3.4  5.6  2.4]\n",
      " [ 6.1  2.9  4.7  1.4]\n",
      " [ 6.9  3.1  5.1  2.3]\n",
      " [ 6.4  2.9  4.3  1.3]\n",
      " [ 6.   3.   4.8  1.8]\n",
      " [ 5.2  3.5  1.5  0.2]\n",
      " [ 6.3  3.3  4.7  1.6]\n",
      " [ 7.2  3.2  6.   1.8]\n",
      " [ 4.9  3.1  1.5  0.1]\n",
      " [ 5.7  3.8  1.7  0.3]\n",
      " [ 6.5  3.   5.8  2.2]\n",
      " [ 4.8  3.   1.4  0.1]\n",
      " [ 6.   2.2  5.   1.5]\n",
      " [ 6.2  2.8  4.8  1.8]\n",
      " [ 6.1  3.   4.6  1.4]\n",
      " [ 6.1  2.8  4.   1.3]\n",
      " [ 6.5  3.   5.2  2. ]\n",
      " [ 5.9  3.   5.1  1.8]\n",
      " [ 5.6  2.7  4.2  1.3]\n",
      " [ 6.7  3.1  4.7  1.5]\n",
      " [ 5.6  2.8  4.9  2. ]\n",
      " [ 6.4  3.2  5.3  2.3]\n",
      " [ 6.7  3.1  5.6  2.4]\n",
      " [ 6.7  3.   5.2  2.3]\n",
      " [ 5.8  2.7  5.1  1.9]\n",
      " [ 5.7  3.   4.2  1.2]]\n"
     ]
    }
   ],
   "source": [
    "#Test set\n",
    "print (x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
    "\n",
    "The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardize the features\n",
    "from sklearn import preprocessing\n",
    "scaler = preprocessing.StandardScaler().fit(x_train)\n",
    "x_train = scaler.transform(x_train)\n",
    "x_test = scaler.transform(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[-0.09752318 -0.32858743  0.34599443  0.25682671]\n",
      " [ 1.06445511  0.09442168  0.45718919  0.39124069]\n",
      " [-1.25950146  0.30592623 -1.09953753 -1.22172707]\n",
      " [ 0.83205945 -0.54009199  0.56838396  0.52565467]\n",
      " [ 0.36726814 -0.9631011   1.12435779  0.39124069]\n",
      " [ 0.59966379  0.51743079  1.34674732  1.86979447]\n",
      " [-1.14330363  0.72893534 -0.93274538 -1.22172707]\n",
      " [-0.79471015  0.9404399  -1.2107323  -1.08731309]\n",
      " [ 0.71586162  0.09442168  1.06876041  0.92889661]\n",
      " [ 1.29685076  0.30592623  1.17995517  1.60096651]\n",
      " [ 1.18065293  0.30592623  1.29114994  1.60096651]\n",
      " [-1.60809495 -0.11708288 -1.26632968 -1.22172707]\n",
      " [ 0.59966379  0.72893534  1.12435779  1.73538049]\n",
      " [ 0.36726814 -0.32858743  0.62398134  0.39124069]\n",
      " [ 1.29685076  0.09442168  0.84637087  1.60096651]\n",
      " [ 0.71586162 -0.32858743  0.40159181  0.25682671]\n",
      " [ 0.25107031 -0.11708288  0.67957873  0.92889661]\n",
      " [-0.67851232  0.9404399  -1.15513491 -1.22172707]\n",
      " [ 0.59966379  0.51743079  0.62398134  0.66006865]\n",
      " [ 1.64544425  0.30592623  1.34674732  0.92889661]\n",
      " [-1.0271058   0.09442168 -1.15513491 -1.35614105]\n",
      " [-0.09752318  1.57495356 -1.04394015 -1.08731309]\n",
      " [ 0.83205945 -0.11708288  1.23555256  1.46655253]\n",
      " [-1.14330363 -0.11708288 -1.2107323  -1.35614105]\n",
      " [ 0.25107031 -1.80911932  0.79077349  0.52565467]\n",
      " [ 0.48346596 -0.54009199  0.67957873  0.92889661]\n",
      " [ 0.36726814 -0.11708288  0.56838396  0.39124069]\n",
      " [ 0.36726814 -0.54009199  0.23479966  0.25682671]\n",
      " [ 0.83205945 -0.11708288  0.90196826  1.19772457]\n",
      " [ 0.13487248 -0.11708288  0.84637087  0.92889661]\n",
      " [-0.21372101 -0.75159654  0.34599443  0.25682671]\n",
      " [ 1.06445511  0.09442168  0.62398134  0.52565467]\n",
      " [-0.21372101 -0.54009199  0.73517611  1.19772457]\n",
      " [ 0.71586162  0.30592623  0.95756564  1.60096651]\n",
      " [ 1.06445511  0.09442168  1.12435779  1.73538049]\n",
      " [ 1.06445511 -0.11708288  0.90196826  1.60096651]\n",
      " [ 0.01867465 -0.75159654  0.84637087  1.06331059]\n",
      " [-0.09752318 -0.11708288  0.34599443  0.12241273]]\n"
     ]
    }
   ],
   "source": [
    "# As we see, the iris dataset is now  normalized\n",
    "print(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
    "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
    "* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
    "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
    "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
    "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Licences\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}