sitc/ml1/2_4_Preprocessing.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Preprocessing](#Preprocessing)\n",
    "* [Training set and Test set](#Training-set-and-Test-set)\n",
    "* [Preprocessing](#Preprocessing)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training set and Test set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
    "\n",
    "We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "x_iris, y_iris = iris.data, iris.target\n",
    "# Test set will be the 25% taken randomly\n",
    "x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Dimensions of train and testing\n",
    "print(x_train.shape, x_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Test set\n",
    "print (x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
    "\n",
    "The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Standardize the features\n",
    "from sklearn import preprocessing\n",
    "scaler = preprocessing.StandardScaler().fit(x_train)\n",
    "x_train = scaler.transform(x_train)\n",
    "x_test = scaler.transform(x_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# As we see, the iris dataset is now  normalized\n",
    "print(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
    "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
    "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
    "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
    "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Licences\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.6"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
Add ml1 2016-03-15 12:55:14 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](files/images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
updated notebooks 2019-02-28 10:32:00 +00:00			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"* [Preprocessing](#Preprocessing)\n",`
			`"* [Training set and Test set](#Training-set-and-Test-set)\n",`
			`"* [Preprocessing](#Preprocessing)\n",`
			`"* [References](#References)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Preprocessing"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Fix visualization section 2017-12-11 17:12:06 +00:00			`"The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"outputs": [],`
			`"source": [`
			`"from sklearn import datasets\n",`
			`"iris = datasets.load_iris()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Training set and Test set"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the training set on which we learn data properties and one that we call the testing set on which we test these properties. \n",`
			`"\n",`
Fix visualization section 2017-12-11 17:12:06 +00:00			"We are going to use scikit-learn to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"outputs": [],`
			`"source": [`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"from sklearn.model_selection import train_test_split\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"x_iris, y_iris = iris.data, iris.target\n",`
			`"# Test set will be the 25% taken randomly\n",`
			`"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Dimensions of train and testing\n",`
			`"print(x_train.shape, x_test.shape)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"#Test set\n",`
			`"print (x_test)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Preprocessing"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",`
			`"\n",`
Review J 2016-03-28 10:26:20 +00:00			"The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set."
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"outputs": [],`
			`"source": [`
			`"# Standardize the features\n",`
			`"from sklearn import preprocessing\n",`
			`"scaler = preprocessing.StandardScaler().fit(x_train)\n",`
			`"x_train = scaler.transform(x_train)\n",`
			`"x_test = scaler.transform(x_test)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# As we see, the iris dataset is now normalized\n",`
			`"print(x_test)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",`
			`"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",`
			`"* [Matplotlib web page](http://matplotlib.org/index.html)\n",`
			`"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",`
			`"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Licences\n",`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"version": "3.5.6"`
			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
Add ml1 2016-03-15 12:55:14 +00:00			`}`
			`},`
			`"nbformat": 4,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"nbformat_minor": 1`
Add ml1 2016-03-15 12:55:14 +00:00			`}`