{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](files/images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [Preprocessing](#Preprocessing)\n", "* [Training set and Test set](#Training-set-and-Test-set)\n", "* [Preprocessing](#Preprocessing)\n", "* [References](#References)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "iris = datasets.load_iris()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training set and Test set" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n", "\n", "We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "x_iris, y_iris = iris.data, iris.target\n", "# Test set will be the 25% taken randomly\n", "x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Dimensions of train and testing\n", "print(x_train.shape, x_test.shape)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Test set\n", "print (x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n", "\n", "The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Standardize the features\n", "from sklearn import preprocessing\n", "scaler = preprocessing.StandardScaler().fit(x_train)\n", "x_train = scaler.transform(x_train)\n", "x_test = scaler.transform(x_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# As we see, the iris dataset is now normalized\n", "print(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n", "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n", "* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n", "* [Matplotlib web page](http://matplotlib.org/index.html)\n", "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n", "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Licences\n", "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.6" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 1 }