{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Rescaling, Standardizing and Normalizing Data\n", "\n", "When our data comprises attributes with varying scales, many machine learning algorithms can benefit from rescaling the attributes to have the same scale. \n", "\n", "Many machine learning algorithms perform better or converge faster when features are on a relatively similar scale and/or close to normally distributed. \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Examples:\n", "* linear and logistic regression\n", "* nearest neighbors\n", "* neural networks\n", "* support vector machines with radial bias kernel functions\n", "* principal components analysis\n", "* linear discriminant analysis" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "The three terms are frequently used interchangeably, but their meanings differ.\n", "\n", "* **Rescaling** means adding/subtracting a constant, then multiplying/dividing by another constant so the features can lie between given minimum and maximum values. For example, convert Celsius to Fahrenheit or [0, 1].\n", "\n", "* **Standardizing**: subtract the mean and divide by the standard deviation, obtaining a \"standard normal\" distribution.\n", "\n", "* **Normalizing**: dividing by a vector norm, so values are between [0, 1]. This process can be helpful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Rescaling with with Scikit-Learn\n", "We have different classes. The main ones are:\n", "\n", "* **MinMaxScaler**: substracts the minimum value in the feature and divides it by the range (maximum - minimum). The result is a value in [0, 1]. Preserves the original form of the data and does not reduce the importance of outliners.\n", "* **MaxAbsScaler**divides each feature by its maximum absolute value. The result is a value in [-1 1]. This method preserves the original form of the data and does not reduce the importance of outliners." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "* **RobustScaler**: substracts the median and divides by the interquartile range (75% value - 25% value). It is used to reduce the effect of outliers. The result is not scaled in a predefined interval.\n", "* **StandardScaler**: substracts the mean and divides by the standard deviation. The result is a standard distribution with a variance of 1 and a mean of 0." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "from sklearn import preprocessing\n", "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "matplotlib.style.use('ggplot')" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "X_train = np.array([[ 1., -1., 2.],\n", " [ 2, 0., 0.],\n", " [ 0., 1., -1.]])" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "#scaler = preprocessing.MinMaxScaler()\n", "#scaler = preprocessing.MaxAbsScaler()\n", "#scaler = preprocessing.RobustScaler()\n", "scaler = preprocessing.StandardScaler()" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
StandardScaler()