![](files/images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias

## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)

# Table of Contents
* [Preprocessing](#Preprocessing)
* [Training set and Test set](#Training-set-and-Test-set)
* [Preprocessing](#Preprocessing)
* [References](#References)

# Preprocessing

The goal of this notebook is to learn how separate the dataset into training and test datasets and then preprocess the data.

In [1]:
from sklearn import datasets
iris = datasets.load_iris()

## Training set and Test set

A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. 

We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ration 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time).

In [3]:
from sklearn.cross_validation import train_test_split
x_iris, y_iris = iris.data, iris.target
# Test set will be the 25% taken randomly
x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)

In [4]:
# Dimensions of train and testing
print(x_train.shape, x_test.shape)

(112, 4) (38, 4)


In [5]:
#Test set
print (x_test)

[[ 5.7  2.9  4.2  1.3]
 [ 6.7  3.1  4.4  1.4]
 [ 4.7  3.2  1.6  0.2]
 [ 6.5  2.8  4.6  1.5]
 [ 6.1  2.6  5.6  1.4]
 [ 6.3  3.3  6.   2.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.1  3.5  1.4  0.3]
 [ 6.4  3.1  5.5  1.8]
 [ 6.9  3.2  5.7  2.3]
 [ 6.8  3.2  5.9  2.3]
 [ 4.4  3.   1.3  0.2]
 [ 6.3  3.4  5.6  2.4]
 [ 6.1  2.9  4.7  1.4]
 [ 6.9  3.1  5.1  2.3]
 [ 6.4  2.9  4.3  1.3]
 [ 6.   3.   4.8  1.8]
 [ 5.2  3.5  1.5  0.2]
 [ 6.3  3.3  4.7  1.6]
 [ 7.2  3.2  6.   1.8]
 [ 4.9  3.1  1.5  0.1]
 [ 5.7  3.8  1.7  0.3]
 [ 6.5  3.   5.8  2.2]
 [ 4.8  3.   1.4  0.1]
 [ 6.   2.2  5.   1.5]
 [ 6.2  2.8  4.8  1.8]
 [ 6.1  3.   4.6  1.4]
 [ 6.1  2.8  4.   1.3]
 [ 6.5  3.   5.2  2. ]
 [ 5.9  3.   5.1  1.8]
 [ 5.6  2.7  4.2  1.3]
 [ 6.7  3.1  4.7  1.5]
 [ 5.6  2.8  4.9  2. ]
 [ 6.4  3.2  5.3  2.3]
 [ 6.7  3.1  5.6  2.4]
 [ 6.7  3.   5.2  2.3]
 [ 5.8  2.7  5.1  1.9]
 [ 5.7  3.   4.2  1.2]]


## Preprocessing

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set.

In [10]:
# Standardize the features
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [11]:
# As we see, the iris dataset is now  normalized
print(x_test)

[[-0.09752318 -0.32858743  0.34599443  0.25682671]
 [ 1.06445511  0.09442168  0.45718919  0.39124069]
 [-1.25950146  0.30592623 -1.09953753 -1.22172707]
 [ 0.83205945 -0.54009199  0.56838396  0.52565467]
 [ 0.36726814 -0.9631011   1.12435779  0.39124069]
 [ 0.59966379  0.51743079  1.34674732  1.86979447]
 [-1.14330363  0.72893534 -0.93274538 -1.22172707]
 [-0.79471015  0.9404399  -1.2107323  -1.08731309]
 [ 0.71586162  0.09442168  1.06876041  0.92889661]
 [ 1.29685076  0.30592623  1.17995517  1.60096651]
 [ 1.18065293  0.30592623  1.29114994  1.60096651]
 [-1.60809495 -0.11708288 -1.26632968 -1.22172707]
 [ 0.59966379  0.72893534  1.12435779  1.73538049]
 [ 0.36726814 -0.32858743  0.62398134  0.39124069]
 [ 1.29685076  0.09442168  0.84637087  1.60096651]
 [ 0.71586162 -0.32858743  0.40159181  0.25682671]
 [ 0.25107031 -0.11708288  0.67957873  0.92889661]
 [-0.67851232  0.9404399  -1.15513491 -1.22172707]
 [ 0.59966379  0.51743079  0.62398134  0.66006865]
 [ 1.64544425  0.30592623  1.34

## References

* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)
* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)
* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.
* [Matplotlib web page](http://matplotlib.org/index.html)
* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)
* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)

### Licences
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid.