![](files/images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)

# Table of Contents
* [Preprocessing](#Preprocessing)
* [Training set and Test set](#Training-set-and-Test-set)
* [Preprocessing](#Preprocessing)
* [References](#References)

# Preprocessing

The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data.

In [None]:
from sklearn import datasets
iris = datasets.load_iris()

## Training set and Test set

A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. 

We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time).

In [None]:
from sklearn.model_selection import train_test_split
x_iris, y_iris = iris.data, iris.target
# Test set will be the 25% taken randomly
x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)

In [None]:
# Dimensions of train and testing
print(x_train.shape, x_test.shape)

In [None]:
#Test set
print (x_test)

## Preprocessing

Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set.

In [None]:
# Standardize the features
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

In [None]:
# As we see, the iris dataset is now  normalized
print(x_test)

## References

* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)
* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)
* [Matplotlib web page](http://matplotlib.org/index.html)
* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)
* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)

### Licences
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.