![](files/images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)

# Table of Contents
* [kNN Model](#kNN-Model)
* [Load data and preprocessing](#Load-data-and-preprocessing)
* [Train classifier](#Train-classifier)
* [Evaluating the algorithm](#Evaluating-the-algorithm)
    * [Precision, recall and f-score](#Precision,-recall-and-f-score)
	* [Confusion matrix](#Confusion-matrix)
	* [K-Fold validation](#K-Fold-validation)
* [Tuning the algorithm](#Tuning-the-algorithm)
* [References](#References)

# kNN Model

The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.

The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm).

## Loading data and preprocessing

The first step is loading and preprocessing the data as explained in the previous notebooks.

In [None]:
# library for displaying plots
import matplotlib.pyplot as plt

# display plots in the notebook 
%matplotlib inline

In [None]:
## First, we repeat the load and preprocessing steps

# Load data
from sklearn import datasets
iris = datasets.load_iris()

# Training and test spliting
from sklearn.model_selection import train_test_split

x_iris, y_iris = iris.data, iris.target

# Test set will be the 25% taken randomly
x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)

# Preprocess: normalize
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

## Train classifier

The usual steps for creating a classifier are:
1. Create classifier object
2. Call *fit* to train the classifier
3. Call *predict* to obtain predictions

Once the model is created, the most relevant methods are:
* model.fit(x_train, y_train): train the model
* model.predict(x): predict
* model.score(x, y): evaluate the prediction

In [None]:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

# Create kNN model
model = KNeighborsClassifier(n_neighbors=15)

# Train the model using the training sets
model.fit(x_train, y_train) 

In [None]:
print("Prediction ", model.predict(x_train))
print("Expected ", y_train)

In [None]:
# Evaluate Accuracy in training

from sklearn import metrics
y_train_pred = model.predict(x_train)
print("Accuracy in training", metrics.accuracy_score(y_train, y_train_pred))

In [None]:
# Now we evaluate error in testing
y_test_pred = model.predict(x_test)
print("Accuracy in testing ", metrics.accuracy_score(y_test, y_test_pred))

Now we are going to visualize the Nearest Neighbors classification. It will plot the decision boundaries for each class.

We are going to import a function defined in the file [util_knn.py](files/util_knn.py) using the *magic command* **%run**.

In [None]:
%run util_knn.py

plot_classification_iris()

## Evaluating the algorithm

### Precision, recall and f-score

For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score

* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).
* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).
* **F1-score**: This is the harmonic mean of precision and recall, and tries to combine both in a single number.

In [None]:
print(metrics.classification_report(y_test, y_test_pred, target_names=iris.target_names))

### Confusion matrix

Another useful metric is the confusion matrix

In [None]:
print(metrics.confusion_matrix(y_test, y_test_pred))

We see we classify well all the 'setosa' and 'versicolor' samples. 

### K-Fold validation

In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.

In [None]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# create a composite estimator made by a pipeline of preprocessing and the KNN model
model = Pipeline([
        ('scaler', StandardScaler()),
        ('kNN', KNeighborsClassifier())
])

# create a k-fold cross validation iterator of k=10 folds
cv = KFold(10, shuffle=True, random_state=33)

# by default the score used is the one returned by score method of the estimator (accuracy)
scores = cross_val_score(model, x_iris, y_iris, cv=cv)
print(scores)

We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure

In [None]:
from scipy.stats import sem
def mean_score(scores):
    return ("Mean score: {0:.3f} (+/- {1:.3f})").format(np.mean(scores), sem(scores))
print(mean_score(scores))

So, we get an average accuracy of 0.940.

## Tuning the algorithm

We are going to tune the algorithm, and calculate which is the best value for the k parameter.

In [None]:
k_range = range(1, 21)
accuracy = []
for k in k_range:
    m = KNeighborsClassifier(k)
    m.fit(x_train, y_train)
    y_test_pred = m.predict(x_test)
    accuracy.append(metrics.accuracy_score(y_test, y_test_pred))
plt.plot(k_range, accuracy)
plt.xlabel('k value')
plt.ylabel('Accuracy')


The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k.

## References

* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* [Learning scikit-learn: Machine Learning in Python](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.


## Licence
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.