![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)

# Introduction SVM 

In this notebook we are going to train a classifier with the preprocessed Titanic dataset. 

We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook.

## Load and clean

In [None]:
# General import and load data
import pandas as pd
import numpy as np

from pandas import Series, DataFrame

# Training and test spliting
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

# Estimators
from sklearn.svm import SVC

# Evaluation
from sklearn import metrics
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# Optimization
from sklearn.model_selection import GridSearchCV

# Visualisation
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)


# if matplotlib is not set inline, you will not see plots
#alternatives auto gtk gtk2 inline osx qt qt5 wx tk
#%matplotlib auto
#%matplotlib qt
%matplotlib inline
%run plot_learning_curve

In [None]:
#We get a URL with raw content (not HTML one)
url="https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv"
df = pd.read_csv(url)
df.head()


#Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Sex'].fillna('male', inplace=True)
df['Embarked'].fillna('S', inplace=True)

# Encode categorical variables
df['Age'] = df['Age'].fillna(df['Age'].median())
df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1
df.loc[df["Embarked"] == "S", "Embarked"] = 0
df.loc[df["Embarked"] == "C", "Embarked"] = 1
df.loc[df["Embarked"] == "Q", "Embarked"] = 2

# Drop colums
df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)

#Show proprocessed df
df.head()

In [None]:
#Check types are numeric
df.dtypes

We have still two columns as objects, so we change the type.

In [None]:
df['Sex'] = df['Sex'].astype(np.int64)
df['Embarked'] = df['Embarked'].astype(np.int64)
df.dtypes

In [None]:
#Check there are not missing values
df.isnull().any()

# Train and test splitting

We use the same techniques we applied in the Iris dataset. 

Nevertheless, we need to remove the column 'Survived' 

In [None]:
# Features of the model
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
# Transform dataframe in numpy arrays
X = df[features].values
y = df['Survived'].values



# Test set will be the 25% taken randomly
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

# Preprocess: normalize
#scaler = preprocessing.StandardScaler().fit(X_train)
#X_train = scaler.transform(X_train)
#X_test = scaler.transform(X_test)

# Define model

In [None]:

types_of_kernels = ['linear', 'rbf', 'poly']

kernel = types_of_kernels[0]
gamma = 3.0

# Create SVM model
model = SVC(kernel=kernel, probability=True, gamma=gamma)

# Train and evaluate

In [None]:
#This step will take some time 
# Train - This is not needed if you use K-Fold

model.fit(X_train, y_train)

predicted = model.predict(X_test)
expected = y_test

In [None]:
# Accuracy
metrics.accuracy_score(expected, predicted)

Ok, we get around 82% of accuracy! (results depend on the splitting)

## Null accuracy

We can evaluate the accuracy if the model always predict the most frequent class, following this [reference](https://medium.com/analytics-vidhya/model-validation-for-classification-5ff4a0373090).

In [None]:
# Count number of samples per class
s_y_test = Series(y_test)
s_y_test.value_counts()

In [None]:
# Mean of ones
y_test.mean()

In [None]:
# Mean of zeros
1 - y_test.mean() 


In [None]:
# Calculate null accuracy (binary classification coded as 0/1)
max(y_test.mean(), 1 - y_test.mean())

In [None]:
# Calculate null accuracy (multiclass classification)
s_y_test.value_counts().head(1) / len(y_test)

So, since our accuracy was 0.82 is better than the null accuracy.

## Confussion matrix and F-score

We can obtain more information from the confussion matrix and the metric F1-score.
In a confussion matrix, we can see:

||**Predicted**: 0| **Predicted: 1**|
|---------------------------|
|**Actual: 0**| TN | FP |
|**Actual: 1**| FN|TP|

* **True negatives (TN)**: actual negatives that were predicted as negatives
* **False positives (FP)**: actual negatives that were predicted as positives
* **False negatives (TN)**: actual positives that were predicted as negatives
* **True negatives (TN)**: actual positives that were predicted as posiives

We can calculate several metrics from the confussion matrix

* **Recall** (also called *sensitivity*): when the actual value is positive, how often the prediction is correct? 
(TP / (TP + FN))
* **Specificity**: when the actual value is negative, how often the prediction is correct? (TN / (TN + FP))
* **False Positive Rate**: when the actual value is negative, how often the prediction is incorrect? (FP / (TN + FP))
* **Precision**: when a positive value is predicted, how many times is correct? (TP / (TP + FP)
A good metric is F1-score: 2TP / (2TP + FP + FN)

In [None]:
# Confusion matrix
print(metrics.confusion_matrix(expected, predicted))

In [None]:
# Report
print(classification_report(expected, predicted))

## ROC (Receiver Operating Characteristic ) and AUC (Area Under the Curve)

The [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)  curve illustrates the performance of a binary classifier system as its discrimination threshold is varied.

In [None]:
y_pred_prob = model.predict_proba(X_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for Titanic')
plt.xlabel('False Positive Rate (1 - Recall)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)

In [None]:
#Threshold used by the decision function, thresholds[0] is the number of 
thresholds

In [None]:
#Histogram of probability vs actual
dprob = pd.DataFrame(data = {'probability':y_pred_prob, 'actual':y_test})
dprob.probability.hist(by=dprob.actual, sharex=True, sharey=True)

ROC curve helps to select a threshold to balance sensitivity and recall.

In [None]:
#Function to evaluate thresholds of the ROC curve
def evaluate_threshold(threshold):
    print('Sensitivity:', tpr[thresholds > threshold][-1])
    print('Recall:', 1 - fpr[thresholds > threshold][-1])

In [None]:
evaluate_threshold(0.74)

In [None]:
evaluate_threshold(0.5)

By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.

threshold = 0.8

predicted = model.predict_proba(X) > threshold

AUC is the percentage of the ROC plot underneath the curve. Represents the likelihood that the predictor assigns  a higher predicted probability to the positive observation.  A simple rule  to evaluate a classifier based on this summary value is the following:
* .90-1 = very good (A)
* .80-.90 = good (B)
* .70-.80 = not so good (C)
* .60-.70 = poor (D)
* .50-.60 = fail (F)

In [None]:
# AUX
print(roc_auc_score(expected, predicted))

# Train and Evaluate with K-Fold

This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate.

In [None]:
# This step will take some time
# Cross-validationt
cv = KFold(n_splits=5, shuffle=True, random_state=33)
# StratifiedKFold has is a variation of k-fold which returns stratified folds:
# each set contains approximately the same percentage of samples of each target class as the complete set.
#cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=33)
scores = cross_val_score(model, X, y, cv=cv)
print("Scores in every iteration", scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

We get 78% of success with K-Fold, quite good!

We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples.

In [None]:
plot_learning_curve(model, "Learning curve with K-Fold", X, y, cv=cv)

# Train and Optimize

In this section we are going to provide an alternative version of the previous one with optimization

In [None]:
#Tune parameters
gammas = np.logspace(-6, -1, 10)
gs = GridSearchCV(model, param_grid=dict(gamma=gammas))
gs.fit(X_train, y_train)
scores = gs.score(X_test, y_test)
print(scores)

In [None]:
# Refine model
model = SVC(kernel='linear', gamma=gs.best_estimator_.gamma)
plot_learning_curve(model, "optimized with GridSearch", X, y, cv=cv)

# Visualise

In [None]:
# Plot with standard configuration of SVM
%run plot_svm
plot_svm(df)

Any value in the blue survived while anyone in the red did not. Checkout the graph for the linear transformation. It created its decision boundary right on 50%! 

# References

* [Titanic Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)
* [API SVC scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
* [How to choose the right metric for evaluating an ML model](https://www.kaggle.com/vipulgandhi/how-to-choose-right-metric-for-evaluating-ml-model)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.