sitc/ml2/3_7_SVM.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction SVM "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook we are going to train a classifier with the preprocessed Titanic dataset. \n",
    "\n",
    "We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and clean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# General import and load data\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "# Training and test spliting\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn import preprocessing\n",
    "\n",
    "# Estimators\n",
    "from sklearn.svm import SVC\n",
    "\n",
    "# Evaluation\n",
    "from sklearn import metrics\n",
    "from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import roc_curve\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "# Optimization\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "# Visualisation\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "sns.set(color_codes=True)\n",
    "\n",
    "\n",
    "# if matplotlib is not set inline, you will not see plots\n",
    "#alternatives auto gtk gtk2 inline osx qt qt5 wx tk\n",
    "#%matplotlib auto\n",
    "#%matplotlib qt\n",
    "%matplotlib inline\n",
    "%run plot_learning_curve"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#We get a URL with raw content (not HTML one)\n",
    "url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
    "df = pd.read_csv(url)\n",
    "df.head()\n",
    "\n",
    "\n",
    "#Fill missing values\n",
    "df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
    "df['Sex'].fillna('male', inplace=True)\n",
    "df['Embarked'].fillna('S', inplace=True)\n",
    "\n",
    "# Encode categorical variables\n",
    "df['Age'] = df['Age'].fillna(df['Age'].median())\n",
    "df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
    "df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
    "df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
    "df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
    "df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
    "\n",
    "# Drop colums\n",
    "df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)\n",
    "\n",
    "#Show proprocessed df\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Check types are numeric\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have still two columns as objects, so we change the type."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['Sex'] = df['Sex'].astype(np.int64)\n",
    "df['Embarked'] = df['Embarked'].astype(np.int64)\n",
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Check there are not missing values\n",
    "df.isnull().any()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train and test splitting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the same techniques we applied in the Iris dataset. \n",
    "\n",
    "Nevertheless, we need to remove the column 'Survived' "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Features of the model\n",
    "features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']\n",
    "# Transform dataframe in numpy arrays\n",
    "X = df[features].values\n",
    "y = df['Survived'].values\n",
    "\n",
    "\n",
    "\n",
    "# Test set will be the 25% taken randomly\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)\n",
    "\n",
    "# Preprocess: normalize\n",
    "#scaler = preprocessing.StandardScaler().fit(X_train)\n",
    "#X_train = scaler.transform(X_train)\n",
    "#X_test = scaler.transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Define model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "types_of_kernels = ['linear', 'rbf', 'poly']\n",
    "\n",
    "kernel = types_of_kernels[0]\n",
    "gamma = 3.0\n",
    "\n",
    "# Create SVM model\n",
    "model = SVC(kernel=kernel, probability=True, gamma=gamma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train and evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#This step will take some time \n",
    "# Train - This is not needed if you use K-Fold\n",
    "\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "predicted = model.predict(X_test)\n",
    "expected = y_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Accuracy\n",
    "metrics.accuracy_score(expected, predicted)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ok, we get around 82% of accuracy! (results depend on the splitting)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Null accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can evaluate the accuracy if the model always predict the most frequent class, following this [reference](https://medium.com/analytics-vidhya/model-validation-for-classification-5ff4a0373090)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Count number of samples per class\n",
    "s_y_test = Series(y_test)\n",
    "s_y_test.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mean of ones\n",
    "y_test.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Mean of zeros\n",
    "1 - y_test.mean() \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate null accuracy (binary classification coded as 0/1)\n",
    "max(y_test.mean(), 1 - y_test.mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate null accuracy (multiclass classification)\n",
    "s_y_test.value_counts().head(1) / len(y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, since our accuracy was 0.82 is better than the null accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confussion matrix and F-score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can obtain more information from the confussion matrix and the metric F1-score.\n",
    "In a confussion matrix, we can see:\n",
    "\n",
    "||**Predicted**: 0| **Predicted: 1**|\n",
    "|---------------------------|\n",
    "|**Actual: 0**| TN | FP |\n",
    "|**Actual: 1**| FN|TP|\n",
    "\n",
    "* **True negatives (TN)**: actual negatives that were predicted as negatives\n",
    "* **False positives (FP)**: actual negatives that were predicted as positives\n",
    "* **False negatives (TN)**: actual positives that were predicted as negatives\n",
    "* **True negatives (TN)**: actual positives that were predicted as posiives\n",
    "\n",
    "We can calculate several metrics from the confussion matrix\n",
    "\n",
    "* **Recall** (also called *sensitivity*): when the actual value is positive, how often the prediction is correct? \n",
    "(TP / (TP + FN))\n",
    "* **Specificity**: when the actual value is negative, how often the prediction is correct? (TN / (TN + FP))\n",
    "* **False Positive Rate**: when the actual value is negative, how often the prediction is incorrect? (FP / (TN + FP))\n",
    "* **Precision**: when a positive value is predicted, how many times is correct? (TP / (TP + FP)\n",
    "A good metric is F1-score: 2TP / (2TP + FP + FN)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Confusion matrix\n",
    "print(metrics.confusion_matrix(expected, predicted))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Report\n",
    "print(classification_report(expected, predicted))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## ROC (Receiver Operating Characteristic ) and AUC (Area Under the Curve)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)  curve illustrates the performance of a binary classifier system as its discrimination threshold is varied."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_prob = model.predict_proba(X_test)[:,1]\n",
    "fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)\n",
    "plt.plot(fpr, tpr)\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.0])\n",
    "plt.title('ROC curve for Titanic')\n",
    "plt.xlabel('False Positive Rate (1 - Recall)')\n",
    "plt.ylabel('True Positive Rate (Sensitivity)')\n",
    "plt.grid(True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Threshold used by the decision function, thresholds[0] is the number of \n",
    "thresholds"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Histogram of probability vs actual\n",
    "dprob = pd.DataFrame(data = {'probability':y_pred_prob, 'actual':y_test})\n",
    "dprob.probability.hist(by=dprob.actual, sharex=True, sharey=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "ROC curve helps to select a threshold to balance sensitivity and recall."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Function to evaluate thresholds of the ROC curve\n",
    "def evaluate_threshold(threshold):\n",
    "    print('Sensitivity:', tpr[thresholds > threshold][-1])\n",
    "    print('Recall:', 1 - fpr[thresholds > threshold][-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluate_threshold(0.74)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "evaluate_threshold(0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.\n",
    "\n",
    "threshold = 0.8\n",
    "\n",
    "predicted = model.predict_proba(X) > threshold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "AUC is the percentage of the ROC plot underneath the curve. Represents the likelihood that the predictor assigns  a higher predicted probability to the positive observation.  A simple rule  to evaluate a classifier based on this summary value is the following:\n",
    "* .90-1 = very good (A)\n",
    "* .80-.90 = good (B)\n",
    "* .70-.80 = not so good (C)\n",
    "* .60-.70 = poor (D)\n",
    "* .50-.60 = fail (F)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# AUX\n",
    "print(roc_auc_score(expected, predicted))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train and Evaluate with K-Fold"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This step will take some time\n",
    "# Cross-validationt\n",
    "cv = KFold(n_splits=5, shuffle=True, random_state=33)\n",
    "# StratifiedKFold has is a variation of k-fold which returns stratified folds:\n",
    "# each set contains approximately the same percentage of samples of each target class as the complete set.\n",
    "#cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=33)\n",
    "scores = cross_val_score(model, X, y, cv=cv)\n",
    "print(\"Scores in every iteration\", scores)\n",
    "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get 78% of success with K-Fold, quite good!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_learning_curve(model, \"Learning curve with K-Fold\", X, y, cv=cv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train and Optimize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section we are going to provide an alternative version of the previous one with optimization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Tune parameters\n",
    "gammas = np.logspace(-6, -1, 10)\n",
    "gs = GridSearchCV(model, param_grid=dict(gamma=gammas))\n",
    "gs.fit(X_train, y_train)\n",
    "scores = gs.score(X_test, y_test)\n",
    "print(scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Refine model\n",
    "model = SVC(kernel='linear', gamma=gs.best_estimator_.gamma)\n",
    "plot_learning_curve(model, \"optimized with GridSearch\", X, y, cv=cv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualise"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot with standard configuration of SVM\n",
    "%run plot_svm\n",
    "plot_svm(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Any value in the blue survived while anyone in the red did not. Checkout the graph for the linear transformation. It created its decision boundary right on 50%! "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Titanic Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
    "* [API SVC scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)\n",
    "* [How to choose the right metric for evaluating an ML model](https://www.kaggle.com/vipulgandhi/how-to-choose-right-metric-for-evaluating-ml-model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}