# Introduction SVM 

In this notebook we are going to train a classifier with the preprocessed Titanic dataset. 

We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load and clean" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# General import and load data\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from pandas import Series, DataFrame\n", "\n", "# Training and test spliting\n", "from sklearn.model_selection import train_test_split\n", "from sklearn import preprocessing\n", "\n", "# Estimators\n", "from sklearn.svm import SVC\n", "\n", "# Evaluation\n", "from sklearn import metrics\n", "from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold\n", "from sklearn.metrics import classification_report\n", "from sklearn.metrics import roc_curve\n", "from sklearn.metrics import roc_auc_score\n", "\n", "# Optimization\n", "from sklearn.model_selection import GridSearchCV\n", "\n", "# Visualisation\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "sns.set(color_codes=True)\n", "\n", "\n", "# if matplotlib is not set inline, you will not see plots\n", "#alternatives auto gtk gtk2 inline osx qt qt5 wx tk\n", "#%matplotlib auto\n", "#%matplotlib qt\n", "%matplotlib inline\n", "%run plot_learning_curve" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#We get a URL with raw content (not HTML one)\n", "url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n", "df = pd.read_csv(url)\n", "df.head()\n", "\n", "\n", "#Fill missing values\n", "df['Age'].fillna(df['Age'].mean(), inplace=True)\n", "df['Sex'].fillna('male', inplace=True)\n", "df['Embarked'].fillna('S', inplace=True)\n", "\n", "# Encode categorical variables\n", "df['Age'] = df['Age'].fillna(df['Age'].median())\n", "df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n", "df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n", "df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n", "df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n", "df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n", "\n", "# Drop colums\n", "df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)\n", "\n", "#Show proprocessed df\n", "df.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Check types are numeric\n", "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have still two columns as objects, so we change the type." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df['Sex'] = df['Sex'].astype(np.int64)\n", "df['Embarked'] = df['Embarked'].astype(np.int64)\n", "df.dtypes" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Check there are not missing values\n", "df.isnull().any()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train and test splitting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the same techniques we applied in the Iris dataset. \n", "\n", "Nevertheless, we need to remove the column 'Survived' " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Features of the model\n", "features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']\n", "# Transform dataframe in numpy arrays\n", "X = df[features].values\n", "y = df['Survived'].values\n", "\n", "\n", "\n", "# Test set will be the 25% taken randomly\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)\n", "\n", "# Preprocess: normalize\n", "#scaler = preprocessing.StandardScaler().fit(X_train)\n", "#X_train = scaler.transform(X_train)\n", "#X_test = scaler.transform(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Define model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "types_of_kernels = ['linear', 'rbf', 'poly']\n", "\n", "kernel = types_of_kernels[0]\n", "gamma = 3.0\n", "\n", "# Create kNN model\n", "model = SVC(kernel=kernel, probability=True, gamma=gamma)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train and evaluate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#This step will take some time \n", "# Train - This is not needed if you use K-Fold\n", "\n", "model.fit(X_train, y_train)\n", "\n", "predicted = model.predict(X_test)\n", "expected = y_test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Accuracy\n", "metrics.accuracy_score(expected, predicted)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, we get around 82% of accuracy! (results depend on the splitting)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Null accuracy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can evaluate the accuracy if the model always predict the most frequent class, following this [refeference](http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Count number of samples per class\n", "s_y_test = Series(y_test)\n", "s_y_test.value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Mean of ones\n", "y_test.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Mean of zeros\n", "1 - y_test.mean() \n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate null accuracy (binary classification coded as 0/1)\n", "max(y_test.mean(), 1 - y_test.mean())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate null accuracy (multiclass classification)\n", "s_y_test.value_counts().head(1) / len(y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, since our accuracy was 0.82 is better than the null accuracy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Confussion matrix and F-score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can obtain more information from the confussion matrix and the metric F1-score.\n", "In a confussion matrix, we can see:\n", "\n", "||**Predicted**: 0| **Predicted: 1**|\n", "|---------------------------|\n", "|**Actual: 0**| TN | FP |\n", "|**Actual: 1**| FN|TP|\n", "\n", "* **True negatives (TN)**: actual negatives that were predicted as negatives\n", "* **False positives (FP)**: actual negatives that were predicted as positives\n", "* **False negatives (TN)**: actual positives that were predicted as negatives\n", "* **True negatives (TN)**: actual positives that were predicted as posiives\n", "\n", "We can calculate several metrics from the confussion matrix\n", "\n", "* **Recall** (also called *sensitivity*): when the actual value is positive, how often the prediction is correct? \n", "(TP / (TP + FN))\n", "* **Specificity**: when the actual value is negative, how often the prediction is correct? (TN / (TN + FP))\n", "* **False Positive Rate**: when the actual value is negative, how often the prediction is incorrect? (FP / (TN + FP))\n", "* **Precision**: when a positive value is predicted, how many times is correct? (TP / (TP + FP)\n", "A good metric is F1-score: 2TP / (2TP + FP + FN)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Confusion matrix\n", "print(metrics.confusion_matrix(expected, predicted))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Report\n", "print(classification_report(expected, predicted))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ROC (Receiver Operating Characteristic ) and AUC (Area Under the Curve)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) curve illustrates the performance of a binary classifier system as its discrimination threshold is varied." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y_pred_prob = model.predict_proba(X_test)[:,1]\n", "fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)\n", "plt.plot(fpr, tpr)\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.0])\n", "plt.title('ROC curve for Titanic')\n", "plt.xlabel('False Positive Rate (1 - Recall)')\n", "plt.xlabel('True Positive Rate (Sensitivity)')\n", "plt.grid(True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Threshold used by the decision function, thresholds[0] is the number of \n", "thresholds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Histogram of probability vs actual\n", "dprob = pd.DataFrame(data = {'probability':y_pred_prob, 'actual':y_test})\n", "dprob.probability.hist(by=dprob.actual, sharex=True, sharey=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "ROC curve helps to select a threshold to balance sensitivity and recall." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Function to evaluate thresholds of the ROC curve\n", "def evaluate_threshold(threshold):\n", " print('Sensitivity:', tpr[thresholds > threshold][-1])\n", " print('Recall:', 1 - fpr[thresholds > threshold][-1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluate_threshold(0.74)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "evaluate_threshold(0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.\n", "\n", "threshold = 0.8\n", "\n", "predicted = model.predict_proba(X) > threshold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "AUC is the percentage of the ROC plot underneath the curve. Represents the likelihood that the predictor assigns a higher predicted probability to the positive observation. A simple rule to evaluate a classifier based on this summary value is the following:\n", "* .90-1 = very good (A)\n", "* .80-.90 = good (B)\n", "* .70-.80 = not so good (C)\n", "* .60-.70 = poor (D)\n", "* .50-.60 = fail (F)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# AUX\n", "print(roc_auc_score(expected, predicted))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train and Evaluate with K-Fold" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This step will take some time\n", "# Cross-validationt\n", "cv = KFold(n_splits=5, shuffle=True, random_state=33)\n", "# StratifiedKFold has is a variation of k-fold which returns stratified folds:\n", "# each set contains approximately the same percentage of samples of each target class as the complete set.\n", "#cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=33)\n", "scores = cross_val_score(model, X, y, cv=cv)\n", "print(\"Scores in every iteration\", scores)\n", "print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get 78% of success with K-Fold, quite good!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_learning_curve(model, \"Learning curve with K-Fold\", X, y, cv=cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Train and Optimize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section we are going to provide an alternative version of the previous one with optimization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Tune parameters\n", "gammas = np.logspace(-6, -1, 10)\n", "gs = GridSearchCV(model, param_grid=dict(gamma=gammas))\n", "gs.fit(X_train, y_train)\n", "scores = gs.score(X_test, y_test)\n", "print(scores)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Refine model\n", "model = SVC(kernel='linear', gamma=gs.best_estimator_.gamma)\n", "plot_learning_curve(model, \"optimized with GridSearch\", X, y, cv=cv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Visualise" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot with standard configuration of SVM\n", "%run plot_svm\n", "plot_svm(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any value in the blue survived while anyone in the red did not. 