You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
sitc/ml2/3_7_SVM.ipynb

706 lines
18 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction SVM "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we are going to train a classifier with the preprocessed Titanic dataset. \n",
"\n",
"We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load and clean"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# General import and load data\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from pandas import Series, DataFrame\n",
"\n",
"# Training and test spliting\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn import preprocessing\n",
"\n",
"# Estimators\n",
"from sklearn.svm import SVC\n",
"\n",
"# Evaluation\n",
"from sklearn import metrics\n",
"from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold\n",
"from sklearn.metrics import classification_report\n",
"from sklearn.metrics import roc_curve\n",
"from sklearn.metrics import roc_auc_score\n",
"\n",
"# Optimization\n",
"from sklearn.model_selection import GridSearchCV\n",
"\n",
"# Visualisation\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"sns.set(color_codes=True)\n",
"\n",
"\n",
"# if matplotlib is not set inline, you will not see plots\n",
"#alternatives auto gtk gtk2 inline osx qt qt5 wx tk\n",
"#%matplotlib auto\n",
"#%matplotlib qt\n",
"%matplotlib inline\n",
"%run plot_learning_curve"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#We get a URL with raw content (not HTML one)\n",
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
"df = pd.read_csv(url)\n",
"df.head()\n",
"\n",
"\n",
"#Fill missing values\n",
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
"df['Sex'].fillna('male', inplace=True)\n",
"df['Embarked'].fillna('S', inplace=True)\n",
"\n",
"# Encode categorical variables\n",
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
"\n",
"# Drop colums\n",
"df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)\n",
"\n",
"#Show proprocessed df\n",
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Check types are numeric\n",
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have still two columns as objects, so we change the type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['Sex'] = df['Sex'].astype(np.int64)\n",
"df['Embarked'] = df['Embarked'].astype(np.int64)\n",
"df.dtypes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Check there are not missing values\n",
"df.isnull().any()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train and test splitting"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the same techniques we applied in the Iris dataset. \n",
"\n",
"Nevertheless, we need to remove the column 'Survived' "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Features of the model\n",
"features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']\n",
"# Transform dataframe in numpy arrays\n",
"X = df[features].values\n",
"y = df['Survived'].values\n",
"\n",
"\n",
"\n",
"# Test set will be the 25% taken randomly\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)\n",
"\n",
"# Preprocess: normalize\n",
"#scaler = preprocessing.StandardScaler().fit(X_train)\n",
"#X_train = scaler.transform(X_train)\n",
"#X_test = scaler.transform(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Define model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"\n",
"types_of_kernels = ['linear', 'rbf', 'poly']\n",
"\n",
"kernel = types_of_kernels[0]\n",
"gamma = 3.0\n",
"\n",
"# Create SVM model\n",
"model = SVC(kernel=kernel, probability=True, gamma=gamma)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train and evaluate"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#This step will take some time \n",
"# Train - This is not needed if you use K-Fold\n",
"\n",
"model.fit(X_train, y_train)\n",
"\n",
"predicted = model.predict(X_test)\n",
"expected = y_test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Accuracy\n",
"metrics.accuracy_score(expected, predicted)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ok, we get around 82% of accuracy! (results depend on the splitting)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Null accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can evaluate the accuracy if the model always predict the most frequent class, following this [refeference](http://blog.kaggle.com/2015/10/23/scikit-learn-video-9-better-evaluation-of-classification-models/)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Count number of samples per class\n",
"s_y_test = Series(y_test)\n",
"s_y_test.value_counts()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Mean of ones\n",
"y_test.mean()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Mean of zeros\n",
"1 - y_test.mean() \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate null accuracy (binary classification coded as 0/1)\n",
"max(y_test.mean(), 1 - y_test.mean())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Calculate null accuracy (multiclass classification)\n",
"s_y_test.value_counts().head(1) / len(y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, since our accuracy was 0.82 is better than the null accuracy."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Confussion matrix and F-score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can obtain more information from the confussion matrix and the metric F1-score.\n",
"In a confussion matrix, we can see:\n",
"\n",
"||**Predicted**: 0| **Predicted: 1**|\n",
"|---------------------------|\n",
"|**Actual: 0**| TN | FP |\n",
"|**Actual: 1**| FN|TP|\n",
"\n",
"* **True negatives (TN)**: actual negatives that were predicted as negatives\n",
"* **False positives (FP)**: actual negatives that were predicted as positives\n",
"* **False negatives (TN)**: actual positives that were predicted as negatives\n",
"* **True negatives (TN)**: actual positives that were predicted as posiives\n",
"\n",
"We can calculate several metrics from the confussion matrix\n",
"\n",
"* **Recall** (also called *sensitivity*): when the actual value is positive, how often the prediction is correct? \n",
"(TP / (TP + FN))\n",
"* **Specificity**: when the actual value is negative, how often the prediction is correct? (TN / (TN + FP))\n",
"* **False Positive Rate**: when the actual value is negative, how often the prediction is incorrect? (FP / (TN + FP))\n",
"* **Precision**: when a positive value is predicted, how many times is correct? (TP / (TP + FP)\n",
"A good metric is F1-score: 2TP / (2TP + FP + FN)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Confusion matrix\n",
"print(metrics.confusion_matrix(expected, predicted))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Report\n",
"print(classification_report(expected, predicted))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ROC (Receiver Operating Characteristic ) and AUC (Area Under the Curve)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [ROC](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) curve illustrates the performance of a binary classifier system as its discrimination threshold is varied."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"y_pred_prob = model.predict_proba(X_test)[:,1]\n",
"fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)\n",
"plt.plot(fpr, tpr)\n",
"plt.xlim([0.0, 1.0])\n",
"plt.ylim([0.0, 1.0])\n",
"plt.title('ROC curve for Titanic')\n",
"plt.xlabel('False Positive Rate (1 - Recall)')\n",
"plt.ylabel('True Positive Rate (Sensitivity)')\n",
"plt.grid(True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Threshold used by the decision function, thresholds[0] is the number of \n",
"thresholds"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Histogram of probability vs actual\n",
"dprob = pd.DataFrame(data = {'probability':y_pred_prob, 'actual':y_test})\n",
"dprob.probability.hist(by=dprob.actual, sharex=True, sharey=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"ROC curve helps to select a threshold to balance sensitivity and recall."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Function to evaluate thresholds of the ROC curve\n",
"def evaluate_threshold(threshold):\n",
" print('Sensitivity:', tpr[thresholds > threshold][-1])\n",
" print('Recall:', 1 - fpr[thresholds > threshold][-1])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"evaluate_threshold(0.74)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"evaluate_threshold(0.5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.\n",
"\n",
"threshold = 0.8\n",
"\n",
"predicted = model.predict_proba(X) > threshold"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"AUC is the percentage of the ROC plot underneath the curve. Represents the likelihood that the predictor assigns a higher predicted probability to the positive observation. A simple rule to evaluate a classifier based on this summary value is the following:\n",
"* .90-1 = very good (A)\n",
"* .80-.90 = good (B)\n",
"* .70-.80 = not so good (C)\n",
"* .60-.70 = poor (D)\n",
"* .50-.60 = fail (F)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# AUX\n",
"print(roc_auc_score(expected, predicted))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train and Evaluate with K-Fold"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# This step will take some time\n",
"# Cross-validationt\n",
"cv = KFold(n_splits=5, shuffle=True, random_state=33)\n",
"# StratifiedKFold has is a variation of k-fold which returns stratified folds:\n",
"# each set contains approximately the same percentage of samples of each target class as the complete set.\n",
"#cv = StratifiedKFold(y, n_folds=3, shuffle=True, random_state=33)\n",
"scores = cross_val_score(model, X, y, cv=cv)\n",
"print(\"Scores in every iteration\", scores)\n",
"print(\"Accuracy: %0.2f (+/- %0.2f)\" % (scores.mean(), scores.std() * 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get 78% of success with K-Fold, quite good!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plot_learning_curve(model, \"Learning curve with K-Fold\", X, y, cv=cv)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train and Optimize"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this section we are going to provide an alternative version of the previous one with optimization"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Tune parameters\n",
"gammas = np.logspace(-6, -1, 10)\n",
"gs = GridSearchCV(model, param_grid=dict(gamma=gammas))\n",
"gs.fit(X_train, y_train)\n",
"scores = gs.score(X_test, y_test)\n",
"print(scores)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Refine model\n",
"model = SVC(kernel='linear', gamma=gs.best_estimator_.gamma)\n",
"plot_learning_curve(model, \"optimized with GridSearch\", X, y, cv=cv)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualise"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Plot with standard configuration of SVM\n",
"%run plot_svm\n",
"plot_svm(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Any value in the blue survived while anyone in the red did not. Checkout the graph for the linear transformation. It created its decision boundary right on 50%! "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Titanic Machine Learning from Disaster](https://www.kaggle.com/c/titanic/forums/t/5105/ipython-notebook-tutorial-for-titanic-machine-learning-from-disaster)\n",
"* [API SVC scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)\n",
"* [How to choose the right metric for evaluating an ML model](https://www.kaggle.com/vipulgandhi/how-to-choose-right-metric-for-evaluating-ml-model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}