sitc/ml1/2_6_Model_Tuning.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [Model Tuning](#Model-Tuning)\n",
    "* [Load data and preprocessing](#Load-data-and-preprocessing)\n",
    "* [Train classifier](#Train-classifier)\n",
    "* [More about Pipelines](#More-about-Pipelines)\n",
    "* [Tuning the algorithm](#Tuning-the-algorithm)\n",
    "\t* [Grid Search for Parameter optimization](#Grid-Search-for-Parameter-optimization)\n",
    "* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
    "\t* [K-Fold validation](#K-Fold-validation)\n",
    "* [References](#References)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Model Tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous [notebook](2_5_2_Decision_Tree_Model.ipynb), we got an accuracy of 9.47. Could we get a better accuracy if we tune the parameters of the estimator?\n",
    "\n",
    "The goal of this notebook is to learn how to tune an algorithm by opimizing its parameters using grid search."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data and preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# library for displaying plots\n",
    "import matplotlib.pyplot as plt\n",
    "# display plots in the notebook \n",
    "%matplotlib inline\n",
    "\n",
    "## First, we repeat the load and preprocessing steps\n",
    "\n",
    "# Load data\n",
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "# Training and test spliting\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "x_iris, y_iris = iris.data, iris.target\n",
    "# Test set will be the 25% taken randomly\n",
    "x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",
    "\n",
    "# Preprocess: normalize\n",
    "from sklearn import preprocessing\n",
    "scaler = preprocessing.StandardScaler().fit(x_train)\n",
    "x_train = scaler.transform(x_train)\n",
    "x_test = scaler.transform(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As previously, we train the model and evaluate the result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, KFold\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "import numpy as np\n",
    "\n",
    "# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
    "model = Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('ds', DecisionTreeClassifier())\n",
    "])\n",
    "\n",
    "# Fit the model\n",
    "model.fit(x_train, y_train) \n",
    "\n",
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "\n",
    "from scipy.stats import sem\n",
    "def mean_score(scores):\n",
    "    return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
    "print(mean_score(scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We obtain an accuracy of 0.947."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## More about Pipelines"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we use a Pipeline, every chained estimator is stored in the dictionary *named_steps* and as a list in *steps*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.named_steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.steps"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get the list of parameters of the model. As you will observe, the parameters of the estimators in the pipeline can be accessed using the &lt;estimator&gt;__&lt;parameter&gt; syntax. We will use this for tuning the parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.get_params().keys()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's see what happens if we change a parameter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.set_params(ds__class_weight='balanced')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another alternative is to create the pipeline with the values we want to set, but it can be useful to access the estimators of the Pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('ds', DecisionTreeClassifier(class_weight='balanced'))\n",
    "])\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The same approach can be used for accessing attributes such as *feature_importances_* we saw in the previous notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit the model\n",
    "model.fit(x_train, y_train) \n",
    "# Using named_steps\n",
    "my_decision_tree = model.named_steps['ds']\n",
    "print(my_decision_tree.feature_importances_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Using steps, we take the last step (-1) or the second step (1)\n",
    "#name, my_desision_tree = model.steps[1]\n",
    "name, my_desision_tree = model.steps[-1]\n",
    "print(my_decision_tree.feature_importances_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tuning the algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the most important feature for this classifier is `petal width`.\n",
    "\n",
    "Look at the [API](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of *scikit-learn* to understand better the algorithm, as well as which parameters can be tuned. As you see, we can change several ones, such as *criterion*, *splitter*, *max_features*, *max_depth*, *min_samples_split*, *class_weight*, etc.\n",
    "\n",
    "We can get the full list parameters of an estimator with the method *get_params()*. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.get_params()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can try different values for these parameters and observe the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Grid Search for Parameter optimization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Changing manually the parameters to find their optimal values is not practical. Instead, we can consider to find the optimal value of the parameters as an *optimization problem*. \n",
    "\n",
    "The sklearn comes with several optimization techniques for this purpose, such as  **grid search** and  **randomized search**. In this notebook we are going to introduce the former one."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The sklearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "import numpy as np\n",
    "\n",
    "param_grid = {'max_depth': np.arange(3, 10)} \n",
    "\n",
    "gs = GridSearchCV(DecisionTreeClassifier(), param_grid)\n",
    "\n",
    "gs.fit(x_train, y_train)\n",
    "\n",
    "# summarize the results of the grid search\n",
    "print(\"Best score: \", gs.best_score_)\n",
    "print(\"Best params: \", gs.best_params_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to show the results of grid search"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We print the score for each value of max_depth\n",
    "for i, max_depth in enumerate(gs.cv_results_['params']):\n",
    "    print(\"%0.3f (+/-%0.03f) for %r\" % (gs.cv_results_['mean_test_score'][i],\n",
    "                                        gs.cv_results_['std_test_score'][i] * 2,\n",
    "                                        max_depth))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now evaluate the KFold with this optimized parameter as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
    "model = Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('ds', DecisionTreeClassifier(max_depth=3))\n",
    "])\n",
    "\n",
    "# Fit the model\n",
    "model.fit(x_train, y_train) \n",
    "\n",
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "def mean_score(scores):\n",
    "    return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
    "print(mean_score(scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have got an *improvement* from 0.947 to 0.953 with k-fold.\n",
    "\n",
    "We are now to try to fit the best combination of the parameters of the algorithm. It can take some time to compute it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the parameters by cross-validation\n",
    "\n",
    "from sklearn.metrics import classification_report\n",
    "\n",
    "# set of parameters to test\n",
    "tuned_parameters = [{'max_depth': np.arange(3, 10),\n",
    "#                     'max_weights': [1, 10, 100, 1000]},\n",
    "                     'criterion': ['gini', 'entropy'], \n",
    "                     'splitter': ['best', 'random'],\n",
    "                    # 'min_samples_leaf': [2, 5, 10],\n",
    "                     'class_weight':['balanced', None],\n",
    "                     'max_leaf_nodes': [None, 5, 10, 20]\n",
    "                    }]\n",
    "\n",
    "scores = ['precision', 'recall']\n",
    "\n",
    "for score in scores:\n",
    "    print(\"# Tuning hyper-parameters for %s\" % score)\n",
    "    print()\n",
    "\n",
    "    # cv = the fold of the cross-validation cv, defaulted to 5\n",
    "    gs = GridSearchCV(DecisionTreeClassifier(), tuned_parameters, cv=10, scoring='%s_weighted' % score)\n",
    "    gs.fit(x_train, y_train)\n",
    "\n",
    "    print(\"Best parameters set found on development set:\")\n",
    "    print()\n",
    "    print(gs.best_params_)\n",
    "    print()\n",
    "    print(\"Grid scores on development set:\")\n",
    "    print()\n",
    "    means = gs.cv_results_['mean_test_score']\n",
    "    stds = gs.cv_results_['std_test_score']\n",
    "\n",
    "    for mean_score, std_score, params in zip(means, stds, gs.cv_results_['params']):\n",
    "        print(\"%0.3f (+/-%0.03f) for %r\" % (mean_score, std_score * 2, params))\n",
    "    print()\n",
    "\n",
    "    print(\"Detailed classification report:\")\n",
    "    print()\n",
    "    print(\"The model is trained on the full development set.\")\n",
    "    print(\"The scores are computed on the full evaluation set.\")\n",
    "    print()\n",
    "    y_true, y_pred = y_test, gs.predict(x_test)\n",
    "    print(classification_report(y_true, y_pred))\n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's evaluate the resulting tuning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
    "model = Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('ds', DecisionTreeClassifier(max_leaf_nodes=20, criterion='gini', \n",
    "                                      splitter='random', class_weight='balanced', max_depth=3))\n",
    "])\n",
    "\n",
    "# Fit the model\n",
    "model.fit(x_train, y_train) \n",
    "\n",
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "def mean_score(scores):\n",
    "    return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
    "print(mean_score(scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we get an average accuracy of 0.96!! Better than 0.947 (without tuning) and 0.953 (tuning only *max_depth*)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n",
    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
    "* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
    "* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}