2016-03-15 12:55:14 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2019-02-28 11:40:59 +00:00
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"\n",
"* [Model Tuning](#Model-Tuning)\n",
"* [Load data and preprocessing](#Load-data-and-preprocessing)\n",
"* [Train classifier](#Train-classifier)\n",
"* [More about Pipelines](#More-about-Pipelines)\n",
"* [Tuning the algorithm](#Tuning-the-algorithm)\n",
"\t* [Grid Search for Parameter optimization](#Grid-Search-for-Parameter-optimization)\n",
"* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
"\t* [K-Fold validation](#K-Fold-validation)\n",
"* [References](#References)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Model Tuning"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous [notebook](2_5_2_Decision_Tree_Model.ipynb), we got an accuracy of 9.47. Could we get a better accuracy if we tune the parameters of the estimator?\n",
"\n",
2016-03-28 10:26:20 +00:00
"The goal of this notebook is to learn how to tune an algorithm by opimizing its parameters using grid search."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load data and preprocessing"
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2016-03-15 12:55:14 +00:00
"outputs": [],
"source": [
"# library for displaying plots\n",
"import matplotlib.pyplot as plt\n",
"# display plots in the notebook \n",
"%matplotlib inline\n",
"\n",
"## First, we repeat the load and preprocessing steps\n",
"\n",
"# Load data\n",
"from sklearn import datasets\n",
"iris = datasets.load_iris()\n",
"\n",
"# Training and test spliting\n",
2018-02-23 14:48:59 +00:00
"from sklearn.model_selection import train_test_split\n",
2016-03-15 12:55:14 +00:00
"\n",
"x_iris, y_iris = iris.data, iris.target\n",
"# Test set will be the 25% taken randomly\n",
"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",
"\n",
"# Preprocess: normalize\n",
"from sklearn import preprocessing\n",
"scaler = preprocessing.StandardScaler().fit(x_train)\n",
"x_train = scaler.transform(x_train)\n",
"x_test = scaler.transform(x_test)"
]
},
{
"cell_type": "markdown",
2019-02-28 14:25:19 +00:00
"metadata": {},
2016-03-15 12:55:14 +00:00
"source": [
"## Train classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As previously, we train the model and evaluate the result."
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
2018-02-23 14:48:59 +00:00
"from sklearn.model_selection import cross_val_score, KFold\n",
2016-03-15 12:55:14 +00:00
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
2016-03-28 10:26:20 +00:00
"from sklearn.tree import DecisionTreeClassifier\n",
"import numpy as np\n",
2016-03-15 12:55:14 +00:00
"\n",
"# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
"model = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('ds', DecisionTreeClassifier())\n",
"])\n",
"\n",
"# Fit the model\n",
"model.fit(x_train, y_train) \n",
"\n",
"# create a k-fold cross validation iterator of k=10 folds\n",
2018-02-23 14:48:59 +00:00
"cv = KFold(10, shuffle=True, random_state=33)\n",
2016-03-15 12:55:14 +00:00
"\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"\n",
"from scipy.stats import sem\n",
"def mean_score(scores):\n",
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
"print(mean_score(scores))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We obtain an accuracy of 0.947."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## More about Pipelines"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we use a Pipeline, every chained estimator is stored in the dictionary *named_steps* and as a list in *steps*."
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"model.named_steps"
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"model.steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can get the list of parameters of the model. As you will observe, the parameters of the estimators in the pipeline can be accessed using the <estimator>__<parameter> syntax. We will use this for tuning the parameters."
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"model.get_params().keys()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"Let's see what happens if we change a parameter"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"model.set_params(ds__class_weight='balanced')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another alternative is to create the pipeline with the values we want to set, but it can be useful to access the estimators of the Pipeline."
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"model = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('ds', DecisionTreeClassifier(class_weight='balanced'))\n",
"])\n",
"model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The same approach can be used for accessing attributes such as *feature_importances_* we saw in the previous notebook."
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# Fit the model\n",
"model.fit(x_train, y_train) \n",
"# Using named_steps\n",
"my_decision_tree = model.named_steps['ds']\n",
"print(my_decision_tree.feature_importances_)"
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"#Using steps, we take the last step (-1) or the second step (1)\n",
"#name, my_desision_tree = model.steps[1]\n",
"name, my_desision_tree = model.steps[-1]\n",
"print(my_decision_tree.feature_importances_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tuning the algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the most important feature for this classifier is `petal width`.\n",
"\n",
"Look at the [API](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) of *scikit-learn* to understand better the algorithm, as well as which parameters can be tuned. As you see, we can change several ones, such as *criterion*, *splitter*, *max_features*, *max_depth*, *min_samples_split*, *class_weight*, etc.\n",
"\n",
"We can get the full list parameters of an estimator with the method *get_params()*. "
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"model.get_params()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can try different values for these parameters and observe the results."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grid Search for Parameter optimization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Changing manually the parameters to find their optimal values is not practical. Instead, we can consider to find the optimal value of the parameters as an *optimization problem*. \n",
"\n",
2016-03-28 10:26:20 +00:00
"The sklearn comes with several optimization techniques for this purpose, such as **grid search** and **randomized search**. In this notebook we are going to introduce the former one."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sklearn provides an object that, given data, computes the score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. "
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
2018-02-23 14:48:59 +00:00
"from sklearn.model_selection import GridSearchCV\n",
2016-03-15 12:55:14 +00:00
"from sklearn.tree import DecisionTreeClassifier\n",
"import numpy as np\n",
"\n",
"param_grid = {'max_depth': np.arange(3, 10)} \n",
"\n",
"gs = GridSearchCV(DecisionTreeClassifier(), param_grid)\n",
"\n",
"gs.fit(x_train, y_train)\n",
"\n",
"# summarize the results of the grid search\n",
"print(\"Best score: \", gs.best_score_)\n",
"print(\"Best params: \", gs.best_params_)"
]
},
{
"cell_type": "markdown",
2019-02-28 14:25:19 +00:00
"metadata": {},
2016-03-15 12:55:14 +00:00
"source": [
"Now we are going to show the results of grid search"
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# We print the score for each value of max_depth\n",
2018-02-23 14:48:59 +00:00
"for i, max_depth in enumerate(gs.cv_results_['params']):\n",
" print(\"%0.3f (+/-%0.03f) for %r\" % (gs.cv_results_['mean_test_score'][i],\n",
" gs.cv_results_['std_test_score'][i] * 2,\n",
" max_depth))"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now evaluate the KFold with this optimized parameter as follows."
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
"model = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('ds', DecisionTreeClassifier(max_depth=3))\n",
"])\n",
"\n",
"# Fit the model\n",
"model.fit(x_train, y_train) \n",
"\n",
"# create a k-fold cross validation iterator of k=10 folds\n",
2018-02-23 14:48:59 +00:00
"cv = KFold(10, shuffle=True, random_state=33)\n",
2016-03-15 12:55:14 +00:00
"\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"def mean_score(scores):\n",
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
"print(mean_score(scores))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have got an *improvement* from 0.947 to 0.953 with k-fold.\n",
"\n",
"We are now to try to fit the best combination of the parameters of the algorithm. It can take some time to compute it."
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# Set the parameters by cross-validation\n",
"\n",
2021-03-11 15:28:14 +00:00
"from sklearn.metrics import classification_report, recall_score, precision_score, make_scorer\n",
2016-03-15 12:55:14 +00:00
"\n",
"# set of parameters to test\n",
"tuned_parameters = [{'max_depth': np.arange(3, 10),\n",
"# 'max_weights': [1, 10, 100, 1000]},\n",
" 'criterion': ['gini', 'entropy'], \n",
" 'splitter': ['best', 'random'],\n",
" # 'min_samples_leaf': [2, 5, 10],\n",
" 'class_weight':['balanced', None],\n",
" 'max_leaf_nodes': [None, 5, 10, 20]\n",
" }]\n",
"\n",
"scores = ['precision', 'recall']\n",
"\n",
"for score in scores:\n",
" print(\"# Tuning hyper-parameters for %s\" % score)\n",
" print()\n",
"\n",
2021-03-11 15:28:14 +00:00
" if score == 'precision':\n",
" scorer = make_scorer(precision_score, average='weighted', zero_division=0)\n",
" elif score == 'recall':\n",
" scorer = make_scorer(recall_score, average='weighted', zero_division=0)\n",
" \n",
2016-03-15 12:55:14 +00:00
" # cv = the fold of the cross-validation cv, defaulted to 5\n",
2021-03-11 15:28:14 +00:00
" gs = GridSearchCV(DecisionTreeClassifier(), tuned_parameters, cv=10, scoring=scorer)\n",
2016-03-15 12:55:14 +00:00
" gs.fit(x_train, y_train)\n",
"\n",
" print(\"Best parameters set found on development set:\")\n",
" print()\n",
" print(gs.best_params_)\n",
" print()\n",
" print(\"Grid scores on development set:\")\n",
" print()\n",
2019-02-28 14:25:19 +00:00
" means = gs.cv_results_['mean_test_score']\n",
" stds = gs.cv_results_['std_test_score']\n",
"\n",
" for mean_score, std_score, params in zip(means, stds, gs.cv_results_['params']):\n",
" print(\"%0.3f (+/-%0.03f) for %r\" % (mean_score, std_score * 2, params))\n",
2016-03-15 12:55:14 +00:00
" print()\n",
"\n",
" print(\"Detailed classification report:\")\n",
" print()\n",
" print(\"The model is trained on the full development set.\")\n",
" print(\"The scores are computed on the full evaluation set.\")\n",
" print()\n",
" y_true, y_pred = y_test, gs.predict(x_test)\n",
" print(classification_report(y_true, y_pred))\n",
" print()"
]
},
{
"cell_type": "markdown",
2019-02-28 14:25:19 +00:00
"metadata": {},
2016-03-15 12:55:14 +00:00
"source": [
2016-03-28 10:26:20 +00:00
"Let's evaluate the resulting tuning."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2019-02-28 14:25:19 +00:00
"execution_count": null,
2018-02-23 14:48:59 +00:00
"metadata": {},
2019-02-28 14:25:19 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
"model = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('ds', DecisionTreeClassifier(max_leaf_nodes=20, criterion='gini', \n",
" splitter='random', class_weight='balanced', max_depth=3))\n",
"])\n",
"\n",
"# Fit the model\n",
"model.fit(x_train, y_train) \n",
"\n",
"# create a k-fold cross validation iterator of k=10 folds\n",
2018-02-23 14:48:59 +00:00
"cv = KFold(10, shuffle=True, random_state=33)\n",
2016-03-15 12:55:14 +00:00
"\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"def mean_score(scores):\n",
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
"print(mean_score(scores))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, we get an average accuracy of 0.96!! Better than 0.947 (without tuning) and 0.953 (tuning only *max_depth*)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n",
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
"* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Licence"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
2019-02-28 11:40:59 +00:00
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
2016-03-15 12:55:14 +00:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2021-03-11 15:28:14 +00:00
"version": "3.8.6"
2019-02-28 11:40:59 +00:00
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
2016-03-15 12:55:14 +00:00
}
},
"nbformat": 4,
2018-02-23 14:48:59 +00:00
"nbformat_minor": 1
2016-03-15 12:55:14 +00:00
}