sitc/ml1/2_5_2_Decision_Tree_Model.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Decision Tree Learning](#Decision-Tree-Learning)\n",
    "* [Load data and preprocessing](#Load-data-and-preprocessing)\n",
    "* [Train classifier](#Train-classifier)\n",
    "* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
    "\t* [Precision, recall and f-score](#Precision,-recall-and-f-score)\n",
    "\t* [Confusion matrix](#Confusion-matrix)\n",
    "\t* [K-Fold cross validation](#K-Fold-cross-validation)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Decision Tree Learning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
    "\n",
    "There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
    "\n",
    "This notebook will follow the same steps that the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",
    "\n",
    "You need to install pydotplus: `conda install pydotplus` for the visualization."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load data and preprocessing\n",
    "\n",
    "Here we repeat the same operations for loading data and preprocessing than in the previous notebooks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# library for displaying plots\n",
    "import matplotlib.pyplot as plt\n",
    "# display plots in the notebook \n",
    "%matplotlib inline\n",
    "\n",
    "## First, we repeat the load and preprocessing steps\n",
    "\n",
    "# Load data\n",
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "# Training and test spliting\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "x_iris, y_iris = iris.data, iris.target\n",
    "# Test set will be the 25% taken randomly\n",
    "x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",
    "\n",
    "# Preprocess: normalize\n",
    "from sklearn import preprocessing\n",
    "scaler = preprocessing.StandardScaler().fit(x_train)\n",
    "x_train = scaler.transform(x_train)\n",
    "x_test = scaler.transform(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The usual steps for creating a classifier are:\n",
    "1. Create classifier object\n",
    "2. Call *fit* to train the classifier\n",
    "3. Call *predict* to obtain predictions\n",
    "\n",
    "*DecisionTreeClassifier* is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, ..., K-1]) classification."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,\n",
       "            max_features=None, max_leaf_nodes=None,\n",
       "            min_impurity_decrease=0.0, min_impurity_split=None,\n",
       "            min_samples_leaf=1, min_samples_split=2,\n",
       "            min_weight_fraction_leaf=0.0, presort=False, random_state=1,\n",
       "            splitter='best')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "import numpy as np\n",
    "\n",
    "from sklearn import tree\n",
    "\n",
    "max_depth=3\n",
    "random_state=1\n",
    "\n",
    "# Create decision tree model\n",
    "model = tree.DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)\n",
    "\n",
    "# Train the model using the training sets\n",
    "model.fit(x_train, y_train) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Prediction  [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",
      " 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",
      " 0 0 0 0 2 2 0 1 1 2 1 0 0 2 1 1 0 1 1 0 2 1 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",
      " 0]\n",
      "Expected  [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",
      " 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",
      " 0 0 0 0 2 2 0 1 1 2 1 0 0 1 1 1 0 1 1 0 2 2 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",
      " 0]\n"
     ]
    }
   ],
   "source": [
    "print(\"Prediction \", model.predict(x_train))\n",
    "print(\"Expected \", y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, the probability of each class can be predicted, which is the fraction of training samples of the same class in a leaf:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Predicted probabilities [[0.         0.97368421 0.02631579]\n",
      " [1.         0.         0.        ]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [1.         0.         0.        ]\n",
      " [1.         0.         0.        ]\n",
      " [0.         0.97368421 0.02631579]\n",
      " [1.         0.         0.        ]\n",
      " [0.         0.         1.        ]]\n"
     ]
    }
   ],
   "source": [
    "# Print the \n",
    "print(\"Predicted probabilities\", model.predict_proba(x_train[:10]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy in training 0.9821428571428571\n"
     ]
    }
   ],
   "source": [
    "# Evaluate Accuracy in training\n",
    "\n",
    "from sklearn import metrics\n",
    "y_train_pred = model.predict(x_train)\n",
    "print(\"Accuracy in training\", metrics.accuracy_score(y_train, y_train_pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy in testing  0.9210526315789473\n"
     ]
    }
   ],
   "source": [
    "# Now we evaluate error in testing\n",
    "y_test_pred = model.predict(x_test)\n",
    "print(\"Accuracy in testing \", metrics.accuracy_score(y_test, y_test_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to visualize the DecisionTree classification. It will plot the decision boundaries for each class.\n",
    "\n",
    "The current version of pydot does not work well in Python 3.\n",
    "For obtaining an image, you need to install `pip install pydotplus` and then `conda install graphviz`.\n",
    "\n",
    "You can skip this example. Since it can require installing additional packages, we include here the result.\n",
    "![Decision Tree](files/images/cart.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "ModuleNotFoundError",
     "evalue": "No module named 'pydotplus'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-7-1bf5ec7fb043>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mIPython\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdisplay\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mImage\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexternals\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msix\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mStringIO\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mpydotplus\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpydot\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      5\u001b[0m \u001b[0mdot_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mStringIO\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pydotplus'"
     ]
    }
   ],
   "source": [
    "from IPython.display import Image \n",
    "from sklearn.externals.six import StringIO\n",
    "import pydotplus as pydot\n",
    "\n",
    "dot_data = StringIO()  \n",
    "tree.export_graphviz(model, out_file=dot_data,  \n",
    "                         feature_names=iris.feature_names,  \n",
    "                         class_names=iris.target_names,  \n",
    "                         filled=True, rounded=True,  \n",
    "                         special_characters=True)  \n",
    "\n",
    "\n",
    "graph = pydot.graph_from_dot_data(dot_data.getvalue()) \n",
    "graph.write_png('iris-tree.png')\n",
    "Image(graph.create_png())  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we show a graph of the decision tree boundaries. For each pair of iris features, the decision tree learns decision boundaries made of combinations of simple thresholding rules inferred from the training samples.\n",
    "\n",
    "We are going to import a function defined in the file [util_ds.py](files/util_ds.py) using the *magic command* **%run**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run util_ds\n",
    "\n",
    "# display plots in the notebook \n",
    "%matplotlib inline\n",
    "plot_tree_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we are going to export the pseudocode of the the learnt decision tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run util_ds\n",
    "get_code(model, iris.feature_names, iris.target_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also obtain the feature importance of the fitted model as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(iris.feature_names)\n",
    "print(model.feature_importances_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the most important feature for this classifier is `petal width`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Precision, recall and f-score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
    "\n",
    "* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
    "* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
    "* **F1-score**: This is the harmonic mean of precision and recall, and tries to combine both in a single number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(metrics.classification_report(y_test, y_test_pred,target_names=iris.target_names))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Confusion matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another useful metric is the confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(metrics.confusion_matrix(y_test, y_test_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see we classify well all the 'setosa' and 'versicolor' samples. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Fold cross validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n",
    "\n",
    "Sklearn comes with other strategies for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), such as stratified K-fold, label k-fold, Leave-One-Out, Leave-P-Out, Leave-One-Label-Out, Leave-P-Label-Out or Shuffle & Split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, KFold\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
    "model = Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('DecisionTree', DecisionTreeClassifier())\n",
    "])\n",
    "\n",
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "print(scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import sem\n",
    "def mean_score(scores):\n",
    "    return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
    "print(mean_score(scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we get an average accuracy of 0.947."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n",
    "* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
    "* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
    "* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
    "* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
Add ml1 2016-03-15 12:55:14 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](files/images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
added update -all 2019-02-28 11:26:33 +00:00			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"* [Decision Tree Learning](#Decision-Tree-Learning)\n",`
			`"* [Load data and preprocessing](#Load-data-and-preprocessing)\n",`
			`"* [Train classifier](#Train-classifier)\n",`
			`"* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",`
			`"\t* [Precision, recall and f-score](#Precision,-recall-and-f-score)\n",`
			`"\t* [Confusion matrix](#Confusion-matrix)\n",`
			`"\t* [K-Fold cross validation](#K-Fold-cross-validation)\n",`
			`"* [References](#References)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Decision Tree Learning"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Fix visualization section 2017-12-11 17:12:06 +00:00			`"The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",`
			`"\n",`
added update -all 2019-02-28 11:26:33 +00:00			`"This notebook will follow the same steps that the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",`
			`"\n",`
			"You need to install pydotplus: `conda install pydotplus` for the visualization."
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Load data and preprocessing\n",`
			`"\n",`
			`"Here we repeat the same operations for loading data and preprocessing than in the previous notebooks."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"execution_count": 1,`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# library for displaying plots\n",`
			`"import matplotlib.pyplot as plt\n",`
			`"# display plots in the notebook \n",`
			`"%matplotlib inline\n",`
			`"\n",`
			`"## First, we repeat the load and preprocessing steps\n",`
			`"\n",`
			`"# Load data\n",`
			`"from sklearn import datasets\n",`
			`"iris = datasets.load_iris()\n",`
			`"\n",`
			`"# Training and test spliting\n",`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"from sklearn.model_selection import train_test_split\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"x_iris, y_iris = iris.data, iris.target\n",`
			`"# Test set will be the 25% taken randomly\n",`
			`"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",`
			`"\n",`
			`"# Preprocess: normalize\n",`
			`"from sklearn import preprocessing\n",`
			`"scaler = preprocessing.StandardScaler().fit(x_train)\n",`
			`"x_train = scaler.transform(x_train)\n",`
			`"x_test = scaler.transform(x_test)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"## Train classifier"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The usual steps for creating a classifier are:\n",`
			`"1. Create classifier object\n",`
			`"2. Call fit to train the classifier\n",`
			`"3. Call predict to obtain predictions\n",`
			`"\n",`
			`"DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, ..., K-1]) classification."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"execution_count": 2,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,\n",`
			`" max_features=None, max_leaf_nodes=None,\n",`
			`" min_impurity_decrease=0.0, min_impurity_split=None,\n",`
			`" min_samples_leaf=1, min_samples_split=2,\n",`
			`" min_weight_fraction_leaf=0.0, presort=False, random_state=1,\n",`
			`" splitter='best')"`
			`]`
			`},`
			`"execution_count": 2,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"from sklearn.tree import DecisionTreeClassifier\n",`
			`"import numpy as np\n",`
			`"\n",`
			`"from sklearn import tree\n",`
			`"\n",`
			`"max_depth=3\n",`
			`"random_state=1\n",`
			`"\n",`
			`"# Create decision tree model\n",`
			`"model = tree.DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)\n",`
			`"\n",`
			`"# Train the model using the training sets\n",`
			`"model.fit(x_train, y_train) "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"execution_count": 3,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Prediction [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",`
			`" 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",`
			`" 0 0 0 0 2 2 0 1 1 2 1 0 0 2 1 1 0 1 1 0 2 1 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",`
			`" 0]\n",`
			`"Expected [1 0 1 1 1 0 0 1 0 2 0 0 1 2 0 1 2 2 1 1 0 0 2 0 0 2 1 1 2 2 2 2 0 0 1 1 0\n",`
			`" 1 2 1 2 0 2 0 1 0 2 1 0 2 2 0 0 2 0 0 0 2 2 0 1 0 1 0 1 1 1 1 1 0 1 0 1 2\n",`
			`" 0 0 0 0 2 2 0 1 1 2 1 0 0 1 1 1 0 1 1 0 2 2 2 1 2 0 1 0 0 0 2 1 2 1 2 1 2\n",`
			`" 0]\n"`
			`]`
			`}`
			`],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"print(\"Prediction \", model.predict(x_train))\n",`
			`"print(\"Expected \", y_train)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Alternatively, the probability of each class can be predicted, which is the fraction of training samples of the same class in a leaf:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"execution_count": 4,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Predicted probabilities [[0. 0.97368421 0.02631579]\n",`
			`" [1. 0. 0. ]\n",`
			`" [0. 0.97368421 0.02631579]\n",`
			`" [0. 0.97368421 0.02631579]\n",`
			`" [0. 0.97368421 0.02631579]\n",`
			`" [1. 0. 0. ]\n",`
			`" [1. 0. 0. ]\n",`
			`" [0. 0.97368421 0.02631579]\n",`
			`" [1. 0. 0. ]\n",`
			`" [0. 0. 1. ]]\n"`
			`]`
			`}`
			`],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Print the \n",`
			`"print(\"Predicted probabilities\", model.predict_proba(x_train[:10]))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"execution_count": 5,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Accuracy in training 0.9821428571428571\n"`
			`]`
			`}`
			`],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Evaluate Accuracy in training\n",`
			`"\n",`
			`"from sklearn import metrics\n",`
			`"y_train_pred = model.predict(x_train)\n",`
			`"print(\"Accuracy in training\", metrics.accuracy_score(y_train, y_train_pred))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"execution_count": 6,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
			`"Accuracy in testing 0.9210526315789473\n"`
			`]`
			`}`
			`],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Now we evaluate error in testing\n",`
			`"y_test_pred = model.predict(x_test)\n",`
			`"print(\"Accuracy in testing \", metrics.accuracy_score(y_test, y_test_pred))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Now we are going to visualize the DecisionTree classification. It will plot the decision boundaries for each class.\n",`
			`"\n",`
			`"The current version of pydot does not work well in Python 3.\n",`
			"For obtaining an image, you need to install `pip install pydotplus` and then `conda install graphviz`.\n",
			`"\n",`
			`"You can skip this example. Since it can require installing additional packages, we include here the result.\n",`
			`"![Decision Tree](files/images/cart.png)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"execution_count": 7,`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"ename": "ModuleNotFoundError",`
			`"evalue": "No module named 'pydotplus'",`
			`"output_type": "error",`
			`"traceback": [`
			`"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",`
			`"\u001b[0;31mModuleNotFoundError\u001b[0m Traceback (most recent call last)",`
			"\u001b[0;32m<ipython-input-7-1bf5ec7fb043>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0mIPython\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdisplay\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mImage\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0msklearn\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexternals\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msix\u001b[0m \u001b[0;32mimport\u001b[0m \u001b[0mStringIO\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 3\u001b[0;31m \u001b[0;32mimport\u001b[0m \u001b[0mpydotplus\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0mpydot\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5\u001b[0m \u001b[0mdot_data\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mStringIO\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
			`"\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'pydotplus'"`
			`]`
			`}`
			`],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"from IPython.display import Image \n",`
			`"from sklearn.externals.six import StringIO\n",`
			`"import pydotplus as pydot\n",`
			`"\n",`
			`"dot_data = StringIO() \n",`
			`"tree.export_graphviz(model, out_file=dot_data, \n",`
			`" feature_names=iris.feature_names, \n",`
			`" class_names=iris.target_names, \n",`
			`" filled=True, rounded=True, \n",`
			`" special_characters=True) \n",`
			`"\n",`
			`"\n",`
			`"graph = pydot.graph_from_dot_data(dot_data.getvalue()) \n",`
			`"graph.write_png('iris-tree.png')\n",`
			`"Image(graph.create_png()) "`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Here we show a graph of the decision tree boundaries. For each pair of iris features, the decision tree learns decision boundaries made of combinations of simple thresholding rules inferred from the training samples.\n",`
			`"\n",`
			`"We are going to import a function defined in the file [util_ds.py](files/util_ds.py) using the magic command %run."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"%run util_ds\n",`
			`"\n",`
			`"# display plots in the notebook \n",`
			`"%matplotlib inline\n",`
			`"plot_tree_iris()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Next we are going to export the pseudocode of the the learnt decision tree."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
added update -all 2019-02-28 11:26:33 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
added update -all 2019-02-28 11:26:33 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"%run util_ds\n",`
			`"get_code(model, iris.feature_names, iris.target_names)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We can also obtain the feature importance of the fitted model as follows."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
added update -all 2019-02-28 11:26:33 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
added update -all 2019-02-28 11:26:33 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"print(iris.feature_names)\n",`
			`"print(model.feature_importances_)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"We see that the most important feature for this classifier is `petal width`."
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Evaluating the algorithm"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Precision, recall and f-score"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",`
			`"\n",`
			`"* Precision: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",`
			`"* Recall: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",`
			`"* F1-score: This is the harmonic mean of precision and recall, and tries to combine both in a single number."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
added update -all 2019-02-28 11:26:33 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
added update -all 2019-02-28 11:26:33 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"print(metrics.classification_report(y_test, y_test_pred,target_names=iris.target_names))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Confusion matrix"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Another useful metric is the confusion matrix"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
added update -all 2019-02-28 11:26:33 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
added update -all 2019-02-28 11:26:33 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"print(metrics.confusion_matrix(y_test, y_test_pred))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We see we classify well all the 'setosa' and 'versicolor' samples. "`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### K-Fold cross validation"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In order to avoid bias in the training and testing dataset partition, it is recommended to use k-fold validation.\n",`
			`"\n",`
			`"Sklearn comes with other strategies for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), such as stratified K-fold, label k-fold, Leave-One-Out, Leave-P-Out, Leave-One-Label-Out, Leave-P-Label-Out or Shuffle & Split."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
added update -all 2019-02-28 11:26:33 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
added update -all 2019-02-28 11:26:33 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"from sklearn.model_selection import cross_val_score, KFold\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"from sklearn.pipeline import Pipeline\n",`
			`"from sklearn.preprocessing import StandardScaler\n",`
			`"\n",`
			`"# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",`
			`"model = Pipeline([\n",`
			`" ('scaler', StandardScaler()),\n",`
			`" ('DecisionTree', DecisionTreeClassifier())\n",`
			`"])\n",`
			`"\n",`
			`"# create a k-fold cross validation iterator of k=10 folds\n",`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"cv = KFold(10, shuffle=True, random_state=33)\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"# by default the score used is the one returned by score method of the estimator (accuracy)\n",`
			`"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",`
			`"print(scores)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
added update -all 2019-02-28 11:26:33 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
added update -all 2019-02-28 11:26:33 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"from scipy.stats import sem\n",`
			`"def mean_score(scores):\n",`
			`" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",`
			`"print(mean_score(scores))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"So, we get an average accuracy of 0.947."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n",`
			`"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",`
			`"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",`
			`"* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",`
			`"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence\n",`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
added update -all 2019-02-28 11:26:33 +00:00			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Updated notebooks 2019-03-06 16:44:30 +00:00			`"version": "3.7.1"`
added update -all 2019-02-28 11:26:33 +00:00			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
Add ml1 2016-03-15 12:55:14 +00:00			`}`
			`},`
			`"nbformat": 4,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"nbformat_minor": 1`
Add ml1 2016-03-15 12:55:14 +00:00			`}`