mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-05 15:31:42 +00:00
503 lines
14 KiB
Plaintext
503 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Decision Tree Learning](#Decision-Tree-Learning)\n",
|
|
"* [Load data and preprocessing](#Load-data-and-preprocessing)\n",
|
|
"* [Train classifier](#Train-classifier)\n",
|
|
"* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
|
|
"\t* [Precision, recall and f-score](#Precision,-recall-and-f-score)\n",
|
|
"\t* [Confusion matrix](#Confusion-matrix)\n",
|
|
"\t* [K-Fold cross validation](#K-Fold-cross-validation)\n",
|
|
"* [References](#References)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Decision Tree Learning"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The goal of this notebook is to learn how to create a classification object using a [decision tree learning algorithm](https://en.wikipedia.org/wiki/Decision_tree_learning). \n",
|
|
"\n",
|
|
"There are a number of well known machine learning algorithms for decision tree learning, such as ID3, C4.5, C5.0 and CART. The scikit-learn uses an optimised version of the [CART (Classification and Regression Trees) algorithm](https://en.wikipedia.org/wiki/Predictive_analytics#Classification_and_regression_trees).\n",
|
|
"\n",
|
|
"This notebook will follow the same steps that the previous notebook for learning using the [kNN Model](2_5_1_kNN_Model.ipynb), and details some peculiarities of the decision tree algorithms.\n",
|
|
"\n",
|
|
"You need to install pydotplus: `conda install pydotplus` for the visualization."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Load data and preprocessing\n",
|
|
"\n",
|
|
"Here we repeat the same operations for loading data and preprocessing than in the previous notebooks."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# library for displaying plots\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"# display plots in the notebook \n",
|
|
"%matplotlib inline\n",
|
|
"\n",
|
|
"## First, we repeat the load and preprocessing steps\n",
|
|
"\n",
|
|
"# Load data\n",
|
|
"from sklearn import datasets\n",
|
|
"iris = datasets.load_iris()\n",
|
|
"\n",
|
|
"# Training and test spliting\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"x_iris, y_iris = iris.data, iris.target\n",
|
|
"# Test set will be the 25% taken randomly\n",
|
|
"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",
|
|
"\n",
|
|
"# Preprocess: normalize\n",
|
|
"from sklearn import preprocessing\n",
|
|
"scaler = preprocessing.StandardScaler().fit(x_train)\n",
|
|
"x_train = scaler.transform(x_train)\n",
|
|
"x_test = scaler.transform(x_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Train classifier"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The usual steps for creating a classifier are:\n",
|
|
"1. Create classifier object\n",
|
|
"2. Call *fit* to train the classifier\n",
|
|
"3. Call *predict* to obtain predictions\n",
|
|
"\n",
|
|
"*DecisionTreeClassifier* is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, ..., K-1]) classification."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.tree import DecisionTreeClassifier\n",
|
|
"import numpy as np\n",
|
|
"\n",
|
|
"from sklearn import tree\n",
|
|
"\n",
|
|
"max_depth=3\n",
|
|
"random_state=1\n",
|
|
"\n",
|
|
"# Create decision tree model\n",
|
|
"model = tree.DecisionTreeClassifier(max_depth=max_depth, random_state=random_state)\n",
|
|
"\n",
|
|
"# Train the model using the training sets\n",
|
|
"model.fit(x_train, y_train) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(\"Prediction \", model.predict(x_train))\n",
|
|
"print(\"Expected \", y_train)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Alternatively, the probability of each class can be predicted, which is the fraction of training samples of the same class in a leaf:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Print the \n",
|
|
"print(\"Predicted probabilities\", model.predict_proba(x_train[:10]))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Evaluate Accuracy in training\n",
|
|
"\n",
|
|
"from sklearn import metrics\n",
|
|
"y_train_pred = model.predict(x_train)\n",
|
|
"print(\"Accuracy in training\", metrics.accuracy_score(y_train, y_train_pred))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Now we evaluate error in testing\n",
|
|
"y_test_pred = model.predict(x_test)\n",
|
|
"print(\"Accuracy in testing \", metrics.accuracy_score(y_test, y_test_pred))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now we are going to visualize the DecisionTree classification. It will plot the decision boundaries for each class.\n",
|
|
"\n",
|
|
"The current version of pydot does not work well in Python 3.\n",
|
|
"For obtaining an image, you need to install `pip install pydotplus` and then `conda install graphviz`.\n",
|
|
"\n",
|
|
"You can skip this example. Since it can require installing additional packages, we include here the result.\n",
|
|
"![Decision Tree](files/images/cart.png)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from IPython.display import Image \n",
|
|
"from six import StringIO\n",
|
|
"import pydotplus as pydot\n",
|
|
"\n",
|
|
"dot_data = StringIO() \n",
|
|
"tree.export_graphviz(model, out_file=dot_data, \n",
|
|
" feature_names=iris.feature_names, \n",
|
|
" class_names=iris.target_names, \n",
|
|
" filled=True, rounded=True, \n",
|
|
" special_characters=True) \n",
|
|
"\n",
|
|
"\n",
|
|
"graph = pydot.graph_from_dot_data(dot_data.getvalue()) \n",
|
|
"graph.write_png('iris-tree.png')\n",
|
|
"Image(graph.create_png()) "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Here we show a graph of the decision tree boundaries. For each pair of iris features, the decision tree learns decision boundaries made of combinations of simple thresholding rules inferred from the training samples.\n",
|
|
"\n",
|
|
"We are going to import a function defined in the file [util_ds.py](files/util_ds.py) using the *magic command* **%run**."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%run util_ds\n",
|
|
"\n",
|
|
"# display plots in the notebook \n",
|
|
"%matplotlib inline\n",
|
|
"plot_tree_iris()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Next we are going to export the pseudocode of the the learnt decision tree."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%run util_ds\n",
|
|
"get_code(model, iris.feature_names, iris.target_names)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can also obtain the feature importance of the fitted model as follows."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(iris.feature_names)\n",
|
|
"print(model.feature_importances_)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We see that the most important feature for this classifier is `petal width`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Evaluating the algorithm"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Precision, recall and f-score"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
|
|
"\n",
|
|
"* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
|
|
"* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
|
|
"* **F1-score**: This is the harmonic mean of precision and recall, and tries to combine both in a single number."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(metrics.classification_report(y_test, y_test_pred,target_names=iris.target_names))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Confusion matrix"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Another useful metric is the confusion matrix"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(metrics.confusion_matrix(y_test, y_test_pred))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We see we classify well all the 'setosa' and 'versicolor' samples. "
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### K-Fold cross validation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**.\n",
|
|
"\n",
|
|
"Sklearn comes with other strategies for [cross validation](http://scikit-learn.org/stable/modules/cross_validation.html#cross-validation), such as stratified K-fold, label k-fold, Leave-One-Out, Leave-P-Out, Leave-One-Label-Out, Leave-P-Label-Out or Shuffle & Split."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import cross_val_score, KFold\n",
|
|
"from sklearn.pipeline import Pipeline\n",
|
|
"from sklearn.preprocessing import StandardScaler\n",
|
|
"\n",
|
|
"# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
|
|
"model = Pipeline([\n",
|
|
" ('scaler', StandardScaler()),\n",
|
|
" ('DecisionTree', DecisionTreeClassifier())\n",
|
|
"])\n",
|
|
"\n",
|
|
"# create a k-fold cross validation iterator of k=10 folds\n",
|
|
"cv = KFold(10, shuffle=True, random_state=33)\n",
|
|
"\n",
|
|
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
|
|
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
|
|
"print(scores)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from scipy.stats import sem\n",
|
|
"def mean_score(scores):\n",
|
|
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
|
|
"print(mean_score(scores))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"So, we get an average accuracy of 0.947."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [Plot the decision surface of a decision tree on the iris dataset](http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html)\n",
|
|
"* [Learning scikit-learn: Machine Learning in Python](http://proquest.safaribooksonline.com/book/programming/python/9781783281930/1dot-machine-learning-a-gentle-introduction/ch01s02_html), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n",
|
|
"* [Python Machine Learning](http://proquest.safaribooksonline.com/book/programming/python/9781783555130), Sebastian Raschka, Packt Publishing, 2015.\n",
|
|
"* [Parameter estimation using grid search with cross-validation](http://scikit-learn.org/stable/auto_examples/model_selection/grid_search_digits.html)\n",
|
|
"* [Decision trees in python with scikit-learn and pandas](http://chrisstrelioff.ws/sandbox/2015/06/08/decision_trees_in_python_with_scikit_learn_and_pandas.html)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence\n",
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"datacleaner": {
|
|
"position": {
|
|
"top": "50px"
|
|
},
|
|
"python": {
|
|
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
|
},
|
|
"window_display": false
|
|
},
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.9"
|
|
},
|
|
"latex_envs": {
|
|
"LaTeX_envs_menu_present": true,
|
|
"autocomplete": true,
|
|
"bibliofile": "biblio.bib",
|
|
"cite_by": "apalike",
|
|
"current_citInitial": 1,
|
|
"eqLabelWithNumbers": true,
|
|
"eqNumInitial": 1,
|
|
"hotkeys": {
|
|
"equation": "Ctrl-E",
|
|
"itemize": "Ctrl-I"
|
|
},
|
|
"labels_anchors": false,
|
|
"latex_user_defs": false,
|
|
"report_style_numbering": false,
|
|
"user_envs_cfg": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|