{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](files/images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "* [kNN Model](#kNN-Model)\n", "* [Load data and preprocessing](#Load-data-and-preprocessing)\n", "* [Train classifier](#Train-classifier)\n", "* [Evaluating the algorithm](#Evaluating-the-algorithm)\n", " * [Precision, recall and f-score](#Precision,-recall-and-f-score)\n", "\t* [Confusion matrix](#Confusion-matrix)\n", "\t* [K-Fold validation](#K-Fold-validation)\n", "* [Tuning the algorithm](#Tuning-the-algorithm)\n", "* [References](#References)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# kNN Model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.\n", "\n", "The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading data and preprocessing\n", "\n", "The first step is loading and preprocessing the data as explained in the previous notebooks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# library for displaying plots\n", "import matplotlib.pyplot as plt\n", "\n", "# display plots in the notebook \n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## First, we repeat the load and preprocessing steps\n", "\n", "# Load data\n", "from sklearn import datasets\n", "iris = datasets.load_iris()\n", "\n", "# Training and test spliting\n", "from sklearn.model_selection import train_test_split\n", "\n", "x_iris, y_iris = iris.data, iris.target\n", "\n", "# Test set will be the 25% taken randomly\n", "x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n", "\n", "# Preprocess: normalize\n", "from sklearn import preprocessing\n", "scaler = preprocessing.StandardScaler().fit(x_train)\n", "x_train = scaler.transform(x_train)\n", "x_test = scaler.transform(x_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The usual steps for creating a classifier are:\n", "1. Create classifier object\n", "2. Call *fit* to train the classifier\n", "3. Call *predict* to obtain predictions\n", "\n", "Once the model is created, the most relevant methods are:\n", "* model.fit(x_train, y_train): train the model\n", "* model.predict(x): predict\n", "* model.score(x, y): evaluate the prediction" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "import numpy as np\n", "\n", "# Create kNN model\n", "model = KNeighborsClassifier(n_neighbors=15)\n", "\n", "# Train the model using the training sets\n", "model.fit(x_train, y_train) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Prediction \", model.predict(x_train))\n", "print(\"Expected \", y_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Evaluate Accuracy in training\n", "\n", "from sklearn import metrics\n", "y_train_pred = model.predict(x_train)\n", "print(\"Accuracy in training\", metrics.accuracy_score(y_train, y_train_pred))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Now we evaluate error in testing\n", "y_test_pred = model.predict(x_test)\n", "print(\"Accuracy in testing \", metrics.accuracy_score(y_test, y_test_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are going to visualize the Nearest Neighbors classification. It will plot the decision boundaries for each class.\n", "\n", "We are going to import a function defined in the file [util_knn.py](files/util_knn.py) using the *magic command* **%run**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run util_knn.py\n", "\n", "plot_classification_iris()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating the algorithm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Precision, recall and f-score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n", "\n", "* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n", "* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n", "* **F1-score**: This is the harmonic mean of precision and recall, and tries to combine both in a single number." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(metrics.classification_report(y_test, y_test_pred, target_names=iris.target_names))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Confusion matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another useful metric is the confusion matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(metrics.confusion_matrix(y_test, y_test_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see we classify well all the 'setosa' and 'versicolor' samples. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### K-Fold validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score, KFold\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "# create a composite estimator made by a pipeline of preprocessing and the KNN model\n", "model = Pipeline([\n", " ('scaler', StandardScaler()),\n", " ('kNN', KNeighborsClassifier())\n", "])\n", "\n", "# create a k-fold cross validation iterator of k=10 folds\n", "cv = KFold(10, shuffle=True, random_state=33)\n", "\n", "# by default the score used is the one returned by score method of the estimator (accuracy)\n", "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n", "print(scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import sem\n", "def mean_score(scores):\n", " return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n", "print(mean_score(scores))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we get an average accuracy of 0.940." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tuning the algorithm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to tune the algorithm, and calculate which is the best value for the k hyperparameter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "k_range = range(1, 21)\n", "accuracy = []\n", "for k in k_range:\n", " m = KNeighborsClassifier(k)\n", " m.fit(x_train, y_train)\n", " y_test_pred = m.predict(x_test)\n", " accuracy.append(metrics.accuracy_score(y_test, y_test_pred))\n", "plt.plot(k_range, accuracy)\n", "plt.xlabel('k value')\n", "plt.ylabel('Accuracy')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licence\n", "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "datacleaner": { "position": { "top": "50px" }, "python": { "varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])" }, "window_display": false }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.12" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 1 }