sitc/ml1/2_5_1_kNN_Model.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [kNN Model](#kNN-Model)\n",
    "* [Load data and preprocessing](#Load-data-and-preprocessing)\n",
    "* [Train classifier](#Train-classifier)\n",
    "* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
    "    * [Precision, recall and f-score](#Precision,-recall-and-f-score)\n",
    "\t* [Confusion matrix](#Confusion-matrix)\n",
    "\t* [K-Fold validation](#K-Fold-validation)\n",
    "* [Tuning the algorithm](#Tuning-the-algorithm)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# kNN Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.\n",
    "\n",
    "The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading data and preprocessing\n",
    "\n",
    "The first step is loading and preprocessing the data as explained in the previous notebooks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# library for displaying plots\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# display plots in the notebook \n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## First, we repeat the load and preprocessing steps\n",
    "\n",
    "# Load data\n",
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "# Training and test spliting\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "x_iris, y_iris = iris.data, iris.target\n",
    "\n",
    "# Test set will be the 25% taken randomly\n",
    "x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",
    "\n",
    "# Preprocess: normalize\n",
    "from sklearn import preprocessing\n",
    "scaler = preprocessing.StandardScaler().fit(x_train)\n",
    "x_train = scaler.transform(x_train)\n",
    "x_test = scaler.transform(x_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train classifier"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The usual steps for creating a classifier are:\n",
    "1. Create classifier object\n",
    "2. Call *fit* to train the classifier\n",
    "3. Call *predict* to obtain predictions\n",
    "\n",
    "Once the model is created, the most relevant methods are:\n",
    "* model.fit(x_train, y_train): train the model\n",
    "* model.predict(x): predict\n",
    "* model.score(x, y): evaluate the prediction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "import numpy as np\n",
    "\n",
    "# Create kNN model\n",
    "model = KNeighborsClassifier(n_neighbors=15)\n",
    "\n",
    "# Train the model using the training sets\n",
    "model.fit(x_train, y_train) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Prediction \", model.predict(x_train))\n",
    "print(\"Expected \", y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate Accuracy in training\n",
    "\n",
    "from sklearn import metrics\n",
    "y_train_pred = model.predict(x_train)\n",
    "print(\"Accuracy in training\", metrics.accuracy_score(y_train, y_train_pred))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Now we evaluate error in testing\n",
    "y_test_pred = model.predict(x_test)\n",
    "print(\"Accuracy in testing \", metrics.accuracy_score(y_test, y_test_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to visualize the Nearest Neighbors classification. It will plot the decision boundaries for each class.\n",
    "\n",
    "We are going to import a function defined in the file [util_knn.py](files/util_knn.py) using the *magic command* **%run**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%run util_knn.py\n",
    "\n",
    "plot_classification_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluating the algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Precision, recall and f-score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
    "\n",
    "* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
    "* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
    "* **F1-score**: This is the harmonic mean of precision and recall, and tries to combine both in a single number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(metrics.classification_report(y_test, y_test_pred, target_names=iris.target_names))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Confusion matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another useful metric is the confusion matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(metrics.confusion_matrix(y_test, y_test_pred))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see we classify well all the 'setosa' and 'versicolor' samples. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### K-Fold validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_val_score, KFold\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
    "model = Pipeline([\n",
    "        ('scaler', StandardScaler()),\n",
    "        ('kNN', KNeighborsClassifier())\n",
    "])\n",
    "\n",
    "# create a k-fold cross validation iterator of k=10 folds\n",
    "cv = KFold(10, shuffle=True, random_state=33)\n",
    "\n",
    "# by default the score used is the one returned by score method of the estimator (accuracy)\n",
    "scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
    "print(scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.stats import sem\n",
    "def mean_score(scores):\n",
    "    return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
    "print(mean_score(scores))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So, we get an average accuracy of 0.940."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tuning the algorithm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are going to tune the algorithm, and calculate which is the best value for the k hyperparameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "k_range = range(1, 21)\n",
    "accuracy = []\n",
    "for k in k_range:\n",
    "    m = KNeighborsClassifier(k)\n",
    "    m.fit(x_train, y_train)\n",
    "    y_test_pred = m.predict(x_test)\n",
    "    accuracy.append(metrics.accuracy_score(y_test, y_test_pred))\n",
    "plt.plot(k_range, accuracy)\n",
    "plt.xlabel('k value')\n",
    "plt.ylabel('Accuracy')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",
    "* [Learning scikit-learn: Machine Learning in Python](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.12"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
Add ml1 2016-03-15 12:55:14 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](files/images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
updated notebooks 2019-02-28 10:32:00 +00:00			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"* [kNN Model](#kNN-Model)\n",`
			`"* [Load data and preprocessing](#Load-data-and-preprocessing)\n",`
			`"* [Train classifier](#Train-classifier)\n",`
			`"* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",`
			`" * [Precision, recall and f-score](#Precision,-recall-and-f-score)\n",`
			`"\t* [Confusion matrix](#Confusion-matrix)\n",`
			`"\t* [K-Fold validation](#K-Fold-validation)\n",`
			`"* [Tuning the algorithm](#Tuning-the-algorithm)\n",`
			`"* [References](#References)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# kNN Model"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"## Loading data and preprocessing\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"The first step is loading and preprocessing the data as explained in the previous notebooks."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"outputs": [],`
			`"source": [`
			`"# library for displaying plots\n",`
			`"import matplotlib.pyplot as plt\n",`
			`"\n",`
			`"# display plots in the notebook \n",`
Review J 2016-03-28 10:26:20 +00:00			`"%matplotlib inline"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
Review J 2016-03-28 10:26:20 +00:00			`"outputs": [],`
			`"source": [`
Add ml1 2016-03-15 12:55:14 +00:00			`"## First, we repeat the load and preprocessing steps\n",`
			`"\n",`
			`"# Load data\n",`
			`"from sklearn import datasets\n",`
			`"iris = datasets.load_iris()\n",`
			`"\n",`
			`"# Training and test spliting\n",`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"from sklearn.model_selection import train_test_split\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"x_iris, y_iris = iris.data, iris.target\n",`
			`"\n",`
			`"# Test set will be the 25% taken randomly\n",`
			`"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",`
			`"\n",`
			`"# Preprocess: normalize\n",`
			`"from sklearn import preprocessing\n",`
			`"scaler = preprocessing.StandardScaler().fit(x_train)\n",`
			`"x_train = scaler.transform(x_train)\n",`
			`"x_test = scaler.transform(x_test)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"## Train classifier"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The usual steps for creating a classifier are:\n",`
			`"1. Create classifier object\n",`
			`"2. Call fit to train the classifier\n",`
			`"3. Call predict to obtain predictions\n",`
			`"\n",`
			`"Once the model is created, the most relevant methods are:\n",`
			`"* model.fit(x_train, y_train): train the model\n",`
			`"* model.predict(x): predict\n",`
			`"* model.score(x, y): evaluate the prediction"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"from sklearn.neighbors import KNeighborsClassifier\n",`
			`"import numpy as np\n",`
			`"\n",`
			`"# Create kNN model\n",`
			`"model = KNeighborsClassifier(n_neighbors=15)\n",`
			`"\n",`
			`"# Train the model using the training sets\n",`
			`"model.fit(x_train, y_train) "`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"print(\"Prediction \", model.predict(x_train))\n",`
			`"print(\"Expected \", y_train)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Evaluate Accuracy in training\n",`
			`"\n",`
			`"from sklearn import metrics\n",`
			`"y_train_pred = model.predict(x_train)\n",`
			`"print(\"Accuracy in training\", metrics.accuracy_score(y_train, y_train_pred))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Now we evaluate error in testing\n",`
			`"y_test_pred = model.predict(x_test)\n",`
			`"print(\"Accuracy in testing \", metrics.accuracy_score(y_test, y_test_pred))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Now we are going to visualize the Nearest Neighbors classification. It will plot the decision boundaries for each class.\n",`
			`"\n",`
			`"We are going to import a function defined in the file [util_knn.py](files/util_knn.py) using the magic command %run."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"%run util_knn.py\n",`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"plot_classification_iris()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Evaluating the algorithm"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Precision, recall and f-score"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",`
			`"\n",`
			`"* Precision: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",`
			`"* Recall: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",`
			`"* F1-score: This is the harmonic mean of precision and recall, and tries to combine both in a single number."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"print(metrics.classification_report(y_test, y_test_pred, target_names=iris.target_names))"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### Confusion matrix"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Another useful metric is the confusion matrix"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"print(metrics.confusion_matrix(y_test, y_test_pred))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"We see we classify well all the 'setosa' and 'versicolor' samples. "`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"### K-Fold validation"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In order to avoid bias in the training and testing dataset partition, it is recommended to use k-fold validation."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"from sklearn.model_selection import cross_val_score, KFold\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"from sklearn.pipeline import Pipeline\n",`
			`"from sklearn.preprocessing import StandardScaler\n",`
			`"\n",`
			`"# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",`
			`"model = Pipeline([\n",`
			`" ('scaler', StandardScaler()),\n",`
			`" ('kNN', KNeighborsClassifier())\n",`
			`"])\n",`
			`"\n",`
			`"# create a k-fold cross validation iterator of k=10 folds\n",`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"cv = KFold(10, shuffle=True, random_state=33)\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"# by default the score used is the one returned by score method of the estimator (accuracy)\n",`
			`"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",`
			`"print(scores)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"from scipy.stats import sem\n",`
			`"def mean_score(scores):\n",`
			`" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",`
			`"print(mean_score(scores))"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"So, we get an average accuracy of 0.940."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Tuning the algorithm"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Update 2_5_1_kNN_Model.ipynb Fixed typo. 2022-02-28 11:38:27 +00:00			`"We are going to tune the algorithm, and calculate which is the best value for the k hyperparameter."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"k_range = range(1, 21)\n",`
			`"accuracy = []\n",`
			`"for k in k_range:\n",`
			`" m = KNeighborsClassifier(k)\n",`
			`" m.fit(x_train, y_train)\n",`
			`" y_test_pred = m.predict(x_test)\n",`
			`" accuracy.append(metrics.accuracy_score(y_test, y_test_pred))\n",`
			`"plt.plot(k_range, accuracy)\n",`
			`"plt.xlabel('k value')\n",`
			`"plt.ylabel('Accuracy')\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",`
Actualizada bibliografía 2022-02-21 12:26:24 +00:00			`"* [Learning scikit-learn: Machine Learning in Python](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence\n",`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`}`
			`],`
			`"metadata": {`
Clean 2_5_1 2021-02-27 19:50:03 +00:00			`"datacleaner": {`
			`"position": {`
			`"top": "50px"`
			`},`
			`"python": {`
			`"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"`
			`},`
			`"window_display": false`
			`},`
Add ml1 2016-03-15 12:55:14 +00:00			`"kernelspec": {`
Actualizada bibliografía 2022-02-21 12:26:24 +00:00			`"display_name": "Python 3 (ipykernel)",`
Add ml1 2016-03-15 12:55:14 +00:00			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Actualizada bibliografía 2022-02-21 12:26:24 +00:00			`"version": "3.8.12"`
updated notebooks 2019-02-28 10:32:00 +00:00			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
Add ml1 2016-03-15 12:55:14 +00:00			`}`
			`},`
			`"nbformat": 4,`
adapted some calls to new scikit version 2018-02-23 14:48:59 +00:00			`"nbformat_minor": 1`
Add ml1 2016-03-15 12:55:14 +00:00			`}`