1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-12-22 19:58:12 +00:00
sitc/ml1/2_5_1_kNN_Model.ipynb

445 lines
12 KiB
Plaintext
Raw Normal View History

2016-03-15 12:55:14 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2019-02-28 10:32:00 +00:00
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [kNN Model](#kNN-Model)\n",
"* [Load data and preprocessing](#Load-data-and-preprocessing)\n",
"* [Train classifier](#Train-classifier)\n",
"* [Evaluating the algorithm](#Evaluating-the-algorithm)\n",
" * [Precision, recall and f-score](#Precision,-recall-and-f-score)\n",
"\t* [Confusion matrix](#Confusion-matrix)\n",
"\t* [K-Fold validation](#K-Fold-validation)\n",
"* [Tuning the algorithm](#Tuning-the-algorithm)\n",
"* [References](#References)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# kNN Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"The goal of this notebook is to learn how to train a model, make predictions with that model and evaluate these predictions.\n",
2016-03-15 12:55:14 +00:00
"\n",
"The notebook uses the [kNN (k nearest neighbors) algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"## Loading data and preprocessing\n",
2016-03-15 12:55:14 +00:00
"\n",
"The first step is loading and preprocessing the data as explained in the previous notebooks."
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
2016-03-15 12:55:14 +00:00
"outputs": [],
"source": [
"# library for displaying plots\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# display plots in the notebook \n",
2016-03-28 10:26:20 +00:00
"%matplotlib inline"
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
2016-03-28 10:26:20 +00:00
"outputs": [],
"source": [
2016-03-15 12:55:14 +00:00
"## First, we repeat the load and preprocessing steps\n",
"\n",
"# Load data\n",
"from sklearn import datasets\n",
"iris = datasets.load_iris()\n",
"\n",
"# Training and test spliting\n",
"from sklearn.model_selection import train_test_split\n",
2016-03-15 12:55:14 +00:00
"\n",
"x_iris, y_iris = iris.data, iris.target\n",
"\n",
"# Test set will be the 25% taken randomly\n",
"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)\n",
"\n",
"# Preprocess: normalize\n",
"from sklearn import preprocessing\n",
"scaler = preprocessing.StandardScaler().fit(x_train)\n",
"x_train = scaler.transform(x_train)\n",
"x_test = scaler.transform(x_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
2016-03-15 12:55:14 +00:00
"source": [
"## Train classifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The usual steps for creating a classifier are:\n",
"1. Create classifier object\n",
"2. Call *fit* to train the classifier\n",
"3. Call *predict* to obtain predictions\n",
"\n",
"Once the model is created, the most relevant methods are:\n",
"* model.fit(x_train, y_train): train the model\n",
"* model.predict(x): predict\n",
"* model.score(x, y): evaluate the prediction"
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"import numpy as np\n",
"\n",
"# Create kNN model\n",
"model = KNeighborsClassifier(n_neighbors=15)\n",
"\n",
"# Train the model using the training sets\n",
"model.fit(x_train, y_train) "
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"print(\"Prediction \", model.predict(x_train))\n",
"print(\"Expected \", y_train)"
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# Evaluate Accuracy in training\n",
"\n",
"from sklearn import metrics\n",
"y_train_pred = model.predict(x_train)\n",
"print(\"Accuracy in training\", metrics.accuracy_score(y_train, y_train_pred))"
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# Now we evaluate error in testing\n",
"y_test_pred = model.predict(x_test)\n",
"print(\"Accuracy in testing \", metrics.accuracy_score(y_test, y_test_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to visualize the Nearest Neighbors classification. It will plot the decision boundaries for each class.\n",
"\n",
"We are going to import a function defined in the file [util_knn.py](files/util_knn.py) using the *magic command* **%run**."
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"%run util_knn.py\n",
"\n",
2016-03-15 12:55:14 +00:00
"plot_classification_iris()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluating the algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Precision, recall and f-score"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For evaluating classification algorithms, we usually calculate three metrics: precision, recall and F1-score\n",
"\n",
"* **Precision**: This computes the proportion of instances predicted as positives that were correctly evaluated (it measures how right our classifier is when it says that an instance is positive).\n",
"* **Recall**: This counts the proportion of positive instances that were correctly evaluated (measuring how right our classifier is when faced with a positive instance).\n",
"* **F1-score**: This is the harmonic mean of precision and recall, and tries to combine both in a single number."
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
2016-03-28 10:26:20 +00:00
"print(metrics.classification_report(y_test, y_test_pred, target_names=iris.target_names))"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Confusion matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another useful metric is the confusion matrix"
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"print(metrics.confusion_matrix(y_test, y_test_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
2016-03-15 12:55:14 +00:00
"source": [
"We see we classify well all the 'setosa' and 'versicolor' samples. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### K-Fold validation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to avoid bias in the training and testing dataset partition, it is recommended to use **k-fold validation**."
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"from sklearn.model_selection import cross_val_score, KFold\n",
2016-03-15 12:55:14 +00:00
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# create a composite estimator made by a pipeline of preprocessing and the KNN model\n",
"model = Pipeline([\n",
" ('scaler', StandardScaler()),\n",
" ('kNN', KNeighborsClassifier())\n",
"])\n",
"\n",
"# create a k-fold cross validation iterator of k=10 folds\n",
"cv = KFold(10, shuffle=True, random_state=33)\n",
2016-03-15 12:55:14 +00:00
"\n",
"# by default the score used is the one returned by score method of the estimator (accuracy)\n",
"scores = cross_val_score(model, x_iris, y_iris, cv=cv)\n",
"print(scores)"
]
},
{
"cell_type": "markdown",
"metadata": {},
2016-03-15 12:55:14 +00:00
"source": [
"We get an array of k scores. We can calculate the mean and the standard error to obtain a final figure"
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"from scipy.stats import sem\n",
"def mean_score(scores):\n",
" return (\"Mean score: {0:.3f} (+/- {1:.3f})\").format(np.mean(scores), sem(scores))\n",
"print(mean_score(scores))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, we get an average accuracy of 0.940."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tuning the algorithm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are going to tune the algorithm, and calculate which is the best value for the k hyperparameter."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2021-02-27 19:50:03 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"k_range = range(1, 21)\n",
"accuracy = []\n",
"for k in k_range:\n",
" m = KNeighborsClassifier(k)\n",
" m.fit(x_train, y_train)\n",
" y_test_pred = m.predict(x_test)\n",
" accuracy.append(metrics.accuracy_score(y_test, y_test_pred))\n",
"plt.plot(k_range, accuracy)\n",
"plt.xlabel('k value')\n",
"plt.ylabel('Accuracy')\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The result is very dependent of the input data. Execute again the train_test_split and test again how the result changes with k."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [KNeighborsClassifier API scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)\n",
2022-02-21 12:26:24 +00:00
"* [Learning scikit-learn: Machine Learning in Python](https://learning.oreilly.com/library/view/scikit-learn-machine/9781788833479/), Raúl Garreta; Guillermo Moncecchi, Packt Publishing, 2013.\n"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
2019-02-28 10:32:00 +00:00
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
2016-03-15 12:55:14 +00:00
]
}
],
"metadata": {
2021-02-27 19:50:03 +00:00
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
2016-03-15 12:55:14 +00:00
"kernelspec": {
2022-02-21 12:26:24 +00:00
"display_name": "Python 3 (ipykernel)",
2016-03-15 12:55:14 +00:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2022-02-21 12:26:24 +00:00
"version": "3.8.12"
2019-02-28 10:32:00 +00:00
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
2016-03-15 12:55:14 +00:00
}
},
"nbformat": 4,
"nbformat_minor": 1
2016-03-15 12:55:14 +00:00
}