sitc/ml1/2_3_0_Visualisation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Visualisation](#Visualisation)\n",
    "* [Exploratory visualisation](#Exploratory-visualisation)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset  in other sessions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory visualisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section covers different ways to inspect the distribution of samples per feature.\n",
    "\n",
    "First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
    "\n",
    "A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n",
    "\n",
    "For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
    "\n",
    "In our case, since the values are not continuous and we have only three values, we do not need to bin them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# library for displaying plots\n",
    "import matplotlib.pyplot as plt\n",
    "# display plots in the notebook\n",
    "# if this is not set, you will not see the graphic here\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot histogram, the default is 10 bins\n",
    "plt.hist(iris.target)\n",
    "plt.ylabel('Number of instances')\n",
    "plt.xlabel('iris class')\n",
    "plt.xticks(range(len(iris.target_names)), iris.target_names);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen, we have the same distribution of samples for every class.\n",
    "The next step is to see the distribution of the features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a reminder of the name and index of each feature\n",
    "print(iris.feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A reminder of feature names and indexes\n",
    "print(iris.target_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A [**scatter plot**](https://en.wikipedia.org/wiki/Scatter_plot) (*gráfico de dispersión*) displays the value of typically two variables for a set of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# scatter makes a plot of x vs y\n",
    "plt.scatter(iris.data[:,0], iris.target)\n",
    "plt.yticks(range(len(iris.target_names)), iris.target_names);\n",
    "plt.xlabel(iris.feature_names[0])\n",
    "plt.ylabel('iris class')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the distribution of the dataset\n",
    "names = set(iris.target)\n",
    "\n",
    "# x and y are all the samples from column 0 (sepal_length) and 1 (sepal_width) respectively\n",
    "x,y = iris.data[:,0],  iris.data[:,1]\n",
    "\n",
    "#if you want to understand better this code, see what happens when you replace name by 0, 1, 2 in the line\n",
    "#  cond = iris.target == name. \n",
    "for name in names:\n",
    "    cond = iris.target == name\n",
    "    plt.plot(x[cond], y[cond], linestyle='none', marker='o', label=iris.target_names[name])\n",
    "\n",
    "plt.legend(numpoints=1)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the Setosa class seems to be linearly separable with these two features.\n",
    "\n",
    "Another nice visualisation is given below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x_index = 0\n",
    "y_index = 1\n",
    "formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n",
    "plt.scatter(iris.data[:, x_index], iris.data[:, y_index], s=40,\n",
    "c=iris.target)\n",
    "plt.colorbar(ticks=[0, 1, 2], format=formatter)\n",
    "plt.xlabel(iris.feature_names[x_index])\n",
    "plt.ylabel(iris.feature_names[y_index]);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This alternate visualisation also suggests that the Setosa class seems to be linearly separable.\n",
    "\n",
    "Students interested in practicing advanced visualisations can check [Advanced visualisation notebook](2_3_1_Advanced_Visualisation.ipynb).\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
    "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
    "* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
    "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
    "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
    "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
    "* [Iris dataset visualisation notebook](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook)\n",
    "* [Tutorial plotting with Seaborn](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
Add ml1 2016-03-15 12:55:14 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](files/images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
updated notebooks 2019-02-28 10:32:00 +00:00			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"* [Visualisation](#Visualisation)\n",`
			`"* [Exploratory visualisation](#Exploratory-visualisation)\n",`
			`"* [References](#References)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Visualisation"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The goal of this notebook is to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset in other sessions."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Exploratory visualisation"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"This section covers different ways to inspect the distribution of samples per feature.\n",`
			`"\n",`
Fix visualization section 2017-12-11 17:12:06 +00:00			`"First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",`
Review J 2016-03-28 10:26:20 +00:00			`"\n",`
Fix visualization section 2017-12-11 17:12:06 +00:00			`"A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n",`
Review J 2016-03-28 10:26:20 +00:00			`"\n",`
			`"For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",`
			`"\n",`
			`"In our case, since the values are not continuous and we have only three values, we do not need to bin them."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"outputs": [],`
			`"source": [`
			`"from sklearn import datasets\n",`
			`"iris = datasets.load_iris()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# library for displaying plots\n",`
			`"import matplotlib.pyplot as plt\n",`
			`"# display plots in the notebook\n",`
			`"# if this is not set, you will not see the graphic here\n",`
			`"%matplotlib inline"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Plot histogram, the default is 10 bins\n",`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"plt.hist(iris.target)\n",`
			`"plt.ylabel('Number of instances')\n",`
			`"plt.xlabel('iris class')\n",`
			`"plt.xticks(range(len(iris.target_names)), iris.target_names);"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Fix visualization section 2017-12-11 17:12:06 +00:00			`"As can be seen, we have the same distribution of samples for every class.\n",`
Review J 2016-03-28 10:26:20 +00:00			`"The next step is to see the distribution of the features"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"# This is a reminder of the name and index of each feature\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"print(iris.feature_names)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"# A reminder of feature names and indexes\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"print(iris.target_names)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"A [scatter plot](https://en.wikipedia.org/wiki/Scatter_plot) (gráfico de dispersión) displays the value of typically two variables for a set of data."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# scatter makes a plot of x vs y\n",`
			`"plt.scatter(iris.data[:,0], iris.target)\n",`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"plt.yticks(range(len(iris.target_names)), iris.target_names);\n",`
			`"plt.xlabel(iris.feature_names[0])\n",`
			`"plt.ylabel('iris class')"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Plot the distribution of the dataset\n",`
			`"names = set(iris.target)\n",`
			`"\n",`
			`"# x and y are all the samples from column 0 (sepal_length) and 1 (sepal_width) respectively\n",`
			`"x,y = iris.data[:,0], iris.data[:,1]\n",`
			`"\n",`
Added explanation in visualization of iris 2019-02-28 18:19:16 +00:00			`"#if you want to understand better this code, see what happens when you replace name by 0, 1, 2 in the line\n",`
			`"# cond = iris.target == name. \n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"for name in names:\n",`
			`" cond = iris.target == name\n",`
			`" plt.plot(x[cond], y[cond], linestyle='none', marker='o', label=iris.target_names[name])\n",`
			`"\n",`
			`"plt.legend(numpoints=1)\n",`
			`"plt.show()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"As we can see, the Setosa class seems to be linearly separable with these two features.\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"Another nice visualisation is given below."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
Fix sklearn.model_selection. Remove output 2019-02-28 14:25:19 +00:00			`"metadata": {},`
updated notebooks 2019-02-28 10:32:00 +00:00			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"x_index = 0\n",`
			`"y_index = 1\n",`
			`"formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n",`
			`"plt.scatter(iris.data[:, x_index], iris.data[:, y_index], s=40,\n",`
			`"c=iris.target)\n",`
			`"plt.colorbar(ticks=[0, 1, 2], format=formatter)\n",`
			`"plt.xlabel(iris.feature_names[x_index])\n",`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"plt.ylabel(iris.feature_names[y_index]);"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"This alternate visualisation also suggests that the Setosa class seems to be linearly separable.\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"Students interested in practicing advanced visualisations can check [Advanced visualisation notebook](2_3_1_Advanced_Visualisation.ipynb).\n",`
			`"\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",`
			`"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",`
			`"* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",`
			`"* [Matplotlib web page](http://matplotlib.org/index.html)\n",`
			`"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",`
			`"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",`
			`"* [Iris dataset visualisation notebook](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook)\n",`
			`"* [Tutorial plotting with Seaborn](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence\n",`
			`"\n",`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Added explanation in visualization of iris 2019-02-28 18:19:16 +00:00			`"version": "3.7.1"`
updated notebooks 2019-02-28 10:32:00 +00:00			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
Add ml1 2016-03-15 12:55:14 +00:00			`}`
			`},`
			`"nbformat": 4,`
Fix typo in axis 2018-02-27 15:14:18 +00:00			`"nbformat_minor": 1`
Add ml1 2016-03-15 12:55:14 +00:00			`}`