mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-04 23:21:42 +00:00
292 lines
8.0 KiB
Plaintext
292 lines
8.0 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Visualisation](#Visualisation)\n",
|
|
"* [Exploratory visualisation](#Exploratory-visualisation)\n",
|
|
"* [References](#References)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Visualisation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The goal of this notebook is to learn how to analyse a dataset. We will cover other tasks such as cleaning or munging (changing the format) the dataset in other sessions."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exploratory visualisation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This section covers different ways to inspect the distribution of samples per feature.\n",
|
|
"\n",
|
|
"First of all, let's see how many samples of each class we have, using a [histogram](https://en.wikipedia.org/wiki/Histogram). \n",
|
|
"\n",
|
|
"A histogram is a graphical representation of the distribution of numerical data. It is an estimation of the probability distribution of a continuous variable (quantitative variable). \n",
|
|
"\n",
|
|
"For building a histogram, we need first to 'bin' the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. \n",
|
|
"\n",
|
|
"In our case, since the values are not continuous and we have only three values, we do not need to bin them."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn import datasets\n",
|
|
"iris = datasets.load_iris()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# library for displaying plots\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"# display plots in the notebook\n",
|
|
"# if this is not set, you will not see the graphic here\n",
|
|
"%matplotlib inline"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Plot histogram, the default is 10 bins\n",
|
|
"plt.hist(iris.target)\n",
|
|
"plt.ylabel('Number of instances')\n",
|
|
"plt.xlabel('iris class')\n",
|
|
"plt.xticks(range(len(iris.target_names)), iris.target_names);"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As can be seen, we have the same distribution of samples for every class.\n",
|
|
"The next step is to see the distribution of the features"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# This is a reminder of the name and index of each feature\n",
|
|
"print(iris.feature_names)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# A reminder of feature names and indexes\n",
|
|
"print(iris.target_names)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A [**scatter plot**](https://en.wikipedia.org/wiki/Scatter_plot) (*gráfico de dispersión*) displays the value of typically two variables for a set of data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# scatter makes a plot of x vs y\n",
|
|
"plt.scatter(iris.data[:,0], iris.target)\n",
|
|
"plt.yticks(range(len(iris.target_names)), iris.target_names);\n",
|
|
"plt.xlabel(iris.feature_names[0])\n",
|
|
"plt.ylabel('iris class')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Plot the distribution of the dataset\n",
|
|
"names = set(iris.target)\n",
|
|
"\n",
|
|
"# x and y are all the samples from column 0 (sepal_length) and 1 (sepal_width) respectively\n",
|
|
"x,y = iris.data[:,0], iris.data[:,1]\n",
|
|
"\n",
|
|
"for name in names:\n",
|
|
" cond = iris.target == name\n",
|
|
" plt.plot(x[cond], y[cond], linestyle='none', marker='o', label=iris.target_names[name])\n",
|
|
"\n",
|
|
"plt.legend(numpoints=1)\n",
|
|
"plt.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"As we can see, the Setosa class seems to be linearly separable with these two features.\n",
|
|
"\n",
|
|
"Another nice visualisation is given below."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"scrolled": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"x_index = 0\n",
|
|
"y_index = 1\n",
|
|
"formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])\n",
|
|
"plt.scatter(iris.data[:, x_index], iris.data[:, y_index], s=40,\n",
|
|
"c=iris.target)\n",
|
|
"plt.colorbar(ticks=[0, 1, 2], format=formatter)\n",
|
|
"plt.xlabel(iris.feature_names[x_index])\n",
|
|
"plt.ylabel(iris.feature_names[y_index]);"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"This alternate visualisation also suggests that the Setosa class seems to be linearly separable.\n",
|
|
"\n",
|
|
"Students interested in practicing advanced visualisations can check [Advanced visualisation notebook](2_3_1_Advanced_Visualisation.ipynb).\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
|
|
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
|
|
"* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
|
|
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
|
|
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
|
|
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
|
|
"* [Iris dataset visualisation notebook](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook)\n",
|
|
"* [Tutorial plotting with Seaborn](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence\n",
|
|
"\n",
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.5"
|
|
},
|
|
"latex_envs": {
|
|
"LaTeX_envs_menu_present": true,
|
|
"autocomplete": true,
|
|
"bibliofile": "biblio.bib",
|
|
"cite_by": "apalike",
|
|
"current_citInitial": 1,
|
|
"eqLabelWithNumbers": true,
|
|
"eqNumInitial": 1,
|
|
"hotkeys": {
|
|
"equation": "Ctrl-E",
|
|
"itemize": "Ctrl-I"
|
|
},
|
|
"labels_anchors": false,
|
|
"latex_user_defs": false,
|
|
"report_style_numbering": false,
|
|
"user_envs_cfg": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 1
|
|
}
|