1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-11-14 10:32:29 +00:00
sitc/ml1/2_3_1_Advanced_Visualisation.ipynb

469 lines
14 KiB
Plaintext
Raw Normal View History

2016-03-15 12:55:14 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2019-02-28 10:32:00 +00:00
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"\n",
"* [Advanced Visualisation](#Advanced-Visualisation)\n",
"* [Install seaborn](#Install-seaborn)\n",
"* [Transform Data into Dataframe](#Transform-Data-into-Dataframe)\n",
"* [Visualisation with seaborn](#Visualisation-with-seaborn)\n",
"* [References](#References)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Advanced Visualisation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous notebook we developed plots with the [matplotlib](http://matplotlib.org/) plotting library.\n",
"\n",
2016-03-28 10:26:20 +00:00
"This notebook introduces another plotting library, [**seaborn**](https://stanford.edu/~mwaskom/software/seaborn/), which provides advanced facilities for data visualization.\n",
2016-03-15 12:55:14 +00:00
"\n",
"*Seaborn* is a library for making attractive and informative statistical graphics in Python. It is built on top of *matplotlib* and tightly integrated with the *PyData* stack, including support for *numpy* and *pandas* data structures and statistical routines from *scipy* and *statsmodels*.\n",
"\n",
2016-03-28 10:26:20 +00:00
"*Seaborn* requires its input to be *DataFrames* (a structure created with the library *pandas*)."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install seaborn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should install the SeaBorn package. Use `conda install seaborn`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform Data into Dataframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Seaborn* requires that data is represented as a *DataFrame* object from the library *pandas*. \n",
"\n",
"A *DataFrame* is a 2-dimensional labeled data structure with columns of potentially different types. We will not go into the details of DataFrames in this session."
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"from pandas import DataFrame\n",
2019-02-28 11:26:33 +00:00
"from sklearn import datasets\n",
"\n",
2016-03-15 12:55:14 +00:00
"\n",
"# iris data set from scikit learn (it is a Bunch object)\n",
"iris = datasets.load_iris()\n",
"\n",
"# transform into dataframe\n",
"iris_df = DataFrame(iris.data)\n",
"iris_df.columns = iris.feature_names\n",
"\n",
"iris_df.head()"
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"iris_df['species'] = iris.target\n",
"iris_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualisation with seaborn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following examples are taken from [a kaggle tutorial](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook) and [the seaborn tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html).\n",
"\n",
2016-03-28 10:26:20 +00:00
"To plot multiple pairwise bivariate distributions in a dataset, you can use the *pairplot()* function and *PairGrid()*."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Scatterplot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-15 12:55:14 +00:00
"A **scatterplot matrix** (*matriz de diagramas de dispersión*) presents every pairwise relationship between a set of variables."
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"sns.set(color_codes=True)\n",
"\n",
"# if matplotlib is not set inline, you will not see plots\n",
"%matplotlib inline \n",
"\n",
"sns.pairplot(iris_df)\n"
]
},
2016-03-28 10:26:20 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PairGrid"
]
},
2016-03-15 12:55:14 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**PairGrid** allows you to quickly draw a grid of small subplots using the same plot type to visualize data in each. In a PairGrid, each row and column is assigned to a different variable, so the resulting plot shows each pairwise relationship in the dataset. This style of plot is sometimes called a “scatterplot matrix”, as this is the most common way to show each relationship"
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# PairGrid\n",
"g = sns.PairGrid(iris_df)\n",
"g.map(plt.scatter);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"A very common way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of the three different species of iris flowers.\n",
2016-03-15 12:55:14 +00:00
"\n",
2016-03-28 10:26:20 +00:00
"We are going to color each class, so that we can easily identify **clustering** and **linear relationships**."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"g = sns.PairGrid(iris_df, hue=\"species\")\n",
"g.map_diag(plt.hist)\n",
"g.map_offdiag(plt.scatter)\n",
"#names = {i: name for i,name in enumerate(iris.target_names)}\n",
"#g.add_legend(legend_data=names)\n",
"g.add_legend()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default every numeric column in the dataset is used, but you can focus on particular relationships if you want."
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"g = sns.PairGrid(iris_df, vars=['sepal length (cm)', 'sepal width (cm)'], hue=\"species\")\n",
"g.map(plt.scatter);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Its also possible to use a different function in the upper and lower triangles to emphasize different aspects of the relationship."
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"g = sns.PairGrid(iris_df)\n",
"g.map_upper(plt.scatter)\n",
"g.map_lower(sns.kdeplot, cmap=\"Blues_d\")\n",
"g.map_diag(sns.kdeplot, lw=3, legend=True);"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"## Pairplot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PairGrid is flexible, but to take a quick look at a dataset, it may be easier to use pairplot(). This function uses scatterplots and histograms by default, although a few other kinds will be added (currently, you can also plot regression plots on the off-diagonals and KDEs on the diagonal)."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
2019-02-28 10:32:00 +00:00
"sns.pairplot(iris_df, hue=\"species\", height=2.5);"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also control the aesthetics of the plot with keyword arguments, and it returns the PairGrid instance for further tweaking."
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
2016-03-28 10:26:20 +00:00
"g = sns.pairplot(iris_df, hue=\"species\", palette=\"Set2\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Violin Plots (boxplot)"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[**Box plots** or **boxplot** ](https://en.wikipedia.org/wiki/Box_plot) (*diagramas de caja*) are a convenient way of graphically depicting groups of numerical data through their quartiles."
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# We can look at an individual feature in Seaborn through a boxplot\n",
"sns.boxplot(x=\"species\", y=\"sepal length (cm)\", data=iris_df)"
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# One way we can extend this plot is adding a layer of individual points on top of\n",
"# it through Seaborn's striplot\n",
"# \n",
"# We'll use jitter=True so that all the points don't fall in single vertical lines\n",
"# above the species\n",
"#\n",
"# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
"# on top of the previous axes\n",
"ax = sns.boxplot(x=\"species\", y=\"petal length (cm)\", data=iris_df)\n",
"ax = sns.stripplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, jitter=True, edgecolor=\"gray\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[**Violin plots**](https://en.wikipedia.org/wiki/Violin_plot) (*diagramas de violín*) are a method of plotting numeric data. A violin plot is a box plot with a rotated kernel density plot on each side. A violin plot is just a histogram (or more often a smoothed variant like a kernel density) turned on its side and mirrored."
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# A violin plot combines the benefits of the previous two plots and simplifies them\n",
"# Denser regions of the data are fatter, and sparser thiner in a violin plot\n",
"sns.violinplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, size=6)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"## Kernel Density Estimation (KDE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another useful representation is the [Kernel density estimation (KDE)](https://en.wikipedia.org/wiki/Kernel_density_estimation) plot. KDE is a non-parametric way to estimate the probability density function of a random variable. The kdeplot represents the shape of a distribution. Like the histogram, the KDE plots encodes the density of observations on one axis with height along the other axis:"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2022-02-21 12:09:21 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# A final seaborn plot useful for looking at univariate relations is the kdeplot,\n",
"# which creates and visualizes a kernel density estimate of the underlying feature\n",
2019-02-28 10:32:00 +00:00
"sns.FacetGrid(iris_df, hue=\"species\", height=6) \\\n",
2016-03-15 12:55:14 +00:00
" .map(sns.kdeplot, \"petal length (cm)\") \\\n",
" .add_legend()"
]
},
2016-03-28 10:26:20 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Choosing the right visualisation"
]
},
2016-03-15 12:55:14 +00:00
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Depending on the data, we can choose which visualisation suits better. the following [diagram](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/) guides this selection.\n",
"\n",
"\n",
"![](files/images/data-chart-type.png \"Graphs\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
"* [Iris dataset visualisation notebook](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook)\n",
"* [Tutorial plotting with Seaborn](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html)\n",
"* [Choose the Right Chart Type for your Data](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
2019-02-28 10:32:00 +00:00
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
2016-03-15 12:55:14 +00:00
]
}
],
"metadata": {
"kernelspec": {
2022-02-21 12:09:21 +00:00
"display_name": "Python 3 (ipykernel)",
2016-03-15 12:55:14 +00:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2022-02-21 12:09:21 +00:00
"version": "3.8.12"
2019-02-28 10:32:00 +00:00
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
2016-03-15 12:55:14 +00:00
}
},
"nbformat": 4,
2019-02-28 10:32:00 +00:00
"nbformat_minor": 1
2016-03-15 12:55:14 +00:00
}