mirror of
https://github.com/gsi-upm/sitc
synced 2024-12-22 11:48:12 +00:00
470 lines
14 KiB
Plaintext
470 lines
14 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Course Notes for Learning Intelligent Systems"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Table of Contents\n",
|
||
"\n",
|
||
"* [Advanced Visualisation](#Advanced-Visualisation)\n",
|
||
"* [Install seaborn](#Install-seaborn)\n",
|
||
"* [Transform Data into Dataframe](#Transform-Data-into-Dataframe)\n",
|
||
"* [Visualisation with seaborn](#Visualisation-with-seaborn)\n",
|
||
"* [References](#References)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Advanced Visualisation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"In the previous notebook we developed plots with the [matplotlib](http://matplotlib.org/) plotting library.\n",
|
||
"\n",
|
||
"This notebook introduces another plotting library, [**seaborn**](https://stanford.edu/~mwaskom/software/seaborn/), which provides advanced facilities for data visualization.\n",
|
||
"\n",
|
||
"*Seaborn* is a library for making attractive and informative statistical graphics in Python. It is built on top of *matplotlib* and tightly integrated with the *PyData* stack, including support for *numpy* and *pandas* data structures and statistical routines from *scipy* and *statsmodels*.\n",
|
||
"\n",
|
||
"*Seaborn* requires its input to be *DataFrames* (a structure created with the library *pandas*)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Install seaborn"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"You should install the SeaBorn package. Use `conda install seaborn`."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Transform Data into Dataframe"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"*Seaborn* requires that data is represented as a *DataFrame* object from the library *pandas*. \n",
|
||
"\n",
|
||
"A *DataFrame* is a 2-dimensional labeled data structure with columns of potentially different types. We will not go into the details of DataFrames in this session."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"from pandas import DataFrame\n",
|
||
"from sklearn import datasets\n",
|
||
"\n",
|
||
"\n",
|
||
"# iris data set from scikit learn (it is a Bunch object)\n",
|
||
"iris = datasets.load_iris()\n",
|
||
"\n",
|
||
"# transform into dataframe\n",
|
||
"iris_df = DataFrame(iris.data)\n",
|
||
"iris_df.columns = iris.feature_names\n",
|
||
"\n",
|
||
"iris_df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"iris_df['species'] = iris.target\n",
|
||
"iris_df.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Visualisation with seaborn"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"The following examples are taken from [a kaggle tutorial](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook) and [the seaborn tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html).\n",
|
||
"\n",
|
||
"To plot multiple pairwise bivariate distributions in a dataset, you can use the *pairplot()* function and *PairGrid()*."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Scatterplot"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"A **scatterplot matrix** (*matriz de diagramas de dispersión*) presents every pairwise relationship between a set of variables."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"import seaborn as sns\n",
|
||
"import matplotlib.pyplot as plt\n",
|
||
"import numpy as np\n",
|
||
"sns.set(color_codes=True)\n",
|
||
"\n",
|
||
"# if matplotlib is not set inline, you will not see plots\n",
|
||
"%matplotlib inline \n",
|
||
"\n",
|
||
"sns.pairplot(iris_df)\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## PairGrid"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"**PairGrid** allows you to quickly draw a grid of small subplots using the same plot type to visualize data in each. In a PairGrid, each row and column is assigned to a different variable, so the resulting plot shows each pairwise relationship in the dataset. This style of plot is sometimes called a “scatterplot matrix”, as this is the most common way to show each relationship"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# PairGrid\n",
|
||
"g = sns.PairGrid(iris_df)\n",
|
||
"g.map(plt.scatter);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"A very common way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of the three different species of iris flowers.\n",
|
||
"\n",
|
||
"We are going to color each class, so that we can easily identify **clustering** and **linear relationships**."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"g = sns.PairGrid(iris_df, hue=\"species\")\n",
|
||
"g.map_diag(plt.hist)\n",
|
||
"g.map_offdiag(plt.scatter)\n",
|
||
"#names = {i: name for i,name in enumerate(iris.target_names)}\n",
|
||
"#g.add_legend(legend_data=names)\n",
|
||
"g.add_legend()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"By default every numeric column in the dataset is used, but you can focus on particular relationships if you want."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"g = sns.PairGrid(iris_df, vars=['sepal length (cm)', 'sepal width (cm)'], hue=\"species\")\n",
|
||
"g.map(plt.scatter);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"It’s also possible to use a different function in the upper and lower triangles to emphasize different aspects of the relationship."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"g = sns.PairGrid(iris_df)\n",
|
||
"g.map_upper(plt.scatter)\n",
|
||
"g.map_lower(sns.kdeplot, cmap=\"Blues_d\")\n",
|
||
"g.map_diag(sns.kdeplot, lw=3, legend=True);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Pairplot"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"PairGrid is flexible, but to take a quick look at a dataset, it may be easier to use pairplot(). This function uses scatterplots and histograms by default, although a few other kinds will be added (currently, you can also plot regression plots on the off-diagonals and KDEs on the diagonal)."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"sns.pairplot(iris_df, hue=\"species\", height=2.5);"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"You can also control the aesthetics of the plot with keyword arguments, and it returns the PairGrid instance for further tweaking."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"g = sns.pairplot(iris_df, hue=\"species\", palette=\"Set2\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Violin Plots (boxplot)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[**Box plots** or **boxplot** ](https://en.wikipedia.org/wiki/Box_plot) (*diagramas de caja*) are a convenient way of graphically depicting groups of numerical data through their quartiles."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# We can look at an individual feature in Seaborn through a boxplot\n",
|
||
"sns.boxplot(x=\"species\", y=\"sepal length (cm)\", data=iris_df)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# One way we can extend this plot is adding a layer of individual points on top of\n",
|
||
"# it through Seaborn's striplot\n",
|
||
"# \n",
|
||
"# We'll use jitter=True so that all the points don't fall in single vertical lines\n",
|
||
"# above the species\n",
|
||
"#\n",
|
||
"# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
|
||
"# on top of the previous axes\n",
|
||
"ax = sns.boxplot(x=\"species\", y=\"petal length (cm)\", data=iris_df)\n",
|
||
"ax = sns.stripplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, jitter=True, edgecolor=\"gray\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"[**Violin plots**](https://en.wikipedia.org/wiki/Violin_plot) (*diagramas de violín*) are a method of plotting numeric data. A violin plot is a box plot with a rotated kernel density plot on each side. A violin plot is just a histogram (or more often a smoothed variant like a kernel density) turned on its side and mirrored."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# A violin plot combines the benefits of the previous two plots and simplifies them\n",
|
||
"# Denser regions of the data are fatter, and sparser thiner in a violin plot\n",
|
||
"sns.violinplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, size=6)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Kernel Density Estimation (KDE)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Another useful representation is the [Kernel density estimation (KDE)](https://en.wikipedia.org/wiki/Kernel_density_estimation) plot. KDE is a non-parametric way to estimate the probability density function of a random variable. The kdeplot represents the shape of a distribution. Like the histogram, the KDE plots encodes the density of observations on one axis with height along the other axis:"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {},
|
||
"outputs": [],
|
||
"source": [
|
||
"# A final seaborn plot useful for looking at univariate relations is the kdeplot,\n",
|
||
"# which creates and visualizes a kernel density estimate of the underlying feature\n",
|
||
"sns.FacetGrid(iris_df, hue=\"species\", height=6) \\\n",
|
||
" .map(sns.kdeplot, \"petal length (cm)\") \\\n",
|
||
" .add_legend()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"# Choosing the right visualisation"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"Depending on the data, we can choose which visualisation suits better. the following [diagram](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/) guides this selection.\n",
|
||
"\n",
|
||
"\n",
|
||
"![](files/images/data-chart-type.png \"Graphs\")"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## References"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
|
||
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
|
||
"* [Mastering Pandas](https://learning.oreilly.com/library/view/mastering-pandas/9781789343236/), Femi Anthony, Packt Publishing, 2015.\n",
|
||
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
|
||
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
|
||
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
|
||
"* [Iris dataset visualisation notebook](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook)\n",
|
||
"* [Tutorial plotting with Seaborn](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html)\n",
|
||
"* [Choose the Right Chart Type for your Data](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {},
|
||
"source": [
|
||
"## Licence\n",
|
||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||
"\n",
|
||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.8.12"
|
||
},
|
||
"latex_envs": {
|
||
"LaTeX_envs_menu_present": true,
|
||
"autocomplete": true,
|
||
"bibliofile": "biblio.bib",
|
||
"cite_by": "apalike",
|
||
"current_citInitial": 1,
|
||
"eqLabelWithNumbers": true,
|
||
"eqNumInitial": 1,
|
||
"hotkeys": {
|
||
"equation": "Ctrl-E",
|
||
"itemize": "Ctrl-I"
|
||
},
|
||
"labels_anchors": false,
|
||
"latex_user_defs": false,
|
||
"report_style_numbering": false,
|
||
"user_envs_cfg": false
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 1
|
||
}
|