8 years ago
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
"# Course Notes for Learning Intelligent Systems"
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
8 years ago
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
"# Table of Contents\n",
"* [Advanced Visualisation](#Advanced-Visualisation)\n",
"* [Install seaborn](#Install-seaborn)\n",
"* [Transform Data into Dataframe](#Transform-Data-into-Dataframe)\n",
"* [Visualisation with seaborn](#Visualisation-with-seaborn)\n",
"* [References](#References)\n"
"# Advanced Visualisation"
"In the previous notebook we developed plots with the [matplotlib]( plotting library.\n",
8 years ago
"This notebook introduces another plotting library, [**seaborn**](, which provides advanced facilities for data visualization.\n",
8 years ago
"*Seaborn* is a library for making attractive and informative statistical graphics in Python. It is built on top of *matplotlib* and tightly integrated with the *PyData* stack, including support for *numpy* and *pandas* data structures and statistical routines from *scipy* and *statsmodels*.\n",
8 years ago
"*Seaborn* requires its input to be *DataFrames* (a structure created with the library *pandas*)."
8 years ago
"## Install seaborn"
"You should install the SeaBorn package. Use `conda install seaborn`."
"## Transform Data into Dataframe"
"*Seaborn* requires that data is represented as a *DataFrame* object from the library *pandas*. \n",
"A *DataFrame* is a 2-dimensional labeled data structure with columns of potentially different types. We will not go into the details of DataFrames in this session."
8 years ago
"from pandas import DataFrame\n",
"from sklearn import datasets\n",
8 years ago
"# iris data set from scikit learn (it is a Bunch object)\n",
"iris = datasets.load_iris()\n",
"# transform into dataframe\n",
"iris_df = DataFrame(\n",
"iris_df.columns = iris.feature_names\n",
8 years ago
"iris_df['species'] =\n",
"# Visualisation with seaborn"
"The following examples are taken from [a kaggle tutorial]( and [the seaborn tutorial](\n",
8 years ago
"To plot multiple pairwise bivariate distributions in a dataset, you can use the *pairplot()* function and *PairGrid()*."
"## Scatterplot"
8 years ago
"A **scatterplot matrix** (*matriz de diagramas de dispersión*) presents every pairwise relationship between a set of variables."
8 years ago
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"# if matplotlib is not set inline, you will not see plots\n",
"%matplotlib inline \n",
8 years ago
"## PairGrid"
8 years ago
"**PairGrid** allows you to quickly draw a grid of small subplots using the same plot type to visualize data in each. In a PairGrid, each row and column is assigned to a different variable, so the resulting plot shows each pairwise relationship in the dataset. This style of plot is sometimes called a “scatterplot matrix”, as this is the most common way to show each relationship"
8 years ago
"# PairGrid\n",
"g = sns.PairGrid(iris_df)\n",
8 years ago
"A very common way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of the three different species of iris flowers.\n",
8 years ago
8 years ago
"We are going to color each class, so that we can easily identify **clustering** and **linear relationships**."
8 years ago
8 years ago
"g = sns.PairGrid(iris_df, hue=\"species\")\n",
"#names = {i: name for i,name in enumerate(iris.target_names)}\n",
"cell_type": "markdown",
"By default every numeric column in the dataset is used, but you can focus on particular relationships if you want."
8 years ago
"g = sns.PairGrid(iris_df, vars=['sepal length (cm)', 'sepal width (cm)'], hue=\"species\")\n",
"Its also possible to use a different function in the upper and lower triangles to emphasize different aspects of the relationship."
8 years ago
"g = sns.PairGrid(iris_df)\n",
"g.map_lower(sns.kdeplot, cmap=\"Blues_d\")\n",
"g.map_diag(sns.kdeplot, lw=3, legend=True);"
8 years ago
"## Pairplot"
"cell_type": "markdown",
"PairGrid is flexible, but to take a quick look at a dataset, it may be easier to use pairplot(). This function uses scatterplots and histograms by default, although a few other kinds will be added (currently, you can also plot regression plots on the off-diagonals and KDEs on the diagonal)."
8 years ago
8 years ago
"sns.pairplot(iris_df, hue=\"species\", height=2.5);"
8 years ago
"You can also control the aesthetics of the plot with keyword arguments, and it returns the PairGrid instance for further tweaking."
8 years ago
8 years ago
"g = sns.pairplot(iris_df, hue=\"species\", palette=\"Set2\")"
"## Violin Plots (boxplot)"
8 years ago
"[**Box plots** or **boxplot** ]( (*diagramas de caja*) are a convenient way of graphically depicting groups of numerical data through their quartiles."
8 years ago
"# We can look at an individual feature in Seaborn through a boxplot\n",
"sns.boxplot(x=\"species\", y=\"sepal length (cm)\", data=iris_df)"
8 years ago
"# One way we can extend this plot is adding a layer of individual points on top of\n",
"# it through Seaborn's striplot\n",
"# \n",
"# We'll use jitter=True so that all the points don't fall in single vertical lines\n",
"# above the species\n",
"# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
"# on top of the previous axes\n",
"ax = sns.boxplot(x=\"species\", y=\"petal length (cm)\", data=iris_df)\n",
"ax = sns.stripplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, jitter=True, edgecolor=\"gray\")"
"[**Violin plots**]( (*diagramas de violín*) are a method of plotting numeric data. A violin plot is a box plot with a rotated kernel density plot on each side. A violin plot is just a histogram (or more often a smoothed variant like a kernel density) turned on its side and mirrored."
8 years ago
"# A violin plot combines the benefits of the previous two plots and simplifies them\n",
"# Denser regions of the data are fatter, and sparser thiner in a violin plot\n",
"sns.violinplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, size=6)"
8 years ago
"## Kernel Density Estimation (KDE)"
"Another useful representation is the [Kernel density estimation (KDE)]( plot. KDE is a non-parametric way to estimate the probability density function of a random variable. The kdeplot represents the shape of a distribution. Like the histogram, the KDE plots encodes the density of observations on one axis with height along the other axis:"
8 years ago
8 years ago
"# A final seaborn plot useful for looking at univariate relations is the kdeplot,\n",
"# which creates and visualizes a kernel density estimate of the underlying feature\n",
"sns.FacetGrid(iris_df, hue=\"species\", height=6) \\\n",
8 years ago
" .map(sns.kdeplot, \"petal length (cm)\") \\\n",
" .add_legend()"
8 years ago
"# Choosing the right visualisation"
8 years ago
"Depending on the data, we can choose which visualisation suits better. the following [diagram]( guides this selection.\n",
"![](files/images/data-chart-type.png \"Graphs\")"
"## References"
"* [Feature selection](\n",
"* [Classification probability](\n",
"* [Mastering Pandas](, Femi Anthony, Packt Publishing, 2015.\n",
8 years ago
"* [Matplotlib web page](\n",
"* [Using matlibplot in IPython](\n",
"* [Seaborn Tutorial](\n",
"* [Iris dataset visualisation notebook](\n",
"* [Tutorial plotting with Seaborn](\n",
"* [Choose the Right Chart Type for your Data]("
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license]( \n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
8 years ago
