sitc/ml1/2_3_1_Advanced_Visualisation.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [Advanced Visualisation](#Advanced-Visualisation)\n",
    "* [Install seaborn](#Install-seaborn)\n",
    "* [Transform Data into Dataframe](#Transform-Data-into-Dataframe)\n",
    "* [Visualisation with seaborn](#Visualisation-with-seaborn)\n",
    "* [References](#References)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Advanced Visualisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the previous notebook we developed plots with the [matplotlib](http://matplotlib.org/) plotting library.\n",
    "\n",
    "This notebook introduces another plotting library, [**seaborn**](https://stanford.edu/~mwaskom/software/seaborn/),  which provides advanced facilities for data visualization.\n",
    "\n",
    "*Seaborn* is a library for making attractive and informative statistical graphics in Python. It is built on top of *matplotlib* and tightly integrated with the *PyData* stack, including support for *numpy* and *pandas* data structures and statistical routines from *scipy* and *statsmodels*.\n",
    "\n",
    "*Seaborn* requires its input to be *DataFrames* (a structure created with the library *pandas*)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Install seaborn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should install the SeaBorn package. Use `conda install seaborn`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform Data into Dataframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Seaborn* requires that data is represented as a *DataFrame* object from the library *pandas*. \n",
    "\n",
    "A *DataFrame* is a 2-dimensional labeled data structure with columns of potentially different types. We will not go into the details of DataFrames in this session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pandas import DataFrame\n",
    "from sklearn import datasets\n",
    "\n",
    "\n",
    "# iris data set from scikit learn (it is a Bunch object)\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "# transform into dataframe\n",
    "iris_df = DataFrame(iris.data)\n",
    "iris_df.columns = iris.feature_names\n",
    "\n",
    "iris_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris_df['species'] = iris.target\n",
    "iris_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualisation with seaborn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following examples are taken from [a kaggle tutorial](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook) and [the seaborn tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html).\n",
    "\n",
    "To plot multiple pairwise bivariate distributions in a dataset, you can use the *pairplot()* function and *PairGrid()*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scatterplot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A **scatterplot matrix** (*matriz de diagramas de dispersión*) presents every pairwise relationship between a set of variables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "sns.set(color_codes=True)\n",
    "\n",
    "# if matplotlib is not set inline, you will not see plots\n",
    "%matplotlib inline \n",
    "\n",
    "sns.pairplot(iris_df)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PairGrid"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**PairGrid** allows you to quickly draw a grid of small subplots using the same plot type to visualize data in each. In a PairGrid, each row and column is assigned to a different variable, so the resulting plot shows each pairwise relationship in the dataset. This style of plot is sometimes called a “scatterplot matrix”, as this is the most common way to show each relationship"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# PairGrid\n",
    "g = sns.PairGrid(iris_df)\n",
    "g.map(plt.scatter);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A very common way to use this plot colors the observations by a separate categorical variable. For example, the iris dataset has four measurements for each of the three different species of iris flowers.\n",
    "\n",
    "We are going to color each class, so that we can easily identify **clustering** and **linear relationships**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.PairGrid(iris_df, hue=\"species\")\n",
    "g.map_diag(plt.hist)\n",
    "g.map_offdiag(plt.scatter)\n",
    "#names = {i: name for i,name in enumerate(iris.target_names)}\n",
    "#g.add_legend(legend_data=names)\n",
    "g.add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default every numeric column in the dataset is used, but you can focus on particular relationships if you want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.PairGrid(iris_df, vars=['sepal length (cm)', 'sepal width (cm)'], hue=\"species\")\n",
    "g.map(plt.scatter);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It’s also possible to use a different function in the upper and lower triangles to emphasize different aspects of the relationship."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.PairGrid(iris_df)\n",
    "g.map_upper(plt.scatter)\n",
    "g.map_lower(sns.kdeplot, cmap=\"Blues_d\")\n",
    "g.map_diag(sns.kdeplot, lw=3, legend=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pairplot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PairGrid is flexible, but to take a quick look at a dataset, it may be easier to use pairplot(). This function uses scatterplots and histograms by default, although a few other kinds will be added (currently, you can also plot regression plots on the off-diagonals and KDEs on the diagonal)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(iris_df, hue=\"species\", height=2.5);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also control the aesthetics of the plot with keyword arguments, and it returns the PairGrid instance for further tweaking."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "g = sns.pairplot(iris_df, hue=\"species\", palette=\"Set2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Violin Plots (boxplot)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[**Box plots** or **boxplot** ](https://en.wikipedia.org/wiki/Box_plot) (*diagramas de caja*) are a convenient way of graphically depicting groups of numerical data through their quartiles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can look at an individual feature in Seaborn through a boxplot\n",
    "sns.boxplot(x=\"species\", y=\"sepal length (cm)\", data=iris_df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# One way we can extend this plot is adding a layer of individual points on top of\n",
    "# it through Seaborn's striplot\n",
    "# \n",
    "# We'll use jitter=True so that all the points don't fall in single vertical lines\n",
    "# above the species\n",
    "#\n",
    "# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
    "# on top of the previous axes\n",
    "ax = sns.boxplot(x=\"species\", y=\"petal length (cm)\", data=iris_df)\n",
    "ax = sns.stripplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, jitter=True, edgecolor=\"gray\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[**Violin plots**](https://en.wikipedia.org/wiki/Violin_plot) (*diagramas de violín*)  are a method of plotting numeric data. A violin plot is a box plot with a rotated kernel density plot on each side.  A violin plot is just a histogram (or more often a smoothed variant like a kernel density) turned on its side and mirrored."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A violin plot combines the benefits of the previous two plots and simplifies them\n",
    "# Denser regions of the data are fatter, and sparser thiner in a violin plot\n",
    "sns.violinplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, size=6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kernel Density Estimation (KDE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another useful representation is the [Kernel density estimation (KDE)](https://en.wikipedia.org/wiki/Kernel_density_estimation) plot. KDE is a non-parametric way to estimate the probability density function of a random variable. The kdeplot represents the shape of a distribution. Like the histogram, the KDE plots encodes the density of observations on one axis with height along the other axis:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# A final seaborn plot useful for looking at univariate relations is the kdeplot,\n",
    "# which creates and visualizes a kernel density estimate of the underlying feature\n",
    "sns.FacetGrid(iris_df, hue=\"species\", height=6) \\\n",
    "   .map(sns.kdeplot, \"petal length (cm)\") \\\n",
    "   .add_legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Choosing the right visualisation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Depending on the data, we can choose which visualisation suits better. the following [diagram](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/) guides this selection.\n",
    "\n",
    "\n",
    "![](files/images/data-chart-type.png \"Graphs\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
    "* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
    "* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
    "* [Matplotlib web page](http://matplotlib.org/index.html)\n",
    "* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
    "* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)\n",
    "* [Iris dataset visualisation notebook](https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations/notebook)\n",
    "* [Tutorial plotting with Seaborn](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html)\n",
    "* [Choose the Right Chart Type for your Data](http://www.labnol.org/software/find-right-chart-type-for-your-data/6523/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}