sitc/ml2/3_2_Pandas.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [Introduction to Pandas](#Introduction-to-Pandas)\n",
    "* [Series](#Series)\n",
    "* [DataFrame](#DataFrame)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Pandas\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook provides an overview of the *pandas* library. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Pandas](http://pandas.pydata.org/) is a Python library that provides easy-to-use data structures and data analysis tools.\n",
    "\n",
    "The main advantage of *Pandas* is that provides extensive facilities for grouping, merging and querying  pandas data structures, and also includes facilities for time series analysis, as well as i/o and visualisation facilities.\n",
    "\n",
    "Pandas in built on top of *NumPy*, so we will have usually to import both libraries.\n",
    "\n",
    "Pandas provides two main data structures:\n",
    "* **Series** is a one dimensional labelled object, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).. It is similar to an array, a list, a dictionary or a column in a table. Every value in a Series object has an index.\n",
    "* **DataFrame** is a two dimensional labelled object with columns of potentially different types. It is similar to a database table, or a spreadsheet. It can be seen as a dictionary of Series that share the same index.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are not going to use Series objects directly as frequently as DataFrames. Here we provide a short introduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "# create series object from an array\n",
    "s = Series([5, 10, 15])\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see each value has an associated label starting with 0 if no index is specified when the Series object is created. \n",
    "\n",
    "It is similar to a dictionary. In fact, we can also create a Series object from a dictionary as follows. In this case, the indexes are the keys of the dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = {'a': 5, 'b': 10, 'c': 15}\n",
    "s = Series(d)\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can get the list of indexes\n",
    "s.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# and the values\n",
    "s.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another option is to create the Series object from two lists, for  values and indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Series with population in 2015 of most populous cities in Spain\n",
    "s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', \n",
    "                                                                      'Zaragoza', 'Malaga'])\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Population of Madrid\n",
    "s['Madrid']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing and slicing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Until now, we have not seen any advantage in using Panda Series. We are going to show some examples of their possibilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Boolean condition\n",
    "s > 1000000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cities with population greater than 1.000.000\n",
    "s[s > 1000000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe that (s > 1000000) returns a Series object. We can use this Boolean vector as a filter to obtain a slice of the original series containing only the elements where the filter evaluates to True. The original Series s is not modified. This selection is called *boolean indexing*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cities with population greater than the mean\n",
    "s[s > s.mean()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Cities with population greater than the median\n",
    "s[s > s.median()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Check cities with a population greater than 700.000\n",
    "s > 700000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# List cities with a population greater than 700.000\n",
    "s[s > 700000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Another way to write the same boolean indexing selection\n",
    "bigger_than_700000 = s > 700000\n",
    "bigger_than_700000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Cities with population > 700000\n",
    "s[bigger_than_700000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operations on series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also carry out other mathematical operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Divide population by 2\n",
    "s / 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the average population\n",
    "s.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get the highest population\n",
    "s.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Item assignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also change values directly or based on a condition. You can consult additional features in the manual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Change population of one city\n",
    "s['Madrid'] = 3320000\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Increase by 10% cities with population greater than 700000\n",
    "# Since pandas 3.X use where\n",
    "s = s.where(s <= 700000, s * 1.1)\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we said previously, **DataFrames** are two-dimensional data structures. You can see like a dict of Series that share the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We are going to create a DataFrame from a dict of Series\n",
    "d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),\n",
    "    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}\n",
    "df = DataFrame(d)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this dataframe, the *indexes* (row labels) are *a*, *b*, *c* and *d* and the *columns* (column labels) are *one* and *two*.\n",
    "\n",
    "We see that the resulting DataFrame is the union of indexes, and missing values are included as NaN (to write this value we will use *np.nan*).\n",
    "\n",
    "If we specify an index, the dictionary is filtered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We can filter\n",
    "df = DataFrame(d, index=['d', 'b', 'a'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another option is to use the constructor with *index* and *columns*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next notebook we are going to learn more about dataframes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Pandas](http://pandas.pydata.org/)\n",
    "* [Pandas. Introduction to Data Structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)\n",
    "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
    "* [Boolean Operators in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#boolean-operators)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}