sitc/ml2/3_2_Pandas.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "* [Introduction to Pandas](#Introduction-to-Pandas)\n",
    "* [Series](#Series)\n",
    "* [DataFrame](#DataFrame)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Pandas\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook provides an overview of the *pandas* library. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Pandas](http://pandas.pydata.org/) is a Python library that provides easy-to-use data structures and data analysis tools.\n",
    "\n",
    "The main advantage of *Pandas* is that provides extensive facilities for grouping, merging and querying  pandas data structures, and also includes facilities for time series analysis, as well as i/o and visualisation facilities.\n",
    "\n",
    "Pandas in built on top of *NumPy*, so we will have usually to import both libraries.\n",
    "\n",
    "Pandas provides two main data structures:\n",
    "* **Series** is a one dimensional labelled object, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).. It is similar to an array, a list, a dictionary or a column in a table. Every value in a Series object has an index.\n",
    "* **DataFrame** is a two dimensional labelled object with columns of potentially different types. It is similar to a database table, or a spreadsheet. It can be seen as a dictionary of Series that share the same index.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are not going to use Series objects directly as frequently as DataFrames. Here we provide a short introduction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     5\n",
       "1    10\n",
       "2    15\n",
       "dtype: int64"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from pandas import Series, DataFrame\n",
    "\n",
    "# create series object from an array\n",
    "s = Series([5, 10, 15])\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see each value has an associated label starting with 0 if no index is specified when the Series object is created. \n",
    "\n",
    "It is similar to a dictionary. In fact, we can also create a Series object from a dictionary as follows. In this case, the indexes are the keys of the dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a     5\n",
       "b    10\n",
       "c    15\n",
       "dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "d = {'a': 5, 'b': 10, 'c': 15}\n",
    "s = Series(d)\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['a', 'b', 'c'], dtype='object')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We can get the list of indexes\n",
    "s.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 5, 10, 15])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and the values\n",
    "s.values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another option is to create the Series object from two lists, for  values and indexes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3141991\n",
       "Barcelona    1604555\n",
       "Valencia      786189\n",
       "Sevilla       693878\n",
       "Zaragoza      664953\n",
       "Malaga        569130\n",
       "dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Series with population in 2015 of more populated cities in Spain\n",
    "s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', \n",
    "                                                                      'Zaragoza', 'Malaga'])\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3141991"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Population of Madrid\n",
    "s['Madrid']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Indexing and slicing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Until now, we have not seen any advantage in using Panda Series. we are going to show now some examples of their possibilities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid        True\n",
       "Barcelona     True\n",
       "Valencia     False\n",
       "Sevilla      False\n",
       "Zaragoza     False\n",
       "Malaga       False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Boolean condition\n",
    "s > 1000000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3141991\n",
       "Barcelona    1604555\n",
       "dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Cities with population greater than 1.000.000\n",
    "s[s > 1000000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe that (s > 1000000) returns a Series object. We can use this boolean vector as a filter to get a *slice* of the original series that contains only the elements where the value of the filter is True. The original Series s is not modified. This selection is called *boolean indexing*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3141991\n",
       "Barcelona    1604555\n",
       "dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Cities with population greater than the mean\n",
    "s[s > s.mean()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3141991\n",
       "Barcelona    1604555\n",
       "Valencia      786189\n",
       "dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Cities with population greater than the median\n",
    "s[s > s.median()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid        True\n",
       "Barcelona     True\n",
       "Valencia      True\n",
       "Sevilla      False\n",
       "Zaragoza     False\n",
       "Malaga       False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Check cities with a population greater than 700.000\n",
    "s > 700000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3141991\n",
       "Barcelona    1604555\n",
       "Valencia      786189\n",
       "dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# List cities with a population greater than 700.000\n",
    "s[s > 700000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid        True\n",
       "Barcelona     True\n",
       "Valencia      True\n",
       "Sevilla      False\n",
       "Zaragoza     False\n",
       "Malaga       False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Another way to write the same boolean indexing selection\n",
    "bigger_than_700000 = s > 700000\n",
    "bigger_than_700000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3141991\n",
       "Barcelona    1604555\n",
       "Valencia      786189\n",
       "dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Cities with population > 700000\n",
    "s[bigger_than_700000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Operations on series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also carry out other mathematical operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       1570995.5\n",
       "Barcelona     802277.5\n",
       "Valencia      393094.5\n",
       "Sevilla       346939.0\n",
       "Zaragoza      332476.5\n",
       "Malaga        284565.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Divide population by 2\n",
    "s / 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1243449.3333333333"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the average population\n",
    "s.mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3141991"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the highest population\n",
    "s.max()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Item assignment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also change values directly or based on a condition. You can consult additional feautures in the manual."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3320000\n",
       "Barcelona    1604555\n",
       "Valencia      786189\n",
       "Sevilla       693878\n",
       "Zaragoza      664953\n",
       "Malaga        569130\n",
       "dtype: int64"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Change population of one city\n",
    "s['Madrid'] = 3320000\n",
    "s"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Madrid       3652000.0\n",
       "Barcelona    1765010.5\n",
       "Valencia      864807.9\n",
       "Sevilla       693878.0\n",
       "Zaragoza      664953.0\n",
       "Malaga        569130.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Increase by 10% cities with population greater than 700000\n",
    "s[s > 700000] = 1.1 * s[s > 700000]\n",
    "s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DataFrame"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we said previously, **DataFrames** are two-dimensional data structures. You can see like a dict of Series that share the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c</th>\n",
       "      <td>3.0</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   one  two\n",
       "a  1.0  1.0\n",
       "b  2.0  2.0\n",
       "c  3.0  3.0\n",
       "d  NaN  4.0"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We are going to create a DataFrame from a dict of Series\n",
    "d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),\n",
    "    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}\n",
    "df = DataFrame(d)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this dataframe, the *indexes* (row labels) are *a*, *b*, *c* and *d* and the *columns* (column labels) are *one* and *two*.\n",
    "\n",
    "We see that the resulting DataFrame is the union of indexes, and missing values are included as NaN (to write this value we will use *np.nan*).\n",
    "\n",
    "If we specify an index, the dictionary is filtered."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>one</th>\n",
       "      <th>two</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>2.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   one  two\n",
       "d  NaN  4.0\n",
       "b  2.0  2.0\n",
       "a  1.0  1.0"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We can filter\n",
    "df = DataFrame(d, index=['d', 'b', 'a'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another option is to use the constructor with *index* and *columns*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>two</th>\n",
       "      <th>three</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>d</th>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>b</th>\n",
       "      <td>2.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>a</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   two three\n",
       "d  4.0   NaN\n",
       "b  2.0   NaN\n",
       "a  1.0   NaN"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the next notebook we are going to learn more about dataframes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Pandas](http://pandas.pydata.org/)\n",
    "* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n",
    "* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",
    "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",
    "* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"![](images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Corrected link 2016-07-13 14:28:39 +00:00			`"## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)"`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"\n",`
			`"* [Introduction to Pandas](#Introduction-to-Pandas)\n",`
			`"* [Series](#Series)\n",`
			`"* [DataFrame](#DataFrame)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Introduction to Pandas\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"This notebook provides an overview of the pandas library. "`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"[Pandas](http://pandas.pydata.org/) is a Python library that provides easy-to-use data structures and data analysis tools.\n",`
			`"\n",`
			`"The main advantage of Pandas is that provides extensive facilities for grouping, merging and querying pandas data structures, and also includes facilities for time series analysis, as well as i/o and visualisation facilities.\n",`
			`"\n",`
			`"Pandas in built on top of NumPy, so we will have usually to import both libraries.\n",`
			`"\n",`
			`"Pandas provides two main data structures:\n",`
			`"* Series is a one dimensional labelled object, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).. It is similar to an array, a list, a dictionary or a column in a table. Every value in a Series object has an index.\n",`
			`"* DataFrame is a two dimensional labelled object with columns of potentially different types. It is similar to a database table, or a spreadsheet. It can be seen as a dictionary of Series that share the same index.\n"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Series"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We are not going to use Series objects directly as frequently as DataFrames. Here we provide a short introduction"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 1,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"0 5\n",`
			`"1 10\n",`
			`"2 15\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 1,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
Fix error while importing numpy 2016-04-05 13:33:47 +00:00			`"import numpy as np\n",`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`"import pandas as pd\n",`
			`"from pandas import Series, DataFrame\n",`
			`"\n",`
			`"# create series object from an array\n",`
			`"s = Series([5, 10, 15])\n",`
			`"s"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We see each value has an associated label starting with 0 if no index is specified when the Series object is created. \n",`
			`"\n",`
			`"It is similar to a dictionary. In fact, we can also create a Series object from a dictionary as follows. In this case, the indexes are the keys of the dictionary."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"a 5\n",`
			`"b 10\n",`
			`"c 15\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 2,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"d = {'a': 5, 'b': 10, 'c': 15}\n",`
			`"s = Series(d)\n",`
			`"s"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 3,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Index(['a', 'b', 'c'], dtype='object')"`
			`]`
			`},`
			`"execution_count": 3,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# We can get the list of indexes\n",`
			`"s.index"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 4,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"array([ 5, 10, 15])"`
			`]`
			`},`
			`"execution_count": 4,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# and the values\n",`
			`"s.values"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Another option is to create the Series object from two lists, for values and indexes."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 5,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3141991\n",`
			`"Barcelona 1604555\n",`
			`"Valencia 786189\n",`
			`"Sevilla 693878\n",`
			`"Zaragoza 664953\n",`
			`"Malaga 569130\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 5,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Series with population in 2015 of more populated cities in Spain\n",`
			`"s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', \n",`
			`" 'Zaragoza', 'Malaga'])\n",`
			`"s"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 6,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"3141991"`
			`]`
			`},`
			`"execution_count": 6,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Population of Madrid\n",`
			`"s['Madrid']"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Indexing and slicing"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Until now, we have not seen any advantage in using Panda Series. we are going to show now some examples of their possibilities."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 7,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid True\n",`
			`"Barcelona True\n",`
			`"Valencia False\n",`
			`"Sevilla False\n",`
			`"Zaragoza False\n",`
			`"Malaga False\n",`
			`"dtype: bool"`
			`]`
			`},`
			`"execution_count": 7,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"#Boolean condition\n",`
			`"s > 1000000"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 8,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3141991\n",`
			`"Barcelona 1604555\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 8,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Cities with population greater than 1.000.000\n",`
			`"s[s > 1000000]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Observe that (s > 1000000) returns a Series object. We can use this boolean vector as a filter to get a slice of the original series that contains only the elements where the value of the filter is True. The original Series s is not modified. This selection is called boolean indexing."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 9,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3141991\n",`
			`"Barcelona 1604555\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 9,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Cities with population greater than the mean\n",`
			`"s[s > s.mean()]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 10,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3141991\n",`
			`"Barcelona 1604555\n",`
			`"Valencia 786189\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 10,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Cities with population greater than the median\n",`
			`"s[s > s.median()]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid True\n",`
			`"Barcelona True\n",`
			`"Valencia True\n",`
			`"Sevilla False\n",`
			`"Zaragoza False\n",`
			`"Malaga False\n",`
			`"dtype: bool"`
			`]`
			`},`
			`"execution_count": 11,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Check cities with a population greater than 700.000\n",`
			`"s > 700000"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 12,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3141991\n",`
			`"Barcelona 1604555\n",`
			`"Valencia 786189\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 12,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# List cities with a population greater than 700.000\n",`
			`"s[s > 700000]"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 13,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid True\n",`
			`"Barcelona True\n",`
			`"Valencia True\n",`
			`"Sevilla False\n",`
			`"Zaragoza False\n",`
			`"Malaga False\n",`
			`"dtype: bool"`
			`]`
			`},`
			`"execution_count": 13,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"#Another way to write the same boolean indexing selection\n",`
			`"bigger_than_700000 = s > 700000\n",`
			`"bigger_than_700000"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 14,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3141991\n",`
			`"Barcelona 1604555\n",`
			`"Valencia 786189\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 14,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"#Cities with population > 700000\n",`
			`"s[bigger_than_700000]"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Operations on series"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We can also carry out other mathematical operations."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 15,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 1570995.5\n",`
			`"Barcelona 802277.5\n",`
			`"Valencia 393094.5\n",`
			`"Sevilla 346939.0\n",`
			`"Zaragoza 332476.5\n",`
			`"Malaga 284565.0\n",`
			`"dtype: float64"`
			`]`
			`},`
			`"execution_count": 15,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Divide population by 2\n",`
			`"s / 2"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 16,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"1243449.3333333333"`
			`]`
			`},`
			`"execution_count": 16,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Get the average population\n",`
			`"s.mean()"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"3141991"`
			`]`
			`},`
			`"execution_count": 17,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Get the highest population\n",`
			`"s.max()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Item assignment"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We can also change values directly or based on a condition. You can consult additional feautures in the manual."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 18,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3320000\n",`
			`"Barcelona 1604555\n",`
			`"Valencia 786189\n",`
			`"Sevilla 693878\n",`
			`"Zaragoza 664953\n",`
			`"Malaga 569130\n",`
			`"dtype: int64"`
			`]`
			`},`
			`"execution_count": 18,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Change population of one city\n",`
			`"s['Madrid'] = 3320000\n",`
			`"s"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 19,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"Madrid 3652000.0\n",`
			`"Barcelona 1765010.5\n",`
			`"Valencia 864807.9\n",`
			`"Sevilla 693878.0\n",`
			`"Zaragoza 664953.0\n",`
			`"Malaga 569130.0\n",`
			`"dtype: float64"`
			`]`
			`},`
			`"execution_count": 19,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Increase by 10% cities with population greater than 700000\n",`
			`"s[s > 700000] = 1.1 * s[s > 700000]\n",`
			`"s"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# DataFrame"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"As we said previously, DataFrames are two-dimensional data structures. You can see like a dict of Series that share the index."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 20,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>one</th>\n",`
			`" <th>two</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>a</th>\n",`
			`" <td>1.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>b</th>\n",`
			`" <td>2.0</td>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>c</th>\n",`
			`" <td>3.0</td>\n",`
			`" <td>3.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>d</th>\n",`
			`" <td>NaN</td>\n",`
			`" <td>4.0</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" one two\n",`
			`"a 1.0 1.0\n",`
			`"b 2.0 2.0\n",`
			`"c 3.0 3.0\n",`
			`"d NaN 4.0"`
			`]`
			`},`
			`"execution_count": 20,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# We are going to create a DataFrame from a dict of Series\n",`
			`"d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),\n",`
			`" 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}\n",`
			`"df = DataFrame(d)\n",`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In this dataframe, the indexes (row labels) are a, b, c and d and the columns (column labels) are one and two.\n",`
			`"\n",`
			`"We see that the resulting DataFrame is the union of indexes, and missing values are included as NaN (to write this value we will use np.nan).\n",`
			`"\n",`
			`"If we specify an index, the dictionary is filtered."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 21,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>one</th>\n",`
			`" <th>two</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>d</th>\n",`
			`" <td>NaN</td>\n",`
			`" <td>4.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>b</th>\n",`
			`" <td>2.0</td>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>a</th>\n",`
			`" <td>1.0</td>\n",`
			`" <td>1.0</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" one two\n",`
			`"d NaN 4.0\n",`
			`"b 2.0 2.0\n",`
			`"a 1.0 1.0"`
			`]`
			`},`
			`"execution_count": 21,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# We can filter\n",`
			`"df = DataFrame(d, index=['d', 'b', 'a'])\n",`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Another option is to use the constructor with index and columns."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 22,`
			`"metadata": {`
			`"collapsed": false`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>two</th>\n",`
			`" <th>three</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>d</th>\n",`
			`" <td>4.0</td>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>b</th>\n",`
			`" <td>2.0</td>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>a</th>\n",`
			`" <td>1.0</td>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" two three\n",`
			`"d 4.0 NaN\n",`
			`"b 2.0 NaN\n",`
			`"a 1.0 NaN"`
			`]`
			`},`
			`"execution_count": 22,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"df = DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])\n",`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"In the next notebook we are going to learn more about dataframes."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [Pandas](http://pandas.pydata.org/)\n",`
			`"* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n",`
			`"* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n",`
			`"* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n",`
			`"* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
			`"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
added import nltk 2017-04-20 14:07:10 +00:00			`"version": "3.5.2"`
Not done reviewing ml2 yet 2016-03-28 12:03:08 +00:00			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 0`
			`}`