{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "\n", "* [Introduction to Pandas](#Introduction-to-Pandas)\n", "* [Series](#Series)\n", "* [DataFrame](#DataFrame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Pandas\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides an overview of the *pandas* library. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pandas](http://pandas.pydata.org/) is a Python library that provides easy-to-use data structures and data analysis tools.\n", "\n", "The main advantage of *Pandas* is that provides extensive facilities for grouping, merging and querying pandas data structures, and also includes facilities for time series analysis, as well as i/o and visualisation facilities.\n", "\n", "Pandas in built on top of *NumPy*, so we will have usually to import both libraries.\n", "\n", "Pandas provides two main data structures:\n", "* **Series** is a one dimensional labelled object, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).. It is similar to an array, a list, a dictionary or a column in a table. Every value in a Series object has an index.\n", "* **DataFrame** is a two dimensional labelled object with columns of potentially different types. It is similar to a database table, or a spreadsheet. It can be seen as a dictionary of Series that share the same index.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are not going to use Series objects directly as frequently as DataFrames. Here we provide a short introduction" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0 5\n", "1 10\n", "2 15\n", "dtype: int64" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "from pandas import Series, DataFrame\n", "\n", "# create series object from an array\n", "s = Series([5, 10, 15])\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see each value has an associated label starting with 0 if no index is specified when the Series object is created. \n", "\n", "It is similar to a dictionary. In fact, we can also create a Series object from a dictionary as follows. In this case, the indexes are the keys of the dictionary." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "a 5\n", "b 10\n", "c 15\n", "dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = {'a': 5, 'b': 10, 'c': 15}\n", "s = Series(d)\n", "s" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Index(['a', 'b', 'c'], dtype='object')" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can get the list of indexes\n", "s.index" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 5, 10, 15])" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# and the values\n", "s.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another option is to create the Series object from two lists, for values and indexes." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3141991\n", "Barcelona 1604555\n", "Valencia 786189\n", "Sevilla 693878\n", "Zaragoza 664953\n", "Malaga 569130\n", "dtype: int64" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Series with population in 2015 of more populated cities in Spain\n", "s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', \n", " 'Zaragoza', 'Malaga'])\n", "s" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3141991" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Population of Madrid\n", "s['Madrid']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing and slicing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Until now, we have not seen any advantage in using Panda Series. we are going to show now some examples of their possibilities." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid True\n", "Barcelona True\n", "Valencia False\n", "Sevilla False\n", "Zaragoza False\n", "Malaga False\n", "dtype: bool" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Boolean condition\n", "s > 1000000" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3141991\n", "Barcelona 1604555\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cities with population greater than 1.000.000\n", "s[s > 1000000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that (s > 1000000) returns a Series object. We can use this boolean vector as a filter to get a *slice* of the original series that contains only the elements where the value of the filter is True. The original Series s is not modified. This selection is called *boolean indexing*." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3141991\n", "Barcelona 1604555\n", "dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cities with population greater than the mean\n", "s[s > s.mean()]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3141991\n", "Barcelona 1604555\n", "Valencia 786189\n", "dtype: int64" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Cities with population greater than the median\n", "s[s > s.median()]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid True\n", "Barcelona True\n", "Valencia True\n", "Sevilla False\n", "Zaragoza False\n", "Malaga False\n", "dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Check cities with a population greater than 700.000\n", "s > 700000" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3141991\n", "Barcelona 1604555\n", "Valencia 786189\n", "dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# List cities with a population greater than 700.000\n", "s[s > 700000]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid True\n", "Barcelona True\n", "Valencia True\n", "Sevilla False\n", "Zaragoza False\n", "Malaga False\n", "dtype: bool" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Another way to write the same boolean indexing selection\n", "bigger_than_700000 = s > 700000\n", "bigger_than_700000" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3141991\n", "Barcelona 1604555\n", "Valencia 786189\n", "dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Cities with population > 700000\n", "s[bigger_than_700000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operations on series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also carry out other mathematical operations." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 1570995.5\n", "Barcelona 802277.5\n", "Valencia 393094.5\n", "Sevilla 346939.0\n", "Zaragoza 332476.5\n", "Malaga 284565.0\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Divide population by 2\n", "s / 2" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "1243449.3333333333" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the average population\n", "s.mean()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3141991" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Get the highest population\n", "s.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Item assignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also change values directly or based on a condition. You can consult additional feautures in the manual." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3320000\n", "Barcelona 1604555\n", "Valencia 786189\n", "Sevilla 693878\n", "Zaragoza 664953\n", "Malaga 569130\n", "dtype: int64" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Change population of one city\n", "s['Madrid'] = 3320000\n", "s" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "Madrid 3652000.0\n", "Barcelona 1765010.5\n", "Valencia 864807.9\n", "Sevilla 693878.0\n", "Zaragoza 664953.0\n", "Malaga 569130.0\n", "dtype: float64" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Increase by 10% cities with population greater than 700000\n", "s[s > 700000] = 1.1 * s[s > 700000]\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we said previously, **DataFrames** are two-dimensional data structures. You can see like a dict of Series that share the index." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
a1.01.0
b2.02.0
c3.03.0
dNaN4.0
\n", "
" ], "text/plain": [ " one two\n", "a 1.0 1.0\n", "b 2.0 2.0\n", "c 3.0 3.0\n", "d NaN 4.0" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We are going to create a DataFrame from a dict of Series\n", "d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),\n", " 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}\n", "df = DataFrame(d)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this dataframe, the *indexes* (row labels) are *a*, *b*, *c* and *d* and the *columns* (column labels) are *one* and *two*.\n", "\n", "We see that the resulting DataFrame is the union of indexes, and missing values are included as NaN (to write this value we will use *np.nan*).\n", "\n", "If we specify an index, the dictionary is filtered." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
onetwo
dNaN4.0
b2.02.0
a1.01.0
\n", "
" ], "text/plain": [ " one two\n", "d NaN 4.0\n", "b 2.0 2.0\n", "a 1.0 1.0" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We can filter\n", "df = DataFrame(d, index=['d', 'b', 'a'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another option is to use the constructor with *index* and *columns*." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
twothree
d4.0NaN
b2.0NaN
a1.0NaN
\n", "
" ], "text/plain": [ " two three\n", "d 4.0 NaN\n", "b 2.0 NaN\n", "a 1.0 NaN" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next notebook we are going to learn more about dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [Pandas](http://pandas.pydata.org/)\n", "* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n", "* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n", "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n", "* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }