{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Table of Contents\n", "\n", "* [Introduction to Pandas](#Introduction-to-Pandas)\n", "* [Series](#Series)\n", "* [DataFrame](#DataFrame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Pandas\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook provides an overview of the *pandas* library. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Pandas](http://pandas.pydata.org/) is a Python library that provides easy-to-use data structures and data analysis tools.\n", "\n", "The main advantage of *Pandas* is that provides extensive facilities for grouping, merging and querying pandas data structures, and also includes facilities for time series analysis, as well as i/o and visualisation facilities.\n", "\n", "Pandas in built on top of *NumPy*, so we will have usually to import both libraries.\n", "\n", "Pandas provides two main data structures:\n", "* **Series** is a one dimensional labelled object, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).. It is similar to an array, a list, a dictionary or a column in a table. Every value in a Series object has an index.\n", "* **DataFrame** is a two dimensional labelled object with columns of potentially different types. It is similar to a database table, or a spreadsheet. It can be seen as a dictionary of Series that share the same index.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are not going to use Series objects directly as frequently as DataFrames. Here we provide a short introduction" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from pandas import Series, DataFrame\n", "\n", "# create series object from an array\n", "s = Series([5, 10, 15])\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see each value has an associated label starting with 0 if no index is specified when the Series object is created. \n", "\n", "It is similar to a dictionary. In fact, we can also create a Series object from a dictionary as follows. In this case, the indexes are the keys of the dictionary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d = {'a': 5, 'b': 10, 'c': 15}\n", "s = Series(d)\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can get the list of indexes\n", "s.index" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# and the values\n", "s.values" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another option is to create the Series object from two lists, for values and indexes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Series with population in 2015 of more populated cities in Spain\n", "s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', \n", " 'Zaragoza', 'Malaga'])\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Population of Madrid\n", "s['Madrid']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Indexing and slicing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Until now, we have not seen any advantage in using Panda Series. we are going to show now some examples of their possibilities." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Boolean condition\n", "s > 1000000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cities with population greater than 1.000.000\n", "s[s > 1000000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe that (s > 1000000) returns a Series object. We can use this boolean vector as a filter to get a *slice* of the original series that contains only the elements where the value of the filter is True. The original Series s is not modified. This selection is called *boolean indexing*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cities with population greater than the mean\n", "s[s > s.mean()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cities with population greater than the median\n", "s[s > s.median()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check cities with a population greater than 700.000\n", "s > 700000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List cities with a population greater than 700.000\n", "s[s > 700000]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Another way to write the same boolean indexing selection\n", "bigger_than_700000 = s > 700000\n", "bigger_than_700000" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Cities with population > 700000\n", "s[bigger_than_700000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Operations on series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also carry out other mathematical operations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Divide population by 2\n", "s / 2" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the average population\n", "s.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the highest population\n", "s.max()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Item assignment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also change values directly or based on a condition. You can consult additional feautures in the manual." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Change population of one city\n", "s['Madrid'] = 3320000\n", "s" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Increase by 10% cities with population greater than 700000\n", "s[s > 700000] = 1.1 * s[s > 700000]\n", "s" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# DataFrame" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we said previously, **DataFrames** are two-dimensional data structures. You can see like a dict of Series that share the index." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We are going to create a DataFrame from a dict of Series\n", "d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),\n", " 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}\n", "df = DataFrame(d)\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this dataframe, the *indexes* (row labels) are *a*, *b*, *c* and *d* and the *columns* (column labels) are *one* and *two*.\n", "\n", "We see that the resulting DataFrame is the union of indexes, and missing values are included as NaN (to write this value we will use *np.nan*).\n", "\n", "If we specify an index, the dictionary is filtered." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We can filter\n", "df = DataFrame(d, index=['d', 'b', 'a'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another option is to use the constructor with *index* and *columns*." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next notebook we are going to learn more about dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [Pandas](http://pandas.pydata.org/)\n", "* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)\n", "* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)\n", "* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)\n", "* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Licence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 1 }