![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to Machine Learning II](3_0_0_Intro_ML_2.ipynb)

# Table of Contents

* [Introduction to Pandas](#Introduction-to-Pandas)
* [Series](#Series)
* [DataFrame](#DataFrame)

# Introduction to Pandas


This notebook provides an overview of the *pandas* library. 

[Pandas](http://pandas.pydata.org/) is a Python library that provides easy-to-use data structures and data analysis tools.

The main advantage of *Pandas* is that provides extensive facilities for grouping, merging and querying pandas data structures, and also includes facilities for time series analysis, as well as i/o and visualisation facilities.

Pandas in built on top of *NumPy*, so we will have usually to import both libraries.

Pandas provides two main data structures:
* **Series** is a one dimensional labelled object, capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).. It is similar to an array, a list, a dictionary or a column in a table. Every value in a Series object has an index.
* **DataFrame** is a two dimensional labelled object with columns of potentially different types. It is similar to a database table, or a spreadsheet. It can be seen as a dictionary of Series that share the same index.


# Series

We are not going to use Series objects directly as frequently as DataFrames. Here we provide a short introduction

In [None]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

# create series object from an array
s = Series([5, 10, 15])
s

We see each value has an associated label starting with 0 if no index is specified when the Series object is created. 

It is similar to a dictionary. In fact, we can also create a Series object from a dictionary as follows. In this case, the indexes are the keys of the dictionary.

In [None]:
d = {'a': 5, 'b': 10, 'c': 15}
s = Series(d)
s

In [None]:
# We can get the list of indexes
s.index

In [None]:
# and the values
s.values

Another option is to create the Series object from two lists, for values and indexes.

In [None]:
# Series with population in 2015 of more populated cities in Spain
s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', 
 'Zaragoza', 'Malaga'])
s

In [None]:
# Population of Madrid
s['Madrid']

## Indexing and slicing

Until now, we have not seen any advantage in using Panda Series. we are going to show now some examples of their possibilities.

In [None]:
#Boolean condition
s > 1000000

In [None]:
# Cities with population greater than 1.000.000
s[s > 1000000]

Observe that (s > 1000000) returns a Series object. We can use this boolean vector as a filter to get a *slice* of the original series that contains only the elements where the value of the filter is True. The original Series s is not modified. This selection is called *boolean indexing*.

In [None]:
# Cities with population greater than the mean
s[s > s.mean()]

In [None]:
# Cities with population greater than the median
s[s > s.median()]

In [None]:
# Check cities with a population greater than 700.000
s > 700000

In [None]:
# List cities with a population greater than 700.000
s[s > 700000]

In [None]:
#Another way to write the same boolean indexing selection
bigger_than_700000 = s > 700000
bigger_than_700000

In [None]:
#Cities with population > 700000
s[bigger_than_700000]

## Operations on series

We can also carry out other mathematical operations.

In [None]:
# Divide population by 2
s / 2

In [None]:
# Get the average population
s.mean()

In [None]:
# Get the highest population
s.max()

## Item assignment

We can also change values directly or based on a condition. You can consult additional feautures in the manual.

In [None]:
# Change population of one city
s['Madrid'] = 3320000
s

In [None]:
# Increase by 10% cities with population greater than 700000
s[s > 700000] = 1.1 * s[s > 700000]
s

# DataFrame

As we said previously, **DataFrames** are two-dimensional data structures. You can see like a dict of Series that share the index.

In [None]:
# We are going to create a DataFrame from a dict of Series
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
 'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = DataFrame(d)
df

In this dataframe, the *indexes* (row labels) are *a*, *b*, *c* and *d* and the *columns* (column labels) are *one* and *two*.

We see that the resulting DataFrame is the union of indexes, and missing values are included as NaN (to write this value we will use *np.nan*).

If we specify an index, the dictionary is filtered.

In [None]:
# We can filter
df = DataFrame(d, index=['d', 'b', 'a'])
df

Another option is to use the constructor with *index* and *columns*.

In [None]:
df = DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
df

In the next notebook we are going to learn more about dataframes.

## References

* [Pandas](http://pandas.pydata.org/)
* [Learning Pandas, Michael Heydt, Packt Publishing, 2015](http://proquest.safaribooksonline.com/book/programming/python/9781783985128)
* [Pandas. Introduction to Data Structures](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dsintro)
* [Introducing Pandas Objects](https://www.oreilly.com/learning/introducing-pandas-objects)
* [Boolean Operators in Pandas](http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-operators)

## Licence

The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). 

© Carlos A. Iglesias, Universidad Politécnica de Madrid.