![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)

# String Data
It is common to clean string columns so that they follow a predefined format (e.g. emails, URLs, ...).

We can do it using regular expressions or specific libraries.

## Beautifier
Simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify url patterns, domains and so on. Library helps to clean unicodes, special characters and unnecessary redirection patterns from the urls and gives you clean date.

Install with **'pip install beautifier'**.

## Email cleanup

In [3]:
from beautifier import Email
email = Email('me@imsach.in')

In [5]:
email.domain

'imsach.in'

In [7]:
email.username

'me'

In [9]:
email.is_free_email

False

In [13]:
email2 = Email('This my address')

In [15]:
email2.is_valid

False

In [23]:
email3 = Email('pepe@gmail.com')

In [18]:
email3.is_valid

True

In [27]:
email3.is_free_email

True

## URL cleanup

In [29]:
from beautifier import Url
url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')

In [31]:
url.cleanup

'https://in.linkedin.com/in/sachinphilip'

In [33]:
url.domain

'in.linkedin.com'

In [35]:
url.param

['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']

In [37]:
url.parameters

'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'

In [39]:
url.username

'sachinphilip'

## Unicode
Problem: Some unicode code has been broken. We see the character in a different character dataset.

A **mojibake** is a character displayed in an unintended character enconding. Example:  "�").

We will use the library **ftfy** (fixed text for you) to fix it.

First, you should install the library: ***conda install ftfy**. 

In [41]:
import ftfy
foo = '&macr;\\_(ã\x83\x84)_/&macr;'
bar = '\ufeffParty'
baz = '\001\033[36;44mI&#x92;m'
print(ftfy.fix_text(foo))
print(ftfy.fix_text(bar))
print(ftfy.fix_text(baz))

¯\_(ツ)_/¯
Party
I'm


We can understand which heuristics ftfy is using.

In [1]:
ftfy.explain_unicode(foo)

NameError: name 'ftfy' is not defined

## Dates
Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as **python-dateutil**.

Install the library: **pip install python-dateutil**.

In [8]:
from dateutil.parser import parse
now = parse("Thu Aug 22 10:22:46 UTC 2019")
print(now)

2019-08-22 10:22:46+00:00


In [9]:
dt = parse("Today is Thursday 8, 2019 at 10:20:00AM", fuzzy=True)
print(dt)

2019-08-22 10:20:00


# References
* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, 
* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)
* Beautifier https://github.com/labtocat/beautifier
* Ftfy https://ftfy.readthedocs.io/en/latest/
* python-dateutil https://dateutil.readthedocs.io/en/stable/

## Licence
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© Carlos A. Iglesias, Universidad Politécnica de Madrid.