![](images/EscUpmPolit_p.gif "UPM")

# Course Notes for Learning Intelligent Systems

Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias

## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)

# String Data
It is widespread to clean string columns to follow a predefined format (e.g., emails, URLs, ...).

We can do it using regular expressions or specific libraries.

## Beautifier
A simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify URL patterns, domains, and so on. The library helps to clean Unicode, special characters, and unnecessary redirection patterns from the URLs and gives you a clean date.

Install with **'pip install beautifier'**.

## Email cleanup

In [1]:
from beautifier import Email
email = Email('me@imsach.in')

In [2]:
email.domain

'imsach.in'

In [3]:
email.username

'me'

In [4]:
email.is_free_email

False

In [5]:
email2 = Email('This my address')

In [6]:
email2.is_valid

False

In [7]:
email3 = Email('pepe@gmail.com')

In [8]:
email3.is_valid

True

In [9]:
email3.is_free_email

True

## URL cleanup

In [10]:
from beautifier import Url
url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')

In [11]:
url.cleanup

'https://in.linkedin.com/in/sachinphilip'

In [12]:
url.domain

'in.linkedin.com'

In [13]:
url.param

['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']

In [14]:
url.parameters

'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'

In [15]:
url.username

'sachinphilip'

## Unicode
Problem: Some unicode code has been broken. We see the character in a different character dataset.

A **mojibake** is a character displayed in an unintended character encoding. Example: "�").

We will use the library **ftfy** (fixed text for you) to fix it.

First, you should install the library: **conda install ftfy** (or **pip install ftfy**).

In [16]:
import ftfy
foo = '¯\\_(ã\x83\x84)_/¯'
bar = '\ufeffParty'
baz = '\001\033[36;44mI’m'
print(ftfy.fix_text(foo))
print(ftfy.fix_text(bar))
print(ftfy.fix_text(baz))

¯\_(ツ)_/¯
Party
I'm


We can understand which heuristics ftfy is using.

In [17]:
ftfy.explain_unicode(foo)

U+0026 & [Po] AMPERSAND
U+006D m [Ll] LATIN SMALL LETTER M
U+0061 a [Ll] LATIN SMALL LETTER A
U+0063 c [Ll] LATIN SMALL LETTER C
U+0072 r [Ll] LATIN SMALL LETTER R
U+003B ; [Po] SEMICOLON
U+005C \ [Po] REVERSE SOLIDUS
U+005F _ [Pc] LOW LINE
U+0028 ( [Ps] LEFT PARENTHESIS
U+00E3 ã [Ll] LATIN SMALL LETTER A WITH TILDE
U+0083 \x83 [Cc] 
U+0084 \x84 [Cc] 
U+0029 ) [Pe] RIGHT PARENTHESIS
U+005F _ [Pc] LOW LINE
U+002F / [Po] SOLIDUS
U+0026 & [Po] AMPERSAND
U+006D m [Ll] LATIN SMALL LETTER M
U+0061 a [Ll] LATIN SMALL LETTER A
U+0063 c [Ll] LATIN SMALL LETTER C
U+0072 r [Ll] LATIN SMALL LETTER R
U+003B ; [Po] SEMICOLON


## Dates
Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as [**python-dateutil**](https://dateutil.readthedocs.io/en/stable/). An alternative is [arrow](https://arrow.readthedocs.io/en/latest/).

Install the library: **pip install python-dateutil**.

In [18]:
from dateutil.parser import parse
now = parse("Thu Aug 22 10:22:46 UTC 2019")
print(now)

2019-08-22 10:22:46+00:00


In [19]:
dt = parse("Today is Thursday 8, 2019 at 10:20:00AM", fuzzy=True)
print(dt)

2019-08-08 10:20:00


# References
* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, 
* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), , A. Sharma, 2018.
* [Beautifier](https://github.com/labtocat/beautifier) package
* [Ftfy](https://ftfy.readthedocs.io/en/latest/) package
* [python-dateutil](https://dateutil.readthedocs.io/en/stable/)package

## Licence
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). 

© Carlos A. Iglesias, Universidad Politécnica de Madrid.