{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# String Data\n", "It is widespread to clean string columns to follow a predefined format (e.g., emails, URLs, ...).\n", "\n", "We can do it using regular expressions or specific libraries." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Beautifier\n", "A simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify URL patterns, domains, and so on. The library helps to clean Unicode, special characters, and unnecessary redirection patterns from the URLs and gives you a clean date.\n", "\n", "Install with **'pip install beautifier'**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Email cleanup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from beautifier import Email\n", "email = Email('me@imsach.in')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'imsach.in'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email.domain" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'me'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email.username" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email.is_free_email" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "email2 = Email('This my address')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email2.is_valid" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "email3 = Email('pepe@gmail.com')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email3.is_valid" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email3.is_free_email" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## URL cleanup" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from beautifier import Url\n", "url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'https://in.linkedin.com/in/sachinphilip'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.cleanup" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'in.linkedin.com'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.domain" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.param" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.parameters" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'sachinphilip'" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.username" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Unicode\n", "Problem: Some unicode code has been broken. We see the character in a different character dataset.\n", "\n", "A **mojibake** is a character displayed in an unintended character encoding. Example: \"�\").\n", "\n", "We will use the library **ftfy** (fixed text for you) to fix it.\n", "\n", "First, you should install the library: **conda install ftfy** (or **pip install ftfy**)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "¯\\_(ツ)_/¯\n", "Party\n", "I'm\n" ] } ], "source": [ "import ftfy\n", "foo = '¯\\\\_(ã\\x83\\x84)_/¯'\n", "bar = '\\ufeffParty'\n", "baz = '\\001\\033[36;44mI’m'\n", "print(ftfy.fix_text(foo))\n", "print(ftfy.fix_text(bar))\n", "print(ftfy.fix_text(baz))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can understand which heuristics ftfy is using." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U+0026 & [Po] AMPERSAND\n", "U+006D m [Ll] LATIN SMALL LETTER M\n", "U+0061 a [Ll] LATIN SMALL LETTER A\n", "U+0063 c [Ll] LATIN SMALL LETTER C\n", "U+0072 r [Ll] LATIN SMALL LETTER R\n", "U+003B ; [Po] SEMICOLON\n", "U+005C \\ [Po] REVERSE SOLIDUS\n", "U+005F _ [Pc] LOW LINE\n", "U+0028 ( [Ps] LEFT PARENTHESIS\n", "U+00E3 ã [Ll] LATIN SMALL LETTER A WITH TILDE\n", "U+0083 \\x83 [Cc] \n", "U+0084 \\x84 [Cc] \n", "U+0029 ) [Pe] RIGHT PARENTHESIS\n", "U+005F _ [Pc] LOW LINE\n", "U+002F / [Po] SOLIDUS\n", "U+0026 & [Po] AMPERSAND\n", "U+006D m [Ll] LATIN SMALL LETTER M\n", "U+0061 a [Ll] LATIN SMALL LETTER A\n", "U+0063 c [Ll] LATIN SMALL LETTER C\n", "U+0072 r [Ll] LATIN SMALL LETTER R\n", "U+003B ; [Po] SEMICOLON\n" ] } ], "source": [ "ftfy.explain_unicode(foo)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Dates\n", "Sometimes we want to extract date from text. We can use regular expressions or handy packages, such as [**python-dateutil**](https://dateutil.readthedocs.io/en/stable/). An alternative is [arrow](https://arrow.readthedocs.io/en/latest/).\n", "\n", "Install the library: **pip install python-dateutil**." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-08-22 10:22:46+00:00\n" ] } ], "source": [ "from dateutil.parser import parse\n", "now = parse(\"Thu Aug 22 10:22:46 UTC 2019\")\n", "print(now)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2019-08-08 10:20:00\n" ] } ], "source": [ "dt = parse(\"Today is Thursday 8, 2019 at 10:20:00AM\", fuzzy=True)\n", "print(dt)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# References\n", "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n", "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/), , A. Sharma, 2018.\n", "* [Beautifier](https://github.com/labtocat/beautifier) package\n", "* [Ftfy](https://ftfy.readthedocs.io/en/latest/) package\n", "* [python-dateutil](https://dateutil.readthedocs.io/en/stable/)package" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Licence\n", "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "celltoolbar": "Slideshow", "datacleaner": { "position": { "top": "50px" }, "python": { "varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])" }, "window_display": false }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 4 }