## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)

# String Data
It is common to clean string columns so that they follow a predefined format (e.g. emails, URLs, ...).

We can do it using regular expressions or specific libraries.

## Beautifier
Simple [library](https://github.com/labtocat/beautifier) to cleanup and prettify url patterns, domains and so on. Library helps to clean unicodes, special characters and unnecessary redirection patterns from the urls and gives you clean date.\n", "\n", "Install with **'pip install beautifier'**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Email cleanup" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from beautifier import Email\n", "email = Email('me@imsach.in')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'imsach.in'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email.domain" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'me'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email.username" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email.is_free_email" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "email2 = Email('This my address')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email2.is_valid" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "email3 = Email('pepe@gmail.com')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email3.is_valid" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email3.is_free_email" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## URL cleanup" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from beautifier import Url\n", "url = Url('https://in.linkedin.com/in/sachinphilip?authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk')" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'https://in.linkedin.com/in/sachinphilip'" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.cleanup" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'in.linkedin.com'" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.domain" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['authtoken=887nasdadasd6hasdtg21', 'secret=98jy766yhhuhnjk']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.param" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'authtoken=887nasdadasd6hasdtg21&secret=98jy766yhhuhnjk'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.parameters" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "'sachinphilip'" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "url.username" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Unicode\n", "Problem: Some unicode code has been broken. We see the character in a different character dataset.\n", "\n", "A **mojibake** is a character displayed in an unintended character enconding. Example: \"�\").\n", "\n", "We will use the library **ftfy** (fixed text for you) to fix it.\n", "\n", "First, you should install the library: ***conda install ftfy**. " ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "¯\\_(ツ)_/¯\n", "Party\n", "I'm\n" ] } ], "source": [ "import ftfy\n", "foo = '¯\\\\_(ã\\x83\\x84)_/¯'\n", "bar = '\\ufeffParty'\n", "baz = '\\001\\033[36;44mI’m'\n", "print(ftfy.fix_text(foo))\n", "print(ftfy.fix_text(bar))\n", "print(ftfy.fix_text(baz))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can understand which heuristics ftfy is using." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "ename": "NameError", "evalue": "name 'ftfy' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mftfy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexplain_unicode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfoo\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mNameError\u001b[0m: name 'ftfy' is not defined" ] } ], "source": [ "ftfy.explain_unicode(foo)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Dates\n", "Sometimes we want to extract date from text. from dateutil.parser import parse
now = parse("Thu Aug 22 10:22:46 UTC 2019")
print(now)

dt = parse("Today is Thursday 8, 2019 at 10:20:00AM", fuzzy=True)
print(dt) Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 1 }