{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Duplicated values\n", "\n", "There are two possible approaches: **remove** these rows or **filling** them. It depends on every case.\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Filling NaN values\n", "If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n", "\n", "* For **string** fields, we can fill NaN with **' '**.\n", "\n", "* For **numbers**, we can fill with the **mean** or **median** value. \n" ] }, { "cell_type": "raw", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "# Fill NaN with ' '\n", "df['col'] = df['col'].fillna(' ')\n", "# Fill NaN with 99\n", "df['col'] = df['col'].fillna(99)\n", "# Fill NaN with the mean of the column\n", "df['col'] = df['col'].fillna(df['col'].mean())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Propagate non-null values forward or backwards\n", "You can also propagate non-null values forward or backwards by putting\n", "method=’pad’ as the method argument. It will fill the next value in the\n", "dataframe with the previous non-NaN value. Maybe you just want to fill one\n", "value ( limit=1 )or you want to fill all the values." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1
0NaN
1NaN
22.0
33.0
44.0
5NaN
6NaN
\n", "
" ], "text/plain": [ " col1\n", "0 NaN\n", "1 NaN\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "5 NaN\n", "6 NaN" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1
0NaN
1NaN
22.0
33.0
44.0
54.0
6NaN
\n", "
" ], "text/plain": [ " col1\n", "0 NaN\n", "1 NaN\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "5 4.0\n", "6 NaN" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We fill forward the value 4.0 and fill the next one (limit = 1)\n", "df.fillna(method='pad', limit=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also backfilling with **bfill**." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1
02.0
12.0
22.0
33.0
44.0
5NaN
6NaN
\n", "
" ], "text/plain": [ " col1\n", "0 2.0\n", "1 2.0\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "5 NaN\n", "6 NaN" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Fill the first two NaN values with the first available value\n", "df.fillna(method='bfill')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Removing NaN values\n", "We can remove them by row or column." ] }, { "cell_type": "raw", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "/# Drop any rows which have any nans\n", "df.dropna()\n", "/# Drop columns that have any nans\n", "df.dropna(axis=1)\n", "/# Only drop columns which have at least 90% non-NaNs\n", "df.dropna(thresh=int(df.shape[0] * .9), axis=1)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# References\n", "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n", "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Licence\n", "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 1 }