{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Unknown values\n", "\n", "Two possible approaches are **remove** these rows or **fill** them. It depends on every case." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Filling NaN values\n", "If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n", "\n", "* For **string** fields, we can fill NaN with **' '**.\n", "\n", "* For **numbers**, we can fill with the **mean** or **median** value. \n" ] }, { "cell_type": "raw", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "# Fill NaN with ' '\n", "df['col'] = df['col'].fillna(' ')\n", "# Fill NaN with 99\n", "df['col'] = df['col'].fillna(99)\n", "# Fill NaN with the mean of the column\n", "df['col'] = df['col'].fillna(df['col'].mean())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Propagate non-null values forward or backward\n", "You can also **propagate** non-null values with these methods:\n", "\n", "* **ffill**: Fill values by propagating the last valid observation to the next valid.\n", "* **bfill**: Fill values using the following valid observation to fill the gap.\n", "* **interpolate**: Fill NaN values using interpolation.\n", "\n", "It will fill the next value in the dataframe with the previous non-NaN value. \n", "\n", "You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1
0NaN
1NaN
22.0
33.0
44.0
5NaN
6NaN
\n", "
" ], "text/plain": [ " col1\n", "0 NaN\n", "1 NaN\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "5 NaN\n", "6 NaN" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We fill forward the value 4.0 and fill the next one (limit = 1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1
0NaN
1NaN
22.0
33.0
44.0
54.0
6NaN
\n", "
" ], "text/plain": [ " col1\n", "0 NaN\n", "1 NaN\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "5 4.0\n", "6 NaN" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ " df.ffill(limit = 1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.ffill()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1
02.0
12.0
22.0
33.0
44.0
5NaN
6NaN
\n", "
" ], "text/plain": [ " col1\n", "0 2.0\n", "1 2.0\n", "2 2.0\n", "3 3.0\n", "4 4.0\n", "5 NaN\n", "6 NaN" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.bfill()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Removing NaN values\n", "We can remove them by row or column (use inplace=True if you want to modify the DataFrame)." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1
22.0
33.0
44.0
\n", "
" ], "text/plain": [ " col1\n", "2 2.0\n", "3 3.0\n", "4 4.0" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Drop any rows which have any nans\n", "df1 = df.dropna()\n", "# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n", "df2 = df.dropna(axis=1)\n", "# Only drop columns which have at least 90% non-NaNs \n", "df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n", "df1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# References\n", "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n", "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Licence\n", "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "celltoolbar": "Slideshow", "datacleaner": { "position": { "top": "50px" }, "python": { "varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])" }, "window_display": false }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 4 }