sitc/ml21/preprocessing/04_Unknown_Values.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "![](images/EscUpmPolit_p.gif \"UPM\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# Course Notes for Learning Intelligent Systems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## [Introduction to  Preprocessing](00_Intro_Preprocessing.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Unknown values\n",
    "\n",
    "Two possible approaches are **remove** these rows or **fill** them. It depends on every case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Filling NaN values\n",
    "If we need to fill errors or blanks, we can use the methods **fillna()** or **dropna()**.\n",
    "\n",
    "* For **string** fields, we can fill NaN with **' '**.\n",
    "\n",
    "* For **numbers**, we can fill with the **mean** or **median** value. \n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "# Fill NaN with ' '\n",
    "df['col'] = df['col'].fillna(' ')\n",
    "# Fill NaN with 99\n",
    "df['col'] = df['col'].fillna(99)\n",
    "# Fill NaN with the mean of the column\n",
    "df['col'] = df['col'].fillna(df['col'].mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Propagate non-null values forward or backward\n",
    "You can also **propagate** non-null values with these methods:\n",
    "\n",
    "* **ffill**: Fill values by propagating the last valid observation to the next valid.\n",
    "* **bfill**:  Fill values using the following valid observation to fill the gap.\n",
    "* **interpolate**:  Fill NaN values using interpolation.\n",
    "\n",
    "It will fill the next value in the dataframe with the previous non-NaN value. \n",
    "\n",
    "You may want to fill in one value (**limit=1**) or all the values. You can also indicate inplace=True to fill in-place."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   NaN\n",
       "1   NaN\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   NaN\n",
       "6   NaN"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We fill forward the value 4.0 and fill the next one (limit = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   NaN\n",
       "1   NaN\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   4.0\n",
       "6   NaN"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    " df.ffill(limit = 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.ffill()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "We can also backfilling with **bfill**. Since we do not include *limit*, we fill all the values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "0   2.0\n",
       "1   2.0\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0\n",
       "5   NaN\n",
       "6   NaN"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.bfill()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Removing NaN values\n",
    "We can remove them by row or column (use inplace=True if you want to modify the DataFrame)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1\n",
       "2   2.0\n",
       "3   3.0\n",
       "4   4.0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Drop any rows which have any nans\n",
    "df1 = df.dropna()\n",
    "# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n",
    "df2 = df.dropna(axis=1)\n",
    "# Only drop columns which have at least 90% non-NaNs \n",
    "df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "# References\n",
    "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",
    "* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "datacleaner": {
   "position": {
    "top": "50px"
   },
   "python": {
    "varRefreshCmd": "try:\n    print(_datacleaner.dataframe_metadata())\nexcept:\n    print([])"
   },
   "window_display": false
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
Added preprocessing notebooks 2024-04-03 20:50:36 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"![](images/EscUpmPolit_p.gif \"UPM\")"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"# Course Notes for Learning Intelligent Systems"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "subslide"`
			`}`
			`},`
			`"source": [`
			`"# Unknown values\n",`
			`"\n",`
			`"Two possible approaches are remove these rows or fill them. It depends on every case."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 2,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "fragment"`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"import pandas as pd\n",`
			`"import numpy as np"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "slide"`
			`}`
			`},`
			`"source": [`
			`"## Filling NaN values\n",`
			`"If we need to fill errors or blanks, we can use the methods fillna() or dropna().\n",`
			`"\n",`
			`"* For string fields, we can fill NaN with ' '.\n",`
			`"\n",`
			`"* For numbers, we can fill with the mean or median value. \n"`
			`]`
			`},`
			`{`
			`"cell_type": "raw",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "subslide"`
			`}`
			`},`
			`"source": [`
			`"# Fill NaN with ' '\n",`
			`"df['col'] = df['col'].fillna(' ')\n",`
			`"# Fill NaN with 99\n",`
			`"df['col'] = df['col'].fillna(99)\n",`
			`"# Fill NaN with the mean of the column\n",`
			`"df['col'] = df['col'].fillna(df['col'].mean())"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "slide"`
			`}`
			`},`
			`"source": [`
			`"## Propagate non-null values forward or backward\n",`
			`"You can also propagate non-null values with these methods:\n",`
			`"\n",`
			`"* ffill: Fill values by propagating the last valid observation to the next valid.\n",`
			`"* bfill: Fill values using the following valid observation to fill the gap.\n",`
			`"* interpolate: Fill NaN values using interpolation.\n",`
			`"\n",`
			`"It will fill the next value in the dataframe with the previous non-NaN value. \n",`
			`"\n",`
			`"You may want to fill in one value (limit=1) or all the values. You can also indicate inplace=True to fill in-place."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 17,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "subslide"`
			`}`
			`},`
			`"outputs": [],`
			`"source": [`
			`"df = pd.DataFrame(data={'col1':[np.nan, np.nan, 2,3,4, np.nan, np.nan]})"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 11,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "subslide"`
			`}`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>col1</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>3.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>4</th>\n",`
			`" <td>4.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>5</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>6</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" col1\n",`
			`"0 NaN\n",`
			`"1 NaN\n",`
			`"2 2.0\n",`
			`"3 3.0\n",`
			`"4 4.0\n",`
			`"5 NaN\n",`
			`"6 NaN"`
			`]`
			`},`
			`"execution_count": 11,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"df"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"We fill forward the value 4.0 and fill the next one (limit = 1)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 12,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "fragment"`
			`}`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>col1</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>3.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>4</th>\n",`
			`" <td>4.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>5</th>\n",`
			`" <td>4.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>6</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" col1\n",`
			`"0 NaN\n",`
			`"1 NaN\n",`
			`"2 2.0\n",`
			`"3 3.0\n",`
			`"4 4.0\n",`
			`"5 4.0\n",`
			`"6 NaN"`
			`]`
			`},`
			`"execution_count": 12,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`" df.ffill(limit = 1)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"df.ffill()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "subslide"`
			`}`
			`},`
			`"source": [`
			`"We can also backfilling with bfill. Since we do not include limit, we fill all the values."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 13,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "fragment"`
			`}`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>col1</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>0</th>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>1</th>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>3.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>4</th>\n",`
			`" <td>4.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>5</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>6</th>\n",`
			`" <td>NaN</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" col1\n",`
			`"0 2.0\n",`
			`"1 2.0\n",`
			`"2 2.0\n",`
			`"3 3.0\n",`
			`"4 4.0\n",`
			`"5 NaN\n",`
			`"6 NaN"`
			`]`
			`},`
			`"execution_count": 13,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"df.bfill()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "slide"`
			`}`
			`},`
			`"source": [`
			`"## Removing NaN values\n",`
			`"We can remove them by row or column (use inplace=True if you want to modify the DataFrame)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 26,`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "fragment"`
			`}`
			`},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/html": [`
			`"<div>\n",`
			`"<style scoped>\n",`
			`" .dataframe tbody tr th:only-of-type {\n",`
			`" vertical-align: middle;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe tbody tr th {\n",`
			`" vertical-align: top;\n",`
			`" }\n",`
			`"\n",`
			`" .dataframe thead th {\n",`
			`" text-align: right;\n",`
			`" }\n",`
			`"</style>\n",`
			`"<table border=\"1\" class=\"dataframe\">\n",`
			`" <thead>\n",`
			`" <tr style=\"text-align: right;\">\n",`
			`" <th></th>\n",`
			`" <th>col1</th>\n",`
			`" </tr>\n",`
			`" </thead>\n",`
			`" <tbody>\n",`
			`" <tr>\n",`
			`" <th>2</th>\n",`
			`" <td>2.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>3</th>\n",`
			`" <td>3.0</td>\n",`
			`" </tr>\n",`
			`" <tr>\n",`
			`" <th>4</th>\n",`
			`" <td>4.0</td>\n",`
			`" </tr>\n",`
			`" </tbody>\n",`
			`"</table>\n",`
			`"</div>"`
			`],`
			`"text/plain": [`
			`" col1\n",`
			`"2 2.0\n",`
			`"3 3.0\n",`
			`"4 4.0"`
			`]`
			`},`
			`"execution_count": 26,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"# Drop any rows which have any nans\n",`
			`"df1 = df.dropna()\n",`
			`"# Drop columns that have any nans (axis = 1 -> drop columns, axis = 0 -> drop rows)\n",`
			`"df2 = df.dropna(axis=1)\n",`
			`"# Only drop columns which have at least 90% non-NaNs \n",`
			`"df3 = df.dropna(thresh=int(df.shape[0] * .9), axis=1)\n",`
			`"df1"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
			`"source": []`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"# References\n",`
			`"* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n",`
			`"* [Data Preprocessing for Machine learning in Python, GeeksForGeeks](https://www.geeksforgeeks.org/data-preprocessing-machine-learning-python/)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {`
			`"slideshow": {`
			`"slide_type": "skip"`
			`}`
			`},`
			`"source": [`
			`"## Licence\n",`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
			`]`
			`}`
			`],`
			`"metadata": {`
			`"celltoolbar": "Slideshow",`
			`"datacleaner": {`
			`"position": {`
			`"top": "50px"`
			`},`
			`"python": {`
			`"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"`
			`},`
			`"window_display": false`
			`},`
			`"kernelspec": {`
			`"display_name": "Python 3 (ipykernel)",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
			`"version": "3.10.13"`
			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
			`}`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 4`
			`}`