{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## [Introduction to Preprocessing](00_Intro_Preprocessing.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Categorical Data\n", "\n", "For many ML algorithms, we need to transform categorical data into numbers.\n", "\n", "For example:\n", "* **'Sex'** with values *'M'*, *'F'*, *'Unknown'*. \n", "* **'Position'** with values 'phD', *'Professor'*, *'TA'*, *'graduate'*.\n", "* **'Temperature'** with values *'low'*, *'medium'*, *'high'*.\n", "\n", "There are two main approaches:\n", "* Integer encoding\n", "* One hot encoding" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Integer Encoding\n", "We assign a number to every value:\n", "\n", "['M', 'F', 'Unknown', 'M'] --> [0, 1, 2, 0]\n", "\n", "['phD', 'Professor', 'TA','graduate', 'phD'] --> [0, 1, 2, 3, 0]\n", "\n", "['low', 'medium', 'high', 'low'] --> [0, 1, 2, 0]\n", "\n", "The main problem with this representation is integers have a natural order, and some ML algorithms can be confused. \n", "\n", "In our examples, this representation can be suitable for **temperature**, but not for the other two." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## One Hot Encoding\n", "A binary column is created for each value of the categorical variable." ] }, { "cell_type": "raw", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Sex M F U\n", "----- ---------\n", "M 1 0 0\n", "F is transformed into 0 1 0\n", "Unknown 0 0 1\n", "M 1 0 0 " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Transforming categorical data with Scikit-Learn\n", "\n", "We can use:\n", "* **get_dummies()** (one hot encoding)\n", "* **LabelEncoder** (integer encoding) and **OneHotEncoder** (one hot encoding). \n", "\n", "We are going to learn the first approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One Hot Encoding\n", "We can use Pandas (*get_dummies*) or Scikit-Learn (*OneHotEncoder*)." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Name Age Sex Position\n", "0 Marius 18 Male graduate\n", "1 Maria 19 Female professor\n", "2 John 20 Male TA\n", "3 Carla 30 Female phD\n" ] } ], "source": [ "import pandas as pd\n", "\n", "data = {\"Name\": [\"Marius\", \"Maria\", \"John\", \"Carla\"],\n", " \"Age\": [18, 19, 20, 30],\n", "\t\t\"Sex\": [\"Male\", \"Female\", \"Male\", \"Female\"],\n", " \"Position\": [\"graduate\", \"professor\", \"TA\", \"phD\"]\n", " }\n", "df = pd.DataFrame(data)\n", "print(df)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAgesex_encodedposition_encodedSex_FemaleSex_MalePosition_TAPosition_graduatePosition_phDPosition_professor
0Marius1811FalseTrueFalseTrueFalseFalse
1Maria1903TrueFalseFalseFalseFalseTrue
2John2010FalseTrueTrueFalseFalseFalse
3Carla3002TrueFalseFalseFalseTrueFalse
\n", "
" ], "text/plain": [ " Name Age sex_encoded position_encoded Sex_Female Sex_Male \\\n", "0 Marius 18 1 1 False True \n", "1 Maria 19 0 3 True False \n", "2 John 20 1 0 False True \n", "3 Carla 30 0 2 True False \n", "\n", " Position_TA Position_graduate Position_phD Position_professor \n", "0 False True False False \n", "1 False False False True \n", "2 True False False False \n", "3 False False True False " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ "The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.\n" ] } ], "source": [ "df_onehot = pd.get_dummies(df, columns=['Sex', 'Position'])\n", "df_onehot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use *OneHotEncoder* from Scikit." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Sex_FemaleSex_MalePosition_TAPosition_graduatePosition_phDPosition_professorNameAgesex_encodedposition_encoded
00.01.00.01.00.00.0Marius1811
11.00.00.00.00.01.0Maria1903
20.01.01.00.00.00.0John2010
31.00.00.00.01.00.0Carla3002
\n", "
" ], "text/plain": [ " Sex_Female Sex_Male Position_TA Position_graduate Position_phD \\\n", "0 0.0 1.0 0.0 1.0 0.0 \n", "1 1.0 0.0 0.0 0.0 0.0 \n", "2 0.0 1.0 1.0 0.0 0.0 \n", "3 1.0 0.0 0.0 0.0 1.0 \n", "\n", " Position_professor Name Age sex_encoded position_encoded \n", "0 0.0 Marius 18 1 1 \n", "1 1.0 Maria 19 0 3 \n", "2 0.0 John 20 1 0 \n", "3 0.0 Carla 30 0 2 " ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import OneHotEncoder\n", "from sklearn.compose import make_column_transformer\n", "\n", "df_onehotencoder = df\n", "# create OneHotEncoder object\n", "encoder = OneHotEncoder()\n", "\n", "# Transformer for several columns\n", "transformer = make_column_transformer(\n", " (OneHotEncoder(), ['Sex', 'Position']),\n", " remainder='passthrough',\n", " verbose_feature_names_out=False)\n", "\n", "# transform\n", "transformed = transformer.fit_transform(df_onehotencoder)\n", "\n", "df_onehotencoder = pd.DataFrame(\n", " transformed,\n", " columns=transformer.get_feature_names_out())\n", "df_onehotencoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas' get_dummy is easier for transforming DataFrames. OneHotEncoder is more efficient and can be good for integrating the step in a machine learning pipeline." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Integer encoding\n", "We will use **LabelEncoder**. It is possible to get the original values with *inverse_transform*. See [LabelEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAgeSexPosition
0Marius18Malegraduate
1Maria19Femaleprofessor
2John20MaleTA
3Carla30FemalephD
\n", "
" ], "text/plain": [ " Name Age Sex Position\n", "0 Marius 18 Male graduate\n", "1 Maria 19 Female professor\n", "2 John 20 Male TA\n", "3 Carla 30 Female phD" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.preprocessing import LabelEncoder\n", "# creating instance of labelencoder\n", "labelencoder = LabelEncoder()\n", "df_encoded = df\n", "# Assigning numerical values and storing in another column\n", "sex_values = ('Male', 'Female')\n", "position_values = ('graduate', 'professor', 'TA', 'phD')\n", "df_encoded" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAgeSexPositionsex_encoded
0Marius18Malegraduate1
1Maria19Femaleprofessor0
2John20MaleTA1
3Carla30FemalephD0
\n", "
" ], "text/plain": [ " Name Age Sex Position sex_encoded\n", "0 Marius 18 Male graduate 1\n", "1 Maria 19 Female professor 0\n", "2 John 20 Male TA 1\n", "3 Carla 30 Female phD 0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_encoded['sex_encoded'] = labelencoder.fit_transform(df_encoded['Sex'])\n", "df_encoded" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameAgeSexPositionsex_encodedposition_encoded
0Marius18Malegraduate11
1Maria19Femaleprofessor03
2John20MaleTA10
3Carla30FemalephD02
\n", "
" ], "text/plain": [ " Name Age Sex Position sex_encoded position_encoded\n", "0 Marius 18 Male graduate 1 1\n", "1 Maria 19 Female professor 0 3\n", "2 John 20 Male TA 1 0\n", "3 Carla 30 Female phD 0 2" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_encoded['position_encoded'] = labelencoder.fit_transform(df_encoded['Position'])\n", "df_encoded" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "# References\n", "* [Cleaning and Prepping Data with Python for Data Science — Best Practices and Helpful Packages](https://medium.com/@rrfd/cleaning-and-prepping-data-with-python-for-data-science-best-practices-and-helpful-packages-af1edfbe2a3), DeFilippi, 2019, \n", "* [Binarizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Binarizer.html), Scikit Learn" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Licence\n", "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", "\n", "© Carlos A. Iglesias, Universidad Politécnica de Madrid." ] } ], "metadata": { "celltoolbar": "Slideshow", "datacleaner": { "position": { "top": "50px" }, "python": { "varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])" }, "window_display": false }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autocomplete": true, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false } }, "nbformat": 4, "nbformat_minor": 4 }