{ "cells": [ { "cell_type": "markdown", "id": "849ad57e-6adb-4c2e-afd6-73db37eef572", "metadata": {}, "source": [ "![](images/EscUpmPolit_p.gif \"UPM\")" ] }, { "cell_type": "markdown", "id": "179cc802-9f1d-40b0-bf0c-9d4fb7ea1262", "metadata": {}, "source": [ "# Course Notes for Learning Intelligent Systems" ] }, { "cell_type": "markdown", "id": "9858d815-0390-4e77-a5ff-a8d2a1960981", "metadata": {}, "source": [ "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias" ] }, { "cell_type": "markdown", "id": "238bab60-75f0-4d29-ab05-66afc463b506", "metadata": {}, "source": [ "# Autoclean\n", "A simple library to clean data. [Autoclean](https://github.com/elisemercury/AutoClean) supports:\n", "AutoClean supports:\n", "\n", "* Handling of duplicates\n", "* Various imputation methods for missing values\n", "* Handling of outliers\n", "* Encoding of categorical data (OneHot, Label)\n", "* Extraction of data time values\n", "\n", "Install the package: **pip install py-AutoClean**.\n", "\n", "Parameters:\n", "\n", "* **duplicates**\n", " * default: False,\n", " * other values: 'auto', True\n", "* **missing_num**\n", " * default:False,\n", " * other values:\t'auto', 'linreg', 'knn', 'mean', 'median', 'most_frequent', 'delete', False\n", "* **missing_categ**\n", " * default: False,\n", " * other values:\t'auto', 'logreg', 'knn', 'most_frequent', 'delete', False\n", "* **encode_categ**\n", " * default: False,\n", " * other values:\t'auto', ['onehot'], ['label'], False ; to encode only specific columns add a list of column names or indexes: ['auto', ['col1', 2]]\n", "* **extract_datetime**\n", " * default:\tFalse,\n", " * other values:\t'auto', 'D', 'M', 'Y', 'h', 'm', 's'\n", "* **outliers**\n", " * default:\tFalse,\n", " * other values:\t'auto', 'winz', 'delete'\n", "* **outlier_param**\tdefault:\t1.5, other values:\tany int or float, False\n", "* **logfile**\n", " * default: True,\n", " * other values:\tFalse\n", "* **verbose**\n", " * default: False,\n", " * other values:\tTrue" ] }, { "cell_type": "code", "execution_count": 29, "id": "491b034b-994e-4f06-b4bc-df0590a62aab", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | PassengerId | \n", "Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Ticket | \n", "Fare | \n", "Cabin | \n", "Embarked | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "0 | \n", "3 | \n", "Braund, Mr. Owen Harris | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "A/5 21171 | \n", "7.2500 | \n", "NaN | \n", "S | \n", "
1 | \n", "2 | \n", "1 | \n", "1 | \n", "Cumings, Mrs. John Bradley (Florence Briggs Th... | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "PC 17599 | \n", "71.2833 | \n", "C85 | \n", "C | \n", "
2 | \n", "3 | \n", "1 | \n", "3 | \n", "Heikkinen, Miss. Laina | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "STON/O2. 3101282 | \n", "7.9250 | \n", "NaN | \n", "S | \n", "
3 | \n", "4 | \n", "1 | \n", "1 | \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "113803 | \n", "53.1000 | \n", "C123 | \n", "S | \n", "
4 | \n", "5 | \n", "0 | \n", "3 | \n", "Allen, Mr. William Henry | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "373450 | \n", "8.0500 | \n", "NaN | \n", "S | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
886 | \n", "887 | \n", "0 | \n", "2 | \n", "Montvila, Rev. Juozas | \n", "male | \n", "27.0 | \n", "0 | \n", "0 | \n", "211536 | \n", "13.0000 | \n", "NaN | \n", "S | \n", "
887 | \n", "888 | \n", "1 | \n", "1 | \n", "Graham, Miss. Margaret Edith | \n", "female | \n", "19.0 | \n", "0 | \n", "0 | \n", "112053 | \n", "30.0000 | \n", "B42 | \n", "S | \n", "
888 | \n", "889 | \n", "0 | \n", "3 | \n", "Johnston, Miss. Catherine Helen \"Carrie\" | \n", "female | \n", "NaN | \n", "1 | \n", "2 | \n", "W./C. 6607 | \n", "23.4500 | \n", "NaN | \n", "S | \n", "
889 | \n", "890 | \n", "1 | \n", "1 | \n", "Behr, Mr. Karl Howell | \n", "male | \n", "26.0 | \n", "0 | \n", "0 | \n", "111369 | \n", "30.0000 | \n", "C148 | \n", "C | \n", "
890 | \n", "891 | \n", "0 | \n", "3 | \n", "Dooley, Mr. Patrick | \n", "male | \n", "32.0 | \n", "0 | \n", "0 | \n", "370376 | \n", "7.7500 | \n", "NaN | \n", "Q | \n", "
891 rows × 12 columns
\n", "\n", " | PassengerId | \n", "Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Ticket | \n", "Fare | \n", "Cabin | \n", "Embarked | \n", "Sex_female | \n", "Sex_male | \n", "Embarked_C | \n", "Embarked_Q | \n", "Embarked_S | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "0 | \n", "3 | \n", "Braund, Mr. Owen Harris | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "A/5 21171 | \n", "7.2500 | \n", "C128 | \n", "S | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "
1 | \n", "2 | \n", "1 | \n", "1 | \n", "Cumings, Mrs. John Bradley (Florence Briggs Th... | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "PC 17599 | \n", "65.6344 | \n", "C85 | \n", "C | \n", "True | \n", "False | \n", "True | \n", "False | \n", "False | \n", "
2 | \n", "3 | \n", "1 | \n", "3 | \n", "Heikkinen, Miss. Laina | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "STON/O2. 3101282 | \n", "7.9250 | \n", "C128 | \n", "S | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "
3 | \n", "4 | \n", "1 | \n", "1 | \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "113803 | \n", "53.1000 | \n", "C123 | \n", "S | \n", "True | \n", "False | \n", "False | \n", "False | \n", "True | \n", "
4 | \n", "5 | \n", "0 | \n", "3 | \n", "Allen, Mr. William Henry | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "373450 | \n", "8.0500 | \n", "C128 | \n", "S | \n", "False | \n", "True | \n", "False | \n", "False | \n", "True | \n", "