1
0
mirror of https://github.com/gsi-upm/sitc synced 2026-03-03 02:08:17 +00:00

Updated to Pandas 3.X and corrected typos

This commit is contained in:
cif
2026-03-02 17:40:58 +01:00
parent 5c440527ac
commit 65da5ae714
8 changed files with 105 additions and 135 deletions

View File

@@ -41,22 +41,22 @@
"source": [
"In the previous session, we learnt how to apply machine learning algorithms to the Iris dataset.\n",
"\n",
"We are going now to review the full process. As probably you have notice, data preparation, cleaning and transformation takes more than 90 % of data mining effort.\n",
"We are going to review the full process now. As you probably have noticed, data preparation, cleaning, and transformation account for more than 90% of the data mining effort.\n",
"\n",
"The phases are:\n",
"\n",
"* **Data ingestion**: reading the data from the data lake\n",
"* **Preprocessing**: \n",
" * **Data cleaning (munging)**: fill missing values, smooth noisy data (binning methods), identify or remove outlier, and resolve inconsistencies \n",
" * **Data cleaning (munging)**: fill missing values, smooth noisy data (binning methods), identify or remove outliers, and resolve inconsistencies \n",
" * **Data integration**: Integrate multiple datasets\n",
" * **Data transformation**: normalization (rescale numeric values between 0 and 1), standardisation (rescale values to have mean of 0 and std of 1), transformation for smoothing a variable (e.g. square toot, ...), aggregation of data from several datasets\n",
" * **Data reduction**: dimensionality reduction, clustering and sampling. \n",
" * **Data transformation**: normalization (rescale numeric values between 0 and 1), standardisation (rescale values to have a mean of 0 and std of 1), transformation for smoothing a variable (e.g., square root, ...), aggregation of data from several datasets\n",
" * **Data reduction**: dimensionality reduction, clustering, and sampling. \n",
" * **Data discretization**: for numerical values and algorithms that do not accept continuous variables\n",
" * **Feature engineering**: selection of most relevant features, creation of new features and delete non relevant features\n",
" * **Feature engineering**: selection of the most relevant features, creation of new features, and deletion of non-relevant features\n",
" * Apply Sampling for dividing the dataset into training and test datasets.\n",
"* **Machine learning**: apply machine learning algorithms and obtain an estimator, tuning its parameters.\n",
"* **Evaluation** of the model\n",
"* **Prediction**: use the model for new data."
"* **Prediction**: Use the model for new data."
]
},
{
@@ -92,7 +92,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
@@ -114,7 +114,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.12"
"version": "3.12.2"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
@@ -135,5 +135,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 1
"nbformat_minor": 4
}