mirror of
https://github.com/gsi-upm/sitc
synced 2026-03-03 02:08:17 +00:00
Updated to Pandas 3.X and corrected typos
This commit is contained in:
@@ -41,22 +41,22 @@
|
||||
"source": [
|
||||
"In the previous session, we learnt how to apply machine learning algorithms to the Iris dataset.\n",
|
||||
"\n",
|
||||
"We are going now to review the full process. As probably you have notice, data preparation, cleaning and transformation takes more than 90 % of data mining effort.\n",
|
||||
"We are going to review the full process now. As you probably have noticed, data preparation, cleaning, and transformation account for more than 90% of the data mining effort.\n",
|
||||
"\n",
|
||||
"The phases are:\n",
|
||||
"\n",
|
||||
"* **Data ingestion**: reading the data from the data lake\n",
|
||||
"* **Preprocessing**: \n",
|
||||
" * **Data cleaning (munging)**: fill missing values, smooth noisy data (binning methods), identify or remove outlier, and resolve inconsistencies \n",
|
||||
" * **Data cleaning (munging)**: fill missing values, smooth noisy data (binning methods), identify or remove outliers, and resolve inconsistencies \n",
|
||||
" * **Data integration**: Integrate multiple datasets\n",
|
||||
" * **Data transformation**: normalization (rescale numeric values between 0 and 1), standardisation (rescale values to have mean of 0 and std of 1), transformation for smoothing a variable (e.g. square toot, ...), aggregation of data from several datasets\n",
|
||||
" * **Data reduction**: dimensionality reduction, clustering and sampling. \n",
|
||||
" * **Data transformation**: normalization (rescale numeric values between 0 and 1), standardisation (rescale values to have a mean of 0 and std of 1), transformation for smoothing a variable (e.g., square root, ...), aggregation of data from several datasets\n",
|
||||
" * **Data reduction**: dimensionality reduction, clustering, and sampling. \n",
|
||||
" * **Data discretization**: for numerical values and algorithms that do not accept continuous variables\n",
|
||||
" * **Feature engineering**: selection of most relevant features, creation of new features and delete non relevant features\n",
|
||||
" * **Feature engineering**: selection of the most relevant features, creation of new features, and deletion of non-relevant features\n",
|
||||
" * Apply Sampling for dividing the dataset into training and test datasets.\n",
|
||||
"* **Machine learning**: apply machine learning algorithms and obtain an estimator, tuning its parameters.\n",
|
||||
"* **Evaluation** of the model\n",
|
||||
"* **Prediction**: use the model for new data."
|
||||
"* **Prediction**: Use the model for new data."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -92,7 +92,7 @@
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||
"\n",
|
||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||
]
|
||||
@@ -114,7 +114,7 @@
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.8.12"
|
||||
"version": "3.12.2"
|
||||
},
|
||||
"latex_envs": {
|
||||
"LaTeX_envs_menu_present": true,
|
||||
@@ -135,5 +135,5 @@
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 1
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user