1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-11-18 04:22:28 +00:00
sitc/ml1/2_4_Preprocessing.ipynb

221 lines
5.8 KiB
Plaintext
Raw Normal View History

2016-03-15 12:55:14 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Course Notes for Learning Intelligent Systems"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2019-02-28 10:32:00 +00:00
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Preprocessing](#Preprocessing)\n",
"* [Training set and Test set](#Training-set-and-Test-set)\n",
"* [Preprocessing](#Preprocessing)\n",
"* [References](#References)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2017-12-11 17:12:06 +00:00
"The goal of this notebook is to learn how to split the dataset into a training and a test datasets and then preprocess the data."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
2016-03-15 12:55:14 +00:00
"outputs": [],
"source": [
"from sklearn import datasets\n",
"iris = datasets.load_iris()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training set and Test set"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
"\n",
2017-12-11 17:12:06 +00:00
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ratio 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
2016-03-15 12:55:14 +00:00
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
2016-03-15 12:55:14 +00:00
"x_iris, y_iris = iris.data, iris.target\n",
"# Test set will be the 25% taken randomly\n",
"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
2019-02-28 10:32:00 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# Dimensions of train and testing\n",
"print(x_train.shape, x_test.shape)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
2019-02-28 10:32:00 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"#Test set\n",
"print (x_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocessing"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
"\n",
2016-03-28 10:26:20 +00:00
"The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
2016-03-15 12:55:14 +00:00
"outputs": [],
"source": [
"# Standardize the features\n",
"from sklearn import preprocessing\n",
"scaler = preprocessing.StandardScaler().fit(x_train)\n",
"x_train = scaler.transform(x_train)\n",
"x_test = scaler.transform(x_test)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
2019-02-28 10:32:00 +00:00
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# As we see, the iris dataset is now normalized\n",
"print(x_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Licences\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
2019-02-28 10:32:00 +00:00
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
2016-03-15 12:55:14 +00:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2019-02-28 10:32:00 +00:00
"version": "3.5.6"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
2016-03-15 12:55:14 +00:00
}
},
"nbformat": 4,
"nbformat_minor": 1
2016-03-15 12:55:14 +00:00
}