mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-25 07:52:27 +00:00
315 lines
9.6 KiB
Plaintext
315 lines
9.6 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Course Notes for Learning Intelligent Systems"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © 2016 Carlos A. Iglesias"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Table of Contents\n",
|
|
"* [Preprocessing](#Preprocessing)\n",
|
|
"* [Training set and Test set](#Training-set-and-Test-set)\n",
|
|
"* [Preprocessing](#Preprocessing)\n",
|
|
"* [References](#References)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Preprocessing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The goal of this notebook is to learn how separate the dataset into training and test datasets and then preprocess the data."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"metadata": {
|
|
"collapsed": true
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn import datasets\n",
|
|
"iris = datasets.load_iris()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Training set and Test set"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"A common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the **training set** on which we learn data properties and one that we call the **testing set** on which we test these properties. \n",
|
|
"\n",
|
|
"We are going to use *scikit-learn* to split the data into random training and testing sets. We follow the ration 75% for training and 25% for testing. We use `random_state` to ensure that the result is always the same and it is reproducible. (Otherwise, we would get different training and testing sets every time)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.cross_validation import train_test_split\n",
|
|
"x_iris, y_iris = iris.data, iris.target\n",
|
|
"# Test set will be the 25% taken randomly\n",
|
|
"x_train, x_test, y_train, y_test = train_test_split(x_iris, y_iris, test_size=0.25, random_state=33)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"(112, 4) (38, 4)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Dimensions of train and testing\n",
|
|
"print(x_train.shape, x_test.shape)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[[ 5.7 2.9 4.2 1.3]\n",
|
|
" [ 6.7 3.1 4.4 1.4]\n",
|
|
" [ 4.7 3.2 1.6 0.2]\n",
|
|
" [ 6.5 2.8 4.6 1.5]\n",
|
|
" [ 6.1 2.6 5.6 1.4]\n",
|
|
" [ 6.3 3.3 6. 2.5]\n",
|
|
" [ 4.8 3.4 1.9 0.2]\n",
|
|
" [ 5.1 3.5 1.4 0.3]\n",
|
|
" [ 6.4 3.1 5.5 1.8]\n",
|
|
" [ 6.9 3.2 5.7 2.3]\n",
|
|
" [ 6.8 3.2 5.9 2.3]\n",
|
|
" [ 4.4 3. 1.3 0.2]\n",
|
|
" [ 6.3 3.4 5.6 2.4]\n",
|
|
" [ 6.1 2.9 4.7 1.4]\n",
|
|
" [ 6.9 3.1 5.1 2.3]\n",
|
|
" [ 6.4 2.9 4.3 1.3]\n",
|
|
" [ 6. 3. 4.8 1.8]\n",
|
|
" [ 5.2 3.5 1.5 0.2]\n",
|
|
" [ 6.3 3.3 4.7 1.6]\n",
|
|
" [ 7.2 3.2 6. 1.8]\n",
|
|
" [ 4.9 3.1 1.5 0.1]\n",
|
|
" [ 5.7 3.8 1.7 0.3]\n",
|
|
" [ 6.5 3. 5.8 2.2]\n",
|
|
" [ 4.8 3. 1.4 0.1]\n",
|
|
" [ 6. 2.2 5. 1.5]\n",
|
|
" [ 6.2 2.8 4.8 1.8]\n",
|
|
" [ 6.1 3. 4.6 1.4]\n",
|
|
" [ 6.1 2.8 4. 1.3]\n",
|
|
" [ 6.5 3. 5.2 2. ]\n",
|
|
" [ 5.9 3. 5.1 1.8]\n",
|
|
" [ 5.6 2.7 4.2 1.3]\n",
|
|
" [ 6.7 3.1 4.7 1.5]\n",
|
|
" [ 5.6 2.8 4.9 2. ]\n",
|
|
" [ 6.4 3.2 5.3 2.3]\n",
|
|
" [ 6.7 3.1 5.6 2.4]\n",
|
|
" [ 6.7 3. 5.2 2.3]\n",
|
|
" [ 5.8 2.7 5.1 1.9]\n",
|
|
" [ 5.7 3. 4.2 1.2]]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"#Test set\n",
|
|
"print (x_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Preprocessing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Standardization of datasets is a common requirement for many machine learning estimators implemented in the scikit; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.\n",
|
|
"\n",
|
|
"The preprocessing module further provides a utility class `StandardScaler` to compute the mean and standard deviation on a training set. Later, the same transformation will be applied on the testing set."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Standardize the features\n",
|
|
"from sklearn import preprocessing\n",
|
|
"scaler = preprocessing.StandardScaler().fit(x_train)\n",
|
|
"x_train = scaler.transform(x_train)\n",
|
|
"x_test = scaler.transform(x_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"collapsed": false
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[[-0.09752318 -0.32858743 0.34599443 0.25682671]\n",
|
|
" [ 1.06445511 0.09442168 0.45718919 0.39124069]\n",
|
|
" [-1.25950146 0.30592623 -1.09953753 -1.22172707]\n",
|
|
" [ 0.83205945 -0.54009199 0.56838396 0.52565467]\n",
|
|
" [ 0.36726814 -0.9631011 1.12435779 0.39124069]\n",
|
|
" [ 0.59966379 0.51743079 1.34674732 1.86979447]\n",
|
|
" [-1.14330363 0.72893534 -0.93274538 -1.22172707]\n",
|
|
" [-0.79471015 0.9404399 -1.2107323 -1.08731309]\n",
|
|
" [ 0.71586162 0.09442168 1.06876041 0.92889661]\n",
|
|
" [ 1.29685076 0.30592623 1.17995517 1.60096651]\n",
|
|
" [ 1.18065293 0.30592623 1.29114994 1.60096651]\n",
|
|
" [-1.60809495 -0.11708288 -1.26632968 -1.22172707]\n",
|
|
" [ 0.59966379 0.72893534 1.12435779 1.73538049]\n",
|
|
" [ 0.36726814 -0.32858743 0.62398134 0.39124069]\n",
|
|
" [ 1.29685076 0.09442168 0.84637087 1.60096651]\n",
|
|
" [ 0.71586162 -0.32858743 0.40159181 0.25682671]\n",
|
|
" [ 0.25107031 -0.11708288 0.67957873 0.92889661]\n",
|
|
" [-0.67851232 0.9404399 -1.15513491 -1.22172707]\n",
|
|
" [ 0.59966379 0.51743079 0.62398134 0.66006865]\n",
|
|
" [ 1.64544425 0.30592623 1.34674732 0.92889661]\n",
|
|
" [-1.0271058 0.09442168 -1.15513491 -1.35614105]\n",
|
|
" [-0.09752318 1.57495356 -1.04394015 -1.08731309]\n",
|
|
" [ 0.83205945 -0.11708288 1.23555256 1.46655253]\n",
|
|
" [-1.14330363 -0.11708288 -1.2107323 -1.35614105]\n",
|
|
" [ 0.25107031 -1.80911932 0.79077349 0.52565467]\n",
|
|
" [ 0.48346596 -0.54009199 0.67957873 0.92889661]\n",
|
|
" [ 0.36726814 -0.11708288 0.56838396 0.39124069]\n",
|
|
" [ 0.36726814 -0.54009199 0.23479966 0.25682671]\n",
|
|
" [ 0.83205945 -0.11708288 0.90196826 1.19772457]\n",
|
|
" [ 0.13487248 -0.11708288 0.84637087 0.92889661]\n",
|
|
" [-0.21372101 -0.75159654 0.34599443 0.25682671]\n",
|
|
" [ 1.06445511 0.09442168 0.62398134 0.52565467]\n",
|
|
" [-0.21372101 -0.54009199 0.73517611 1.19772457]\n",
|
|
" [ 0.71586162 0.30592623 0.95756564 1.60096651]\n",
|
|
" [ 1.06445511 0.09442168 1.12435779 1.73538049]\n",
|
|
" [ 1.06445511 -0.11708288 0.90196826 1.60096651]\n",
|
|
" [ 0.01867465 -0.75159654 0.84637087 1.06331059]\n",
|
|
" [-0.09752318 -0.11708288 0.34599443 0.12241273]]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# As we see, the iris dataset is now normalized\n",
|
|
"print(x_test)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [Feature selection](http://scikit-learn.org/stable/modules/feature_selection.html)\n",
|
|
"* [Classification probability](http://scikit-learn.org/stable/auto_examples/classification/plot_classification_probability.html)\n",
|
|
"* [Mastering Pandas](http://proquest.safaribooksonline.com/book/programming/python/9781783981960), Femi Anthony, Packt Publishing, 2015.\n",
|
|
"* [Matplotlib web page](http://matplotlib.org/index.html)\n",
|
|
"* [Using matlibplot in IPython](http://ipython.readthedocs.org/en/stable/interactive/plotting.html)\n",
|
|
"* [Seaborn Tutorial](https://stanford.edu/~mwaskom/software/seaborn/tutorial.html)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Licences\n",
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© 2016 Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.1+"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 0
|
|
}
|