1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-11-17 20:12:28 +00:00
sitc/ml1/2_2_Read_Data.ipynb

294 lines
7.3 KiB
Plaintext
Raw Permalink Normal View History

2016-03-15 12:55:14 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"![](files/images/EscUpmPolit_p.gif \"UPM\")\n",
"\n",
"# Course Notes for Learning Intelligent Systems\n",
"\n",
2019-02-28 10:32:00 +00:00
"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias\n",
2016-03-28 10:26:20 +00:00
"\n",
2016-03-15 12:55:14 +00:00
"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"* [Reading Data](#Reading-Data)\n",
"* [Iris flower dataset](#Iris-flower-dataset)\n",
"* [References](#References)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reading Data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of this notebook is to learn how to read and load a sample dataset.\n",
"\n",
"Scikit-learn comes with some bundled [datasets](https://scikit-learn.org/stable/datasets.html): iris, digits, boston, etc.\n",
2016-03-15 12:55:14 +00:00
"\n",
"In this notebook we are going to use the Iris dataset."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Iris flower dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",
"\n",
2022-09-05 16:20:29 +00:00
"The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",
2016-03-15 12:55:14 +00:00
"\n",
"![Iris](files/images/iris-dataset.jpg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2022-09-05 16:20:29 +00:00
"In order to read the dataset, we import the datasets bundle and then load the Iris dataset. "
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
2016-03-15 12:55:14 +00:00
"outputs": [],
"source": [
"# import datasets from scikit-learn\n",
"from sklearn import datasets\n",
"\n",
"# load iris dataset\n",
"iris = datasets.load_iris()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the `.data` member, which is a 2D (`n_samples`, `n_features`) array. In the case of supervised problem, one or more response variables are stored in the `.target` member."
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"#type 'bunch' of a dataset\n",
"type(iris)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# print descrition of the dataset\n",
2016-03-28 10:26:20 +00:00
"print(iris.DESCR)"
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# names of the features (attributes of the entities)\n",
"print(iris.feature_names)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"#names of the targets(classes of the classifier)\n",
"print(iris.target_names)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"#type numpy array\n",
"type(iris.data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we are going to inspect the dataset. You can consult the NumPy tutorial listed in the references."
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"#Data in the iris dataset. The value of the features of the samples.\n",
"print(iris.data)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# Target. Category of every sample\n",
"print(iris.target)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# Iris data is a numpy array\n",
"# We can inspect its shape (rows, columns). In our case, (n_samples, n_features)\n",
"print(iris.data.shape)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"#Using numpy, I can print the dimensions (here we are working with 2D matriz)\n",
"print(iris.data.ndim)"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# I can print n_samples\n",
"print(iris.data.shape[0])"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# ... n_features\n",
"print(iris.data.shape[1])"
]
},
{
"cell_type": "code",
2019-02-28 10:32:00 +00:00
"execution_count": null,
"metadata": {},
"outputs": [],
2016-03-15 12:55:14 +00:00
"source": [
"# names of the features\n",
"print(iris.feature_names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2016-03-28 10:26:20 +00:00
"In following sessions we will learn how to load a dataset from a file (csv, excel, ...) using the pandas library."
2016-03-15 12:55:14 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",
"* [How to load an example dataset with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset)\n",
"* [Dataset loading utilities in scikit-learn](http://scikit-learn.org/stable/datasets/)\n",
"* [How to plot the Iris dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)\n",
"* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",
"* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
2019-02-28 10:32:00 +00:00
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
2016-03-15 12:55:14 +00:00
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2019-02-28 10:32:00 +00:00
"version": "3.5.5"
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
2016-03-15 12:55:14 +00:00
}
},
"nbformat": 4,
2019-02-28 10:32:00 +00:00
"nbformat_minor": 1
2016-03-15 12:55:14 +00:00
}