sitc/ml1/2_2_Read_Data.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](files/images/EscUpmPolit_p.gif \"UPM\")\n",
    "\n",
    "# Course Notes for Learning Intelligent Systems\n",
    "\n",
    "Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, ©  Carlos A. Iglesias\n",
    "\n",
    "## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "* [Reading Data](#Reading-Data)\n",
    "* [Iris flower dataset](#Iris-flower-dataset)\n",
    "* [References](#References)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The goal of this notebook is to learn how to read and load a sample dataset.\n",
    "\n",
    "Scikit-learn comes with some bundled [datasets](https://scikit-learn.org/stable/datasets.html): iris, digits, boston, etc.\n",
    "\n",
    "In this notebook we are going to use the Iris dataset."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Iris flower dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",
    "\n",
    "The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",
    "\n",
    "![Iris](files/images/iris-dataset.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to read the dataset, we import the datasets bundle and then load the Iris dataset. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import datasets from scikit-learn\n",
    "from sklearn import datasets\n",
    "\n",
    "# load iris dataset\n",
    "iris = datasets.load_iris()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the `.data` member, which is a 2D (`n_samples`, `n_features`) array. In the case of supervised problem, one or more response variables are stored in the `.target` member."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#type 'bunch' of a dataset\n",
    "type(iris)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print descrition of the dataset\n",
    "print(iris.DESCR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# names of the features (attributes of the entities)\n",
    "print(iris.feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#names of the targets(classes of the classifier)\n",
    "print(iris.target_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#type numpy array\n",
    "type(iris.data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we are going to inspect the dataset. You can consult the NumPy tutorial listed in the references."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Data in the iris dataset. The value of the features of the samples.\n",
    "print(iris.data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Target.  Category of every sample\n",
    "print(iris.target)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Iris data is a numpy array\n",
    "# We can inspect its shape (rows, columns). In our case, (n_samples, n_features)\n",
    "print(iris.data.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Using numpy, I can print the dimensions (here we are working with 2D matriz)\n",
    "print(iris.data.ndim)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# I can print n_samples\n",
    "print(iris.data.shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ... n_features\n",
    "print(iris.data.shape[1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# names of the features\n",
    "print(iris.feature_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In following sessions we will learn how to load a dataset from a file (csv, excel, ...) using the pandas library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",
    "* [How to load an example dataset with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset)\n",
    "* [Dataset loading utilities in scikit-learn](http://scikit-learn.org/stable/datasets/)\n",
    "* [How to plot the Iris dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)\n",
    "* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",
    "* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "©  Carlos A. Iglesias, Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.5"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
Add ml1 2016-03-15 12:55:14 +00:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"![](files/images/EscUpmPolit_p.gif \"UPM\")\n",`
			`"\n",`
			`"# Course Notes for Learning Intelligent Systems\n",`
			`"\n",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"Department of Telematic Engineering Systems, Universidad Politécnica de Madrid, © Carlos A. Iglesias\n",`
Review J 2016-03-28 10:26:20 +00:00			`"\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"## [Introduction to Machine Learning](2_0_0_Intro_ML.ipynb)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Table of Contents\n",`
			`"* [Reading Data](#Reading-Data)\n",`
			`"* [Iris flower dataset](#Iris-flower-dataset)\n",`
			`"* [References](#References)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"# Reading Data"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The goal of this notebook is to learn how to read and load a sample dataset.\n",`
			`"\n",`
Update 2_2_Read_Data.ipynb Updated scikit url 2022-02-21 11:26:30 +00:00			`"Scikit-learn comes with some bundled [datasets](https://scikit-learn.org/stable/datasets.html): iris, digits, boston, etc.\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"In this notebook we are going to use the Iris dataset."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Iris flower dataset"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"The [Iris flower dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), available at [UCI dataset repository](https://archive.ics.uci.edu/ml/datasets/Iris), is a classic dataset for classification.\n",`
			`"\n",`
minor typos in ml1 2022-09-05 16:20:29 +00:00			`"The dataset consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. Based on the combination of these four features, a machine learning model will learn to differentiate the species of Iris.\n",`
Add ml1 2016-03-15 12:55:14 +00:00			`"\n",`
			`"![Iris](files/images/iris-dataset.jpg)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
minor typos in ml1 2022-09-05 16:20:29 +00:00			`"In order to read the dataset, we import the datasets bundle and then load the Iris dataset. "`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
Add ml1 2016-03-15 12:55:14 +00:00			`"outputs": [],`
			`"source": [`
			`"# import datasets from scikit-learn\n",`
			`"from sklearn import datasets\n",`
			`"\n",`
			`"# load iris dataset\n",`
			`"iris = datasets.load_iris()"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			"A dataset is a dictionary-like object that holds all the data and some metadata about the data. This data is stored in the `.data` member, which is a 2D (`n_samples`, `n_features`) array. In the case of supervised problem, one or more response variables are stored in the `.target` member."
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"#type 'bunch' of a dataset\n",`
			`"type(iris)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# print descrition of the dataset\n",`
Review J 2016-03-28 10:26:20 +00:00			`"print(iris.DESCR)"`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# names of the features (attributes of the entities)\n",`
			`"print(iris.feature_names)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"#names of the targets(classes of the classifier)\n",`
			`"print(iris.target_names)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"#type numpy array\n",`
			`"type(iris.data)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"Now we are going to inspect the dataset. You can consult the NumPy tutorial listed in the references."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"#Data in the iris dataset. The value of the features of the samples.\n",`
			`"print(iris.data)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Target. Category of every sample\n",`
			`"print(iris.target)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# Iris data is a numpy array\n",`
			`"# We can inspect its shape (rows, columns). In our case, (n_samples, n_features)\n",`
			`"print(iris.data.shape)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"#Using numpy, I can print the dimensions (here we are working with 2D matriz)\n",`
			`"print(iris.data.ndim)"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# I can print n_samples\n",`
			`"print(iris.data.shape[0])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# ... n_features\n",`
			`"print(iris.data.shape[1])"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"execution_count": null,`
			`"metadata": {},`
			`"outputs": [],`
Add ml1 2016-03-15 12:55:14 +00:00			`"source": [`
			`"# names of the features\n",`
			`"print(iris.feature_names)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
Review J 2016-03-28 10:26:20 +00:00			`"In following sessions we will learn how to load a dataset from a file (csv, excel, ...) using the pandas library."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## References"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"* [Iris flower data set](https://en.wikipedia.org/wiki/Iris_flower_data_set)\n",`
			`"* [How to load an example dataset with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html#loading-example-dataset)\n",`
			`"* [Dataset loading utilities in scikit-learn](http://scikit-learn.org/stable/datasets/)\n",`
			`"* [How to plot the Iris dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html)\n",`
			`"* [An introduction to NumPy and Scipy](http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf)\n",`
			`"* [NumPy tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"metadata": {},`
			`"source": [`
			`"## Licence\n",`
			`"\n",`
			`"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",`
			`"\n",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"© Carlos A. Iglesias, Universidad Politécnica de Madrid."`
Add ml1 2016-03-15 12:55:14 +00:00			`]`
			`}`
			`],`
			`"metadata": {`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
updated notebooks 2019-02-28 10:32:00 +00:00			`"version": "3.5.5"`
			`},`
			`"latex_envs": {`
			`"LaTeX_envs_menu_present": true,`
			`"autocomplete": true,`
			`"bibliofile": "biblio.bib",`
			`"cite_by": "apalike",`
			`"current_citInitial": 1,`
			`"eqLabelWithNumbers": true,`
			`"eqNumInitial": 1,`
			`"hotkeys": {`
			`"equation": "Ctrl-E",`
			`"itemize": "Ctrl-I"`
			`},`
			`"labels_anchors": false,`
			`"latex_user_defs": false,`
			`"report_style_numbering": false,`
			`"user_envs_cfg": false`
Add ml1 2016-03-15 12:55:14 +00:00			`}`
			`},`
			`"nbformat": 4,`
updated notebooks 2019-02-28 10:32:00 +00:00			`"nbformat_minor": 1`
Add ml1 2016-03-15 12:55:14 +00:00			`}`