Compare commits
17 Commits
08d2621a98
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
7ecea5343e | ||
|
|
d485774488 | ||
|
|
65da5ae714 | ||
|
|
5c440527ac | ||
|
|
722da8fc6c | ||
|
|
8e9d3cfdad | ||
|
|
f09997743d | ||
|
|
8c82c6fbcd | ||
|
|
ff76909b87 | ||
|
|
5d01f26e72 | ||
|
|
8e177963af | ||
|
|
ecd53ceee5 | ||
|
|
99f8032d05 | ||
|
|
470a3d692d | ||
|
|
d66bb25245 | ||
|
|
f10e6650c9 | ||
|
|
c9d16bdda3 |
157
lod/01_1_SPARQL_Server.ipynb
Normal file
@@ -0,0 +1,157 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"<header style=\"width:100%;position:relative\">\n",
|
||||||
|
" <div style=\"width:80%;float:right;\">\n",
|
||||||
|
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
|
||||||
|
" <h2>Department of Telematic Engineering Systems</h2>\n",
|
||||||
|
" <h3>Universidad Politécnica de Madrid. © Carlos A. Iglesias </h3>\n",
|
||||||
|
" </div>\n",
|
||||||
|
" <img style=\"width:15%;\" src=\"https://github.com/gsi-upm/sitc/blob/9844820e6653b0e169113a06538f8e54554c4fbc/images/EscUpmPolit_p.gif?raw=true\" alt=\"UPM\" />\n",
|
||||||
|
"</header>"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Introduction\n",
|
||||||
|
"\n",
|
||||||
|
"This lecture explains how to run the SPARQL Server Fuseki using Docker."
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# Installing Fuseki with Docker\n",
|
||||||
|
"This section is taken from [[1](#1), [2](#2)].\n",
|
||||||
|
"\n",
|
||||||
|
"## Install Docker if not installed\n",
|
||||||
|
"You should have installed **Docker**. If not, refer to the [docker install guide](https://docs.docker.com/engine/install/).\n",
|
||||||
|
"\n",
|
||||||
|
"## Install fuseki image\n",
|
||||||
|
"In a terminal, run\n",
|
||||||
|
"```\n",
|
||||||
|
"docker run -p 3030:3030 --name fuseki -e ADMIN_PASSWORD=fuseki -it stain/jena-fuseki\n",
|
||||||
|
"```\n",
|
||||||
|
"You can change the admin password to anyone that you want, or the ports.\n",
|
||||||
|
"\n",
|
||||||
|
"You should see the logs in the terminal.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"## Access admin UI\n",
|
||||||
|
"Now open a browser to [http://localhost:3030](http://localhost:3030) and log in as user *admin* with password *fuseki* (or the password you set earlier).\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"## Load data\n",
|
||||||
|
"Download this dataset to your computer.\n",
|
||||||
|
"\n",
|
||||||
|
"[Beatles dataset](https://github.com/gsi-upm/sitc/blob/master/lod/BeatlesMusicians.ttl).\n",
|
||||||
|
"\n",
|
||||||
|
"At the bottom of the UI, you can see 'No datasets created - add one'. Click on *add one*. Set *beatles* as the dataset name and in-memory as the dataset type.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Click the *create dataset* button.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Click the *add data* button. In the new screen, select the button *select files* and click on the file you have previously downloaded, and click on the *upload now* button.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Now the *query* tab.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"You see a generic query to list triples, click on the *Play* button (a triangle on the SPARQL query's right).\n",
|
||||||
|
"\n",
|
||||||
|
"Scroll down to see the query results.\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"\n",
|
||||||
|
"Congratulations! You have a SPARQL server running, serving a dataset!!"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## References"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"* <a id=\"1\">[1]</a> [SPARQL by Example. A Tutorial. Lee Feigenbaum. W3C, 2009](https://hub.docker.com/r/stain/jena-fuseki)\n",
|
||||||
|
"* <a id=\"2\">[2]</a> [SPARQL queries of Beatles recording sessions](http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"## Licence\n",
|
||||||
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
|
"\n",
|
||||||
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"datacleaner": {
|
||||||
|
"position": {
|
||||||
|
"top": "50px"
|
||||||
|
},
|
||||||
|
"python": {
|
||||||
|
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
|
||||||
|
},
|
||||||
|
"window_display": false
|
||||||
|
},
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.12.2"
|
||||||
|
},
|
||||||
|
"latex_envs": {
|
||||||
|
"LaTeX_envs_menu_present": true,
|
||||||
|
"autocomplete": true,
|
||||||
|
"bibliofile": "biblio.bib",
|
||||||
|
"cite_by": "apalike",
|
||||||
|
"current_citInitial": 1,
|
||||||
|
"eqLabelWithNumbers": true,
|
||||||
|
"eqNumInitial": 1,
|
||||||
|
"hotkeys": {
|
||||||
|
"equation": "Ctrl-E",
|
||||||
|
"itemize": "Ctrl-I"
|
||||||
|
},
|
||||||
|
"labels_anchors": false,
|
||||||
|
"latex_user_defs": false,
|
||||||
|
"report_style_numbering": false,
|
||||||
|
"user_envs_cfg": false
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 4
|
||||||
|
}
|
||||||
@@ -7,7 +7,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"checksum": "7276f055a8c504d3c80098c62ed41a4f",
|
"checksum": "91ddca8e4ff354f86a8bdbfee755a886",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-0bfe38f97f6ab2d2",
|
"grade_id": "cell-0bfe38f97f6ab2d2",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -19,10 +19,10 @@
|
|||||||
"<header style=\"width:100%;position:relative\">\n",
|
"<header style=\"width:100%;position:relative\">\n",
|
||||||
" <div style=\"width:80%;float:right;\">\n",
|
" <div style=\"width:80%;float:right;\">\n",
|
||||||
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
|
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
|
||||||
" <h3>Department of Telematic Engineering Systems</h3>\n",
|
" <h2>Department of Telematic Engineering Systems</h2>\n",
|
||||||
" <h5>Universidad Politécnica de Madrid</h5>\n",
|
" <h3>Universidad Politécnica de Madrid</h3>\n",
|
||||||
" </div>\n",
|
" </div>\n",
|
||||||
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
|
" <img style=\"width:15%;\" src=\"https://github.com/gsi-upm/sitc/blob/9844820e6653b0e169113a06538f8e54554c4fbc/images/EscUpmPolit_p.gif?raw=true\" alt=\"UPM\" />\n",
|
||||||
"</header>"
|
"</header>"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -58,7 +58,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"checksum": "40ccd05ad0704781327031a84dfb9939",
|
"checksum": "5760eec341771cfd3ebd91627da0a481",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-4f8492996e74bf20",
|
"grade_id": "cell-4f8492996e74bf20",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -71,10 +71,10 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"* This notebook\n",
|
"* This notebook\n",
|
||||||
"* External SPARQL editors (optional)\n",
|
"* External SPARQL editors (optional)\n",
|
||||||
" * YASGUI-GSI http://yasgui.gsi.upm.es\n",
|
" * *YASGUI* https://yasgui.org/\n",
|
||||||
" * DBpedia virtuoso http://dbpedia.org/sparql\n",
|
" * *DBpedia Virtuoso* http://dbpedia.org/sparql\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Using the YASGUI-GSI editor has several advantages over other options.\n",
|
"Using the YASGUI editor has several advantages over other options.\n",
|
||||||
"It features:\n",
|
"It features:\n",
|
||||||
"\n",
|
"\n",
|
||||||
"* Selection of data source, either by specifying the URL or by selecting from a dropdown menu\n",
|
"* Selection of data source, either by specifying the URL or by selecting from a dropdown menu\n",
|
||||||
@@ -96,7 +96,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"checksum": "81894e9d65e5dd9f3b6e1c5f66804bf6",
|
"checksum": "b1285fb82e4438a22e05ff134d1e080d",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-70ac24910356c3cf",
|
"grade_id": "cell-70ac24910356c3cf",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -107,13 +107,15 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"## Instructions\n",
|
"## Instructions\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We will be using a semantic server, available at: http://fuseki.gsi.upm.es/sitc.\n",
|
"We will use an available semantic server. There are two possible settings:\n",
|
||||||
|
"* A local server available at [http://fuseki.gsi.upm.es/sitc]( http://fuseki.gsi.upm.es/sitc/)\n",
|
||||||
|
"* Install the SPARQL server yourself and run it locally. To do this, follow the instructions in the notebook [how to install a SPARQL Server](01_1_SPARQL_Server.ipynb). In this case, your server is available at [http://localhost:3030](http://localhost:3030).\n",
|
||||||
"\n",
|
"\n",
|
||||||
"This server contains a dataset about [Beatles songs](http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html), which we will query with SPARQL.\n",
|
"This server contains a dataset about [Beatles songs](http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html), which we will query with SPARQL.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We will provide you some example code to get you started, the *question* you will have to answer using SPARQL, a template for the answer.\n",
|
"We will provide you with some example code to get you started, along with the *question* you will have to answer using SPARQL and a template for the answer.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"After every query, you will find some python code to test the results of the query.\n",
|
"After every query, you will find some Python code to test the results of the query.\n",
|
||||||
"**Make sure you've run the tests before moving to the next exercise**.\n",
|
"**Make sure you've run the tests before moving to the next exercise**.\n",
|
||||||
"If the test gives you an error, you've probably done something wrong.\n",
|
"If the test gives you an error, you've probably done something wrong.\n",
|
||||||
"You do not need to understand or modify the test code."
|
"You do not need to understand or modify the test code."
|
||||||
@@ -126,7 +128,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"checksum": "1d332d3d11fd6b57f0ec0ac3c358c6cb",
|
"checksum": "a5918241a0b402cb091a85d245aaa3fd",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-eb13908482825e42",
|
"grade_id": "cell-eb13908482825e42",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -138,7 +140,7 @@
|
|||||||
"For convenience, the examples in the notebook are executable (using the `%%sparql` magic command), and they are accompanied by some code to test the results.\n",
|
"For convenience, the examples in the notebook are executable (using the `%%sparql` magic command), and they are accompanied by some code to test the results.\n",
|
||||||
"If the tests pass, you probably got the answer right.\n",
|
"If the tests pass, you probably got the answer right.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"**Run this line to enable the `%%sparql` magic command.**"
|
"### Run this line to enable the `%%sparql` magic command.**"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -200,7 +202,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"checksum": "34710d3bb8e2cf826833a43adb7fb448",
|
"checksum": "77cd823ca5a1556311f3dcf7e1533bce",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-2a44c0da2c206d01",
|
"grade_id": "cell-2a44c0da2c206d01",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -216,10 +218,23 @@
|
|||||||
"In addition to running queries from your browser, they provide useful features such as syntax highlighting and autocompletion.\n",
|
"In addition to running queries from your browser, they provide useful features such as syntax highlighting and autocompletion.\n",
|
||||||
"Some examples are:\n",
|
"Some examples are:\n",
|
||||||
"\n",
|
"\n",
|
||||||
"* DBpedia's virtuoso query editor https://dbpedia.org/sparql\n",
|
"* DBpedia's Virtuoso query editor https://dbpedia.org/sparql\n",
|
||||||
"* A javascript based client hosted at GSI: http://yasgui.gsi.upm.es/\n",
|
"* A JavaScript-based client: https://yasgui.org/\n",
|
||||||
"\n",
|
"\n",
|
||||||
"[^1]: http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html"
|
"[^1]: http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html\n",
|
||||||
|
"\n",
|
||||||
|
"### Set your SPARQL server"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"## Set your Fuseki server URL\n",
|
||||||
|
"\n",
|
||||||
|
"fuseki = 'http://localhost:3030/beatles/sparql'"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -250,7 +265,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"checksum": "f7428fe79cd33383dfd3b09a0d951b6e",
|
"checksum": "1d6bd8b03621fb0b2d74c80b91e9b91a",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-8391a5322a9ad4a7",
|
"grade_id": "cell-8391a5322a9ad4a7",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -259,7 +274,7 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"source": [
|
"source": [
|
||||||
"#### First select - Exploring the dataset\n",
|
"### First select - Exploring the dataset\n",
|
||||||
"\n"
|
"\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -324,7 +339,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "3bc71f851a33fa401d18ea3ab02cf61f",
|
"checksum": "3b3a78b2676844ff59d3555a5f7690a0",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-8ce8c954513f17e7",
|
"grade_id": "cell-8ce8c954513f17e7",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -334,7 +349,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"SELECT ?entity ?type\n",
|
"SELECT ?entity ?type\n",
|
||||||
"WHERE {\n",
|
"WHERE {\n",
|
||||||
@@ -388,7 +403,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "65be7168bedb4f6dc2f19e2138bab232",
|
"checksum": "1879bf78f30a1ec22499eec5f975803d",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-6e904d692b5facad",
|
"grade_id": "cell-6e904d692b5facad",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -398,7 +413,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"SELECT ?entity ?prop\n",
|
"SELECT ?entity ?prop\n",
|
||||||
"WHERE {\n",
|
"WHERE {\n",
|
||||||
@@ -454,7 +469,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"SELECT ?type\n",
|
"SELECT ?type\n",
|
||||||
"WHERE {\n",
|
"WHERE {\n",
|
||||||
@@ -479,7 +494,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"SELECT DISTINCT ?type\n",
|
"SELECT DISTINCT ?type\n",
|
||||||
"WHERE {\n",
|
"WHERE {\n",
|
||||||
@@ -522,7 +537,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "35563ff455c7e8b1c91f61db97b2011b",
|
"checksum": "73cbbe78a138e5f927f18155106c0761",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-e615f9a77c4bc9a5",
|
"grade_id": "cell-e615f9a77c4bc9a5",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -532,7 +547,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"SELECT DISTINCT ?property\n",
|
"SELECT DISTINCT ?property\n",
|
||||||
"WHERE {\n",
|
"WHERE {\n",
|
||||||
@@ -585,7 +600,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
||||||
@@ -655,7 +670,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "069811507dbac4b86dc5d3adc82ba4ec",
|
"checksum": "79d1ab4722a04ee11b1f892cd4434510",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-0223a51f609edcf9",
|
"grade_id": "cell-0223a51f609edcf9",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -665,7 +680,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
||||||
@@ -725,7 +740,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "b68a279085a1ed087f5e474a6602299e",
|
"checksum": "47ad334640472eeec3d52ee87035fc60",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-8f43547dd788bb33",
|
"grade_id": "cell-8f43547dd788bb33",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -735,7 +750,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
||||||
@@ -812,7 +827,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "335403f01e484ce5563ff059e9764ff4",
|
"checksum": "ac6cb33399d7efab311bd58479a75929",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-a0f0b9d9b05c9631",
|
"grade_id": "cell-a0f0b9d9b05c9631",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -822,7 +837,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
||||||
@@ -891,7 +906,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "8fb253675d2e8510e2c6780b960721e5",
|
"checksum": "d2ed010ea9127e45edc81b2c1a8c94d9",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-523b963fa4e288d0",
|
"grade_id": "cell-523b963fa4e288d0",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -901,7 +916,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
||||||
@@ -971,7 +986,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "c7b6620f5ba28b482197ab693cb7142a",
|
"checksum": "390eea10127829419f4026f292d907ad",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-e89d08031e30b299",
|
"grade_id": "cell-e89d08031e30b299",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -981,7 +996,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX m: <http://learningsparql.com/ns/musician/>\n",
|
"PREFIX m: <http://learningsparql.com/ns/musician/>\n",
|
||||||
@@ -1039,7 +1054,7 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"Once results are grouped, they can be aggregated using any aggregation function, such as `COUNT`.\n",
|
"Once results are grouped, they can be aggregated using any aggregation function, such as `COUNT`.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Using `GROUP BY` and `COUNT`, get the count of songs in which Ringo Starr has played each of the instruments:"
|
"Using `GROUP BY` and `COUNT`, get the count of songs that use each instrument:"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1049,7 +1064,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "7556bacb20c1fbd059dec165c982908d",
|
"checksum": "899b2cf4ee7d010e5e5d02ca28ead13d",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-1429e4eb5400dbc7",
|
"grade_id": "cell-1429e4eb5400dbc7",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1059,7 +1074,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX m: <http://learningsparql.com/ns/musician/>\n",
|
"PREFIX m: <http://learningsparql.com/ns/musician/>\n",
|
||||||
@@ -1123,7 +1138,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
||||||
@@ -1156,7 +1171,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "3139d9b7e620266946ffe1ae0cf67581",
|
"checksum": "24f94cc322288f4c467af5bfadd6a4c9",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-ee208c762d00da9c",
|
"grade_id": "cell-ee208c762d00da9c",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1166,7 +1181,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
||||||
@@ -1228,7 +1243,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
||||||
@@ -1263,7 +1278,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
||||||
@@ -1297,7 +1312,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "3bc508872193750d57d07efbf334c212",
|
"checksum": "513916421ec8451b8a10a6923fde775b",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-dcd68c45c1608a28",
|
"grade_id": "cell-dcd68c45c1608a28",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1307,7 +1322,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
||||||
@@ -1381,7 +1396,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "300df0a3cf9729dd4814b3153b2fedb4",
|
"checksum": "7b6c7de8beb6cd78b99fbc8eaa1b2c87",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-0c7cc924a13d792a",
|
"grade_id": "cell-0c7cc924a13d792a",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1391,7 +1406,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
||||||
@@ -1456,7 +1471,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "e4e898c8a16b8aa5865dfde2f6e68ec6",
|
"checksum": "f4b8b1601e9cd1c05464ebad8e6836a6",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-d750b6d64c6aa0a7",
|
"grade_id": "cell-d750b6d64c6aa0a7",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1466,7 +1481,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
@@ -1521,7 +1536,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "fade6ab714376e0eabfa595dd6bd6a8b",
|
"checksum": "b3a570ae42656d907a1ce60f199fdbec",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-2f5aa516f8191787",
|
"grade_id": "cell-2f5aa516f8191787",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1531,7 +1546,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
@@ -1577,7 +1592,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"### Which songs had Ringo in drums OR Lennon in lead vocals? (UNION)"
|
"### Which songs had Ringo in dums OR Lennon in lead vocals? (UNION)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1612,7 +1627,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "09262d81449c498c37e4b9d9b1dcdfed",
|
"checksum": "4c1d0eaf45f7e69233bb998f5dfb9a48",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-d3a742bd87d9c793",
|
"grade_id": "cell-d3a742bd87d9c793",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1622,7 +1637,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
@@ -1643,7 +1658,7 @@
|
|||||||
"editable": false,
|
"editable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "11061e79ec06ccb3a9c496319a528366",
|
"checksum": "d583b30a1e00960df3a4411b6854c8c8",
|
||||||
"grade": true,
|
"grade": true,
|
||||||
"grade_id": "cell-409402df0e801d09",
|
"grade_id": "cell-409402df0e801d09",
|
||||||
"locked": true,
|
"locked": true,
|
||||||
@@ -1654,7 +1669,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"assert len(solution()['tuples']) == 209"
|
"assert len(solution()['tuples']) == 246"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1695,7 +1710,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "9ddd2d1f50f841b889bfd29b175d06da",
|
"checksum": "161fd1c6ed06206d4661e4e6c3e255c7",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-9d1ec854eb530235",
|
"grade_id": "cell-9d1ec854eb530235",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1705,7 +1720,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"\n",
|
"\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -1789,7 +1804,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "d18e8b6e1d32aed395a533febb29fcb5",
|
"checksum": "16d79b02f510bbfffcb2cc36af159081",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-7ea1f5154cdd8324",
|
"grade_id": "cell-7ea1f5154cdd8324",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1799,7 +1814,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
||||||
@@ -1836,7 +1851,7 @@
|
|||||||
"deletable": false,
|
"deletable": false,
|
||||||
"nbgrader": {
|
"nbgrader": {
|
||||||
"cell_type": "code",
|
"cell_type": "code",
|
||||||
"checksum": "f926fa3a3568d122454a12312859cda1",
|
"checksum": "65cece835581895b44257668948a5130",
|
||||||
"grade": false,
|
"grade": false,
|
||||||
"grade_id": "cell-b6bee887a1b1fc60",
|
"grade_id": "cell-b6bee887a1b1fc60",
|
||||||
"locked": false,
|
"locked": false,
|
||||||
@@ -1846,7 +1861,7 @@
|
|||||||
},
|
},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"%%sparql http://fuseki.gsi.upm.es/sitc/\n",
|
"%%sparql $fuseki\n",
|
||||||
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> \n",
|
||||||
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
"PREFIX s: <http://learningsparql.com/ns/schema/>\n",
|
||||||
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
"PREFIX i: <http://learningsparql.com/ns/instrument/>\n",
|
||||||
@@ -1898,7 +1913,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.10"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"toc": {
|
"toc": {
|
||||||
"base_numbering": 1,
|
"base_numbering": 1,
|
||||||
|
|||||||
BIN
lod/created-dataset.jpg
Normal file
|
After Width: | Height: | Size: 111 KiB |
BIN
lod/docker-run-jena-fuseki.jpg
Normal file
|
After Width: | Height: | Size: 48 KiB |
BIN
lod/fuseki-running.jpg
Normal file
|
After Width: | Height: | Size: 82 KiB |
BIN
lod/new-dataset.jpg
Normal file
|
After Width: | Height: | Size: 93 KiB |
BIN
lod/query-results.jpg
Normal file
|
After Width: | Height: | Size: 92 KiB |
BIN
lod/query-ui.jpg
Normal file
|
After Width: | Height: | Size: 95 KiB |
BIN
lod/upload-data.jpg
Normal file
|
After Width: | Height: | Size: 91 KiB |
@@ -330,7 +330,7 @@
|
|||||||
"# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
|
"# Saving the resulting axes as ax each time causes the resulting plot to be shown\n",
|
||||||
"# on top of the previous axes\n",
|
"# on top of the previous axes\n",
|
||||||
"ax = sns.boxplot(x=\"species\", y=\"petal length (cm)\", data=iris_df)\n",
|
"ax = sns.boxplot(x=\"species\", y=\"petal length (cm)\", data=iris_df)\n",
|
||||||
"ax = sns.stripplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, jitter=True, edgecolor=\"gray\")"
|
"ax = sns.stripplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, jitter=True, edgecolor=\"auto\")"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -348,7 +348,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"# A violin plot combines the benefits of the previous two plots and simplifies them\n",
|
"# A violin plot combines the benefits of the previous two plots and simplifies them\n",
|
||||||
"# Denser regions of the data are fatter, and sparser thinner in a violin plot\n",
|
"# Denser regions of the data are fatter, and sparser thinner in a violin plot\n",
|
||||||
"sns.violinplot(x=\"species\", y=\"petal length (cm)\", data=iris_df, size=6)"
|
"sns.violinplot(x=\"species\", y=\"petal length (cm)\", data=iris_df)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -72,7 +72,7 @@
|
|||||||
"Machine learning algorithms are programs that learn a model from a dataset to make predictions or learn structures to organize the data.\n",
|
"Machine learning algorithms are programs that learn a model from a dataset to make predictions or learn structures to organize the data.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"In scikit-learn, machine learning algorithms take as input a *numpy* array (n_samples, n_features), where\n",
|
"In scikit-learn, machine learning algorithms take as input a *numpy* array (n_samples, n_features), where\n",
|
||||||
"* **n_samples**: number of samples. Each sample is an item to process (i.e., classify). A sample can be a document, a picture, a sound, a video, a row in a database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
|
"* **n_samples**: number of samples. Each sample is an item to be processed (i.e., classified). A sample can be a document, a picture, a sound, a video, a row in a database or CSV file, or whatever you can describe with a fixed set of quantitative traits.\n",
|
||||||
"* **n_features**: The number of features or distinct traits that can be used to describe each item quantitatively.\n",
|
"* **n_features**: The number of features or distinct traits that can be used to describe each item quantitatively.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"The number of features should be defined in advance. A specific type of feature set is high-dimensional (e.g., millions of features), but most values are zero for a given sample. Using (numpy) arrays, all those zero values would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
|
"The number of features should be defined in advance. A specific type of feature set is high-dimensional (e.g., millions of features), but most values are zero for a given sample. Using (numpy) arrays, all those zero values would also take up memory. For this reason, these feature sets are often represented with sparse matrices (scipy.sparse) instead of (numpy) arrays.\n",
|
||||||
@@ -112,7 +112,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In *unsupervised machine learning models*, the machine learning model algorithm takes as input the feature vectors. It produces a predictive model that is used to fit its parameters to summarize the best regularities found in the data.\n",
|
"In *unsupervised machine learning models*, the machine learning model algorithm takes as input the feature vectors. It produces a predictive model that is used to fit its parameters to summarize the best regularities found in the data.\n",
|
||||||
""
|
""
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -140,7 +140,7 @@
|
|||||||
" * **model.fit_transform()**: Some estimators implement this method, which performs a fit and a transform on the same input data.\n",
|
" * **model.fit_transform()**: Some estimators implement this method, which performs a fit and a transform on the same input data.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"\n",
|
"\n",
|
||||||
""
|
""
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
|
|||||||
@@ -53,10 +53,10 @@ import matplotlib.pyplot as plt
|
|||||||
|
|
||||||
from sklearn.datasets import load_iris
|
from sklearn.datasets import load_iris
|
||||||
from sklearn.tree import DecisionTreeClassifier
|
from sklearn.tree import DecisionTreeClassifier
|
||||||
|
from sklearn.inspection import DecisionBoundaryDisplay
|
||||||
|
|
||||||
def plot_tree_iris():
|
def plot_tree_iris():
|
||||||
"""
|
"""
|
||||||
|
|
||||||
Taken fromhttp://scikit-learn.org/stable/auto_examples/tree/plot_iris.html
|
Taken fromhttp://scikit-learn.org/stable/auto_examples/tree/plot_iris.html
|
||||||
"""
|
"""
|
||||||
# Parameters
|
# Parameters
|
||||||
@@ -67,11 +67,11 @@ def plot_tree_iris():
|
|||||||
# Load data
|
# Load data
|
||||||
iris = load_iris()
|
iris = load_iris()
|
||||||
|
|
||||||
for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
|
for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3], [1, 2], [1, 3], [2, 3]]):
|
||||||
[1, 2], [1, 3], [2, 3]]):
|
|
||||||
# We only take the two corresponding features
|
# We only take the two corresponding features
|
||||||
X = iris.data[:, pair]
|
X = iris.data[:, pair]
|
||||||
y = iris.target
|
y = iris.target
|
||||||
|
'''
|
||||||
|
|
||||||
# Shuffle
|
# Shuffle
|
||||||
idx = np.arange(X.shape[0])
|
idx = np.arange(X.shape[0])
|
||||||
@@ -84,34 +84,38 @@ def plot_tree_iris():
|
|||||||
mean = X.mean(axis=0)
|
mean = X.mean(axis=0)
|
||||||
std = X.std(axis=0)
|
std = X.std(axis=0)
|
||||||
X = (X - mean) / std
|
X = (X - mean) / std
|
||||||
|
'''
|
||||||
# Train
|
# Train
|
||||||
model = DecisionTreeClassifier(max_depth=3, random_state=1).fit(X, y)
|
clf = DecisionTreeClassifier(max_depth=3, random_state=1).fit(X, y)
|
||||||
|
|
||||||
# Plot the decision boundary
|
# Plot the decision boundary
|
||||||
plt.subplot(2, 3, pairidx + 1)
|
# Taken from https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html
|
||||||
|
# Plot the decision boundary
|
||||||
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
|
ax = plt.subplot(2, 3, pairidx + 1)
|
||||||
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
|
plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
|
||||||
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
|
DecisionBoundaryDisplay.from_estimator(
|
||||||
np.arange(y_min, y_max, plot_step))
|
clf,
|
||||||
|
X,
|
||||||
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
|
cmap=plt.cm.RdYlBu,
|
||||||
Z = Z.reshape(xx.shape)
|
response_method="predict",
|
||||||
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
|
ax=ax,
|
||||||
|
xlabel=iris.feature_names[pair[0]],
|
||||||
plt.xlabel(iris.feature_names[pair[0]])
|
ylabel=iris.feature_names[pair[1]],
|
||||||
plt.ylabel(iris.feature_names[pair[1]])
|
)
|
||||||
plt.axis("tight")
|
|
||||||
|
|
||||||
# Plot the training points
|
# Plot the training points
|
||||||
for i, color in zip(range(n_classes), plot_colors):
|
for i, color in zip(range(n_classes), plot_colors):
|
||||||
idx = np.where(y == i)
|
idx = np.asarray(y == i).nonzero()
|
||||||
plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
|
plt.scatter(
|
||||||
cmap=plt.cm.Paired)
|
X[idx, 0],
|
||||||
|
X[idx, 1],
|
||||||
|
c=color,
|
||||||
|
label=iris.target_names[i],
|
||||||
|
edgecolor="black",
|
||||||
|
s=15
|
||||||
|
)
|
||||||
plt.axis("tight")
|
plt.axis("tight")
|
||||||
|
|
||||||
plt.suptitle("Decision surface of a decision tree using paired features")
|
plt.suptitle("Decision surface of a decision tree using paired features")
|
||||||
plt.legend()
|
#plt.legend()
|
||||||
|
plt.legend(bbox_to_anchor=(1.04, 1), loc="upper left")
|
||||||
plt.show()
|
plt.show()
|
||||||
|
|||||||
@@ -27,17 +27,17 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"# Introduction to Machine Learning II\n",
|
"# Introduction to Machine Learning II\n",
|
||||||
" \n",
|
" \n",
|
||||||
"In this lab session, we will go deeper in some aspects that were introduced in the previous session. This time we will delve into a little bit more detail about reading datasets, analyzing data and selecting features. In addition, we will explore the machine learning algorithm SVM in a binary classification problem provided by the Titanic dataset.\n",
|
"In this lab session, we will go deeper into the aspects introduced in the previous session. This time, we will delve a bit more deeply into reading datasets, analyzing data, and selecting features. In addition, we will explore the SVM machine learning algorithm on a binary classification problem using the Titanic dataset.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# Objectives\n",
|
"# Objectives\n",
|
||||||
"\n",
|
"\n",
|
||||||
"In this lecture we are going to introduce some more details about machine learning aspects. \n",
|
"In this lecture, we will introduce more details about machine learning. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"The main objectives of this session are:\n",
|
"The main objectives of this session are:\n",
|
||||||
"* Learn how to read data from a file or URL with pandas\n",
|
"* Learn how to read data from a file or URL with pandas\n",
|
||||||
"* Learn how to use the pandas DataFrame data structure\n",
|
"* Learn how to use the pandas DataFrame data structure\n",
|
||||||
"* Learn how to select features\n",
|
"* Learn how to select features\n",
|
||||||
"* Understand better and SVM machine learning algorithm"
|
"* Understand better the SVM machine learning algorithm"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -82,7 +82,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Licence\n",
|
"## Licence\n",
|
||||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
"\n",
|
"\n",
|
||||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
]
|
]
|
||||||
@@ -104,7 +104,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.12"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -125,5 +125,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -235,7 +235,7 @@
|
|||||||
],
|
],
|
||||||
"metadata": {
|
"metadata": {
|
||||||
"kernelspec": {
|
"kernelspec": {
|
||||||
"display_name": "Python 3",
|
"display_name": "Python 3 (ipykernel)",
|
||||||
"language": "python",
|
"language": "python",
|
||||||
"name": "python3"
|
"name": "python3"
|
||||||
},
|
},
|
||||||
@@ -249,7 +249,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.7.1"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -270,5 +270,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -150,7 +150,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# Series with population in 2015 of more populated cities in Spain\n",
|
"# Series with population in 2015 of most populous cities in Spain\n",
|
||||||
"s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', \n",
|
"s = Series([3141991, 1604555, 786189, 693878, 664953, 569130], index=['Madrid', 'Barcelona', 'Valencia', 'Sevilla', \n",
|
||||||
" 'Zaragoza', 'Malaga'])\n",
|
" 'Zaragoza', 'Malaga'])\n",
|
||||||
"s"
|
"s"
|
||||||
@@ -177,7 +177,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Until now, we have not seen any advantage in using Panda Series. we are going to show now some examples of their possibilities."
|
"Until now, we have not seen any advantage in using Panda Series. We are going to show some examples of their possibilities."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -204,7 +204,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Observe that (s > 1000000) returns a Series object. We can use this boolean vector as a filter to get a *slice* of the original series that contains only the elements where the value of the filter is True. The original Series s is not modified. This selection is called *boolean indexing*."
|
"Observe that (s > 1000000) returns a Series object. We can use this Boolean vector as a filter to obtain a slice of the original series containing only the elements where the filter evaluates to True. The original Series s is not modified. This selection is called *boolean indexing*."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -323,7 +323,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We can also change values directly or based on a condition. You can consult additional feautures in the manual."
|
"We can also change values directly or based on a condition. You can consult additional features in the manual."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -344,7 +344,8 @@
|
|||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# Increase by 10% cities with population greater than 700000\n",
|
"# Increase by 10% cities with population greater than 700000\n",
|
||||||
"s[s > 700000] = 1.1 * s[s > 700000]\n",
|
"# Since pandas 3.X use where\n",
|
||||||
|
"s = s.where(s <= 700000, s * 1.1)\n",
|
||||||
"s"
|
"s"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -449,7 +450,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
"\n",
|
"\n",
|
||||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
]
|
]
|
||||||
@@ -471,7 +472,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.12"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -492,5 +493,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -61,19 +61,19 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is loosely the process of manually converting or mapping data from one \"raw\" form (*datos en bruto*) into another format that allows for more convenient consumption of the data with the help of semi-automated tools.\n",
|
"[**Data munging**](https://en.wikipedia.org/wiki/Data_wrangling) or data wrangling is the process of manually converting or mapping data from one \"raw\" form (datos en bruto) into another format that makes the data more easily consumable with the help of semi-automated tools.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"*Scikit-learn* estimators which assume that all values are numerical. This is a common in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
"*Scikit-learn* estimators, which assume that all values are numerical. This is a common feature in many machine learning libraries. So, we need to preprocess our raw dataset. \n",
|
||||||
"Some of the most common tasks are:\n",
|
"Some of the most common tasks are:\n",
|
||||||
"* Remove samples with missing values or replace the missing values with a value (median, mean or interpolation)\n",
|
"* Remove samples with missing values or replace the missing values with a value (median, mean, or interpolation)\n",
|
||||||
"* Encode categorical variables as integers\n",
|
"* Encode categorical variables as integers\n",
|
||||||
"* Combine datasets\n",
|
"* Combine datasets\n",
|
||||||
"* Rename variables and convert types\n",
|
"* Rename variables and convert types\n",
|
||||||
"* Transform / scale variables\n",
|
"* Transform / scale variables\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We are going to play again with the Titanic dataset to practice with Pandas Dataframes and introduce a number of preprocessing facilities of scikit-learn.\n",
|
"We are going to play again with the Titanic dataset to practice with Pandas DataFrames and to introduce a number of preprocessing facilities from scikit-learn.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"First we load the dataset and we get a dataframe."
|
"First, we load the dataset, and we get a dataframe."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -281,14 +281,13 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"DataFrames provide a set of functions for selection that we will need later\n",
|
"DataFrames provide a set of functions for selection that we will need later\n",
|
||||||
"\n",
|
"\n",
|
||||||
"\n",
|
"| Operation | Syntax | Result |\n",
|
||||||
"|Operation | Syntax | Result |\n",
|
"|--------------------------------|--------------|----------|\n",
|
||||||
"|-----------------------------|\n",
|
"| Select column | `df[col]` | Series |\n",
|
||||||
"|Select column | df[col] | Series |\n",
|
"| Select row by label | `df.loc[label]` | Series |\n",
|
||||||
"|Select row by label | df.loc[label] | Series |\n",
|
"| Select row by integer location | `df.iloc[loc]` | Series |\n",
|
||||||
"|Select row by integer location | df.iloc[loc] | Series |\n",
|
"| Slice rows | `df[5:10]` | DataFrame |\n",
|
||||||
"|Slice rows\t | df[5:10]\t | DataFrame |\n",
|
"| Select rows by boolean vector | `df[bool_vec]` | DataFrame |"
|
||||||
"|Select rows by boolean vector | df[bool_vec] | DataFrame |"
|
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -451,7 +450,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Pivot tables are an intuitive way to analyze data, and an alternative to group columns.\n",
|
"Pivot tables are an intuitive way to analyze data and an alternative to grouping columns.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"This command makes a table with rows Sex and columns Pclass, and\n",
|
"This command makes a table with rows Sex and columns Pclass, and\n",
|
||||||
"averages the result of the column Survived, thereby giving the percentage of survivors in each grouping."
|
"averages the result of the column Survived, thereby giving the percentage of survivors in each grouping."
|
||||||
@@ -581,7 +580,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In this case there not duplicates. In case we would needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
"In this case, there are no duplicates. In case we needed, we could have removed them with [*df.drop_duplicates()*](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html), which can receive a list of columns to be considered for identifying duplicates (otherwise, it uses all the columns)."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -597,7 +596,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"Here we check how many null values there are.\n",
|
"Here we check how many null values there are.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We use sum() instead of count() or we would get the total number of records). Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
"We use sum() instead of count() to get the total number of records. Notice how we do not use size() now, either. You can print 'df.isnull()' and will see a DataFrame with boolean values."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -654,13 +653,11 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Observe that the Passenger with 889 has now an Agent of 28 (median) instead of NaN. \n",
|
"Observe that the Passenger with 889 now has an Agent of 28 (median) instead of NaN. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
|
"Regarding the column *cabins*, there are still NaN values, since the *Cabin* column is not numeric. We will see later how to change it.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"In addition, we could drop rows with any or all null values (method *dropna()*).\n",
|
"In addition, we could drop rows with any or all null values (method *dropna()*)."
|
||||||
"\n",
|
|
||||||
"If we want to modify directly the *df* object, we should add the parameter *inplace* with value *True*."
|
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -669,7 +666,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
|
"df['Age'] = df['Age'].fillna(df['Age'].mean())\n",
|
||||||
"df[-5:]"
|
"df[-5:]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -762,7 +759,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"df['Sex'].fillna('male', inplace=True)\n",
|
"df['Sex'] = df['Sex'].fillna('male')\n",
|
||||||
"df[-5:]"
|
"df[-5:]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -794,7 +791,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"As we saw, we have several non numerical columns: **Name**, **Sex**, **Ticket**, **Cabin** and **Embarked**.\n",
|
"As we saw, we have several non-numerical columns: **Name**, **Sex**, **Ticket**, **Cabin**, and **Embarked**.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"**Name** and **Ticket** do not seem informative.\n",
|
"**Name** and **Ticket** do not seem informative.\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -811,7 +808,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"# We remove Cabin and Ticket. We should specify the axis\n",
|
"# We remove Cabin and Ticket. We should specify the axis\n",
|
||||||
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
|
"# Use axis 0 for dropping rows and axis 1 for dropping columns\n",
|
||||||
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
|
"df = df.drop(['Cabin', 'Ticket'], axis=1)\n",
|
||||||
"df[-5:]"
|
"df[-5:]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -862,8 +859,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
|
"df[\"Sex\"] = df[\"Sex\"].map({\"male\": 0, \"female\": 1})\n",
|
||||||
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
|
|
||||||
"df[-5:]"
|
"df[-5:]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -927,7 +923,7 @@
|
|||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"#Replace nulls with the most common value\n",
|
"#Replace nulls with the most common value\n",
|
||||||
"df['Embarked'].fillna('S', inplace=True)\n",
|
"df['Embarked'] = df['Embarked'].fillna('S')\n",
|
||||||
"df['Embarked'].isnull().any()"
|
"df['Embarked'].isnull().any()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -937,10 +933,8 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# Now we replace as previosly the categories with integers\n",
|
"# Now we replace, as previously, the categories with integers\n",
|
||||||
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
|
"df[\"Embarked\"] = df[\"Embarked\"].map({\"S\": 0, \"C\": 1, \"Q\": 2})\n",
|
||||||
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
|
|
||||||
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
|
|
||||||
"df[-5:]"
|
"df[-5:]"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -948,9 +942,9 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Although this transformation can be ok, we are introducing *an error*. Some classifiers could think that there is an order in S, C, Q, and that Q is higher than S. \n",
|
"Although this transformation may be acceptable, we are introducing an error. Some classifiers could think that there is an order in S, C, and Q, and that Q is higher than S. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"To avoid this error, Scikit learn provides a facility for transforming all the categorical features into integer ones. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0 and Q=0.\n",
|
"To avoid this error, scikit-learn provides a facility for transforming all categorical features into integer features. In fact, it creates a new dummy binary feature per category. This means, in this case, Embarked=S would be represented as S=1, C=0, and Q=0.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)."
|
"We will learn how to do this in the next notebook. More details can be found in the [Scikit-learn documentation](http://scikit-learn.org/stable/modules/preprocessing.html)."
|
||||||
]
|
]
|
||||||
@@ -1007,7 +1001,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.11.5"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -1028,5 +1022,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -59,7 +59,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"# Introduction: preprocessing"
|
"# Introduction: preprocessing."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -68,7 +68,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"In the previous session, we introduced two libraries for visualisation: *matplotlib* and *seaborn*. We are going to review new functionalities in this notebook, as well as the integration of *pandas* with *matplotlib*.\n",
|
"In the previous session, we introduced two libraries for visualisation: *matplotlib* and *seaborn*. We are going to review new functionalities in this notebook, as well as the integration of *pandas* with *matplotlib*.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Visualisation is usually combined with munging. We have done this in separated notebooks for learning purposes. We we are going to examine again the dataset, combinging both techniques, and applying the knowledge we got in the previous notebook."
|
"Visualisation is usually combined with munging. We have done this in separate notebooks for learning purposes. We are going to examine the dataset again, combining both techniques, and applying the knowledge we got in the previous notebook."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -93,11 +93,11 @@
|
|||||||
" * 'hexbin' for hexagonal bin plots\n",
|
" * 'hexbin' for hexagonal bin plots\n",
|
||||||
" * 'pie' for pie charts\n",
|
" * 'pie' for pie charts\n",
|
||||||
" \n",
|
" \n",
|
||||||
"Every plot kind has an equivalent on Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of parameters.\n",
|
"Every plot kind has an equivalent on the Dataframe.plot accessor. This means, you can use **df.plot(kind='line')** or **df.plot.line**. Check the [plot documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html#pandas.DataFrame.plot) to learn the rest of the parameters.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"In addition, the module *pandas.tools.plotting* provides: **scatter_matrix**.\n",
|
"In addition, the module *pandas.tools.plotting* provides: **scatter_matrix**.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"You can consult more details in the [documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html)."
|
"You can consult the [documentation](http://pandas.pydata.org/pandas-docs/stable/visualization.html) for more details."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -135,7 +135,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"#We get a URL with raw content (not HTML one)\n",
|
"#We get a URL with raw content (not an HTML one)\n",
|
||||||
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
|
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
|
||||||
"df = pd.read_csv(url)\n",
|
"df = pd.read_csv(url)\n",
|
||||||
"df_original = df.copy() # Copy to have a version of df without modifications\n",
|
"df_original = df.copy() # Copy to have a version of df without modifications\n",
|
||||||
@@ -151,12 +151,9 @@
|
|||||||
"# Cleaning\n",
|
"# Cleaning\n",
|
||||||
"df_clean = df.copy() # We copy to see what happens with na values\n",
|
"df_clean = df.copy() # We copy to see what happens with na values\n",
|
||||||
"df_clean['Age'] = df['Age'].fillna(df['Age'].median())\n",
|
"df_clean['Age'] = df['Age'].fillna(df['Age'].median())\n",
|
||||||
"df_clean.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
|
"df_clean['Sex'] = df_clean['Sex'].map({'male': 0, 'female': 1})\n",
|
||||||
"df_clean.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
|
"df_clean = df_clean.drop(['Cabin', 'Ticket'], axis=1)\n",
|
||||||
"df_clean.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
|
"df_clean['Embarked'] = df_clean['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n",
|
||||||
"df_clean.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
|
|
||||||
"df_clean.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
|
|
||||||
"df_clean.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
|
|
||||||
"df_clean.head()"
|
"df_clean.head()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -171,7 +168,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In the previous session we saw that *Seaborn* provides several facilities for working with DataFrames. We are going to review some of them."
|
"In the previous session, we saw that *Seaborn* provides several facilities for working with DataFrames. We are going to review some of them."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -249,8 +246,8 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# General description of relationship between variables uwing Seaborn PairGrid\n",
|
"# General description of the relationship between variables using Seaborn PairGrid\n",
|
||||||
"# We use df_clean, since the null values of df would gives us an error, you can check it.\n",
|
"# We use df_clean because the null values in df would cause an error; you can check it.\n",
|
||||||
"g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
|
"g = sns.PairGrid(df_clean, hue=\"Survived\")\n",
|
||||||
"g.map(sns.scatterplot)\n",
|
"g.map(sns.scatterplot)\n",
|
||||||
"g.add_legend()"
|
"g.add_legend()"
|
||||||
@@ -260,7 +257,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"There are two many variables, we are going to represent only a subset."
|
"There are too many variables, we are going to represent only a subset."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -280,7 +277,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We can observe, for example, that more women survived as well as more people in 3rd class. \n",
|
"We can observe, for example, that more women and more people in the 3rd class survived. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"We can represent these findings."
|
"We can represent these findings."
|
||||||
]
|
]
|
||||||
@@ -319,7 +316,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We saw that there are 177 missing values of age. We are going this feature with more detail."
|
"We saw that there are 177 missing values of age. We are going to implement this feature with more detail."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -337,9 +334,9 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We see the histogram is slightly *right skewed* (*sesgada a la derecha*), so we will replace null values with the median instead of the mean.\n",
|
"We see the histogram is slightly *right-skewed* (*sesgada a la derecha*), so we will replace null values with the median instead of the mean.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"In case we have a significant *skewed distribution*, the extreme values in the long tail can have a disproportionately large influence on our model. So, it can be good to transform the variable before building our model to reduce skewness.Taking the natural logarithm or the square root of each point are two simple transformations. "
|
"If we have a significantly skewed distribution, extreme values in the long tail can exert a disproportionately large influence on our model. So, it can be good to transform the variable before building our model to reduce skewness. Taking the natural logarithm or the square root of each point is a simple transformation. "
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -391,7 +388,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We observe that non survived is left skewed. Most children survived."
|
"We observe that non-survived is left skewed. Most children survived."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -410,7 +407,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# We can observe the detail for children\n",
|
"# We can observe the details for children\n",
|
||||||
"df[df.Age < 20].hist(column='Age', by='Survived', sharey=True)"
|
"df[df.Age < 20].hist(column='Age', by='Survived', sharey=True)"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -428,9 +425,9 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"There were null values, we will recap at the end of this notebook how to manage them.\n",
|
"There were null values; we will recap at the end of this notebook how to manage them.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We are going now to see the distribution of passengers younger than 20 that survived."
|
"We are now going to see the distribution of passengers younger than 20 who survived."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -448,7 +445,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# Passengers older than 25 that survived grouped by Sex\n",
|
"# Passengers older than 25 who survived grouped by Sex\n",
|
||||||
"\n",
|
"\n",
|
||||||
"df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().plot(kind='bar')"
|
"df.query('Age < 20 and Survived == 1').groupby(['Sex','Pclass']).size().plot(kind='bar')"
|
||||||
]
|
]
|
||||||
@@ -586,7 +583,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# How many passergers survived by sex\n",
|
"# How many passengers survived by sex\n",
|
||||||
"df.groupby('Sex')['Survived'].sum()"
|
"df.groupby('Sex')['Survived'].sum()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -596,7 +593,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# How many passergers survived by sex\n",
|
"# How many passengers survived by sex\n",
|
||||||
"df.groupby('Sex')['Survived'].mean()"
|
"df.groupby('Sex')['Survived'].mean()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -604,7 +601,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We see that 74% of female survived, while only 18% of male survived."
|
"We see that 74% of females survived, while only 18% of males survived."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -615,7 +612,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"#Graphical representation\n",
|
"#Graphical representation\n",
|
||||||
"# You can add the parameter estimator to change the estimator. (e.g. estimator=np.median)\n",
|
"# You can add the parameter estimator to change the estimator. (e.g. estimator=np.median)\n",
|
||||||
"# For example, estimator=np.size is you get the same chart than with countplot\n",
|
"# For example, estimator=np.size is you get the same as with countplot\n",
|
||||||
"#sns.barplot(x='Sex', y='Survived', data=df, estimator=np.size)\n",
|
"#sns.barplot(x='Sex', y='Survived', data=df, estimator=np.size)\n",
|
||||||
"sns.barplot(x='Sex', y='Survived', data=df)"
|
"sns.barplot(x='Sex', y='Survived', data=df)"
|
||||||
]
|
]
|
||||||
@@ -624,7 +621,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We can see now if men and women follow the same age distribution."
|
"We can see now whether men and women follow the same age distribution."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -771,7 +768,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We see the distribution is right sweked. We are going to detect outliers using a box plot"
|
"We see the distribution is right-skewed. We are going to detect outliers using a box plot."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -790,7 +787,7 @@
|
|||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"# We can see the same with matplotlib.\n",
|
"# We can see the same with matplotlib.\n",
|
||||||
"# There is a bug and if you import seaborn, you should add 'sym='k.' to show the outliers\n",
|
"# There is a bug, and if you import seaborn, you should add 'sym='k.' to show the outliers\n",
|
||||||
"df.boxplot(column='Fare', return_type='axes', sym='k.')"
|
"df.boxplot(column='Fare', return_type='axes', sym='k.')"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -814,7 +811,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We see that most outliers are in class 1. In particular, we see some values higher thatn 500 that should be an error."
|
"We see that most outliers are in class 1. In particular, we see some values higher than 500 that should be an error."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -902,9 +899,9 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Since there are missing values, we will replace them by the most popular value ('S'), and we will also encode it since it is a categorical variable.\n",
|
"Since there are missing values, we will replace them with the most popular value ('S'), and we will also encode it since it is a categorical variable.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We can see if this has impact on its survival."
|
"We can see if this has an impact on its survival."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -953,7 +950,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We have to fill null values (2 null values) and encode this variable, since it is categorical. We will do it after reviewing the rest of features."
|
"We have to fill in the null values (2 nulls) and encode this variable, since it is categorical. We will do it after reviewing the rest of the features."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -995,7 +992,7 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"We can see that most passengers traveled without siblings or spouses. \n",
|
"We can see that most passengers traveled without siblings or spouses. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"We analyse if this had impact on its survival."
|
"We analyse whether this had an impact on its survival."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1020,7 +1017,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We see that it does not provide too much information. While the survival mean of all passengers is 38%, passengers with 0 SibSp has 34% of probability. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
|
"We see that it does not provide too much information. While the survival rate for all passengers is 38%, passengers with 0 SibSp have a 34% survival rate. Surprisingly, passengers with 1 sibling or spouse have a higher probability, 53%. We are going to see the distribution by gender"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1061,7 +1058,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We observe that when SibSp > 2, the survival probability decreases to the half. We are going to check if there is a difference in the age. "
|
"We observe that when SibSp > 2, the survival probability decreases to half. We are going to check if there is a difference in the age. "
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1150,7 +1147,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The feature Parch (Parents-Children Aboard) is somewhat related to the previous one, since it reflects family ties. It is well known that in emergencies, family groups often all die or evacuate together, so it is expected that it will also have an impact on our model."
|
"The feature Parch (Parents-Children Aboard) is somewhat related to the previous one, since it reflects family ties. It is well known that in emergencies, family groups often die or evacuate together, so it is expected that this will also affect our model."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1204,7 +1201,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We see the probability of surviving is higher in 2 and 3. Sincethere were too few rows for Parch >= 3, this part is not relevant."
|
"We see the probability of surviving is higher in 2 and 3. Since there were too few rows for Parch >= 3, this part is not relevant."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1229,7 +1226,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We observe that Parch has an important impact for men in first and second class. We are going to check the age."
|
"We observe that Parch has an important impact on men in first and second class. We are going to check the age."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1261,7 +1258,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We observe that there is a significant difference, so we suspect that this feature has impact of men in first and second class."
|
"We observe that there is a significant difference, so we suspect that this feature has an impact on men in first and second class."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1275,7 +1272,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Feature Age: null values"
|
"## Feature Age: null values."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1337,7 +1334,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Feature Embarking: null values"
|
"## Feature Embarking: null values."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1360,7 +1357,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"As we discussed previously, we will replace these missing values by the most popular one (mode): S."
|
"As we discussed previously, we will replace these missing values with the most popular one (mode): S."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1370,7 +1367,7 @@
|
|||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"#Replace nulls with the most common value\n",
|
"#Replace nulls with the most common value\n",
|
||||||
"df['Embarked'].fillna('S', inplace=True)\n",
|
"df['Embarked'] = df['Embarked'].fillna('S')\n",
|
||||||
"df['Embarked'].isnull().any()"
|
"df['Embarked'].isnull().any()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -1378,14 +1375,14 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Feature Cabin: null values"
|
"## Feature Cabin: null values."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We are going to analyse Cabin in the exercise"
|
"We are going to analyse Cabin in the exercise."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1399,14 +1396,14 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Recap: encoding categorical features"
|
"## Recap: encoding categorical features."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In the previous notebook we saw how to encode categorical features. We are going to explore an alternative way."
|
"In the previous notebook, we saw how to encode categorical features. We are going to explore an alternative way."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1418,13 +1415,10 @@
|
|||||||
"#df = df_original.copy()\n",
|
"#df = df_original.copy()\n",
|
||||||
"#df['SexEncoded'] = df.Sex\n",
|
"#df['SexEncoded'] = df.Sex\n",
|
||||||
"#\n",
|
"#\n",
|
||||||
"#df.loc[df[\"SexEncoded\"] == 'male', \"SexEncoded\"] = 0\n",
|
"#df[\"SexEncoded\"] = df[\"Sex\"].map({\"male\": 0, \"female\": 1})\n",
|
||||||
"#df.loc[df[\"SexEncoded\"] == \"female\", \"SexEncoded\"] = 1\n",
|
|
||||||
"#\n",
|
"#\n",
|
||||||
"#df['EmbarkedEncoded'] = df.Embarked\n",
|
"#df['EmbarkedEncoded'] = df.Embarked\n",
|
||||||
"#df.loc[df[\"EmbarkedEncoded\"] == \"S\", \"EmbarkedEncoded\"] = 0\n",
|
"#df[\"EmbarkedEncoded\"] = df[\"Embarked\"].map({\"S\": 0, \"C\": 1, \"Q\": 2})\n",
|
||||||
"#df.loc[df[\"EmbarkedEncoded\"] == \"C\", \"EmbarkedEncoded\"] = 1\n",
|
|
||||||
"#df.loc[df[\"EmbarkedEncoded\"] == \"Q\", \"EmbarkedEncoded\"] = 2\n",
|
|
||||||
"#df.head()"
|
"#df.head()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -1432,16 +1426,16 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"## Encoding Categorical Variables as Binary ones"
|
"## Encoding Categorical Variables as Binary Ones"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"As we see previously, translating categorical variables into integer can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable, and we can consider there exists an order in *Pclass*.\n",
|
"As we saw previously, translating categorical variables into integers can introduce an order. In our case, this is not a problem, since *Sex* is a binary variable and we can assume an order in *Pclass*.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Nevertheless, we are going to introduce a general approach to encode categorical variables using some facilities provided by scikit-learn."
|
"Nevertheless, we will introduce a general approach to encoding categorical variables using facilities provided by scikit-learn."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -1461,8 +1455,8 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"df = df_original.copy() # take original df\n",
|
"df = df_original.copy() # take original df\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# We define here the categorical columns have non integer values, so we need to convert them\n",
|
"# We define here the categorical columns have non-integer values, so we need to convert them\n",
|
||||||
"# into integers first with LabelEncoder. This can be omitted if the are already integers.\n",
|
"# into integers first with LabelEncoder. This can be omitted if they are already integers.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"label_enc = LabelEncoder()\n",
|
"label_enc = LabelEncoder()\n",
|
||||||
"label_sex = label_enc.fit_transform(df['Sex'])\n",
|
"label_sex = label_enc.fit_transform(df['Sex'])\n",
|
||||||
@@ -1489,7 +1483,8 @@
|
|||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"#Remove nulls\n",
|
"#Remove nulls\n",
|
||||||
"df['Embarked'].fillna('S', inplace=True)\n",
|
"# Fill missing values in 'Embarked'\n",
|
||||||
|
"df['Embarked'] = df['Embarked'].fillna('S')\n",
|
||||||
"df = pd.get_dummies(df, columns=['Embarked', 'Pclass'])\n",
|
"df = pd.get_dummies(df, columns=['Embarked', 'Pclass'])\n",
|
||||||
"df.head()"
|
"df.head()"
|
||||||
]
|
]
|
||||||
@@ -1514,7 +1509,8 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"df.drop(['Cabin', 'Ticket'], axis=1, inplace=True)\n",
|
"# Drop unwanted columns\n",
|
||||||
|
"df = df.drop(columns=['Cabin', 'Ticket'])\n",
|
||||||
"df.head()"
|
"df.head()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -1557,7 +1553,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
"\n",
|
"\n",
|
||||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
]
|
]
|
||||||
@@ -1579,7 +1575,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.12"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -1600,5 +1596,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -151,7 +151,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"How many passsengers have survived? List them grouped by Sex and Pclass.\n",
|
"How many passengers have survived? List them grouped by Sex and Pclass.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Assign the result to a variable df_1 and print it"
|
"Assign the result to a variable df_1 and print it"
|
||||||
]
|
]
|
||||||
@@ -448,7 +448,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Since age and class are both numbers we can just multiply them and get a new feature.\n"
|
"Since age and class are both numbers, we can simply multiply them to obtain a new feature.\n"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -471,7 +471,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
"\n",
|
"\n",
|
||||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
]
|
]
|
||||||
@@ -502,7 +502,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.12"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -523,5 +523,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -41,22 +41,22 @@
|
|||||||
"source": [
|
"source": [
|
||||||
"In the previous session, we learnt how to apply machine learning algorithms to the Iris dataset.\n",
|
"In the previous session, we learnt how to apply machine learning algorithms to the Iris dataset.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"We are going now to review the full process. As probably you have notice, data preparation, cleaning and transformation takes more than 90 % of data mining effort.\n",
|
"We are going to review the full process now. As you probably have noticed, data preparation, cleaning, and transformation account for more than 90% of the data mining effort.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"The phases are:\n",
|
"The phases are:\n",
|
||||||
"\n",
|
"\n",
|
||||||
"* **Data ingestion**: reading the data from the data lake\n",
|
"* **Data ingestion**: reading the data from the data lake\n",
|
||||||
"* **Preprocessing**: \n",
|
"* **Preprocessing**: \n",
|
||||||
" * **Data cleaning (munging)**: fill missing values, smooth noisy data (binning methods), identify or remove outlier, and resolve inconsistencies \n",
|
" * **Data cleaning (munging)**: fill missing values, smooth noisy data (binning methods), identify or remove outliers, and resolve inconsistencies \n",
|
||||||
" * **Data integration**: Integrate multiple datasets\n",
|
" * **Data integration**: Integrate multiple datasets\n",
|
||||||
" * **Data transformation**: normalization (rescale numeric values between 0 and 1), standardisation (rescale values to have mean of 0 and std of 1), transformation for smoothing a variable (e.g. square toot, ...), aggregation of data from several datasets\n",
|
" * **Data transformation**: normalization (rescale numeric values between 0 and 1), standardisation (rescale values to have a mean of 0 and std of 1), transformation for smoothing a variable (e.g., square root, ...), aggregation of data from several datasets\n",
|
||||||
" * **Data reduction**: dimensionality reduction, clustering and sampling. \n",
|
" * **Data reduction**: dimensionality reduction, clustering, and sampling. \n",
|
||||||
" * **Data discretization**: for numerical values and algorithms that do not accept continuous variables\n",
|
" * **Data discretization**: for numerical values and algorithms that do not accept continuous variables\n",
|
||||||
" * **Feature engineering**: selection of most relevant features, creation of new features and delete non relevant features\n",
|
" * **Feature engineering**: selection of the most relevant features, creation of new features, and deletion of non-relevant features\n",
|
||||||
" * Apply Sampling for dividing the dataset into training and test datasets.\n",
|
" * Apply Sampling for dividing the dataset into training and test datasets.\n",
|
||||||
"* **Machine learning**: apply machine learning algorithms and obtain an estimator, tuning its parameters.\n",
|
"* **Machine learning**: apply machine learning algorithms and obtain an estimator, tuning its parameters.\n",
|
||||||
"* **Evaluation** of the model\n",
|
"* **Evaluation** of the model\n",
|
||||||
"* **Prediction**: use the model for new data."
|
"* **Prediction**: Use the model for new data."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -92,7 +92,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
"\n",
|
"\n",
|
||||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
]
|
]
|
||||||
@@ -114,7 +114,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.12"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -135,5 +135,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -39,9 +39,9 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In this notebook we are going to train a classifier with the preprocessed Titanic dataset. \n",
|
"In this notebook, we will train a classifier on the preprocessed Titanic dataset. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"We are going to use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
|
"We will use the dataset we obtained in the [pandas munging notebook](3_3_Data_Munging_with_Pandas.ipynb) for simplicity. You can try some of the techniques learnt in the previous notebook."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -63,7 +63,7 @@
|
|||||||
"\n",
|
"\n",
|
||||||
"from pandas import Series, DataFrame\n",
|
"from pandas import Series, DataFrame\n",
|
||||||
"\n",
|
"\n",
|
||||||
"# Training and test spliting\n",
|
"# Training and test splitting\n",
|
||||||
"from sklearn.model_selection import train_test_split\n",
|
"from sklearn.model_selection import train_test_split\n",
|
||||||
"from sklearn import preprocessing\n",
|
"from sklearn import preprocessing\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -100,29 +100,33 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"#We get a URL with raw content (not HTML one)\n",
|
"#We get a URL with raw content (not an HTML one)\n",
|
||||||
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
|
"url=\"https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv\"\n",
|
||||||
"df = pd.read_csv(url)\n",
|
"df = pd.read_csv(url)\n",
|
||||||
"df.head()\n",
|
"df.head()\n",
|
||||||
"\n",
|
"\n",
|
||||||
"\n",
|
"\n",
|
||||||
"#Fill missing values\n",
|
"# --- Fill missing values ---\n",
|
||||||
"df['Age'].fillna(df['Age'].mean(), inplace=True)\n",
|
"## Age: fill with mean\n",
|
||||||
"df['Sex'].fillna('male', inplace=True)\n",
|
"df['Age'] = df['Age'].fillna(df['Age'].mean())\n",
|
||||||
"df['Embarked'].fillna('S', inplace=True)\n",
|
|
||||||
"\n",
|
"\n",
|
||||||
"# Encode categorical variables\n",
|
"## Sex: fill missing with 'male'\n",
|
||||||
"df['Age'] = df['Age'].fillna(df['Age'].median())\n",
|
"df['Sex'] = df['Sex'].fillna('male')\n",
|
||||||
"df.loc[df[\"Sex\"] == \"male\", \"Sex\"] = 0\n",
|
|
||||||
"df.loc[df[\"Sex\"] == \"female\", \"Sex\"] = 1\n",
|
|
||||||
"df.loc[df[\"Embarked\"] == \"S\", \"Embarked\"] = 0\n",
|
|
||||||
"df.loc[df[\"Embarked\"] == \"C\", \"Embarked\"] = 1\n",
|
|
||||||
"df.loc[df[\"Embarked\"] == \"Q\", \"Embarked\"] = 2\n",
|
|
||||||
"\n",
|
"\n",
|
||||||
"# Drop colums\n",
|
"## Embarked: fill missing with 'S'\n",
|
||||||
"df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)\n",
|
"df['Embarked'] = df['Embarked'].fillna('S')\n",
|
||||||
"\n",
|
"\n",
|
||||||
"#Show proprocessed df\n",
|
"# --- Encode categorical variables ---\n",
|
||||||
|
"# Sex: male=0, female=1\n",
|
||||||
|
"df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})\n",
|
||||||
|
"\n",
|
||||||
|
"# Embarked: S=0, C=1, Q=2\n",
|
||||||
|
"df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})\n",
|
||||||
|
"\n",
|
||||||
|
"# --- Drop unnecessary columns ---\n",
|
||||||
|
"df = df.drop(columns=['Cabin', 'Ticket', 'Name'])\n",
|
||||||
|
"\n",
|
||||||
|
"#Show preprocessed df\n",
|
||||||
"df.head()"
|
"df.head()"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
@@ -239,7 +243,7 @@
|
|||||||
"metadata": {},
|
"metadata": {},
|
||||||
"outputs": [],
|
"outputs": [],
|
||||||
"source": [
|
"source": [
|
||||||
"#This step will take some time \n",
|
"# This step will take some time \n",
|
||||||
"# Train - This is not needed if you use K-Fold\n",
|
"# Train - This is not needed if you use K-Fold\n",
|
||||||
"\n",
|
"\n",
|
||||||
"model.fit(X_train, y_train)\n",
|
"model.fit(X_train, y_train)\n",
|
||||||
@@ -447,7 +451,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"ROC curve helps to select a threshold to balance sensitivity and recall."
|
"The ROC curve helps to select a threshold to balance sensitivity and recall."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -484,7 +488,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"By default, the thresdhold to decide a class is 0.5, If we modify it, we should use the new thresdhold.\n",
|
"By default, the threshold to decide a class is 0.5. If we modify it, we should use the new threshold.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"threshold = 0.8\n",
|
"threshold = 0.8\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -524,7 +528,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"This is alternative to splitting the dataset into train and test. It will run k times slower than the other method, but it will be more accurate."
|
"This is an alternative to splitting the dataset into training and test sets. It will run k times slower than the other method but be more accurate."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -555,7 +559,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The traning scores decreases with the number of samples. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
|
"We can plot the [learning curve](http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html). The training scores decrease as the number of samples increases. The cross-validation reaches the training score at the end. It seems we will not get a better result with more samples."
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -578,7 +582,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In this section we are going to provide an alternative version of the previous one with optimization"
|
"In this section, we will provide an alternative version of the previous one with optimization"
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -628,7 +632,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"Any value in the blue survived while anyone in the red did not. Checkout the graph for the linear transformation. It created its decision boundary right on 50%! "
|
"Any value in the blue survived, while anyone in the red did not. Check out the graph of the linear transformation. It created its decision boundary right on 50%! "
|
||||||
]
|
]
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
@@ -658,7 +662,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
"\n",
|
"\n",
|
||||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
]
|
]
|
||||||
@@ -680,7 +684,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.12"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -701,5 +705,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -39,9 +39,9 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"In this exercise, we are going to put in practice what we have learnt in the notebooks of the session. \n",
|
"In this exercise, we are going to put into practice what we have learnt from the session notebooks. \n",
|
||||||
"\n",
|
"\n",
|
||||||
"In the previous notebook we have been applying the SVM machine learning algorithm.\n",
|
"In the previous notebook, we applied the SVM machine learning algorithm.\n",
|
||||||
"\n",
|
"\n",
|
||||||
"Your task is to apply other machine learning algorithms (at least 2) that you have seen in theory or others you are interested in.\n",
|
"Your task is to apply other machine learning algorithms (at least 2) that you have seen in theory or others you are interested in.\n",
|
||||||
"\n",
|
"\n",
|
||||||
@@ -59,7 +59,7 @@
|
|||||||
"cell_type": "markdown",
|
"cell_type": "markdown",
|
||||||
"metadata": {},
|
"metadata": {},
|
||||||
"source": [
|
"source": [
|
||||||
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
"The notebook is freely licensed under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
||||||
"\n",
|
"\n",
|
||||||
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
|
||||||
]
|
]
|
||||||
@@ -81,7 +81,7 @@
|
|||||||
"name": "python",
|
"name": "python",
|
||||||
"nbconvert_exporter": "python",
|
"nbconvert_exporter": "python",
|
||||||
"pygments_lexer": "ipython3",
|
"pygments_lexer": "ipython3",
|
||||||
"version": "3.8.12"
|
"version": "3.12.2"
|
||||||
},
|
},
|
||||||
"latex_envs": {
|
"latex_envs": {
|
||||||
"LaTeX_envs_menu_present": true,
|
"LaTeX_envs_menu_present": true,
|
||||||
@@ -102,5 +102,5 @@
|
|||||||
}
|
}
|
||||||
},
|
},
|
||||||
"nbformat": 4,
|
"nbformat": 4,
|
||||||
"nbformat_minor": 1
|
"nbformat_minor": 4
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -35,7 +35,7 @@ def plot_svm(df):
|
|||||||
order = np.random.permutation(n_sample)
|
order = np.random.permutation(n_sample)
|
||||||
|
|
||||||
X = X[order]
|
X = X[order]
|
||||||
y = y[order].astype(np.float)
|
y = y[order].astype(float)
|
||||||
|
|
||||||
# do a cross validation
|
# do a cross validation
|
||||||
nighty_precent_of_sample = int(.9 * n_sample)
|
nighty_precent_of_sample = int(.9 * n_sample)
|
||||||
|
|||||||