mirror of
				https://github.com/gsi-upm/sitc
				synced 2025-10-30 23:18:18 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			464 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			464 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| {
 | |
|  "cells": [
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "editable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "markdown",
 | |
|      "checksum": "7276f055a8c504d3c80098c62ed41a4f",
 | |
|      "grade": false,
 | |
|      "grade_id": "cell-0bfe38f97f6ab2d2",
 | |
|      "locked": true,
 | |
|      "schema_version": 3,
 | |
|      "solution": false
 | |
|     }
 | |
|    },
 | |
|    "source": [
 | |
|     "<header style=\"width:100%;position:relative\">\n",
 | |
|     "  <div style=\"width:80%;float:right;\">\n",
 | |
|     "    <h1>Course Notes for Learning Intelligent Systems</h1>\n",
 | |
|     "    <h3>Department of Telematic Engineering Systems</h3>\n",
 | |
|     "    <h5>Universidad Politécnica de Madrid</h5>\n",
 | |
|     "  </div>\n",
 | |
|     "        <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
 | |
|     "</header>"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "editable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "markdown",
 | |
|      "checksum": "42642609861283bc33914d16750b7efa",
 | |
|      "grade": false,
 | |
|      "grade_id": "cell-0cd673883ee592d1",
 | |
|      "locked": true,
 | |
|      "schema_version": 3,
 | |
|      "solution": false
 | |
|     }
 | |
|    },
 | |
|    "source": [
 | |
|     "## Introduction\n",
 | |
|     "\n",
 | |
|     "In the previous notebook, we learnt how to use SPARQL by querying DBpedia.\n",
 | |
|     "\n",
 | |
|     "In this notebook, we will use SPARQL on manually annotated data. The data was collected as part of a [previous exercise](../lod/).\n",
 | |
|     "\n",
 | |
|     "The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n",
 | |
|     "\n",
 | |
|     "Hence, there are two parts.\n",
 | |
|     "First, you will query a set of graphs annotated by students of this course.\n",
 | |
|     "Then, you will query a synthetic dataset that contains similar information."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "editable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "markdown",
 | |
|      "checksum": "a3ecb4b300a5ab82376a4a8cb01f7e6b",
 | |
|      "grade": false,
 | |
|      "grade_id": "cell-10264483046abcc4",
 | |
|      "locked": true,
 | |
|      "schema_version": 3,
 | |
|      "solution": false
 | |
|     }
 | |
|    },
 | |
|    "source": [
 | |
|     "## Objectives\n",
 | |
|     "\n",
 | |
|     "* Experiencing the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n",
 | |
|     "* Understanding the challenges in querying multiple sources, with different annotators.\n"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "editable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "markdown",
 | |
|      "checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
 | |
|      "grade": false,
 | |
|      "grade_id": "cell-4f8492996e74bf20",
 | |
|      "locked": true,
 | |
|      "schema_version": 3,
 | |
|      "solution": false
 | |
|     }
 | |
|    },
 | |
|    "source": [
 | |
|     "## Tools\n",
 | |
|     "\n",
 | |
|     "See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "editable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "markdown",
 | |
|      "checksum": "c5f8646518bd832a47d71f9d3218237a",
 | |
|      "grade": false,
 | |
|      "grade_id": "cell-eb13908482825e42",
 | |
|      "locked": true,
 | |
|      "schema_version": 3,
 | |
|      "solution": false
 | |
|     }
 | |
|    },
 | |
|    "source": [
 | |
|     "Run this line to enable the `%%sparql` magic command."
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "from helpers import *"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Exercises\n"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Querying the manually annotated dataset will be slightly different from querying DBpedia.\n",
 | |
|     "The main difference is that this dataset uses different graphs to separate the annotations from different students.\n",
 | |
|     "\n",
 | |
|     "**Each graph is a separate set of triples**.\n",
 | |
|     "For this exercise, you could think of graphs as individual endpoints.\n",
 | |
|     "\n",
 | |
|     "\n",
 | |
|     "First, let us get a list of graphs available:"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "%%sparql http://fuseki.gsi.upm.es/hotels\n",
 | |
|     "    \n",
 | |
|     "SELECT ?g (COUNT(?s) as ?count) WHERE {\n",
 | |
|     "    GRAPH ?g {\n",
 | |
|     "        ?s ?p ?o\n",
 | |
|     "    }\n",
 | |
|     "}\n",
 | |
|     "GROUP BY ?g\n",
 | |
|     "ORDER BY desc(?count)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "You should see many graphs, with different triple counts.\n",
 | |
|     "\n",
 | |
|     "The biggest one should be http://fuseki.gsi.upm.es/synthetic"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Once you have this list, you can query specific graphs like so:"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "%%sparql http://fuseki.gsi.upm.es/hotels\n",
 | |
|     "    \n",
 | |
|     "SELECT *\n",
 | |
|     "WHERE {\n",
 | |
|     "    GRAPH <http://fuseki.gsi.upm.es/synthetic>{\n",
 | |
|     "    ?s ?p ?o .\n",
 | |
|     "    }\n",
 | |
|     "}\n",
 | |
|     "LIMIT 10"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "There are two exercises in this notebook.\n",
 | |
|     "\n",
 | |
|     "In each of them, you are asked to run five queries, to answer the following questions:\n",
 | |
|     "\n",
 | |
|     "* Number of hotels (or entities) with reviews\n",
 | |
|     "* Number of reviews\n",
 | |
|     "* The hotel with the lowest average score\n",
 | |
|     "* The hotel with the highest average score\n",
 | |
|     "* A list of hotels with their addresses and telephone numbers"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Manually annotated data"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Your task is to design five queries to answer the questions in the description, and run each of them in at least three graphs, other than the `synthetic` graph.\n",
 | |
|     "\n",
 | |
|     "To design the queries, what you know about the schema.org vocabularies, or explore subjects, predicates and objects in each of the graphs.\n",
 | |
|     "\n",
 | |
|     "Here's an example to get the entities and their types in a graph:"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "%%sparql http://fuseki.gsi.upm.es/hotels\n",
 | |
|     "\n",
 | |
|     "PREFIX schema: <http://schema.org/>\n",
 | |
|     "    \n",
 | |
|     "SELECT ?s ?o\n",
 | |
|     "WHERE {\n",
 | |
|     "    GRAPH <http://fuseki.gsi.upm.es/35c20a49f8c6581be1cf7bd56d12d131>{\n",
 | |
|     "        ?s a ?o .\n",
 | |
|     "    }\n",
 | |
|     "\n",
 | |
|     "}\n",
 | |
|     "LIMIT 40"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Synthetic dataset\n",
 | |
|     "\n",
 | |
|     "Now, run the same queries in the synthetic dataset.\n",
 | |
|     "\n",
 | |
|     "The query below should get you started:"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "%%sparql http://fuseki.gsi.upm.es/hotels\n",
 | |
|     "    \n",
 | |
|     "SELECT *\n",
 | |
|     "WHERE {\n",
 | |
|     "    GRAPH <http://fuseki.gsi.upm.es/synthetic>{\n",
 | |
|     "    ?s ?p ?o .\n",
 | |
|     "    }\n",
 | |
|     "}\n",
 | |
|     "LIMIT 10"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Optional exercise\n",
 | |
|     "\n",
 | |
|     "\n",
 | |
|     "Explore the graphs and find the most typical mistakes (e.g. using `http://schema.org/Hotel/Hotel`).\n",
 | |
|     "\n",
 | |
|     "Tip: You can use normal SPARQL queries with `BOUND` and `REGEX` to check if the annotations are correct.\n",
 | |
|     "\n",
 | |
|     "You can also query all the graphs at the same time. e.g. to get all types used:"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {},
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "%%sparql http://fuseki.gsi.upm.es/hotels\n",
 | |
|     "\n",
 | |
|     "PREFIX schema: <http://schema.org/>\n",
 | |
|     "    \n",
 | |
|     "SELECT DISTINCT ?o\n",
 | |
|     "WHERE {\n",
 | |
|     "    GRAPH ?g {\n",
 | |
|     "        ?s a ?o .\n",
 | |
|     "    }\n",
 | |
|     "    {\n",
 | |
|     "        SELECT ?g\n",
 | |
|     "        WHERE {\n",
 | |
|     "           GRAPH ?g {}\n",
 | |
|     "           FILTER (str(?g) != 'http://fuseki.gsi.upm.es/synthetic')\n",
 | |
|     "        }\n",
 | |
|     "    }\n",
 | |
|     "\n",
 | |
|     "\n",
 | |
|     "}\n",
 | |
|     "LIMIT 50"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "### Discussion\n",
 | |
|     "\n",
 | |
|     "Compare the results of the synthetic and the manual dataset, and answer these questions:"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Both datasets should use the same schema. Are there any differences when it comes to using them?"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "code",
 | |
|      "checksum": "860c3977cd06736f1342d535944dbb63",
 | |
|      "grade": true,
 | |
|      "grade_id": "cell-9bd08e4f5842cb89",
 | |
|      "locked": false,
 | |
|      "points": 0,
 | |
|      "schema_version": 3,
 | |
|      "solution": true
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# YOUR ANSWER HERE"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Are the annotations used correctly in every graph?"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "code",
 | |
|      "checksum": "1946a7ed4aba8d168bb3fad898c05651",
 | |
|      "grade": true,
 | |
|      "grade_id": "cell-9dc1c9033198bb18",
 | |
|      "locked": false,
 | |
|      "points": 0,
 | |
|      "schema_version": 3,
 | |
|      "solution": true
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# YOUR ANSWER HERE"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "Has any of the datasets been harder to query? If so, why?"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "code",
 | |
|    "execution_count": null,
 | |
|    "metadata": {
 | |
|     "deletable": false,
 | |
|     "nbgrader": {
 | |
|      "cell_type": "code",
 | |
|      "checksum": "6714abc5226618b76dc4c1aaed6d1a49",
 | |
|      "grade": true,
 | |
|      "grade_id": "cell-6c18003ced54be23",
 | |
|      "locked": false,
 | |
|      "points": 0,
 | |
|      "schema_version": 3,
 | |
|      "solution": true
 | |
|     }
 | |
|    },
 | |
|    "outputs": [],
 | |
|    "source": [
 | |
|     "# YOUR ANSWER HERE"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## References"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
 | |
|     "* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
 | |
|    ]
 | |
|   },
 | |
|   {
 | |
|    "cell_type": "markdown",
 | |
|    "metadata": {},
 | |
|    "source": [
 | |
|     "## Licence\n",
 | |
|     "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
 | |
|     "\n",
 | |
|     "© Universidad Politécnica de Madrid."
 | |
|    ]
 | |
|   }
 | |
|  ],
 | |
|  "metadata": {
 | |
|   "kernelspec": {
 | |
|    "display_name": "Python 3",
 | |
|    "language": "python",
 | |
|    "name": "python3"
 | |
|   },
 | |
|   "language_info": {
 | |
|    "codemirror_mode": {
 | |
|     "name": "ipython",
 | |
|     "version": 3
 | |
|    },
 | |
|    "file_extension": ".py",
 | |
|    "mimetype": "text/x-python",
 | |
|    "name": "python",
 | |
|    "nbconvert_exporter": "python",
 | |
|    "pygments_lexer": "ipython3",
 | |
|    "version": "3.8.1"
 | |
|   }
 | |
|  },
 | |
|  "nbformat": 4,
 | |
|  "nbformat_minor": 2
 | |
| }
 |