Add SPARQL custom endpoint

2026-06-06 23:51:58 +00:00 · 2019-02-20 19:34:09 +01:00
parent fc07718ae8
commit a4f8f69b19
1 changed files with 459 additions and 0 deletions
--- a/lod/02_SPARQL_Custom_Endpoint.ipynb
+++ b/lod/02_SPARQL_Custom_Endpoint.ipynb
@@ -0,0 +1,459 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "checksum": "7276f055a8c504d3c80098c62ed41a4f",
+     "grade": false,
+     "grade_id": "cell-0bfe38f97f6ab2d2",
+     "locked": true,
+     "schema_version": 1,
+     "solution": false
+    }
+   },
+   "source": [
+    "<header style=\"width:100%;position:relative\">\n",
+    "  <div style=\"width:80%;float:right;\">\n",
+    "    <h1>Course Notes for Learning Intelligent Systems</h1>\n",
+    "    <h3>Department of Telematic Engineering Systems</h3>\n",
+    "    <h5>Universidad Politécnica de Madrid</h5>\n",
+    "  </div>\n",
+    "        <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
+    "</header>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "checksum": "42642609861283bc33914d16750b7efa",
+     "grade": false,
+     "grade_id": "cell-0cd673883ee592d1",
+     "locked": true,
+     "schema_version": 1,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Introduction\n",
+    "\n",
+    "In the previous notebook, we learnt how to use SPARQL by querying DBpedia.\n",
+    "\n",
+    "In this notebook, we will use SPARQL on manually annotated data. The data was collected as part of a [previous exercise](../lod/).\n",
+    "\n",
+    "The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n",
+    "\n",
+    "Hence, there are two parts.\n",
+    "First, you will query a set of graphs annotated by students of this course.\n",
+    "Then, you will query a synthetic dataset that contains similar information."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "checksum": "a3ecb4b300a5ab82376a4a8cb01f7e6b",
+     "grade": false,
+     "grade_id": "cell-10264483046abcc4",
+     "locked": true,
+     "schema_version": 1,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Objectives\n",
+    "\n",
+    "* Experiencing the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n",
+    "* Understanding the challenges in querying multiple sources, with different annotators.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
+     "grade": false,
+     "grade_id": "cell-4f8492996e74bf20",
+     "locked": true,
+     "schema_version": 1,
+     "solution": false
+    }
+   },
+   "source": [
+    "## Tools\n",
+    "\n",
+    "See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "deletable": false,
+    "editable": false,
+    "nbgrader": {
+     "checksum": "c5f8646518bd832a47d71f9d3218237a",
+     "grade": false,
+     "grade_id": "cell-eb13908482825e42",
+     "locked": true,
+     "schema_version": 1,
+     "solution": false
+    }
+   },
+   "source": [
+    "Run this line to enable the `%%sparql` magic command."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from helpers import *"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Exercises\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Querying the manually annotated dataset will be slightly different from querying DBpedia.\n",
+    "The main difference is that this dataset uses different graphs to separate the annotations from different students.\n",
+    "\n",
+    "**Each graph is a separate set of triples**.\n",
+    "For this exercise, you could think of graphs as individual endpoints.\n",
+    "\n",
+    "\n",
+    "First, let us get a list of graphs available:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
+    "    \n",
+    "SELECT ?g (COUNT(?s) as ?count) WHERE {\n",
+    "    GRAPH ?g {\n",
+    "        ?s ?p ?o\n",
+    "    }\n",
+    "}\n",
+    "GROUP BY ?g\n",
+    "ORDER BY desc(?count)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "You should see many graphs, with different triple counts.\n",
+    "\n",
+    "The biggest one should be http://fuseki.cluster.gsi.dit.upm.es/synthetic"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Once you have this list, you can query specific graphs like so:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
+    "    \n",
+    "SELECT *\n",
+    "WHERE {\n",
+    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
+    "    ?s ?p ?o .\n",
+    "    }\n",
+    "}\n",
+    "LIMIT 10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are two exercises in this notebook.\n",
+    "\n",
+    "In each of them, you are asked to run five queries, to answer the following questions:\n",
+    "\n",
+    "* Number of hotels (or entities) with reviews\n",
+    "* Number of reviews\n",
+    "* The hotel with the lowest average score\n",
+    "* The hotel with the highest average score\n",
+    "* A list of hotels with their addresses and telephone numbers"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Manually annotated data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Your task is to design five queries to answer the questions in the description, and run each of them in at least three graphs, other than the `synthetic` graph.\n",
+    "\n",
+    "To design the queries, you can either use what you know about the schema.org vocabularies, or explore subjects, predicates and objects in each of the graphs.\n",
+    "\n",
+    "You will get a better understanding if you follow the exploratory path.\n",
+    "\n",
+    "Here's an example to get the entities and their types in a graph:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
+    "\n",
+    "PREFIX schema: <http://schema.org/>\n",
+    "    \n",
+    "SELECT ?s ?o\n",
+    "WHERE {\n",
+    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/35c20a49f8c6581be1cf7bd56d12d131>{\n",
+    "        ?s a ?o .\n",
+    "    }\n",
+    "\n",
+    "}\n",
+    "LIMIT 40"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Synthetic dataset\n",
+    "\n",
+    "Now, run the same queries in the synthetic dataset.\n",
+    "\n",
+    "The query below should get you started:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
+    "    \n",
+    "SELECT *\n",
+    "WHERE {\n",
+    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
+    "    ?s ?p ?o .\n",
+    "    }\n",
+    "}\n",
+    "LIMIT 10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Optional exercise\n",
+    "\n",
+    "\n",
+    "Explore the graphs and find the most typical mistakes (e.g. using `http://schema.org/Hotel/Hotel`).\n",
+    "\n",
+    "Tip: You can use normal SPARQL queries with `BOUND` and `REGEX` to check if the annotations are correct.\n",
+    "\n",
+    "You can also query all the graphs at the same time. e.g. to get all types used:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
+    "\n",
+    "PREFIX schema: <http://schema.org/>\n",
+    "    \n",
+    "SELECT DISTINCT ?o\n",
+    "WHERE {\n",
+    "    GRAPH ?g {\n",
+    "        ?s a ?o .\n",
+    "    }\n",
+    "    {\n",
+    "        SELECT ?g\n",
+    "        WHERE {\n",
+    "           GRAPH ?g {}\n",
+    "           FILTER (str(?g) != 'http://fuseki.cluster.gsi.dit.upm.es/synthetic')\n",
+    "        }\n",
+    "    }\n",
+    "\n",
+    "\n",
+    "}\n",
+    "LIMIT 50"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Discussion\n",
+    "\n",
+    "Compare the results of the synthetic and the manual dataset, and answer these questions:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Both datasets should use the same schema. Are there any differences when it comes to using them?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "checksum": "860c3977cd06736f1342d535944dbb63",
+     "grade": true,
+     "grade_id": "cell-9bd08e4f5842cb89",
+     "locked": false,
+     "points": 0,
+     "schema_version": 1,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# YOUR ANSWER HERE"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Are the annotations used correctly in every graph?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "checksum": "1946a7ed4aba8d168bb3fad898c05651",
+     "grade": true,
+     "grade_id": "cell-9dc1c9033198bb18",
+     "locked": false,
+     "points": 0,
+     "schema_version": 1,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# YOUR ANSWER HERE"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Has any of the datasets been harder to query? If so, why?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "deletable": false,
+    "nbgrader": {
+     "checksum": "6714abc5226618b76dc4c1aaed6d1a49",
+     "grade": true,
+     "grade_id": "cell-6c18003ced54be23",
+     "locked": false,
+     "points": 0,
+     "schema_version": 1,
+     "solution": true
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# YOUR ANSWER HERE"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## References"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
+    "* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Licence\n",
+    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
+    "\n",
+    "© 2018 Universidad Politécnica de Madrid."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}