From a4f8f69b191150bec43e47d25d9cf79ec4f4122d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?J=2E=20Fernando=20S=C3=A1nchez?= Date: Wed, 20 Feb 2019 19:34:09 +0100 Subject: [PATCH] Add SPARQL custom endpoint --- lod/02_SPARQL_Custom_Endpoint.ipynb | 459 ++++++++++++++++++++++++++++ 1 file changed, 459 insertions(+) create mode 100644 lod/02_SPARQL_Custom_Endpoint.ipynb diff --git a/lod/02_SPARQL_Custom_Endpoint.ipynb b/lod/02_SPARQL_Custom_Endpoint.ipynb new file mode 100644 index 0000000..a4726d0 --- /dev/null +++ b/lod/02_SPARQL_Custom_Endpoint.ipynb @@ -0,0 +1,459 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "7276f055a8c504d3c80098c62ed41a4f", + "grade": false, + "grade_id": "cell-0bfe38f97f6ab2d2", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "
\n", + "
\n", + "

Course Notes for Learning Intelligent Systems

\n", + "

Department of Telematic Engineering Systems

\n", + "
Universidad Politécnica de Madrid
\n", + "
\n", + " \"UPM\"\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "42642609861283bc33914d16750b7efa", + "grade": false, + "grade_id": "cell-0cd673883ee592d1", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "## Introduction\n", + "\n", + "In the previous notebook, we learnt how to use SPARQL by querying DBpedia.\n", + "\n", + "In this notebook, we will use SPARQL on manually annotated data. The data was collected as part of a [previous exercise](../lod/).\n", + "\n", + "The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n", + "\n", + "Hence, there are two parts.\n", + "First, you will query a set of graphs annotated by students of this course.\n", + "Then, you will query a synthetic dataset that contains similar information." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "a3ecb4b300a5ab82376a4a8cb01f7e6b", + "grade": false, + "grade_id": "cell-10264483046abcc4", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "## Objectives\n", + "\n", + "* Experiencing the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n", + "* Understanding the challenges in querying multiple sources, with different annotators.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "2fedf0d73fc90104d1ab72c3413dfc83", + "grade": false, + "grade_id": "cell-4f8492996e74bf20", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "## Tools\n", + "\n", + "See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "c5f8646518bd832a47d71f9d3218237a", + "grade": false, + "grade_id": "cell-eb13908482825e42", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "Run this line to enable the `%%sparql` magic command." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from helpers import *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercises\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Querying the manually annotated dataset will be slightly different from querying DBpedia.\n", + "The main difference is that this dataset uses different graphs to separate the annotations from different students.\n", + "\n", + "**Each graph is a separate set of triples**.\n", + "For this exercise, you could think of graphs as individual endpoints.\n", + "\n", + "\n", + "First, let us get a list of graphs available:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n", + " \n", + "SELECT ?g (COUNT(?s) as ?count) WHERE {\n", + " GRAPH ?g {\n", + " ?s ?p ?o\n", + " }\n", + "}\n", + "GROUP BY ?g\n", + "ORDER BY desc(?count)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "You should see many graphs, with different triple counts.\n", + "\n", + "The biggest one should be http://fuseki.cluster.gsi.dit.upm.es/synthetic" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Once you have this list, you can query specific graphs like so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n", + " \n", + "SELECT *\n", + "WHERE {\n", + " GRAPH {\n", + " ?s ?p ?o .\n", + " }\n", + "}\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are two exercises in this notebook.\n", + "\n", + "In each of them, you are asked to run five queries, to answer the following questions:\n", + "\n", + "* Number of hotels (or entities) with reviews\n", + "* Number of reviews\n", + "* The hotel with the lowest average score\n", + "* The hotel with the highest average score\n", + "* A list of hotels with their addresses and telephone numbers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Manually annotated data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Your task is to design five queries to answer the questions in the description, and run each of them in at least three graphs, other than the `synthetic` graph.\n", + "\n", + "To design the queries, you can either use what you know about the schema.org vocabularies, or explore subjects, predicates and objects in each of the graphs.\n", + "\n", + "You will get a better understanding if you follow the exploratory path.\n", + "\n", + "Here's an example to get the entities and their types in a graph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n", + "\n", + "PREFIX schema: \n", + " \n", + "SELECT ?s ?o\n", + "WHERE {\n", + " GRAPH {\n", + " ?s a ?o .\n", + " }\n", + "\n", + "}\n", + "LIMIT 40" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Synthetic dataset\n", + "\n", + "Now, run the same queries in the synthetic dataset.\n", + "\n", + "The query below should get you started:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n", + " \n", + "SELECT *\n", + "WHERE {\n", + " GRAPH {\n", + " ?s ?p ?o .\n", + " }\n", + "}\n", + "LIMIT 10" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Optional exercise\n", + "\n", + "\n", + "Explore the graphs and find the most typical mistakes (e.g. using `http://schema.org/Hotel/Hotel`).\n", + "\n", + "Tip: You can use normal SPARQL queries with `BOUND` and `REGEX` to check if the annotations are correct.\n", + "\n", + "You can also query all the graphs at the same time. e.g. to get all types used:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n", + "\n", + "PREFIX schema: \n", + " \n", + "SELECT DISTINCT ?o\n", + "WHERE {\n", + " GRAPH ?g {\n", + " ?s a ?o .\n", + " }\n", + " {\n", + " SELECT ?g\n", + " WHERE {\n", + " GRAPH ?g {}\n", + " FILTER (str(?g) != 'http://fuseki.cluster.gsi.dit.upm.es/synthetic')\n", + " }\n", + " }\n", + "\n", + "\n", + "}\n", + "LIMIT 50" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Discussion\n", + "\n", + "Compare the results of the synthetic and the manual dataset, and answer these questions:" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Both datasets should use the same schema. Are there any differences when it comes to using them?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "860c3977cd06736f1342d535944dbb63", + "grade": true, + "grade_id": "cell-9bd08e4f5842cb89", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "outputs": [], + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Are the annotations used correctly in every graph?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "1946a7ed4aba8d168bb3fad898c05651", + "grade": true, + "grade_id": "cell-9dc1c9033198bb18", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "outputs": [], + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Has any of the datasets been harder to query? If so, why?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "6714abc5226618b76dc4c1aaed6d1a49", + "grade": true, + "grade_id": "cell-6c18003ced54be23", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "outputs": [], + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## References" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n", + "* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Licence\n", + "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n", + "\n", + "© 2018 Universidad Politécnica de Madrid." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}