2025-06-16 21:02:20 +00:00
9 changed files with 2003 additions and 2369 deletions
--- a/lod/01_SPARQL_Introduction.ipynb
+++ b/lod/01_SPARQL_Introduction.ipynb
--- a/lod/02_SPARQL_Custom_Endpoint.ipynb
+++ b/lod/02_SPARQL_Custom_Endpoint.ipynb
@ -1,459 +0,0 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "7276f055a8c504d3c80098c62ed41a4f",
     "grade": false,
     "grade_id": "cell-0bfe38f97f6ab2d2",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "<header style=\"width:100%;position:relative\">\n",
    "  <div style=\"width:80%;float:right;\">\n",
    "    <h1>Course Notes for Learning Intelligent Systems</h1>\n",
    "    <h3>Department of Telematic Engineering Systems</h3>\n",
    "    <h5>Universidad Politécnica de Madrid</h5>\n",
    "  </div>\n",
    "        <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
    "</header>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "42642609861283bc33914d16750b7efa",
     "grade": false,
     "grade_id": "cell-0cd673883ee592d1",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Introduction\n",
    "\n",
    "In the previous notebook, we learnt how to use SPARQL by querying DBpedia.\n",
    "\n",
    "In this notebook, we will use SPARQL on manually annotated data. The data was collected as part of a [previous exercise](../lod/).\n",
    "\n",
    "The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n",
    "\n",
    "Hence, there are two parts.\n",
    "First, you will query a set of graphs annotated by students of this course.\n",
    "Then, you will query a synthetic dataset that contains similar information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "a3ecb4b300a5ab82376a4a8cb01f7e6b",
     "grade": false,
     "grade_id": "cell-10264483046abcc4",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Objectives\n",
    "\n",
    "* Experiencing the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n",
    "* Understanding the challenges in querying multiple sources, with different annotators.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
     "grade": false,
     "grade_id": "cell-4f8492996e74bf20",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Tools\n",
    "\n",
    "See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "c5f8646518bd832a47d71f9d3218237a",
     "grade": false,
     "grade_id": "cell-eb13908482825e42",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "Run this line to enable the `%%sparql` magic command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from helpers import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Querying the manually annotated dataset will be slightly different from querying DBpedia.\n",
    "The main difference is that this dataset uses different graphs to separate the annotations from different students.\n",
    "\n",
    "**Each graph is a separate set of triples**.\n",
    "For this exercise, you could think of graphs as individual endpoints.\n",
    "\n",
    "\n",
    "First, let us get a list of graphs available:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "    \n",
    "SELECT ?g (COUNT(?s) as ?count) WHERE {\n",
    "    GRAPH ?g {\n",
    "        ?s ?p ?o\n",
    "    }\n",
    "}\n",
    "GROUP BY ?g\n",
    "ORDER BY desc(?count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "You should see many graphs, with different triple counts.\n",
    "\n",
    "The biggest one should be http://fuseki.cluster.gsi.dit.upm.es/synthetic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have this list, you can query specific graphs like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "    \n",
    "SELECT *\n",
    "WHERE {\n",
    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
    "    ?s ?p ?o .\n",
    "    }\n",
    "}\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two exercises in this notebook.\n",
    "\n",
    "In each of them, you are asked to run five queries, to answer the following questions:\n",
    "\n",
    "* Number of hotels (or entities) with reviews\n",
    "* Number of reviews\n",
    "* The hotel with the lowest average score\n",
    "* The hotel with the highest average score\n",
    "* A list of hotels with their addresses and telephone numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Manually annotated data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your task is to design five queries to answer the questions in the description, and run each of them in at least three graphs, other than the `synthetic` graph.\n",
    "\n",
    "To design the queries, you can either use what you know about the schema.org vocabularies, or explore subjects, predicates and objects in each of the graphs.\n",
    "\n",
    "You will get a better understanding if you follow the exploratory path.\n",
    "\n",
    "Here's an example to get the entities and their types in a graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "\n",
    "PREFIX schema: <http://schema.org/>\n",
    "    \n",
    "SELECT ?s ?o\n",
    "WHERE {\n",
    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/35c20a49f8c6581be1cf7bd56d12d131>{\n",
    "        ?s a ?o .\n",
    "    }\n",
    "\n",
    "}\n",
    "LIMIT 40"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Synthetic dataset\n",
    "\n",
    "Now, run the same queries in the synthetic dataset.\n",
    "\n",
    "The query below should get you started:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "    \n",
    "SELECT *\n",
    "WHERE {\n",
    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
    "    ?s ?p ?o .\n",
    "    }\n",
    "}\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional exercise\n",
    "\n",
    "\n",
    "Explore the graphs and find the most typical mistakes (e.g. using `http://schema.org/Hotel/Hotel`).\n",
    "\n",
    "Tip: You can use normal SPARQL queries with `BOUND` and `REGEX` to check if the annotations are correct.\n",
    "\n",
    "You can also query all the graphs at the same time. e.g. to get all types used:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "\n",
    "PREFIX schema: <http://schema.org/>\n",
    "    \n",
    "SELECT DISTINCT ?o\n",
    "WHERE {\n",
    "    GRAPH ?g {\n",
    "        ?s a ?o .\n",
    "    }\n",
    "    {\n",
    "        SELECT ?g\n",
    "        WHERE {\n",
    "           GRAPH ?g {}\n",
    "           FILTER (str(?g) != 'http://fuseki.cluster.gsi.dit.upm.es/synthetic')\n",
    "        }\n",
    "    }\n",
    "\n",
    "\n",
    "}\n",
    "LIMIT 50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "\n",
    "Compare the results of the synthetic and the manual dataset, and answer these questions:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both datasets should use the same schema. Are there any differences when it comes to using them?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "860c3977cd06736f1342d535944dbb63",
     "grade": true,
     "grade_id": "cell-9bd08e4f5842cb89",
     "locked": false,
     "points": 0,
     "schema_version": 1,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "# YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are the annotations used correctly in every graph?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "1946a7ed4aba8d168bb3fad898c05651",
     "grade": true,
     "grade_id": "cell-9dc1c9033198bb18",
     "locked": false,
     "points": 0,
     "schema_version": 1,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "# YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Has any of the datasets been harder to query? If so, why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "6714abc5226618b76dc4c1aaed6d1a49",
     "grade": true,
     "grade_id": "cell-6c18003ced54be23",
     "locked": false,
     "points": 0,
     "schema_version": 1,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "# YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
    "* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2018 Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/lod/README.md
+++ b/lod/README.md
@ -0,0 +1,11 @@
 # Files included #
 * `validate.py` validates and serializes a turtle dataset
 * `sparql.py` runs a custom sparql query on a given dataset (by default, `reviews.ttl`)
 * `extract_data.py` extracts RDFa, micro-data and JSON-LD data from a given URL
 # Installation #
 ```
 pip install --user -r requirements.txt
 ```
--- a/lod/SPARQL.ipynb
+++ b/lod/SPARQL.ipynb
--- a/lod/extract_data.py
+++ b/lod/extract_data.py
@ -0,0 +1,49 @@
 import sys
 from future.standard_library import install_aliases
 install_aliases()
 from urllib import request, parse
 from rdflib import Graph, term
 from lxml import etree
 if len(sys.argv) < 2:
    print('Usage: python {} <URL>'.format(sys.argv[0]))
    print('')
    print('Extract rdfa, microdata and json-ld annotations from a website')
    exit(1)
 url = sys.argv[1]
 g = Graph()
 g.parse(url, format='rdfa')
 g.parse(url, format='microdata')
 def sanitize_triple(t):
    """Function to remove bad URIs from the graph that would otherwise
    make the serialization fail."""
    def sanitize_triple_item(item):
        if isinstance(item, term.URIRef) and '/' not in item:
            return term.URIRef(parse.quote(str(item)))
        return item
    return (sanitize_triple_item(t[0]),
            sanitize_triple_item(t[1]),
            sanitize_triple_item(t[2]))
 with request.urlopen(url) as response:
    # Get all json-ld objects embedded in the html file
    html = response.read().decode('utf-8', errors='ignore')
    parser = etree.XMLParser(recover=True)
    root = etree.fromstring(html, parser=parser)
    if root:
        for jsonld in root.findall(".//script[@type='application/ld+json']"):
            g.parse(data=jsonld.text, publicID=url, format='json-ld')
 fixedgraph = Graph()
 fixedgraph += [sanitize_triple(s) for s in g]
 print(g.serialize(format='turtle').decode('utf-8', errors='ignore'))
--- a/lod/helpers.py
+++ b/lod/helpers.py
@ -1,22 +1,12 @@
 '''
 Helper functions and ipython magic for the SPARQL exercises.
 The tests in the notebooks rely on the `LAST_QUERY` variable, which is updated by the `%%sparql` magic after every query.
 This variable contains the full query used (`LAST_QUERY["query"]`), the endpoint it was sent to (`LAST_QUERY["endpoint"]`), and a dictionary with the response of the endpoint (`LAST_QUERY["results"]`).
 For convenience, the results are also given as tuples (`LAST_QUERY["tuples"]`), and as a dictionary of of `{column:[values]}` (`LAST_QUERY["columns"]`).
 '''
 from IPython.core.magic import (register_line_magic, register_cell_magic,
                                register_line_cell_magic)
-from IPython.display import HTML, display, Image, display_javascript
+
 from IPython.display import HTML, display, Image
 from urllib.request import Request, urlopen
 from urllib.parse import quote_plus, urlencode
 from urllib.error import HTTPError
 import json
 import sys
 js = "IPython.CodeCell.options_default.highlight_modes['magic_sparql'] = {'reg':[/^%%sparql/]};"
 display_javascript(js, raw=True)
 def send_query(query, endpoint):
@ -30,11 +20,7 @@ def send_query(query, endpoint):
                headers={'content-type': 'application/x-www-form-urlencoded',
                         'accept': FORMATS},
                method='POST')
-    res = urlopen(r)
+    return json.loads(urlopen(r).read().decode('utf-8'));
    data = res.read().decode('utf-8')
    if res.getcode() == 200:
        return json.loads(data)
    raise Exception('Error getting results: {}'.format(data))
 def tabulate(tuples, header=None):
@ -53,14 +39,11 @@ def tabulate(tuples, header=None):
 LAST_QUERY = {}
 def solution():
    return LAST_QUERY
 def query(query, endpoint=None, print_table=False):
    global LAST_QUERY
-    endpoint = endpoint or "http://fuseki.cluster.gsi.dit.upm.es/sitc/"
+    endpoint = endpoint or "http://dbpedia.org/sparql"
    results = send_query(query, endpoint)
    tuples = to_table(results)
@ -97,30 +80,12 @@ def to_table(results):
@register_cell_magic
 def sparql(line, cell):
    '''
    Sparql magic command for ipython. It can be used in a cell like this:
    ```
    %%sparql
    ... Your SPARQL query ...
    ```
    by default, it will use the DBpedia endpoint, but you can use a different endpoint like this:
    ```
    %%sparql http://my-sparql-endpoint...
    ... Your SPARQL query ...
    ```
    '''
    try:
        return query(cell, endpoint=line, print_table=True)
    except HTTPError as ex:
        error_message = ex.read().decode('utf-8')
        print('Error {}. Reason: {}'.format(ex.status, ex.reason))
-        print(error_message, file=sys.stderr)
+        print(error_message)
 def show_photos(values):
--- a/lod/reviews.ttl
+++ b/lod/reviews.ttl
@ -0,0 +1,29 @@
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
 _:Hotel1 a schema:Hotel ;
         schema:description "A fictitious hotel" .
 _:Review1 a schema:Review ;
          schema:reviewBody "This is a great review" ;
          schema:reviewRating [
           a schema:Rating ;
           schema:author <http://jfernando.es/me> ;
           schema:ratingValue "0.7"
          ] ;
          schema:itemReviewed _:Hotel1 .
 _:Review2 a schema:Review ;
          schema:reviewBody "This is a not so great review" ;
          schema:reviewRating [
           a schema:Rating ;
           schema:author [ a schema:Person ;
           schema:givenName "anonymous" ] ;
           schema:ratingValue "0.3"
          ] ;
          schema:itemReviewed _:Hotel1 .
--- a/lod/sparql.py
+++ b/lod/sparql.py
@ -0,0 +1,23 @@
 # !/bin/env python #
 # Ejemplo de consultas SPARQL sobre turtle #
 # python consultas.py #
 import rdflib
 import sys
 dataset = sys.argv[1] if len(sys.argv) > 1 else 'reviews.ttl'
 g = rdflib.Graph()
 schema = rdflib.Namespace("http://schema.org/")
 # Read Turtle file #
 g.parse(dataset, format='turtle')
 results = g.query(
    """SELECT DISTINCT ?review ?p ?o
       WHERE {
          ?review a schema:Review.
          ?review ?p ?o.
       }""", initNs={'schema': schema})
 for row in results:
    print("%s %s %s" % row)
--- a/lod/validate.py
+++ b/lod/validate.py
@ -0,0 +1,6 @@
 import rdflib
 import sys
 g = rdflib.Graph()
 dataset = sys.argv[1] if len(sys.argv) > 1 else 'reviews.ttl'
 g.parse(dataset, format="n3")
 print(g.serialize(format="n3").decode('utf-8'))