sitc/lod/02_SPARQL_Custom_Endpoint.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "7276f055a8c504d3c80098c62ed41a4f",
     "grade": false,
     "grade_id": "cell-0bfe38f97f6ab2d2",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "<header style=\"width:100%;position:relative\">\n",
    "  <div style=\"width:80%;float:right;\">\n",
    "    <h1>Course Notes for Learning Intelligent Systems</h1>\n",
    "    <h3>Department of Telematic Engineering Systems</h3>\n",
    "    <h5>Universidad Politécnica de Madrid</h5>\n",
    "  </div>\n",
    "        <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
    "</header>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "42642609861283bc33914d16750b7efa",
     "grade": false,
     "grade_id": "cell-0cd673883ee592d1",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Introduction\n",
    "\n",
    "In the previous notebook, we learnt how to use SPARQL by querying DBpedia.\n",
    "\n",
    "In this notebook, we will use SPARQL on manually annotated data. The data was collected as part of a [previous exercise](../lod/).\n",
    "\n",
    "The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n",
    "\n",
    "Hence, there are two parts.\n",
    "First, you will query a set of graphs annotated by students of this course.\n",
    "Then, you will query a synthetic dataset that contains similar information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "a3ecb4b300a5ab82376a4a8cb01f7e6b",
     "grade": false,
     "grade_id": "cell-10264483046abcc4",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Objectives\n",
    "\n",
    "* Experiencing the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n",
    "* Understanding the challenges in querying multiple sources, with different annotators.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
     "grade": false,
     "grade_id": "cell-4f8492996e74bf20",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "## Tools\n",
    "\n",
    "See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "checksum": "c5f8646518bd832a47d71f9d3218237a",
     "grade": false,
     "grade_id": "cell-eb13908482825e42",
     "locked": true,
     "schema_version": 1,
     "solution": false
    }
   },
   "source": [
    "Run this line to enable the `%%sparql` magic command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from helpers import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Querying the manually annotated dataset will be slightly different from querying DBpedia.\n",
    "The main difference is that this dataset uses different graphs to separate the annotations from different students.\n",
    "\n",
    "**Each graph is a separate set of triples**.\n",
    "For this exercise, you could think of graphs as individual endpoints.\n",
    "\n",
    "\n",
    "First, let us get a list of graphs available:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "    \n",
    "SELECT ?g (COUNT(?s) as ?count) WHERE {\n",
    "    GRAPH ?g {\n",
    "        ?s ?p ?o\n",
    "    }\n",
    "}\n",
    "GROUP BY ?g\n",
    "ORDER BY desc(?count)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "You should see many graphs, with different triple counts.\n",
    "\n",
    "The biggest one should be http://fuseki.cluster.gsi.dit.upm.es/synthetic"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you have this list, you can query specific graphs like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "    \n",
    "SELECT *\n",
    "WHERE {\n",
    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
    "    ?s ?p ?o .\n",
    "    }\n",
    "}\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are two exercises in this notebook.\n",
    "\n",
    "In each of them, you are asked to run five queries, to answer the following questions:\n",
    "\n",
    "* Number of hotels (or entities) with reviews\n",
    "* Number of reviews\n",
    "* The hotel with the lowest average score\n",
    "* The hotel with the highest average score\n",
    "* A list of hotels with their addresses and telephone numbers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Manually annotated data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your task is to design five queries to answer the questions in the description, and run each of them in at least three graphs, other than the `synthetic` graph.\n",
    "\n",
    "To design the queries, what you know about the schema.org vocabularies, or explore subjects, predicates and objects in each of the graphs.\n",
    "\n",
    "Here's an example to get the entities and their types in a graph:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "\n",
    "PREFIX schema: <http://schema.org/>\n",
    "    \n",
    "SELECT ?s ?o\n",
    "WHERE {\n",
    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/35c20a49f8c6581be1cf7bd56d12d131>{\n",
    "        ?s a ?o .\n",
    "    }\n",
    "\n",
    "}\n",
    "LIMIT 40"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Synthetic dataset\n",
    "\n",
    "Now, run the same queries in the synthetic dataset.\n",
    "\n",
    "The query below should get you started:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "    \n",
    "SELECT *\n",
    "WHERE {\n",
    "    GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
    "    ?s ?p ?o .\n",
    "    }\n",
    "}\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Optional exercise\n",
    "\n",
    "\n",
    "Explore the graphs and find the most typical mistakes (e.g. using `http://schema.org/Hotel/Hotel`).\n",
    "\n",
    "Tip: You can use normal SPARQL queries with `BOUND` and `REGEX` to check if the annotations are correct.\n",
    "\n",
    "You can also query all the graphs at the same time. e.g. to get all types used:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
    "\n",
    "PREFIX schema: <http://schema.org/>\n",
    "    \n",
    "SELECT DISTINCT ?o\n",
    "WHERE {\n",
    "    GRAPH ?g {\n",
    "        ?s a ?o .\n",
    "    }\n",
    "    {\n",
    "        SELECT ?g\n",
    "        WHERE {\n",
    "           GRAPH ?g {}\n",
    "           FILTER (str(?g) != 'http://fuseki.cluster.gsi.dit.upm.es/synthetic')\n",
    "        }\n",
    "    }\n",
    "\n",
    "\n",
    "}\n",
    "LIMIT 50"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "\n",
    "Compare the results of the synthetic and the manual dataset, and answer these questions:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both datasets should use the same schema. Are there any differences when it comes to using them?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "860c3977cd06736f1342d535944dbb63",
     "grade": true,
     "grade_id": "cell-9bd08e4f5842cb89",
     "locked": false,
     "points": 0,
     "schema_version": 1,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "# YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Are the annotations used correctly in every graph?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "1946a7ed4aba8d168bb3fad898c05651",
     "grade": true,
     "grade_id": "cell-9dc1c9033198bb18",
     "locked": false,
     "points": 0,
     "schema_version": 1,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "# YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Has any of the datasets been harder to query? If so, why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "checksum": "6714abc5226618b76dc4c1aaed6d1a49",
     "grade": true,
     "grade_id": "cell-6c18003ced54be23",
     "locked": false,
     "points": 0,
     "schema_version": 1,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "# YOUR ANSWER HERE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
    "* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}