sitc/lod/04_SPARQL_Advanced.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "7276f055a8c504d3c80098c62ed41a4f",
     "grade": false,
     "grade_id": "cell-0bfe38f97f6ab2d2",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "<header style=\"width:100%;position:relative\">\n",
    "  <div style=\"width:80%;float:right;\">\n",
    "    <h1>Course Notes for Learning Intelligent Systems</h1>\n",
    "    <h3>Department of Telematic Engineering Systems</h3>\n",
    "    <h5>Universidad Politécnica de Madrid</h5>\n",
    "  </div>\n",
    "        <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
    "</header>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "bd478e6253226d24ba7f33cb9f6ba706",
     "grade": false,
     "grade_id": "cell-0cd673883ee592d1",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "## Advanced SPARQL\n",
    "\n",
    "This notebook complements [the SPARQL notebook](./01_SPARQL.ipynb) with some advanced commands.\n",
    "\n",
    "If you have not completed the exercises in the previous notebook, please do so before continuing.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "9ea4fd529653214745b937d5fc4559e5",
     "grade": false,
     "grade_id": "cell-10264483046abcc4",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "## Objectives\n",
    "\n",
    "* To cover some SPARQL concepts that are less frequently used "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
     "grade": false,
     "grade_id": "cell-4f8492996e74bf20",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "## Tools\n",
    "\n",
    "See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "c5f8646518bd832a47d71f9d3218237a",
     "grade": false,
     "grade_id": "cell-eb13908482825e42",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "Run this line to enable the `%%sparql` magic command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from helpers import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Working with dates"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To explore dates, we will focus on our Writers example.\n",
    "\n",
    "First, search for writers born in the XX century.\n",
    "You can use a special filter, knowing that `\"2000\"^^xsd:date` is the first date of year 2000."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "69e23e6e3dc06ca9d2b5d878c2baba94",
     "grade": false,
     "grade_id": "cell-ab7755944d46f9ca",
     "locked": false,
     "schema_version": 3,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "%%sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
    "\n",
    "WHERE {\n",
    "    ?escritor dct:subject dbc:Spanish_novelists .\n",
    "    ?escritor rdfs:label ?nombre .\n",
    "    ?escritor dbo:birthDate ?fechaNac .\n",
    "    FILTER(lang(?nombre) = \"es\") .\n",
    "    # YOUR ANSWER HERE\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
    "LIMIT 1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "211c632634327a1fd805326fa0520cdd",
     "grade": true,
     "grade_id": "cell-cf3821f2d33fb0f6",
     "locked": true,
     "points": 0,
     "schema_version": 3,
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'Camilo José Cela' in solution()['columns']['nombre']\n",
    "assert 'Javier Marías' in solution()['columns']['nombre']\n",
    "assert all(int(x) > 1899 and int(x) < 2001 for x in solution()['columns']['nac'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, get the list of Spanish novelists that are still alive.\n",
    "\n",
    "A person is alive if their death date is not defined and the were born less than 100 years ago.\n",
    "\n",
    "Remember, we can check whether the optional value for a key was bound in a SPARQL query using `BOUND(?key)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "e4579d551790c33ba4662562c6a05d99",
     "grade": false,
     "grade_id": "cell-474b1a72dec6827c",
     "locked": false,
     "schema_version": 3,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "%%sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
    "\n",
    "WHERE {\n",
    "    ?escritor dct:subject dbc:Spanish_novelists .\n",
    "    ?escritor rdfs:label ?nombre .\n",
    "    ?escritor dbo:birthDate ?fechaNac .\n",
    "# YOUR ANSWER HERE\n",
    "    FILTER(lang(?nombre) = \"es\") .\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
    "LIMIT 1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "770bbddef5210c28486a1929e4513ada",
     "grade": true,
     "grade_id": "cell-46b62dd2856bc919",
     "locked": true,
     "points": 0,
     "schema_version": 3,
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'Fernando Arrabal' in solution()['columns']['nombre']\n",
    "assert 'Albert Espinosa' in solution()['columns']['nombre']\n",
    "for year in solution()['columns']['nac']:\n",
    "    assert int(year) >= 1918"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Working with badly formatted dates (OPTIONAL!)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, get the list of Spanish novelists that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n",
    "\n",
    "For the sake of simplicity, you can use the `year(<date>)` function.\n",
    "\n",
    "Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).\n",
    "\n",
    "Hint 2: Some dates are not formatted properly, which makes some queries fail when they shouldn't. As a workaround, you could convert the date to string, and back to date again: `xsd:dateTime(str(?date))`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "2a24f623c23116fd23877facb487dd16",
     "grade": false,
     "grade_id": "cell-ceefd3c8fbd39d79",
     "locked": false,
     "schema_version": 3,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "%%sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac, ?fechaDef\n",
    "\n",
    "WHERE {\n",
    "    ?escritor dct:subject dbc:Spanish_novelists .\n",
    "    ?escritor rdfs:label ?nombre .\n",
    "    ?escritor dbo:birthDate ?fechaNac .\n",
    "    # YOUR ANSWER HERE\n",
    "}\n",
    "# YOUR ANSWER HERE\n",
    "LIMIT 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "18bb2d8d586bf4a5231973e69958ab75",
     "grade": true,
     "grade_id": "cell-461cd6ccc6c2dc79",
     "locked": true,
     "points": 0,
     "schema_version": 3,
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "assert 'Javier Sierra' in solution()['columns']['nombre']\n",
    "assert 'http://dbpedia.org/resource/Sanmao_(author)' in solution()['columns']['escritor']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Regular expressions](https://www.w3.org/TR/rdf-sparql-query/#funcex-regex) are a very powerful tool, but we will only cover the basics in this exercise.\n",
    "\n",
    "In essence, regular expressions match strings against patterns.\n",
    "In their simplest form, they can be used to find substrings within a variable.\n",
    "For instance, using `regex(?label, \"substring\")` would only match if and only if the `?label` variable contains `substring`.\n",
    "But regular expressions can be more complex than that.\n",
    "For instance, we can find patterns such as: a 10 digit number, a 5 character long string, or variables without whitespaces.\n",
    "\n",
    "The syntax of the regex function is the following:\n",
    "\n",
    "```\n",
    "regex(?variable, \"pattern\", \"flags\")\n",
    "```\n",
    "\n",
    "Flags are optional configuration options for the regular expression, such as *do not care about case* (`i` flag).\n",
    "\n",
    "As an example, let us find the cities in Madrid that contain \"de\" in their name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql\n",
    "\n",
    "SELECT ?localidad\n",
    "WHERE {\n",
    "    ?localidad <http://dbpedia.org/ontology/isPartOf> <http://dbpedia.org/resource/Community_of_Madrid> .\n",
    "    ?localidad rdfs:label ?nombre .\n",
    "    FILTER (lang(?nombre) = \"es\" ).\n",
    "    FILTER regex(?nombre, \"de\", \"i\")\n",
    "}\n",
    "LIMIT 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, use regular expressions to find Spanish novelists whose **first name** is Juan.\n",
    "In other words, their name **starts with** \"Juan\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "6e444c20b411033a6c45fd5a566018fa",
     "grade": false,
     "grade_id": "cell-a57d3546a812f689",
     "locked": false,
     "schema_version": 3,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "%%sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
    "PREFIX dbr:<http://dbpedia.org/resource/>\n",
    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "# YOUR ANSWER HERE\n",
    "\n",
    "WHERE {\n",
    "    {\n",
    "        ?escritor dct:subject dbc:Spanish_poets .\n",
    "    }\n",
    "    UNION {\n",
    "        ?escritor dct:subject dbc:Spanish_novelists .\n",
    "    }\n",
    "    ?escritor rdfs:label ?nombre\n",
    "    FILTER(lang(?nombre) = \"es\") .\n",
    "# YOUR ANSWER HERE\n",
    "}\n",
    "ORDER BY ?nombre\n",
    "LIMIT 1000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "66db9abddfafa91c2dc25577457f71fb",
     "grade": true,
     "grade_id": "cell-c149fe65008f39a9",
     "locked": true,
     "points": 0,
     "schema_version": 3,
     "solution": false
    }
   },
   "outputs": [],
   "source": [
    "assert len(solution()['columns']['nombre']) > 15\n",
    "for i in solution()['columns']['nombre']:\n",
    "    assert 'Juan' in i\n",
    "assert \"Robert Juan-Cantavella\" not in solution()['columns']['nombre']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "1be6d6e4d8e74240ef07deffcbe5e71a",
     "grade": false,
     "grade_id": "cell-0c2f0113d97dc9de",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "## Group concat"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "c8dbb73a781bd24080804f289a1cea0b",
     "grade": false,
     "grade_id": "asdasdasdddddddddddasdasdsad",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "Sometimes, it is useful to aggregate results from form different rows.\n",
    "For instance, we might want to get a comma-separated list of the names in each each autonomous community in Spain.\n",
    "\n",
    "In those cases, we can use the `GROUP_CONCAT` function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dbo: <http://dbpedia.org/ontology/>\n",
    "PREFIX dbr: <http://dbpedia.org/resource/>\n",
    "        \n",
    "SELECT ?com, GROUP_CONCAT(?name, \",\") as ?places  # notice how we rename the variable\n",
    "\n",
    "WHERE {\n",
    "    ?localidad dbo:isPartOf ?com .\n",
    "    ?com dbo:type dbr:Autonomous_communities_of_Spain .\n",
    "    ?localidad rdfs:label ?name .\n",
    "    FILTER (lang(?name)=\"es\")\n",
    "}\n",
    "\n",
    "ORDER BY ?com\n",
    "LIMIT 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "nbgrader": {
     "cell_type": "markdown",
     "checksum": "e100e2f89c832cf832add62c107e4008",
     "grade": false,
     "grade_id": "asdiopjasdoijasdoijasd",
     "locked": true,
     "schema_version": 3,
     "solution": false
    }
   },
   "source": [
    "Try it yourself, to get a list of works by each of these authors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "nbgrader": {
     "cell_type": "code",
     "checksum": "9f6e26faab2be98c72fb7a917ac5a421",
     "grade": false,
     "grade_id": "cell-2e3de17c75047652",
     "locked": false,
     "schema_version": 3,
     "solution": true
    }
   },
   "outputs": [],
   "source": [
    "%%sparql\n",
    "\n",
    "PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
    "PREFIX dct:<http://purl.org/dc/terms/>\n",
    "PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
    "PREFIX dbr:<http://dbpedia.org/resource/>\n",
    "PREFIX dbo:<http://dbpedia.org/ontology/>\n",
    "\n",
    "# YOUR ANSWER HERE\n",
    "\n",
    "WHERE {\n",
    "    ?escritor dct:subject dbc:Spanish_novelists .\n",
    "    ?escritor rdfs:label ?nombre .\n",
    "    ?escritor dbo:birthDate ?fechaNac .\n",
    "    ?escritor dbo:birthPlace dbr:Madrid .\n",
    "    OPTIONAL {\n",
    "        ?obra dbo:author ?escritor .\n",
    "        ?obra rdfs:label ?titulo .\n",
    "    }\n",
    "    OPTIONAL {\n",
    "        ?escritor dbo:deathDate ?fechaDef .\n",
    "    }\n",
    "    FILTER (?fechaNac <= \"2000\"^^xsd:date).\n",
    "    FILTER (?fechaNac >= \"1918\"^^xsd:date).\n",
    "    FILTER (!bound(?fechaDef) || (?fechaNac >= \"1918\"^^xsd:date)) .\n",
    "    FILTER(lang(?nombre) = \"es\") .\n",
    "    FILTER(!bound(?titulo) || lang(?titulo) = \"en\") .\n",
    "\n",
    "}\n",
    "ORDER BY ?nombre\n",
    "LIMIT 10000"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Licence\n",
    "The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  \n",
    "\n",
    "© 2018 Universidad Politécnica de Madrid."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}