mirror of
https://github.com/gsi-upm/sitc
synced 2024-11-14 02:32:27 +00:00
1881 lines
50 KiB
Plaintext
1881 lines
50 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "7276f055a8c504d3c80098c62ed41a4f",
|
|
"grade": false,
|
|
"grade_id": "cell-0bfe38f97f6ab2d2",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"<header style=\"width:100%;position:relative\">\n",
|
|
" <div style=\"width:80%;float:right;\">\n",
|
|
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
|
|
" <h3>Department of Telematic Engineering Systems</h3>\n",
|
|
" <h5>Universidad Politécnica de Madrid</h5>\n",
|
|
" </div>\n",
|
|
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
|
|
"</header>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "a273399fb0e4a7752cea07a36562def1",
|
|
"grade": false,
|
|
"grade_id": "cell-0cd673883ee592d1",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Introduction to Linked Data\n",
|
|
"\n",
|
|
"This lecture provides a quick introduction to semantic queries in Python.\n",
|
|
"We will be using DBpedia, a semantic version of Wikipedia.\n",
|
|
"\n",
|
|
"The language we will use to query DBpedia is SPARQL, a semantic query language inspired by SQL.\n",
|
|
"For convenience, the examples in the notebook are executable, and they are accompanied by some code to test the results.\n",
|
|
"If the tests pass, you probably got the answer right.\n",
|
|
"\n",
|
|
"However, you can also use any other method to write and send your queries.\n",
|
|
"You may find online query editors particularly useful.\n",
|
|
"In addition to running queries from your browser, they provide useful features such as syntax highlighting and autocompletion.\n",
|
|
"Some examples are:\n",
|
|
"\n",
|
|
"* DBpedia's virtuoso query editor https://dbpedia.org/sparql\n",
|
|
"* A javascript based client hosted at GSI: http://yasgui.cluster.gsi.dit.upm.es/"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "255c4bd678939b4448860dc5e0afdae6",
|
|
"grade": false,
|
|
"grade_id": "cell-10264483046abcc4",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Objectives\n",
|
|
"\n",
|
|
"* Learning SPARQL and the Linked Data principles by defining queries to answer a set of problems of increasing difficulty\n",
|
|
"* Verifying the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n",
|
|
"* Learning how to use integrated SPARQL editors and programming interfaces to SPARQL."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "f04dd27e103bacc5166763900527901e",
|
|
"grade": false,
|
|
"grade_id": "cell-4f8492996e74bf20",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Tools\n",
|
|
"\n",
|
|
"* This notebook\n",
|
|
"* SPARQL editors (optional)\n",
|
|
" * YASGUI-GSI http://yasgui.cluster.gsi.dit.upm.es\n",
|
|
" * DBpedia virtuoso http://dbpedia.org/sparql\n",
|
|
"\n",
|
|
"Using the YASGUI-GSI editor has several advantages over other options.\n",
|
|
"It features:\n",
|
|
"\n",
|
|
"* Selection of data source, either by specifying the URL or by selecting from a dropdown menu\n",
|
|
"* Interactive query editing\n",
|
|
" * A set of pre-defined queries\n",
|
|
" * Syntax errors\n",
|
|
" * Auto-complete\n",
|
|
"* Data visualization\n",
|
|
" * Total number of results\n",
|
|
" * Different formats (table, pivot table, raw response, etc.)\n",
|
|
" * Pagination of results\n",
|
|
" * Search and filter results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "c5f8646518bd832a47d71f9d3218237a",
|
|
"grade": false,
|
|
"grade_id": "cell-eb13908482825e42",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"Run this line to enable the `%%sparql` magic command."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from helpers import *"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The `%%sparql` magic command will allow us to use SPARQL inside normal jupyter cells.\n",
|
|
"\n",
|
|
"For instance, the following code:\n",
|
|
"\n",
|
|
"```\n",
|
|
"%%sparql\n",
|
|
"\n",
|
|
"MY QUERY\n",
|
|
"``` \n",
|
|
"\n",
|
|
"Is the same as `run_query('MY QUERY', endpoint='http://dbpedia.org/sparql')` plus some additional steps, such as saving the results in a nice table format so that they can be used later and storing the results in a variable (`LAST_QUERY`), which we will use in our tests.\n",
|
|
"\n",
|
|
"You do not need to worry about it, and **you can always use one of the suggested online editors if you wish**."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exercises\n",
|
|
"\n",
|
|
"The following exercises cover the basics of SPARQL with simple use cases.\n",
|
|
"We will provide you some example code to get you started, the *question* you will have to answer using SPARQL, and the skeleton for the answer.\n",
|
|
"\n",
|
|
"After every query, you will find some python code to test the results of the query.\n",
|
|
"Make sure you've run the tests before moving to the next exercise.\n",
|
|
"If the test gives you an error, you've probably done something wrong.\n",
|
|
"You **do not need to understand or modify the test code**.\n",
|
|
"\n",
|
|
"\n",
|
|
"In case you're interested, the tests rely on the `LAST_QUERY` variable, which is updated by the `%%sparql` magic after every query.\n",
|
|
"This variable contains the full query used (`LAST_QUERY[\"query\"]`), the endpoint it was sent to (`LAST_QUERY[\"endpoint\"]`), and a dictionary with the response of the endpoint (`LAST_QUERY[\"results\"]`).\n",
|
|
"For convenience, the results are also given as tuples (`LAST_QUERY[\"tuples\"]`), and as a dictionary of of `{column:[values]}` (`LAST_QUERY[\"columns\"]`)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"#### First Select\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let's start with a simple query. We will get a list of cities and towns in Madrid.\n",
|
|
"If we take a look at the DBpedia ontology or the page of any town we already know, we discover that the property that links towns to their community is [`isPartOf`](http://dbpedia.org/ontology/isPartOf), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)\n",
|
|
"\n",
|
|
"Since there are potentially many cities to get, we will limit our results to the first 10 results:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"SELECT ?localidad\n",
|
|
"WHERE {\n",
|
|
" ?localidad <http://dbpedia.org/ontology/isPartOf> <http://dbpedia.org/resource/Community_of_Madrid>\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"However, that query is very verbose because we are using full URIs.\n",
|
|
"To simplify it, we will make use of SPARQL prefixes:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
|
|
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
|
|
" \n",
|
|
"SELECT ?localidad\n",
|
|
"WHERE {\n",
|
|
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To make sure that the query returned something sensible, we can test it with some python code:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'localidad' in LAST_QUERY['columns']\n",
|
|
"assert len(LAST_QUERY['tuples']) == 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now that you have some experience under your belt, it is time to design your own query.\n",
|
|
"\n",
|
|
"Your first task it to get a list of Spanish Novelits, using the skeleton below and the previous query to guide you.\n",
|
|
"\n",
|
|
"Pages for Spanish novelists are grouped in the *Spanish novelists* DBpedia category. You can use that fact to get your list.\n",
|
|
"In other words, the difference from the previous query will be using `dct:subject` instead of `dbo:isPartOf`, and `dbc:Spanish_novelists` instead of `dbr:Community_of_Madrid`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "d73b49b84482f51dc199b0e22763e9cc",
|
|
"grade": false,
|
|
"grade_id": "cell-7a9509ff3c34127e",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"\n",
|
|
"SELECT ?escritor\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "a5aafd75ac7fa036fe5dafc4ed30c535",
|
|
"grade": true,
|
|
"grade_id": "cell-91240ded2cac7b6d",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert len(LAST_QUERY['columns']) == 1 # We only use one variable, ?escritor\n",
|
|
"assert len(LAST_QUERY['tuples']) == 10 # There should be 10 results"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Using more criteria\n",
|
|
"\n",
|
|
"We can get more than one property in the same query. Let us modify our query to get the population of the cities as well."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
|
|
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
|
|
" \n",
|
|
"SELECT ?localidad ?pop ?when\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
" ?localidad dbo:populationTotal ?pop .\n",
|
|
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
|
|
" ?localidad dbp:populationAsOf ?when .\n",
|
|
"}\n",
|
|
"\n",
|
|
"LIMIT 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'localidad' in LAST_QUERY['columns']\n",
|
|
"assert 'http://dbpedia.org/resource/Parla' in LAST_QUERY['columns']['localidad']\n",
|
|
"assert ('http://dbpedia.org/resource/San_Sebastián_de_los_Reyes', '75912', '2009') in LAST_QUERY['tuples']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Time to try it yourself.\n",
|
|
"\n",
|
|
"Get the list of Spanish novelists AND their name (using rdfs:label)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "7cbf5260bbc6121b4ec1ec0f62e814c1",
|
|
"grade": false,
|
|
"grade_id": "cell-83dcaae0d09657b5",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"\n",
|
|
"SELECT ?escritor ?name\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "5c7bee95c0c08a8ede47fcaad597f51f",
|
|
"grade": true,
|
|
"grade_id": "cell-8afd28aada7a896c",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'escritor' in LAST_QUERY['columns']\n",
|
|
"assert 'http://dbpedia.org/resource/Eduardo_Mendoza_Garriga' in LAST_QUERY['columns']['escritor']\n",
|
|
"assert ('http://dbpedia.org/resource/Eduardo_Mendoza_Garriga', 'Eduardo Mendoza') in LAST_QUERY['tuples']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Filtering and ordering"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In the previous example, we saw that we got what seemed to be duplicated answers.\n",
|
|
"\n",
|
|
"This happens because entities can have labels in different languages (e.g. English, Spanish).\n",
|
|
"To restrict the search to only those results we're interested in, we can use filtering.\n",
|
|
"\n",
|
|
"We can also decide the order in which our results are shown.\n",
|
|
"\n",
|
|
"For instance, this is how we could use filtering to get only large cities in our example, ordered by population:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
|
|
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
|
|
" \n",
|
|
"SELECT ?localidad ?pop ?when\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
" ?localidad dbo:populationTotal ?pop .\n",
|
|
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
|
|
" ?localidad dbp:populationAsOf ?when .\n",
|
|
" FILTER(?pop > 100000)\n",
|
|
"}\n",
|
|
"ORDER BY ?pop\n",
|
|
"LIMIT 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Note that ordering happens before limits."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "c6080c3ed1dd3e9c3a224ac74e9dedc6",
|
|
"grade": true,
|
|
"grade_id": "cell-cb7b8283568cd349",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# We still have the biggest city\n",
|
|
"assert ('http://dbpedia.org/resource/Madrid', '3141991', '2014') in LAST_QUERY['tuples']\n",
|
|
"# But the smaller ones are gone\n",
|
|
"assert 'http://dbpedia.org/resource/Tres_Cantos' not in LAST_QUERY['columns']['localidad']\n",
|
|
"assert 'http://dbpedia.org/resource/San_Sebastián_de_los_Reyes' not in LAST_QUERY['columns']['localidad']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, try filtering to get a list of novelists and their name in Spanish, ordered by name `(FILTER (LANG(?nombre) = \"es\") y ORDER BY`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "8b1697739ecd76d45b6597a28429f13d",
|
|
"grade": false,
|
|
"grade_id": "cell-ff3d611cb0304b01",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"\n",
|
|
"SELECT ?escritor, ?nombre\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 1000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "2300be1911eb9cfddc6e2a82dcb244c2",
|
|
"grade": true,
|
|
"grade_id": "cell-d70cc6ea394741bc",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert len(LAST_QUERY['tuples']) >= 50\n",
|
|
"assert 'Adelaida García Morales' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert sum(1 for k in LAST_QUERY['columns']['escritor'] if k == 'http://dbpedia.org/resource/Adelaida_García_Morales') == 1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Dates"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"From now on, we will focus on our Writers example.\n",
|
|
"\n",
|
|
"First, search for writers born in the XX century.\n",
|
|
"You can use a special filter, knowing that `\"2000\"^^xsd:date` is the first date of year 2000."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "1764314669c1e3ad131a0930fa33549c",
|
|
"grade": false,
|
|
"grade_id": "cell-ab7755944d46f9ca",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 1000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "7a3c047b64ce4ffd02c87878f73f212a",
|
|
"grade": true,
|
|
"grade_id": "cell-cf3821f2d33fb0f6",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'Camilo José Cela' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert 'Javier Marías' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert all(int(x) > 1899 and int(x) < 2001 for x in LAST_QUERY['columns']['nac'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Optional\n",
|
|
"\n",
|
|
"In our last example, we were missing all the novelists that are missing their birth information in DBpedia.\n",
|
|
"\n",
|
|
"We can specify optional values in a query using the `OPTIONAL` keyword.\n",
|
|
"When a set of clauses are inside an OPTIONAL group, the SPARQL endpoint will try to use them in the query\n",
|
|
"If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).\n",
|
|
"\n",
|
|
"Using that, let us retrieve all the novelists born between 1900 and 2000, and the date they died (if they are available)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "429b902d4da0f40aefebba0ab722645e",
|
|
"grade": false,
|
|
"grade_id": "cell-254a18dd973e82ed",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT ?escritor, ?nombre, ?fechaNac, ?fechaDef\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "a6db5e4879286b0617be04711002ad63",
|
|
"grade": true,
|
|
"grade_id": "cell-4d6a64dde67f0e11",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'Camilo José Cela' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert '1916-05-11' in LAST_QUERY['columns']['fechaNac']\n",
|
|
"assert '' not in LAST_QUERY['columns']['fechaNac'] # All birthdates are defined\n",
|
|
"assert '' in LAST_QUERY['columns']['fechaDef'] # Some deathdates are not defined"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Bound"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can check whether the optional value for a key was bound in a SPARQL query using `BOUND(?key)`.\n",
|
|
"\n",
|
|
"This is very useful for two purposes.\n",
|
|
"First, it allows us to look for patterns that **do not occur** in the graph, such as missing properties.\n",
|
|
"For instance, we could search for the authors with missing birth information so we can add it.\n",
|
|
"Secondly, we can use bound in filters to get conditional filters.\n",
|
|
"We will explore both uses in this exercise."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Get the list of Spanish novelists that are still alive.\n",
|
|
"A person is alive if their death date is not defined and the were born less than 100 years ago"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "555154c87d8722bfeacd0e5cf5abc1a7",
|
|
"grade": false,
|
|
"grade_id": "cell-474b1a72dec6827c",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 1000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "fd420f3d8b7eca269eaba715b3999893",
|
|
"grade": true,
|
|
"grade_id": "cell-46b62dd2856bc919",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'Fernando Arrabal' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert 'Albert Espinosa' in LAST_QUERY['columns']['nombre']\n",
|
|
"for year in LAST_QUERY['columns']['nac']:\n",
|
|
" assert int(year) >= 1918"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, get the list of Spanish novelists that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n",
|
|
"\n",
|
|
"Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).\n",
|
|
"\n",
|
|
"Hint 2: Some dates are not formatted properly, which makes some queries fail when they shouldn't. You might need to convert between different types as a workaround. For instance, you could get the year from a date like this: `year(xsd:dateTime(str(?date)))`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "22505aa8eab7f771bf30ed12fe13f80c",
|
|
"grade": false,
|
|
"grade_id": "cell-ceefd3c8fbd39d79",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac, ?fechaDef\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "f11cf03b1c9ae7dbdaac314579b6c4bf",
|
|
"grade": true,
|
|
"grade_id": "cell-461cd6ccc6c2dc79",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'Javier Sierra' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert 'http://dbpedia.org/resource/Sanmao_(author)' in LAST_QUERY['columns']['escritor']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Finding unique elements"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In our last example, our results show some authors more than once.\n",
|
|
"This is because some properties are defined more than once.\n",
|
|
"For instance, birth date is giving using different formats.\n",
|
|
"Even if we exclude that property from our results by not adding it in our `SELECT`, we will get duplicated lines.\n",
|
|
"\n",
|
|
"To solve this, we can use the `DISTINCT` keyword."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Modify your last query to remove duplicated lines.\n",
|
|
"In other words, authors should only appear once."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "1380346cba93b5641132ba21f102e116",
|
|
"grade": false,
|
|
"grade_id": "cell-2a39adc71d26ae73",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT DISTINCT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "c8e5bf05e9d050389b2f8e7f142fdab0",
|
|
"grade": true,
|
|
"grade_id": "cell-542e0e36347fd5d1",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'Javier Sierra' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert 'http://dbpedia.org/resource/Albert_Espinosa' in LAST_QUERY['columns']['escritor']\n",
|
|
"\n",
|
|
"from collections import Counter\n",
|
|
"c = Counter(LAST_QUERY['columns']['nombre'])\n",
|
|
"for count in c.values():\n",
|
|
" assert count == 1\n",
|
|
" \n",
|
|
"c1 = Counter(LAST_QUERY['columns']['escritor'])\n",
|
|
"assert all(count==1 for count in c1.values())\n",
|
|
"# c = Counter(LAST_QUERY['columns']['nombre'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Using other resources"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Get the list of living Spanish novelists born in Madrid.\n",
|
|
"\n",
|
|
"Hint: use `dbr:Madrid` and `dbo:birthPlace`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "32e2c9b0ce32483960f5ca794da54fa8",
|
|
"grade": false,
|
|
"grade_id": "cell-d175e41da57c889b",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT DISTINCT ?escritor, ?nombre, ?lugarNac, year(?fechaNac) as ?nac\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "db2cdda5575af942f110d85e2dbe02b5",
|
|
"grade": true,
|
|
"grade_id": "cell-fadd095862db6bc8",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'José Ángel Mañas' in LAST_QUERY['columns']['nombre']\n",
|
|
"assert 'http://dbpedia.org/resource/Madrid' in LAST_QUERY['columns']['lugarNac']\n",
|
|
"MADRID_QUERY = LAST_QUERY['columns'].copy()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Traversing the graph"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Get the list of works of the authors in the previous query (i.e. authors born in Madrid), if they have any.\n",
|
|
"\n",
|
|
"Hint: use `dbo:author`, which is a **property of a literary work** that points to the author."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "abd3d09bdf5801d6d0b27d80326dfead",
|
|
"grade": false,
|
|
"grade_id": "cell-e4b99af9ef91ff6f",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT DISTINCT ?escritor, ?nombre, ?obra\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 10000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "d1305aa44456d51e3c52d78a9381f73a",
|
|
"grade": true,
|
|
"grade_id": "cell-68661b73c2140e4f",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'http://dbpedia.org/resource/A_Heart_So_White' in LAST_QUERY['columns']['obra']\n",
|
|
"assert 'http://dbpedia.org/resource/Tomorrow_in_the_Battle_Think_on_Me' in LAST_QUERY['columns']['obra']\n",
|
|
"assert '' in LAST_QUERY['columns']['obra'] # Some authors don't have works in dbpedia"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can also get a list of the works in string format using GROUP_CONCAT.\n",
|
|
"For instance, `GROUP_CONCAT(?obra, \",\")`, to separate works with a comma.\n",
|
|
"\n",
|
|
"Try it yourself:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "f0ab8a246687b926fb919abbafaf3b53",
|
|
"grade": false,
|
|
"grade_id": "cell-e13fae23ccb78bb8",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"# YOUR CODE HERE\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 10000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Traversing the graph"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Get a list of living Spanish novelists born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).\n",
|
|
"\n",
|
|
"If the query is right, you should see a list of writers after running the test code.\n",
|
|
"\n",
|
|
"Hint: `foaf:depiction` and `foaf: homepage`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "4ffc5d79f79c2079e93843838e91e053",
|
|
"grade": false,
|
|
"grade_id": "cell-b1f71c67dd71dad4",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT ?escritor ?web ?foto\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"ORDER BY ?nombre\n",
|
|
"LIMIT 100"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "bc497e6eaebe05e31248e3479df43c0c",
|
|
"grade": true,
|
|
"grade_id": "cell-8b8ba7cca701c652",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"fotos = set(filter(lambda x: x != '', LAST_QUERY['columns']['foto']))\n",
|
|
"assert len(fotos) > 2\n",
|
|
"show_photos(fotos) #show the pictures of the writers!"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Union"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We can merge the results of several queries, just like using `JOIN` in SQL.\n",
|
|
"The keyword in SPARQL is `UNION`, because we are merging graphs.\n",
|
|
"\n",
|
|
"`UNION` is useful in many situations.\n",
|
|
"For instance, when there are equivalent properties, or when you want to use two search terms and FILTER would be too inefficient.\n",
|
|
"\n",
|
|
"The syntax is as follows:\n",
|
|
"\n",
|
|
"```sparql\n",
|
|
"SELECT ?title\n",
|
|
"WHERE {\n",
|
|
" { ?book dc10:title ?title }\n",
|
|
" UNION\n",
|
|
" { ?book dc11:title ?title }\n",
|
|
" \n",
|
|
" ... REST OF YOUR QUERY ...\n",
|
|
"\n",
|
|
"}\n",
|
|
"```\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Using UNION, get a list of distinct spanish novelists AND poets.\n",
|
|
"\n",
|
|
"Hint: Category: Spanish_poets"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "5606810420d8cd259da74a3cc17fa824",
|
|
"grade": false,
|
|
"grade_id": "cell-21eb6323b6d0011d",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"SELECT DISTINCT ?escritor, ?nombre\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 10000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "eec248e71a855a5e713d31ae470f3fd4",
|
|
"grade": true,
|
|
"grade_id": "cell-004e021e877c6ace",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert 'Garcilaso de la Vega' in LAST_QUERY['columns']['nombre']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You can also get the count of results either by inspecting the result (we will not cover this) or by aggregating the results using the `COUNT` operation.\n",
|
|
"\n",
|
|
"The syntax is:\n",
|
|
" \n",
|
|
"```sparql\n",
|
|
"SELECT COUNT(?variable) as ?count_name\n",
|
|
"```\n",
|
|
"\n",
|
|
"Try it yourself with our previous example:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "2452c6213ad156deb5adbcfaeef74b8b",
|
|
"grade": false,
|
|
"grade_id": "cell-e35414e191c5bf16",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"# YOUR CODE HERE\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"LIMIT 10000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "f8b76d57ce959522a3914a442835393a",
|
|
"grade": true,
|
|
"grade_id": "cell-7a7ef8255a5662e2",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert len(LAST_QUERY['columns']) == 1\n",
|
|
"column_name = list(LAST_QUERY['columns'].keys())[0]\n",
|
|
"assert int(LAST_QUERY['columns'][column_name][0]) > 200"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Regular expressions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The last SPARQL concept we will cover are [regular expressions](https://www.w3.org/TR/rdf-sparql-query/#funcex-regex) (`regex`).\n",
|
|
"Regular expressions are a very powerful tool, but we will only cover the basics in this exercise.\n",
|
|
"\n",
|
|
"In essence, regular expressions match strings against patterns.\n",
|
|
"In their simplest form, they can be used to find substrings within a variable.\n",
|
|
"For instance, using `regex(?label, \"substring\")` would only match if and only if the `?label` variable contains `substring`.\n",
|
|
"But regular expressions can be more complex than that.\n",
|
|
"For instance, we can find patterns such as: a 10 digit number, a 5 character long string, or variables without whitespaces.\n",
|
|
"\n",
|
|
"The syntax of the regex function is the following:\n",
|
|
"\n",
|
|
"```\n",
|
|
"regex(?variable, \"pattern\", \"flags\")\n",
|
|
"```\n",
|
|
"\n",
|
|
"Flags are optional configuration options for the regular expression, such as *do not care about case* (`i` flag).\n",
|
|
"\n",
|
|
"As an example, let us find the cities in Madrid that contain \"de\" in their name."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"SELECT ?localidad\n",
|
|
"WHERE {\n",
|
|
" ?localidad <http://dbpedia.org/ontology/isPartOf> <http://dbpedia.org/resource/Community_of_Madrid> .\n",
|
|
" ?localidad rdfs:label ?nombre .\n",
|
|
" FILTER (lang(?nombre) = \"es\" ).\n",
|
|
" FILTER regex(?nombre, \"de\", \"i\")\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, use regular expressions to find Spanish novelists whose **first name** is Juan.\n",
|
|
"In other words, their name **starts with** \"Juan\"."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "580570dba869801272f9948f1e901bfd",
|
|
"grade": false,
|
|
"grade_id": "cell-a57d3546a812f689",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql\n",
|
|
"\n",
|
|
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
|
|
"PREFIX dct:<http://purl.org/dc/terms/>\n",
|
|
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
|
|
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
|
|
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
|
|
"\n",
|
|
"# YOUR CODE HERE\n",
|
|
"\n",
|
|
"WHERE {\n",
|
|
"# YOUR CODE HERE\n",
|
|
"}\n",
|
|
"# YOUR CODE HERE\n",
|
|
"LIMIT 1000"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "6632242d1d5055e12c3df37941b9e434",
|
|
"grade": true,
|
|
"grade_id": "cell-c149fe65008f39a9",
|
|
"locked": true,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"assert len(LAST_QUERY['columns']['nombre']) > 15\n",
|
|
"for i in LAST_QUERY['columns']['nombre']:\n",
|
|
" assert 'Juan' in i\n",
|
|
"assert \"Robert Juan-Cantavella\" not in LAST_QUERY['columns']['nombre']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Additional exercises"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Find out if there are more dbpedia entries for writers (dbo:Writer) than for football players (dbo:SoccerPlayers)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Get a list of European countries with a population higher than 20 million, in decreasing order of population, including their URI, name in English and population."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Find the country in the world that speaks the most languages. Show its name in Spanish, if available."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Querying custom data\n",
|
|
"\n",
|
|
"In the last part of this course, we will query the data annotated in the previous course on RDF.\n",
|
|
"\n",
|
|
"The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n",
|
|
"\n",
|
|
"Hence, there are two parts.\n",
|
|
"First, you will query a set of graphs annotated by students of this course.\n",
|
|
"Then, you will query a synthetic dataset that contains similar information."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"In particular, you need to run five queries, each one will answer one of the following questions:\n",
|
|
"\n",
|
|
"* Number of hotels (or entities) with reviews\n",
|
|
"* Number of reviews\n",
|
|
"* The hotel with the lowest average score\n",
|
|
"* The hotel with the highest average score\n",
|
|
"* A list of hotels with their addresses and telephone numbers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Manually annotated"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Querying the manually annotated dataset is slightly different from querying DBpedia.\n",
|
|
"The main difference is that this dataset uses different graphs to separate the annotations from different students.\n",
|
|
"\n",
|
|
"**Each graph is a separate set of triples**.\n",
|
|
"For this exercise, you could think of graphs as individual endpoints.\n",
|
|
"\n",
|
|
"\n",
|
|
"First, let us get a list of graphs available:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/ejerciciohoteles\n",
|
|
" \n",
|
|
"SELECT ?g WHERE {\n",
|
|
" GRAPH ?g {\n",
|
|
" ?s ?p ?o .\n",
|
|
" }\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once you have this list, you can query specific graphs like so:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/ejerciciohoteles\n",
|
|
" \n",
|
|
"SELECT *\n",
|
|
"WHERE {\n",
|
|
" GRAPH <http://fuseki.cluster.gsi.dit.upm.es/36de86e6754934381d935f10618fe985>{\n",
|
|
" ?s ?p ?o .\n",
|
|
" }\n",
|
|
"}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Now, design five queries to answer the questions in the description, and run each of them in at least five of these graphs.\n",
|
|
"\n",
|
|
"You can manually run the queries or use the code below, where you only need to specify your queries and the graphs you have identified.\n",
|
|
"\n",
|
|
"If you need additional prefixes, feel free to modify the TEMPLATE variable."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from IPython.display import display\n",
|
|
"\n",
|
|
"QUERIES = {\n",
|
|
" 'highest score': '''\n",
|
|
" ?s ?p ?o\n",
|
|
"''',\n",
|
|
" 'lowest score': '''\n",
|
|
" ?s ?p ?o\n",
|
|
" ''',\n",
|
|
" 'number of hotels': '''\n",
|
|
" ?s ?p ?o\n",
|
|
" ''',\n",
|
|
" 'number of reviews': '''\n",
|
|
" ?s ?p ?o\n",
|
|
" ''',\n",
|
|
" 'telephones and addresses': '''\n",
|
|
" ?s ?p ?o\n",
|
|
" ''',\n",
|
|
" \n",
|
|
"}\n",
|
|
"\n",
|
|
"TEMPLATE = '''\n",
|
|
"SELECT * WHERE {{\n",
|
|
" GRAPH <{graph}>{{\n",
|
|
" {query}\n",
|
|
" }}\n",
|
|
" }}\n",
|
|
"'''\n",
|
|
"\n",
|
|
"GRAPHS = ['http://fuseki.cluster.gsi.dit.upm.es/36de86e6754934381d935f10618fe985',\n",
|
|
" ]\n",
|
|
"\n",
|
|
"for name, query in QUERIES.items():\n",
|
|
" for graph in GRAPHS:\n",
|
|
" print(name, '@', graph)\n",
|
|
" display(sparql('http://fuseki.cluster.gsi.dit.upm.es/ejerciciohoteles', TEMPLATE.format(graph=graph,\n",
|
|
" query=query)\n",
|
|
" ))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Synthetic dataset\n",
|
|
"\n",
|
|
"Now, run the same queries in the synthetic dataset.\n",
|
|
"\n",
|
|
"The query below should get you started:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotelessintetico \n",
|
|
"\n",
|
|
"SELECT *\n",
|
|
"WHERE {\n",
|
|
" ?s ?p ?o .\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Discussion\n",
|
|
"\n",
|
|
"Compare the results of the synthetic and the manual dataset, and answer these questions:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Both datasets should use the same schema. Are there any differences when it comes to using them?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "11e7e2b7d3dfb45f9534506761f896f9",
|
|
"grade": true,
|
|
"grade_id": "cell-9bd08e4f5842cb89",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Are data correctly annotated in both datasets?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "f676f18c71297e8429448fa0f0833db1",
|
|
"grade": true,
|
|
"grade_id": "cell-9dc1c9033198bb18",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Has any of the datasets been harder to query? Why?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "2a24b20a338d18f4879540f5e03f5889",
|
|
"grade": true,
|
|
"grade_id": "cell-0e63b8e9dcb24676",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Has any of the datasets been harder to query? Why"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "2ec2cf74959db9112c189a4e7a0b3609",
|
|
"grade": true,
|
|
"grade_id": "cell-6c18003ced54be23",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Are data correctly annotated in both datasets"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "4a062d17043e5459a48314b1177cb8f1",
|
|
"grade": true,
|
|
"grade_id": "cell-cdce24ef5f581981",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR CODE HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
|
|
"* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence\n",
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© 2018 Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.4"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|