1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-11-17 20:12:28 +00:00
sitc/lod/03_SPARQL_Writers.ipynb

1418 lines
38 KiB
Plaintext
Raw Normal View History

2020-02-20 10:28:55 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "7276f055a8c504d3c80098c62ed41a4f",
"grade": false,
"grade_id": "cell-0bfe38f97f6ab2d2",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"<header style=\"width:100%;position:relative\">\n",
" <div style=\"width:80%;float:right;\">\n",
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
" <h3>Department of Telematic Engineering Systems</h3>\n",
" <h5>Universidad Politécnica de Madrid</h5>\n",
" </div>\n",
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
"</header>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "2387a77db61721ecc375a8275111ecaf",
"grade": false,
"grade_id": "cell-0cd673883ee592d1",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"## Introduction to Linked Open Data\n",
"\n",
"In this lecture, we will apply the same SPARQL concepts as in previous notebooks.\n",
"This time, instead of using a database specifically built for the exercise, we will be using DBpedia.\n",
"DBpedia is a semantic version of Wikipedia.\n",
"\n",
"The language we will use to query DBpedia is SPARQL, a semantic query language inspired by SQL.\n",
"For convenience, the examples in the notebook are executable, and they are accompanied by some code to test the results.\n",
"If the tests pass, you probably got the answer right.\n",
"\n",
"However, you can also use any other method to write and send your queries.\n",
"You may find online query editors particularly useful.\n",
"In addition to running queries from your browser, they provide useful features such as syntax highlighting and autocompletion.\n",
"Some examples are:\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "bc0ca2e21254707344c60f895cb204b4",
"grade": false,
"grade_id": "cell-10264483046abcc4",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"## Objectives\n",
"\n",
"* Learning SPARQL and the Linked Data principles by defining queries to answer a set of problems of increasing difficulty\n",
"* Learning how to use integrated SPARQL editors and programming interfaces to SPARQL."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
"grade": false,
"grade_id": "cell-4f8492996e74bf20",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"## Tools\n",
"\n",
"See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instructions\n",
"\n",
"As in previous notebooks, the exercises can be done in the notebook, using the `%%sparql` magic, and the set of tests.\n",
"\n",
"\n",
"After every query, you will find some python code to test the results of the query.\n",
"**Make sure you've run the tests before moving to the next exercise**.\n",
"If the test gives you an error, you've probably done something wrong.\n",
"You **do not need to understand or modify the test code**.\n",
"\n",
"If you prefer to edit your queries in a different editor, here are some options:\n",
"\n",
"* DBpedia's virtuoso query editor https://dbpedia.org/sparql\n",
"* A javascript based client hosted at GSI: http://yasgui.gsi.upm.es/\n",
"\n",
"If you use an editor, make sure to copy it to the notebook and run the tests, once you are getting the expected results."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "c5f8646518bd832a47d71f9d3218237a",
"grade": false,
"grade_id": "cell-eb13908482825e42",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"Run this line to enable the `%%sparql` magic command."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from helpers import sparql, solution, show_photos"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `%%sparql` magic command will allow us to use SPARQL inside normal jupyter cells.\n",
"\n",
"For instance, the following code:\n",
"\n",
"```python\n",
"%%sparql\n",
"\n",
"<MY QUERY>\n",
"``` \n",
"\n",
"Is the same as `run_query('<MY QUERY>', endpoint='http://dbpedia.org/sparql')` plus some additional steps, such as saving the results in a nice table format so that they can be used later and storing the results in a variable (`solution()`), which we will use in our tests.\n",
"\n",
"You do not need to worry about it, and **you can always use one of the suggested online editors if you wish**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### First Select"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with a simple query. We will get a list of cities and towns in Madrid.\n",
"If we take a look at the DBpedia ontology or the page of any town we already know, we discover that the property that links towns to their community is [`isPartOf`](http://dbpedia.org/ontology/isPartOf), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)\n",
"\n",
"Since there are potentially many cities to get, we will limit our results to the first 10 results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"SELECT ?localidad\n",
"WHERE {\n",
" ?localidad <http://dbpedia.org/ontology/isPartOf> <http://dbpedia.org/resource/Community_of_Madrid>\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, that query is very verbose because we are using full URIs.\n",
"To simplify it, we will make use of SPARQL prefixes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
" \n",
"SELECT ?localidad\n",
"WHERE {\n",
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make sure that the query returned something sensible, we can test it with some python code:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert 'localidad' in solution()['columns']\n",
"assert len(solution()['tuples']) == 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have some experience under your belt, it is time to design your own query.\n",
"\n",
"Your first task it to get a list of Spanish Novelits, using the skeleton below and the previous query to guide you.\n",
"\n",
"Pages for Spanish novelists are grouped in the *Spanish novelists* DBpedia category. You can use that fact to get your list.\n",
"In other words, the difference from the previous query will be using `dct:subject` instead of `dbo:isPartOf`, and `dbc:Spanish_novelists` instead of `dbr:Community_of_Madrid`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "eef1c62e2797bd3ef01f2061da6f83c4",
"grade": false,
"grade_id": "cell-7a9509ff3c34127e",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"\n",
"SELECT ?escritor\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "5a57d16cb2b53925f6e39fba429b7ef2",
"grade": true,
"grade_id": "cell-91240ded2cac7b6d",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['columns']) == 1 # We only use one variable, ?escritor\n",
"assert len(solution()['tuples']) == 10 # There should be 10 results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using more criteria\n",
"\n",
"We can get more than one property in the same query. Let us modify our query to get the population of the cities as well."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
"PREFIX dbp: <http://dbpedia.org/property/>\n",
" \n",
"SELECT ?localidad ?pop ?when\n",
"\n",
"WHERE {\n",
" ?localidad dbo:populationTotal ?pop .\n",
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
" ?localidad dbp:populationAsOf ?when .\n",
"}\n",
"\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert 'localidad' in solution()['columns']\n",
"assert 'http://dbpedia.org/resource/Parla' in solution()['columns']['localidad']\n",
"assert ('http://dbpedia.org/resource/San_Sebastián_de_los_Reyes', '75912', '2009') in solution()['tuples']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time to try it yourself.\n",
"\n",
"Get the list of Spanish novelists AND their name (using rdfs:label)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "9d4193612dea95da2d91762b638ad5e6",
"grade": false,
"grade_id": "cell-83dcaae0d09657b5",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"\n",
"SELECT ?escritor ?name\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "86115c2a8982ad12b7250cf4341ae9c3",
"grade": true,
"grade_id": "cell-8afd28aada7a896c",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'escritor' in solution()['columns']\n",
"assert 'http://dbpedia.org/resource/Eduardo_Mendoza_Garriga' in solution()['columns']['escritor']\n",
"assert ('http://dbpedia.org/resource/Eduardo_Mendoza_Garriga', 'Eduardo Mendoza') in solution()['tuples']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering and ordering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous example, we saw that we got what seemed to be duplicated answers.\n",
"\n",
"This happens because entities can have labels in different languages (e.g. English, Spanish).\n",
"To restrict the search to only those results we're interested in, we can use filtering.\n",
"\n",
"We can also decide the order in which our results are shown.\n",
"\n",
"For instance, this is how we could use filtering to get only large cities in our example, ordered by population:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
" \n",
"SELECT ?localidad ?pop ?when\n",
"\n",
"WHERE {\n",
" ?localidad dbo:populationTotal ?pop .\n",
" ?localidad dbo:isPartOf dbr:Community_of_Madrid.\n",
" ?localidad dbp:populationAsOf ?when .\n",
" FILTER(?pop > 100000)\n",
"}\n",
"ORDER BY ?pop\n",
"LIMIT 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that ordering happens before limits."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "a38cb1aea7b1f01f6b37c088384e0a3d",
"grade": true,
"grade_id": "cell-cb7b8283568cd349",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"# We still have the biggest city\n",
"assert ('http://dbpedia.org/resource/Madrid', '3141991', '2014') in solution()['tuples']\n",
"# But the smaller ones are gone\n",
"assert 'http://dbpedia.org/resource/Tres_Cantos' not in solution()['columns']['localidad']\n",
"assert 'http://dbpedia.org/resource/San_Sebastián_de_los_Reyes' not in solution()['columns']['localidad']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, try filtering to get a list of novelists and their name in Spanish, ordered by name `(FILTER (LANG(?nombre) = \"es\") y ORDER BY`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "b6aaac8ab30d52a042c1efefbbff7550",
"grade": false,
"grade_id": "cell-ff3d611cb0304b01",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"\n",
"SELECT ?escritor ?nombre\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 1000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "3441fbd2267002acbb0d46d9ce94ba97",
"grade": true,
"grade_id": "cell-d70cc6ea394741bc",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['tuples']) >= 50\n",
"assert 'Adelaida García Morales' in solution()['columns']['nombre']\n",
"assert sum(1 for k in solution()['columns']['escritor'] if k == 'http://dbpedia.org/resource/Adelaida_García_Morales') == 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From now on, we will focus on our Writers example.\n",
"\n",
"First, we will search for writers born in the XX century, using the [20th-century Spanish novelists](http://dbpedia.org/page/Category:20th-century_Spanish_novelists) category."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "e3ff089c983be1ae937f254b8d9d229a",
"grade": false,
"grade_id": "cell-ab7755944d46f9ca",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "cacdd08a8a267c1173304e319ffff563",
"grade": true,
"grade_id": "cell-cf3821f2d33fb0f6",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Camilo José Cela' in solution()['columns']['nombre']\n",
"assert 'Javier Marías' in solution()['columns']['nombre']\n",
"assert all(x > '1850-12-31' and x < '2001-01-01' for x in solution()['columns']['nac'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our last example, we were missing many novelists that are do not have birth information in DBpedia.\n",
"\n",
"We can specify optional values in a query using the `OPTIONAL` keyword.\n",
"When a set of clauses are inside an OPTIONAL group, the SPARQL endpoint will try to use them in the query.\n",
"If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).\n",
"\n",
"Using that, let us retrieve all the novelists, their birth and death date (if they are available)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "f4170cbbf042644e394d1eb9acf12ce3",
"grade": false,
"grade_id": "cell-254a18dd973e82ed",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor ?nombre ?fechaNac ?fechaDef\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 200"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "29c6362adbdb5606e158f696594e1052",
"grade": true,
"grade_id": "cell-4d6a64dde67f0e11",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Wenceslao Fernández Flórez' in solution()['columns']['nombre']\n",
"assert '1879-2-11' in solution()['columns']['fechaNac']\n",
"assert '' in solution()['columns']['fechaNac'] # Not all birthdates are defined\n",
"assert '' in solution()['columns']['fechaDef'] # Some deathdates are not defined"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bound"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can check whether the optional value for a key was bound in a SPARQL query using `BOUND(?key)`.\n",
"\n",
"This is very useful for two purposes.\n",
"First, it allows us to look for patterns that **do not occur** in the graph, such as missing properties.\n",
"For instance, we could search for the authors with missing birth information so we can add it.\n",
"Secondly, we can use bound in filters to get conditional filters.\n",
"We will explore both uses in this exercise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of Spanish novelists that are still alive.\n",
"A person is alive if their death date is not defined and the were born less than 100 years ago"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "f3c11121eb0d1328d2f5da3580f8d648",
"grade": false,
"grade_id": "cell-474b1a72dec6827c",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
"\n",
"WHERE {\n",
" \n",
"# YOUR ANSWER HERE\n",
"}\n",
"\n",
"# YOUR ANSWER HERE\n",
"LIMIT 1000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "770bbddef5210c28486a1929e4513ada",
"grade": true,
"grade_id": "cell-46b62dd2856bc919",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Fernando Arrabal' in solution()['columns']['nombre']\n",
"assert 'Albert Espinosa' in solution()['columns']['nombre']\n",
"for year in solution()['columns']['nac']:\n",
" assert int(year) >= 1918"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, get the list of Spanish novelists that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n",
"\n",
"Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).\n",
"\n",
"Hint 2: Some dates are not formatted properly, which makes some queries fail when they shouldn't. You might need to convert between different types as a workaround. For instance, you could get the year from a date like this: `year(xsd:dateTime(str(?date)))`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "ed34857649c9a6926eb0a3a0e1d8198d",
"grade": false,
"grade_id": "cell-ceefd3c8fbd39d79",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor, ?nombre, YEAR(?fechaNac) as ?nac, ?fechaDef\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "18bb2d8d586bf4a5231973e69958ab75",
"grade": true,
"grade_id": "cell-461cd6ccc6c2dc79",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Javier Sierra' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Sanmao_(author)' in solution()['columns']['escritor']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding unique elements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our last example, our results show some authors more than once.\n",
"This is because some properties are defined more than once.\n",
"For instance, birth date is giving using different formats.\n",
"Even if we exclude that property from our results by not adding it in our `SELECT`, we will get duplicated lines.\n",
"\n",
"To solve this, we can use the `DISTINCT` keyword."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Modify your last query to remove duplicated lines.\n",
"In other words, authors should only appear once."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "34163ddb0400cd8ddd2c2e2cdf29c20b",
"grade": false,
"grade_id": "cell-2a39adc71d26ae73",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "84ab7d64a45e03e6dd902216a2aad030",
"grade": true,
"grade_id": "cell-542e0e36347fd5d1",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Javier Sierra' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Albert_Espinosa' in solution()['columns']['escritor']\n",
"\n",
"from collections import Counter\n",
"c = Counter(solution()['columns']['nombre'])\n",
"for count in c.values():\n",
" assert count == 1\n",
" \n",
"c1 = Counter(solution()['columns']['escritor'])\n",
"assert all(count==1 for count in c1.values())\n",
"# c = Counter(solution()['columns']['nombre'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using other resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of living Spanish novelists born in Madrid.\n",
"\n",
"Hint: use `dbr:Madrid` and `dbo:birthPlace`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "25c8edcee216d536aac98fc9aa2b6422",
"grade": false,
"grade_id": "cell-d175e41da57c889b",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre, ?lugarNac, year(?fechaNac) as ?nac\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "443608e177f514f2cddafa6c1d1e3cc7",
"grade": true,
"grade_id": "cell-fadd095862db6bc8",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'José Ángel Mañas' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Madrid' in solution()['columns']['lugarNac']\n",
"MADRID_QUERY = solution()['columns'].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Traversing the graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of works of the authors in the previous query (i.e. authors born in Madrid), if they have any.\n",
"\n",
"Hint: use `dbo:author`, which is a **property of a literary work** that points to the author."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "c1f22b82c4d0bd4102a6c38f7f933dc6",
"grade": false,
"grade_id": "cell-e4b99af9ef91ff6f",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre, ?obra\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 10000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "51acaeb26379c6bd2f8c767001ef79ec",
"grade": true,
"grade_id": "cell-68661b73c2140e4f",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'http://dbpedia.org/resource/A_Heart_So_White' in solution()['columns']['obra']\n",
"assert 'http://dbpedia.org/resource/Tomorrow_in_the_Battle_Think_on_Me' in solution()['columns']['obra']\n",
"assert '' in solution()['columns']['obra'] # Some authors don't have works in dbpedia"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Traversing the graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get a list of living Spanish novelists born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).\n",
"\n",
"If the query is right, you should see a list of writers after running the test code.\n",
"\n",
"Hint: `foaf:depiction` and `foaf: homepage`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "e3f8e18a006a763f5cdbe49c97b73f5f",
"grade": false,
"grade_id": "cell-b1f71c67dd71dad4",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor ?web ?foto\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"ORDER BY ?nombre\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "2d40e7ceb7774b29a709092ee8dfa9f5",
"grade": true,
"grade_id": "cell-8b8ba7cca701c652",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"fotos = set(filter(lambda x: x != '', solution()['columns']['foto']))\n",
"assert len(fotos) > 2\n",
"show_photos(fotos) #show the pictures of the writers!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Union"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can merge the results of several queries, just like using `JOIN` in SQL.\n",
"The keyword in SPARQL is `UNION`, because we are merging graphs.\n",
"\n",
"`UNION` is useful in many situations.\n",
"For instance, when there are equivalent properties, or when you want to use two search terms and FILTER would be too inefficient.\n",
"\n",
"The syntax is as follows:\n",
"\n",
"```sparql\n",
"SELECT ?title\n",
"WHERE {\n",
" { ?book dc10:title ?title }\n",
" UNION\n",
" { ?book dc11:title ?title }\n",
" \n",
" ... REST OF YOUR QUERY ...\n",
"\n",
"}\n",
"```\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using UNION, get a list of distinct spanish novelists AND poets.\n",
"\n",
"Hint: Category: Spanish_poets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "9c0da379841474601397f5623abc6a9c",
"grade": false,
"grade_id": "cell-21eb6323b6d0011d",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 10000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "f22c7db423410fcf3e8fce4ec0a8e9f9",
"grade": true,
"grade_id": "cell-004e021e877c6ace",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Garcilaso de la Vega' in solution()['columns']['nombre']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also get the count of results either by inspecting the result (we will not cover this) or by aggregating the results using the `COUNT` operation.\n",
"\n",
"The syntax is:\n",
" \n",
"```sparql\n",
"SELECT COUNT(?variable) as ?count_name\n",
"```\n",
"\n",
"Try it yourself with our previous example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "cd7ce9212f587afe311c7631b3908de2",
"grade": false,
"grade_id": "cell-e35414e191c5bf16",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"# YOUR ANSWER HERE\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 10000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "68609fa02dcf7480e16f0e5eb7849e65",
"grade": true,
"grade_id": "cell-7a7ef8255a5662e2",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['columns']) == 1\n",
"column_name = list(solution()['columns'].keys())[0]\n",
"assert int(solution()['columns'][column_name][0]) > 200"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find out if there are more dbpedia entries for writers (dbo:Writer) than for football players (dbo:SoccerPlayers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get a list of European countries with a population higher than 20 million, in decreasing order of population, including their URI, name in English and population."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the country in the world that speaks the most languages. Show its name in Spanish, if available."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
"* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2018 Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}