You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
sitc/lod/03_SPARQL_Writers.ipynb

1430 lines
38 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "7276f055a8c504d3c80098c62ed41a4f",
"grade": false,
"grade_id": "cell-0bfe38f97f6ab2d2",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"<header style=\"width:100%;position:relative\">\n",
" <div style=\"width:80%;float:right;\">\n",
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
" <h3>Department of Telematic Engineering Systems</h3>\n",
" <h5>Universidad Politécnica de Madrid</h5>\n",
" </div>\n",
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
"</header>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "2387a77db61721ecc375a8275111ecaf",
"grade": false,
"grade_id": "cell-0cd673883ee592d1",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"## Introduction to Linked Open Data\n",
"\n",
"In this lecture, we will apply the same SPARQL concepts as in previous notebooks.\n",
"This time, instead of using a database specifically built for the exercise, we will be using DBpedia.\n",
"DBpedia is a semantic version of Wikipedia.\n",
"\n",
"The language we will use to query DBpedia is SPARQL, a semantic query language inspired by SQL.\n",
"For convenience, the examples in the notebook are executable, and they are accompanied by some code to test the results.\n",
"If the tests pass, you probably got the answer right.\n",
"\n",
"However, you can also use any other method to write and send your queries.\n",
"You may find online query editors particularly useful.\n",
"In addition to running queries from your browser, they provide useful features such as syntax highlighting and autocompletion.\n",
"Some examples are:\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "bc0ca2e21254707344c60f895cb204b4",
"grade": false,
"grade_id": "cell-10264483046abcc4",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"## Objectives\n",
"\n",
"* Learning SPARQL and the Linked Data principles by defining queries to answer a set of problems of increasing difficulty\n",
"* Learning how to use integrated SPARQL editors and programming interfaces to SPARQL."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
"grade": false,
"grade_id": "cell-4f8492996e74bf20",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"## Tools\n",
"\n",
"See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instructions\n",
"\n",
"As in previous notebooks, the exercises can be done in the notebook, using the `%%sparql` magic, and the set of tests.\n",
"\n",
"\n",
"After every query, you will find some python code to test the results of the query.\n",
"**Make sure you've run the tests before moving to the next exercise**.\n",
"If the test gives you an error, you've probably done something wrong.\n",
"You **do not need to understand or modify the test code**.\n",
"\n",
"If you prefer to edit your queries in a different editor, here are some options:\n",
"\n",
"* DBpedia's virtuoso query editor https://dbpedia.org/sparql\n",
"* A javascript based client hosted at GSI: http://yasgui.gsi.upm.es/\n",
"\n",
"If you use an editor, make sure to copy it to the notebook and run the tests, once you are getting the expected results."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "markdown",
"checksum": "c5f8646518bd832a47d71f9d3218237a",
"grade": false,
"grade_id": "cell-eb13908482825e42",
"locked": true,
"schema_version": 3,
"solution": false
}
},
"source": [
"Run this line to enable the `%%sparql` magic command."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from helpers import sparql, solution, show_photos"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `%%sparql` magic command will allow us to use SPARQL inside normal jupyter cells.\n",
"\n",
"For instance, the following code:\n",
"\n",
"```python\n",
"%%sparql\n",
"\n",
"<MY QUERY>\n",
"``` \n",
"\n",
"Is the same as `run_query('<MY QUERY>', endpoint='http://dbpedia.org/sparql')` plus some additional steps, such as saving the results in a nice table format so that they can be used later and storing the results in a variable (`solution()`), which we will use in our tests.\n",
"\n",
"You do not need to worry about it, and **you can always use one of the suggested online editors if you wish**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### First Select"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with a simple query. We will get a list of towns and other populated areas within the Community of Madrid.\n",
"If we take a look at the DBpedia ontology, or the page of any town we already know, we discover that the property that links towns to their community is [`subdivision`](http://dbpedia.org/ontology/subdivision), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)\n",
"\n",
"Since there are potentially many cities to get, we will limit our results to the first 10 results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"SELECT ?localidad\n",
"WHERE {\n",
" ?localidad <http://dbpedia.org/ontology/subdivision> <http://dbpedia.org/resource/Community_of_Madrid>\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, that query is very verbose because we are using full URIs.\n",
"To simplify it, we will make use of SPARQL prefixes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
" \n",
"SELECT ?localidad\n",
"WHERE {\n",
" ?localidad dbo:subdivision dbr:Community_of_Madrid.\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make sure that the query returned something sensible, we can test it with some python code:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert 'localidad' in solution()['columns']\n",
"assert len(solution()['tuples']) == 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that you have some experience under your belt, it is time to design your own query.\n",
"\n",
"Your first task it to get a list of writers, using the skeleton below and the previous query to guide you.\n",
"\n",
"The DBpedia vocabulary has a special class for writers: `<http://dbpedia.org/ontology/Writer>`.\n",
"\n",
"In other words, the difference from the previous query will be using `a` instead of `dbo:isPartOf`, and `dbo:Writer` instead of `dbr:Community_of_Madrid`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "2a5c55e8bca983aca6cc2293f4560f31",
"grade": false,
"grade_id": "cell-7a9509ff3c34127e",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "5a57d16cb2b53925f6e39fba429b7ef2",
"grade": true,
"grade_id": "cell-91240ded2cac7b6d",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['columns']) == 1 # We only use one variable, ?escritor\n",
"assert len(solution()['tuples']) == 10 # There should be 10 results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using more criteria\n",
"\n",
"We can get more than one property in the same query. Let us modify our query to get the total area of the towns we found before."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
"PREFIX dbp: <http://dbpedia.org/property/>\n",
" \n",
"SELECT ?localidad ?area\n",
"\n",
"WHERE {\n",
" ?localidad dbo:areaTotal ?area .\n",
" ?localidad dbo:subdivision dbr:Community_of_Madrid .\n",
"}\n",
"\n",
"LIMIT 1000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert 'localidad' in solution()['columns']\n",
"assert ('http://dbpedia.org/resource/Lozoya', '5.794e+07') in solution()['tuples']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Time to try it yourself.\n",
"\n",
"Get the list of writers AND their name (using rdfs:label)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "2ebdc8d3f3420bb961e2c8c77d027c3b",
"grade": false,
"grade_id": "cell-83dcaae0d09657b5",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"\n",
"SELECT ?escritor ?name\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "d779d690d5d1865973fdcf113b74c221",
"grade": true,
"grade_id": "cell-8afd28aada7a896c",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'escritor' in solution()['columns']\n",
"assert 'http://dbpedia.org/resource/Alison_Stine' in solution()['columns']['escritor']\n",
"assert ('http://dbpedia.org/resource/Alistair_MacLeod', 'Alistair MacLeod') in solution()['tuples']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Filtering and ordering"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous example, we saw that we got what seemed to be duplicated answers.\n",
"\n",
"This happens because entities can have labels in different languages (e.g. English, Spanish).\n",
"We can filter results using the `FILTER` keyword.\n",
"\n",
"We can also decide the order in which our results are shown using the `ORDER BY` sentence.\n",
"We can order in ascending (`ASC`) or descending (`DESC`) order.\n",
"\n",
"For instance, this is how we could use filtering to get only large areas in our example, in descending order:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
" \n",
"SELECT ?localidad ?area\n",
"\n",
"WHERE {\n",
" ?localidad dbo:areaTotal ?area .\n",
" ?localidad dbo:type dbr:Municipalities_of_Spain .\n",
" FILTER(?area > 100000)\n",
"}\n",
"ORDER BY DESC(?area)\n",
"LIMIT 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that ordering happens before limits."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "1e09f3c1749dd3c9256a1d0bbc14ff2d",
"grade": true,
"grade_id": "cell-cb7b8283568cd349",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"# We still have the biggest city\n",
"assert 'http://dbpedia.org/resource/Úbeda' in solution()['columns']['localidad']\n",
"# But the smaller ones are gone\n",
"assert 'http://dbpedia.org/resource/El_Cañaveral' not in solution()['columns']['localidad']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, try filtering to get a list of novelists and their name in Spanish, ordered by name `(FILTER (LANG(?nombre) = \"es\") y ORDER BY`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "b200ff7d97fe03bab726040d16b636fe",
"grade": false,
"grade_id": "cell-ff3d611cb0304b01",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor ?nombre\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "637f8a2e0eb286f968f22b0e0fa2215a",
"grade": true,
"grade_id": "cell-d70cc6ea394741bc",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['tuples']) >= 50\n",
"assert 'Abraham Abulafia' in solution()['columns']['nombre']\n",
"assert sum(1 for k in solution()['columns']['escritor'] if k == 'http://dbpedia.org/resource/Abraham_Abulafia') == 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From now on, we will focus on our Writers example.\n",
"More specifically, we will be interested in writers born in the XX century.\n",
"\n",
"To do that, we will filter our novelists to only those born (`dbo:birthDate`) in the 20th century (after 1900)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "e3ff089c983be1ae937f254b8d9d229a",
"grade": false,
"grade_id": "cell-ab7755944d46f9ca",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "e896e64c21f317aeacf82ccd46811059",
"grade": true,
"grade_id": "cell-cf3821f2d33fb0f6",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Kiku Amino' in solution()['columns']['nombre']\n",
"assert 'Albert Hackett' in solution()['columns']['nombre']\n",
"assert all(x > '1900-01-01' and x < '2001-01-01' for x in solution()['columns']['nac'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our last example, we were missing many novelists that are do not have birth information in DBpedia.\n",
"\n",
"We can specify optional values in a query using the `OPTIONAL` keyword.\n",
"When a set of clauses are inside an OPTIONAL group, the SPARQL endpoint will try to use them in the query.\n",
"If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).\n",
"\n",
"Using that, let us retrieve all the novelists, their birth and death date (if they are available)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "df4364d90fd37ec886bec8f39f6df8ee",
"grade": false,
"grade_id": "cell-254a18dd973e82ed",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor ?nombre ?fechaNac ?fechaDef\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "26d08d050ac6963b20595f52b5d14781",
"grade": true,
"grade_id": "cell-4d6a64dde67f0e11",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Alister McGrath' in solution()['columns']['nombre']\n",
"# assert '1879-2-11' in solution()['columns']['fechaNac']\n",
"assert '' in solution()['columns']['fechaNac'] # Not all birthdates are defined\n",
"assert '' in solution()['columns']['fechaDef'] # Some deathdates are not defined"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bound"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can check whether the optional value for a key was bound in a SPARQL query using `BOUND(?key)`.\n",
"\n",
"This is very useful for two purposes.\n",
"First, it allows us to look for patterns that **do not occur** in the graph, such as missing properties.\n",
"For instance, we could search for the authors with missing birth information so we can add it.\n",
"Secondly, we can use bound in filters to get conditional filters.\n",
"We will explore both uses in this exercise."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of writers that are still alive.\n",
"A person is alive if their death date is not defined and the were born less than 100 years ago"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "7527bd597f9550ec14d454732f6b2183",
"grade": false,
"grade_id": "cell-474b1a72dec6827c",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
"\n",
"WHERE {\n",
" \n",
"# YOUR ANSWER HERE\n",
"}\n",
"\n",
"# YOUR ANSWER HERE\n",
"LIMIT 1000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "8f8c783af97cd3024b90a8f5b7fd7027",
"grade": true,
"grade_id": "cell-46b62dd2856bc919",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Fernando Arrabal' in solution()['columns']['nombre']\n",
"assert 'Javier Sierra' in solution()['columns']['nombre']\n",
"for year in solution()['columns']['nac']:\n",
" assert int(year) >= 1918"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, get the list of writers that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.\n",
"\n",
"Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).\n",
"\n",
"Hint 2: Some dates are not formatted properly, which makes some queries fail when they shouldn't. You might need to convert between different types as a workaround. For instance, you could get the year from a date like this: `year(xsd:dateTime(str(?date)))`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "2e608b808ceceb2c8515f892a6b98d06",
"grade": false,
"grade_id": "cell-ceefd3c8fbd39d79",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor, ?nombre, YEAR(?fechaNac) as ?nac, ?fechaDef\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "ec821397f67619e5bfa02a19bdd597fc",
"grade": true,
"grade_id": "cell-461cd6ccc6c2dc79",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Wang Ruowang' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Manuel_de_Pedrolo' in solution()['columns']['escritor']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Finding unique elements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In our last example, our results show some authors more than once.\n",
"This is because some properties are defined more than once.\n",
"For instance, birth date is giving using different formats.\n",
"Even if we exclude that property from our results by not adding it in our `SELECT`, we will get duplicated lines.\n",
"\n",
"To solve this, we can use the `DISTINCT` keyword."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Modify your last query to remove duplicated lines.\n",
"In other words, authors should only appear once."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "3d647ccd0f3e861b843af0ec4a33098b",
"grade": false,
"grade_id": "cell-2a39adc71d26ae73",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre, year(?fechaNac) as ?nac\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "524d152d46d3c1166052b6d5871c6aa5",
"grade": true,
"grade_id": "cell-542e0e36347fd5d1",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Anna Langfus' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Paul_Celan' in solution()['columns']['escritor']\n",
"\n",
"from collections import Counter\n",
"c = Counter(solution()['columns']['nombre'])\n",
"for count in c.values():\n",
" assert count == 1\n",
" \n",
"c1 = Counter(solution()['columns']['escritor'])\n",
"assert all(count==1 for count in c1.values())\n",
"# c = Counter(solution()['columns']['nombre'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Using other resources"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of living novelists born in Madrid.\n",
"\n",
"Hint: use `dbr:Madrid` and `dbo:birthPlace`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "f067a70a247b62d7eb5cc526efdc53c4",
"grade": false,
"grade_id": "cell-d175e41da57c889b",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre, ?lugarNac, year(?fechaNac) as ?nac\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "443608e177f514f2cddafa6c1d1e3cc7",
"grade": true,
"grade_id": "cell-fadd095862db6bc8",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'José Ángel Mañas' in solution()['columns']['nombre']\n",
"assert 'http://dbpedia.org/resource/Madrid' in solution()['columns']['lugarNac']\n",
"MADRID_QUERY = solution()['columns'].copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Traversing the graph"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the list of works of the authors in the previous query (i.e. authors born in Madrid), if they have any.\n",
"\n",
"Hint: use `dbo:author`, which is a **property of a literary work** that points to the author."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "64ea2ef341901ce486bb1dcbed6c3785",
"grade": false,
"grade_id": "cell-e4b99af9ef91ff6f",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre, ?obra\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 1000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "fe47b48969b20b50a16a4ce4ad75e97d",
"grade": true,
"grade_id": "cell-68661b73c2140e4f",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'http://dbpedia.org/resource/Cristina_Guzmán_(novel)' in solution()['columns']['obra']\n",
"assert 'http://dbpedia.org/resource/Life_Is_a_Dream' in solution()['columns']['obra']\n",
"assert '' in solution()['columns']['obra'] # Some authors don't have works in dbpedia"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Traversing the graph II"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get a list of writers born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).\n",
"\n",
"If the query is right, you should see a list of writers after running the test code.\n",
"\n",
"Hint: `foaf:depiction` and `foaf: homepage`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "d3636d90f8d6a3c824b17ce87ba6c423",
"grade": false,
"grade_id": "cell-b1f71c67dd71dad4",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT ?escritor ?web ?foto\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"ORDER BY ?nombre\n",
"LIMIT 5"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "2d40e7ceb7774b29a709092ee8dfa9f5",
"grade": true,
"grade_id": "cell-8b8ba7cca701c652",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"fotos = set(filter(lambda x: x != '', solution()['columns']['foto']))\n",
"assert len(fotos) > 2\n",
"show_photos(fotos) #show the pictures of the writers!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Union"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can merge the results of several queries, just like using `JOIN` in SQL.\n",
"The keyword in SPARQL is `UNION`, because we are merging graphs.\n",
"\n",
"`UNION` is useful in many situations.\n",
"For instance, when there are equivalent properties, or when you want to use two search terms and FILTER would be too inefficient.\n",
"\n",
"The syntax is as follows:\n",
"\n",
"```sparql\n",
"SELECT ?title\n",
"WHERE {\n",
" { ?book dc10:title ?title }\n",
" UNION\n",
" { ?book dc11:title ?title }\n",
" \n",
" ... REST OF YOUR QUERY ...\n",
"\n",
"}\n",
"```\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using UNION, get a list of distinct spanish novelists AND poets.\n",
"\n",
"In this query, instead of looking for writers, try to find the right entities by looking at the `dct:subject` property.\n",
"The entities we are looking after should be in the `Spanish_poets` and `Spanish_novelists` categories."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "2547e55ac68b37687efddd50c768eb5b",
"grade": false,
"grade_id": "cell-21eb6323b6d0011d",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"SELECT DISTINCT ?escritor, ?nombre\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "565dac8ae632765bc3f128f830e70993",
"grade": true,
"grade_id": "cell-004e021e877c6ace",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert 'Antonio Gala' in solution()['columns']['nombre']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also get the count of results either by inspecting the result (we will not cover this) or by aggregating the results using the `COUNT` operation.\n",
"\n",
"The syntax is:\n",
" \n",
"```sparql\n",
"SELECT COUNT(?variable) as ?count_name\n",
"```\n",
"\n",
"Try it yourself with our previous example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "f8cca6da3b6830a5474eac28c3c8ebde",
"grade": false,
"grade_id": "cell-e35414e191c5bf16",
"locked": false,
"schema_version": 3,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n",
"PREFIX dct:<http://purl.org/dc/terms/>\n",
"PREFIX dbc:<http://dbpedia.org/resource/Category:>\n",
"PREFIX dbr:<http://dbpedia.org/resource/>\n",
"PREFIX dbo:<http://dbpedia.org/ontology/>\n",
"\n",
"# YOUR ANSWER HERE\n",
"\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 10000"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"cell_type": "code",
"checksum": "68609fa02dcf7480e16f0e5eb7849e65",
"grade": true,
"grade_id": "cell-7a7ef8255a5662e2",
"locked": true,
"points": 0,
"schema_version": 3,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['columns']) == 1\n",
"column_name = list(solution()['columns'].keys())[0]\n",
"assert int(solution()['columns'][column_name][0]) > 200"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find out if there are more dbpedia entries for writers (dbo:Writer) than for football players (dbo:SoccerPlayers)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get a list of European countries with a population higher than 20 million, in decreasing order of population, including their URI, name in English and population."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find the country in the world that speaks the most languages. Show its name in Spanish, if available."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
"* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2023 Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"toc": {
"base_numbering": 1,
"nav_menu": {},
"number_sections": true,
"sideBar": true,
"skip_h1_title": false,
"title_cell": "Table of Contents",
"title_sidebar": "Contents",
"toc_cell": false,
"toc_position": {},
"toc_section_display": true,
"toc_window_display": false
}
},
"nbformat": 4,
"nbformat_minor": 2
}