<header style="width:100%;position:relative">
  <div style="width:80%;float:right;">
    <h1>Course Notes for Learning Intelligent Systems</h1>
    <h3>Department of Telematic Engineering Systems</h3>
    <h5>Universidad Politécnica de Madrid</h5>
  </div>
        <img style="width:15%;" src="../logo.jpg" alt="UPM" />
</header>

## Introduction to Linked Open Data

In this lecture, we will apply the same SPARQL concepts as in previous notebooks.
This time, instead of using a database specifically built for the exercise, we will be using DBpedia.
DBpedia is a semantic version of Wikipedia.

The language we will use to query DBpedia is SPARQL, a semantic query language inspired by SQL.
For convenience, the examples in the notebook are executable, and they are accompanied by some code to test the results.
If the tests pass, you probably got the answer right.

However, you can also use any other method to write and send your queries.
You may find online query editors particularly useful.
In addition to running queries from your browser, they provide useful features such as syntax highlighting and autocompletion.
Some examples are:



## Objectives

* Learning SPARQL and the Linked Data principles by defining queries to answer a set of problems of increasing difficulty
* Learning how to use integrated SPARQL editors and programming interfaces to SPARQL.

## Tools

See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)

## Instructions

As in previous notebooks, the exercises can be done in the notebook, using the `%%sparql` magic, and the set of tests.


After every query, you will find some python code to test the results of the query.
**Make sure you've run the tests before moving to the next exercise**.
If the test gives you an error, you've probably done something wrong.
You **do not need to understand or modify the test code**.

If you prefer to edit your queries in a different editor, here are some options:

* DBpedia's virtuoso query editor https://dbpedia.org/sparql
* A javascript based client hosted at GSI: http://yasgui.gsi.upm.es/

If you use an editor, make sure to copy it to the notebook and run the tests, once you are getting the expected results.

Run this line to enable the `%%sparql` magic command.

In [None]:
from helpers import sparql, solution, show_photos

The `%%sparql` magic command will allow us to use SPARQL inside normal jupyter cells.

For instance, the following code:

```python
%%sparql

<MY QUERY>
```    

Is the same as `run_query('<MY QUERY>', endpoint='http://dbpedia.org/sparql')` plus some additional steps, such as saving the results in a nice table format so that they can be used later and storing the results in a variable (`solution()`), which we will use in our tests.

You do not need to worry about it, and **you can always use one of the suggested online editors if you wish**.

## Exercises

#### First Select

Let's start with a simple query. We will get a list of towns and other populated areas within the Community of Madrid.
If we take a look at the DBpedia ontology, or the page of any town we already know, we discover that the property that links towns to their community is [`subdivision`](http://dbpedia.org/ontology/subdivision), and [the Community of Madrid is also a resource in DBpedia](http://dbpedia.org/resource/Community_of_Madrid)

Since there are potentially many cities to get, we will limit our results to the first 10 results:

In [None]:
%%sparql https://dbpedia.org/sparql

SELECT ?localidad
WHERE {
    ?localidad <http://dbpedia.org/ontology/subdivision> <http://dbpedia.org/resource/Community_of_Madrid>
}
LIMIT 10

However, that query is very verbose because we are using full URIs.
To simplify it, we will make use of SPARQL prefixes:

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
        
SELECT ?localidad
WHERE {
    ?localidad dbo:subdivision dbr:Community_of_Madrid.
}
LIMIT 10

To make sure that the query returned something sensible, we can test it with some python code:

In [None]:
assert 'localidad' in solution()['columns']
assert len(solution()['tuples']) == 10

Now that you have some experience under your belt, it is time to design your own query.

Your first task it to get a list of writers, using the skeleton below and the previous query to guide you.

The DBpedia vocabulary has a special class for writers: `<http://dbpedia.org/ontology/Writer>`.

In other words, the difference from the previous query will be using `a` instead of `dbo:isPartOf`, and `dbo:Writer`  instead of `dbr:Community_of_Madrid`.

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT ?escritor

WHERE {
# YOUR ANSWER HERE
}
LIMIT 10

In [None]:
assert len(solution()['columns']) == 1 # We only use one variable, ?escritor
assert len(solution()['tuples']) == 10 # There should be 10 results

### Using more criteria

We can get more than one property in the same query. Let us modify our query to get the total area of the towns we found before.

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbp: <http://dbpedia.org/property/>
        
SELECT ?localidad ?area

WHERE {
    ?localidad dbo:areaTotal ?area .
    ?localidad dbo:subdivision dbr:Community_of_Madrid .
}

LIMIT 1000

In [None]:
assert 'localidad' in solution()['columns']
assert ('http://dbpedia.org/resource/Lozoya', '5.794e+07') in solution()['tuples']

Time to try it yourself.

Get the list of writers AND their name (using rdfs:label).

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>

SELECT ?escritor ?name

WHERE {
# YOUR ANSWER HERE
}
LIMIT 100

In [None]:
assert 'escritor' in solution()['columns']
assert 'http://dbpedia.org/resource/Alison_Stine' in solution()['columns']['escritor']
assert ('http://dbpedia.org/resource/Alistair_MacLeod', 'Alistair MacLeod') in solution()['tuples']

### Filtering and ordering

In the previous example, we saw that we got what seemed to be duplicated answers.

This happens because entities can have labels in different languages (e.g. English, Spanish).
We can filter results using the `FILTER` keyword.

We can also decide the order in which our results are shown using the `ORDER BY` sentence.
We can order in ascending (`ASC`) or descending (`DESC`) order.

For instance, this is how we could use filtering to get only large areas in our example, in descending order:

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
        
SELECT ?localidad ?area

WHERE {
    ?localidad dbo:areaTotal ?area .
    ?localidad dbo:type dbr:Municipalities_of_Spain .
    FILTER(?area > 100000)
}
ORDER BY DESC(?area)
LIMIT 100

Note that ordering happens before limits.

In [None]:
# We still have the biggest city
assert 'http://dbpedia.org/resource/Úbeda' in solution()['columns']['localidad']
# But the smaller ones are gone
assert 'http://dbpedia.org/resource/El_Cañaveral' not in solution()['columns']['localidad']

Now, try filtering to get a list of novelists and their name in Spanish, ordered by name `(FILTER (LANG(?nombre) = "es") y ORDER BY`

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT ?escritor ?nombre

WHERE {
# YOUR ANSWER HERE
}
# YOUR ANSWER HERE
LIMIT 100

In [None]:
assert len(solution()['tuples']) >= 50
assert 'Abraham Abulafia' in solution()['columns']['nombre']
assert sum(1 for k in solution()['columns']['escritor'] if k == 'http://dbpedia.org/resource/Abraham_Abulafia') == 1

### Optional

From now on, we will focus on our Writers example.
More specifically, we will be interested in writers born in the XX century.

To do that, we will filter our novelists to only those born (`dbo:birthDate`) in the 20th century (after 1900).

In [None]:
# YOUR ANSWER HERE

In [None]:
assert 'Kiku Amino' in solution()['columns']['nombre']
assert 'Albert Hackett' in solution()['columns']['nombre']
assert all(x > '1900-01-01' and x < '2001-01-01' for x in solution()['columns']['nac'])

In our last example, we were missing many novelists that are do not have birth information in DBpedia.

We can specify optional values in a query using the `OPTIONAL` keyword.
When a set of clauses are inside an OPTIONAL group, the SPARQL endpoint will try to use them in the query.
If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).

Using that, let us retrieve all the novelists, their birth and death date (if they are available).

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT ?escritor ?nombre ?fechaNac ?fechaDef

WHERE {
# YOUR ANSWER HERE
}
# YOUR ANSWER HERE
LIMIT 100

In [None]:
assert 'Alister McGrath' in solution()['columns']['nombre']
# assert '1879-2-11' in solution()['columns']['fechaNac']
assert '' in solution()['columns']['fechaNac'] # Not all birthdates are defined
assert '' in solution()['columns']['fechaDef'] # Some deathdates are not defined

### Bound

We can check whether the optional value for a key was bound in a SPARQL query using `BOUND(?key)`.

This is very useful for two purposes.
First, it allows us to look for patterns that **do not occur** in the graph, such as missing properties.
For instance, we could search for the authors with missing birth information so we can add it.
Secondly, we can use bound in filters to get conditional filters.
We will explore both uses in this exercise.

Get the list of writers that are still alive.
A person is alive if their death date is not defined and the were born less than 100 years ago

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT ?escritor, ?nombre, year(?fechaNac) as ?nac

WHERE {
    
# YOUR ANSWER HERE
}

# YOUR ANSWER HERE
LIMIT 1000

In [None]:
assert 'Fernando Arrabal' in solution()['columns']['nombre']
assert 'Javier Sierra' in solution()['columns']['nombre']
for year in solution()['columns']['nac']:
    assert int(year) >= 1918

Now, get the list of writers that died before their fifties (i.e. younger than 50 years old), or that aren't 50 years old yet.

Hint: you can use boolean logic in your filters (e.g. `&&` and `||`).

Hint 2: Some dates are not formatted properly, which makes some queries fail when they shouldn't. You might need to convert between different types as a workaround. For instance, you could get the year from a date like this: `year(xsd:dateTime(str(?date)))`.

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT ?escritor, ?nombre, YEAR(?fechaNac) as ?nac, ?fechaDef

WHERE {
# YOUR ANSWER HERE
}
# YOUR ANSWER HERE
LIMIT 100

In [None]:
assert 'Wang Ruowang' in solution()['columns']['nombre']
assert 'http://dbpedia.org/resource/Manuel_de_Pedrolo' in solution()['columns']['escritor']

### Finding unique elements

In our last example, our results show some authors more than once.
This is because some properties are defined more than once.
For instance, birth date is giving using different formats.
Even if we exclude that property from our results by not adding it in our `SELECT`, we will get duplicated lines.

To solve this, we can use the `DISTINCT` keyword.

Modify your last query to remove duplicated lines.
In other words, authors should only appear once.

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT DISTINCT ?escritor, ?nombre, year(?fechaNac) as ?nac

WHERE {
# YOUR ANSWER HERE
}
# YOUR ANSWER HERE
LIMIT 100

In [None]:
assert 'Anna Langfus' in solution()['columns']['nombre']
assert 'http://dbpedia.org/resource/Paul_Celan' in solution()['columns']['escritor']

from collections import Counter
c = Counter(solution()['columns']['nombre'])
for count in c.values():
    assert count == 1
    
c1 = Counter(solution()['columns']['escritor'])
assert all(count==1 for count in c1.values())
# c = Counter(solution()['columns']['nombre'])

### Using other resources

Get the list of living novelists born in Madrid.

Hint: use `dbr:Madrid` and `dbo:birthPlace`

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT DISTINCT ?escritor, ?nombre, ?lugarNac, year(?fechaNac) as ?nac

WHERE {
# YOUR ANSWER HERE
}
# YOUR ANSWER HERE
LIMIT 100

In [None]:
assert 'José Ángel Mañas' in solution()['columns']['nombre']
assert 'http://dbpedia.org/resource/Madrid' in solution()['columns']['lugarNac']
MADRID_QUERY = solution()['columns'].copy()

### Traversing the graph

Get the list of works of the authors in the previous query (i.e. authors born in Madrid), if they have any.

Hint: use `dbo:author`, which is a **property of a literary work** that points to the author.

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT DISTINCT ?escritor, ?nombre, ?obra

WHERE {
# YOUR ANSWER HERE
}
# YOUR ANSWER HERE
LIMIT 1000

In [None]:
assert 'http://dbpedia.org/resource/Cristina_Guzmán_(novel)' in solution()['columns']['obra']
assert 'http://dbpedia.org/resource/Life_Is_a_Dream' in solution()['columns']['obra']
assert '' in solution()['columns']['obra'] # Some authors don't have works in dbpedia

### Traversing the graph II

Get a list of writers born in Madrid, their name in Spanish, a link to their foto and a website (if they have one).

If the query is right, you should see a list of writers after running the test code.

Hint: `foaf:depiction` and `foaf: homepage`

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT ?escritor ?web ?foto

WHERE {
# YOUR ANSWER HERE
}
ORDER BY ?nombre
LIMIT 5

In [None]:
fotos = set(filter(lambda x: x != '', solution()['columns']['foto']))
assert len(fotos) > 2
show_photos(fotos) #show the pictures of the writers!

### Union

We can merge the results of several queries, just like using `JOIN` in SQL.
The keyword in SPARQL is `UNION`, because we are merging graphs.

`UNION` is useful in many situations.
For instance, when there are equivalent properties, or when you want to use two search terms and FILTER would be too inefficient.

The syntax is as follows:

```sparql
SELECT ?title
WHERE  {
  { ?book dc10:title  ?title }
  UNION
  { ?book dc11:title  ?title }
  
  ... REST OF YOUR QUERY ...

}
```



Using UNION, get a list of distinct spanish novelists AND poets.

In this query, instead of looking for writers, try to find the right entities by looking at the `dct:subject` property.
The entities we are looking after should be in the `Spanish_poets` and `Spanish_novelists` categories.

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX dbo:<http://dbpedia.org/ontology/>

SELECT DISTINCT ?escritor, ?nombre

WHERE {
# YOUR ANSWER HERE
}
# YOUR ANSWER HERE
LIMIT 100

In [None]:
assert 'Antonio Gala' in solution()['columns']['nombre']

You can also get the count of results either by inspecting the result (we will not cover this) or by aggregating the results using the `COUNT` operation.

The syntax is:
    
```sparql
SELECT COUNT(?variable) as ?count_name
```

Try it yourself with our previous example:

In [None]:
%%sparql https://dbpedia.org/sparql

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dct:<http://purl.org/dc/terms/>
PREFIX dbc:<http://dbpedia.org/resource/Category:>
PREFIX dbr:<http://dbpedia.org/resource/>
PREFIX dbo:<http://dbpedia.org/ontology/>

# YOUR ANSWER HERE

WHERE {
# YOUR ANSWER HERE
}
LIMIT 10000

In [None]:
assert len(solution()['columns']) == 1
column_name = list(solution()['columns'].keys())[0]
assert int(solution()['columns'][column_name][0]) > 200

## Additional exercises

Find out if there are more dbpedia entries for writers (dbo:Writer) than for football players (dbo:SoccerPlayers)

Get a list of European countries with a population higher than 20 million, in decreasing order of population, including their URI, name in English and population.

Find the country in the world that speaks the most languages. Show its name in Spanish, if available.

## References

* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).
* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)

## Licence
The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/).  

© 2023 Universidad Politécnica de Madrid.