{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "7276f055a8c504d3c80098c62ed41a4f",
"grade": false,
"grade_id": "cell-0bfe38f97f6ab2d2",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"\n",
" \n",
"
Course Notes for Learning Intelligent Systems
\n",
" Department of Telematic Engineering Systems
\n",
" Universidad Politécnica de Madrid
\n",
"
\n",
" \n",
""
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "6a78a7c2cbcad6ec014af585a381f1ff",
"grade": false,
"grade_id": "cell-0cd673883ee592d1",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Introduction to Linked Open Data\n",
"\n",
"This lecture provides a quick introduction to semantic queries in Python using SPARQL.\n",
"SPARQL is aa semantic query language inspired by SQL.\n",
"\n",
"This is the first in a series of notebooks about SPARQL, which consists of:\n",
"\n",
"* This notebook, which introduces basic concepts using a small public dataset.\n",
"* [A notebook with queries to a custom dataset](02_SPARQL_Custom_Endpoint.ipynb), which links to the RDF exercises.\n",
"* [A notebook with queries to DBpedia](03_SPARQL_Writers.ipynb). DBpedia is the semantic version of Wikipedia. It is very useful, as it contains much more data. However, finding the right properties to query can be challenging.\n",
"* [A notebook with more advanced SPARQL concepts](04_SPARQL_Advanced.ipynb), which extends the previous notebook with more advanced concepts, such as regular expressions and dealing with dates."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "bc0ca2e21254707344c60f895cb204b4",
"grade": false,
"grade_id": "cell-10264483046abcc4",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Objectives\n",
"\n",
"* Learning SPARQL and the Linked Data principles by defining queries to answer a set of problems of increasing difficulty\n",
"* Learning how to use integrated SPARQL editors and programming interfaces to SPARQL."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "99aecbad8f94966d92d72dc911d3ff99",
"grade": false,
"grade_id": "cell-4f8492996e74bf20",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Tools\n",
"\n",
"* This notebook\n",
"* External SPARQL editors (optional)\n",
" * YASGUI-GSI http://yasgui.cluster.gsi.dit.upm.es\n",
" * DBpedia virtuoso http://dbpedia.org/sparql\n",
"\n",
"Using the YASGUI-GSI editor has several advantages over other options.\n",
"It features:\n",
"\n",
"* Selection of data source, either by specifying the URL or by selecting from a dropdown menu\n",
"* Interactive query editing\n",
" * A set of pre-defined queries\n",
" * Syntax errors\n",
" * Auto-complete\n",
"* Data visualization\n",
" * Total number of results\n",
" * Different formats (table, pivot table, raw response, etc.)\n",
" * Pagination of results\n",
" * Search and filter results"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "99e3107f9987cdddae7866dded27f165",
"grade": false,
"grade_id": "cell-70ac24910356c3cf",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Instructions\n",
"\n",
"We will be using a semantic server, available at: http://fuseki.cluster.gsi.dit.upm.es/sitc.\n",
"\n",
"This server contains a dataset about [Beatles songs](http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html), which we will query with SPARQL.\n",
"\n",
"We will provide you some example code to get you started, the *question* you will have to answer using SPARQL, a template for the answer.\n",
"\n",
"After every query, you will find some python code to test the results of the query.\n",
"**Make sure you've run the tests before moving to the next exercise**.\n",
"If the test gives you an error, you've probably done something wrong.\n",
"You do not need to understand or modify the test code."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "1d332d3d11fd6b57f0ec0ac3c358c6cb",
"grade": false,
"grade_id": "cell-eb13908482825e42",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"For convenience, the examples in the notebook are executable (using the `%%sparql` magic command), and they are accompanied by some code to test the results.\n",
"If the tests pass, you probably got the answer right.\n",
"\n",
"**Run this line to enable the `%%sparql` magic command.**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "aca7c5538b8fc53e99c92e94e6818c83",
"grade": false,
"grade_id": "cell-b3f3d92fa2100c3d",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"from helpers import sparql, solution, show_photos"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "e896b6560e45d5c385a43aa85e3523c7",
"grade": false,
"grade_id": "cell-04410e75828c388d",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"The `%%sparql` magic command will allow us to use SPARQL inside normal jupyter cells.\n",
"\n",
"For instance, the following code:\n",
"\n",
"```python \n",
"%%sparql http://dbpedia.org/sparql\n",
"\n",
"\n",
"``` \n",
"\n",
"Is the same as `run_query('', endpoint='http://dbpedia.org/sparql')` plus some additional steps, such as saving the results in a nice table format so that they can be used later and storing the results in a variable (`solution()`), which we will use in our tests.\n",
"\n",
"You do not need to worry about it, and **you can always use one of the suggested online editors if you wish**."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "96ca90572d6b275fa515c6b976115257",
"grade": false,
"grade_id": "cell-2a44c0da2c206d01",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"You can also use any other method to write your queries.\n",
"Just make sure to copy the working query back into the notebook so you can test it.\n",
"\n",
"You may find online query editors particularly useful.\n",
"In addition to running queries from your browser, they provide useful features such as syntax highlighting and autocompletion.\n",
"Some examples are:\n",
"\n",
"* DBpedia's virtuoso query editor https://dbpedia.org/sparql\n",
"* A javascript based client hosted at GSI: http://yasgui.cluster.gsi.dit.upm.es/\n",
"\n",
"[^1]: http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "79c60bd3d4c13f380aae5778c5ce7245",
"grade": false,
"grade_id": "cell-d645128d3af18117",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Exercises\n",
"\n",
"The following exercises cover the basics of SPARQL with simple use cases."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "f7428fe79cd33383dfd3b09a0d951b6e",
"grade": false,
"grade_id": "cell-8391a5322a9ad4a7",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"#### First select - Exploring the dataset\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "f6b5da583694dd5cc9326c670830875d",
"grade": false,
"grade_id": "cell-4f56a152e4d70c02",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"Let's start with a simple query to explore the dataset using SPARQL.\n",
"We will get a list of the types of entities in the dataset.\n",
"\n",
"SPARQL syntax is similar to SQL, mixed with turtle.\n",
"A SPARQL query has two main parts: the `SELECT` block, which specifies what variables we want to get; and the `WHERE` block which, loosely speaking, defines how the variables will be obtained from the graph.\n",
"\n",
"In order to construct the `WHERE` block, we have to know the data we want to extract would be represented in Turtle.\n",
"\n",
"In particular, to write an entity and its type, we would write this triple:\n",
"\n",
"```turtle\n",
" a .\n",
"```\n",
"\n",
"For example:\n",
"\n",
"```turtle\n",
"example:Timmy a example:Boy\n",
"```\n",
"\n",
"In SPARQL, the parts that we wish to extract are replaced with a variable (e.g. `?name`, `?type`).\n",
"Hence, we would have something like this:\n",
"\n",
"```turtle\n",
"?entity a ?type\n",
"```\n",
"\n",
"The name of the variable has no effect on the query, but you should use a sensible name.\n",
"In these notebooks, try to use the names provided in the templates, because they might be used in the tests.\n",
"\n",
"There are additional parts in the query.\n",
"For now, we will only cover the `LIMIT` statement, which limits the number of results we will get.\n",
"Using `LIMIT` is usually a good idea, especially when trying new queries, because the dataset may be too big. \n",
"\n",
"Using all these concepts, we will run our first query, to get the list of entities and their type:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "7a9dc62ab639143c9fc13593e50500d4",
"grade": false,
"grade_id": "cell-8ce8c954513f17e7",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"SELECT ?entity ?type\n",
"WHERE {\n",
" ?entity a ?type\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "d6a79c2f5fd005a9e15a8f67dcfd4784",
"grade": false,
"grade_id": "cell-3d6d622c717c3950",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"You can check that the results you got match our expectations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert len(solution()['tuples']) == 10 # Make sure we got 10 results \n",
"assert len(solution()['columns']) >= 1 # In 2 columns (?entity and ?type)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, use the same concepts to write a query that gets the **list of entities and their properties**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "3414fee2b6ccfc90a87a62697b35fbda",
"grade": false,
"grade_id": "cell-6e904d692b5facad",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"SELECT DISTINCT ?entity ?prop\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "97bd5d5383bd94a72c7452bc33e4b0f9",
"grade": true,
"grade_id": "cell-3fc0d3c43dfd04a3",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert len(s['tuples']) >= 100 # There are at least 100 results\n",
"assert 'entity' in s['columns'] # A column named entity exists\n",
"assert 'http://learningsparql.com/ns/musician/RaymondBrown' in s['columns']['entity'] # RaymondBrown is an entity"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting a list of DISTINCT types"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To get a better grip of the dataset, we will get a list of types.\n",
"\n",
"We may try to do so with a simple query: "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"SELECT ?type\n",
"WHERE {\n",
" ?entity a ?type\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, this list has many duplicates.\n",
"In fact, we only get one type (`Musician`).\n",
"\n",
"To remove duplicates, we will need the `DISTINCT` statement, which only shows unique (distinct) rows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"SELECT DISTINCT ?type\n",
"WHERE {\n",
" ?entity a ?type\n",
"}\n",
"LIMIT 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We should see only three types now (`Musician`, `Song`, and `Instrument`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert 'type' in solution()['columns']\n",
"assert len(solution()['tuples']) == 3\n",
"assert 'http://learningsparql.com/ns/schema/Musician' in solution()['columns']['type']\n",
"assert 'http://learningsparql.com/ns/schema/Song' in solution()['columns']['type']\n",
"assert 'http://learningsparql.com/ns/schema/Instrument' in solution()['columns']['type']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, **build a query to get the list of unique properties**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "47c4f68e342ffe59a3804de7b6a3909b",
"grade": false,
"grade_id": "cell-e615f9a77c4bc9a5",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"SELECT DISTINCT ?property\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "c9ffeba2d4ffc3e0b95f15a0ec6012c5",
"grade": true,
"grade_id": "cell-9168718938ab7347",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['tuples']) == 182\n",
"assert 'http://learningsparql.com/ns/instrument/bass' in solution()['columns']['property']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Geting all properties for songs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `WHERE` statement can contain more than one line.\n",
"\n",
"For example, we can restrict the list of properties from the previous exercise, to only get properties of musicians:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT DISTINCT ?prop\n",
"WHERE {\n",
" ?song a s:Musician .\n",
" ?song ?prop ?value .\n",
"}\n",
"LIMIT 20"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There should be two results:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"assert len(solution()['tuples']) == 2 # There are exactly two results"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice the use of prefixes, just like in turtle.\n",
"Also, these two options are equivalent:\n",
"\n",
"```turtle\n",
"?song a s:Musician ;\n",
" ?prop ?value .\n",
"\n",
"# And\n",
"\n",
"?song a s:Musician ;\n",
"?song ?prop ?value .\n",
"```\n",
"\n",
"The first one is just shorter to write.\n",
"\n",
"Alternatively, in this example we can also replace the properties we are not using with square brackets `[]`:\n",
"\n",
"```turtle\n",
"[] a s:Musician ;\n",
" ?prop [] .\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, use the same concepts to get a list of **songs and properties**, without duplicates:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "8b0faf938efc1a64a70515da3c132605",
"grade": false,
"grade_id": "cell-0223a51f609edcf9",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX rdfs: \n",
"\n",
"# YOUR ANSWER HERE\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 20"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "e93d7336fd125d95996e60fd312a4e4d",
"grade": true,
"grade_id": "cell-3c7943c6382c62f5",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert len(set(s['tuples'])) == len(s['tuples']) # There are no duplicates\n",
"assert len(s['tuples']) >= 20"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting a list of song names"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous exercise, we saw the properties for Songs.\n",
"One of them is `rdfs:label`, which gives a human readable name for the entity.\n",
"\n",
"Using `rdfs:label`, get a list of song names:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "271f2b194c2db4c558a46e8312b593e6",
"grade": false,
"grade_id": "cell-8f43547dd788bb33",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?name\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 20"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "9f1f7cec8ce4674971543728ada86674",
"grade": true,
"grade_id": "cell-e13a1c921af2f6eb",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert 'Besame Mucho' in s['columns']['name']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting an ordered list of songs (ORDER BY)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `ORDER BY` statement allows us to determine the way results will be sorted.\n",
"This makes it easier to find errors, or missing data.\n",
"\n",
"The syntax is the following:\n",
"\n",
"```sparql\n",
"\n",
"SELECT *\n",
"WHERE { ... }\n",
"ORDER BY ... DESC() ASC()\n",
"... other statements like LIMIT ...\n",
"```\n",
"\n",
"The results can be sorted in ascending or descending order, and using several variables."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `ORDER BY` to get a list of songs in **descending order**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "9dcd9c6d51a61ac129cffa06e1463c66",
"grade": false,
"grade_id": "cell-a0f0b9d9b05c9631",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?name\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"LIMIT 50"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "a044b3fd6b8bd4e098bbe4d818cb4e9f",
"grade": true,
"grade_id": "cell-bc012ca9d7ad2867",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert len(s['tuples']) >= 20\n",
"assert s['columns']['name'][0][0] > s['columns']['name'][-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Get a list of musicians who collaborated in at least one song (Traversing the graph)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From our inspection of the properties in previous exercises, we know that each song has a list of properties that link to musicians, and each musician has a name. For example:\n",
"\n",
"\n",
"```turtle\n",
"song:HeyJude a schema:Song ;\n",
" instrument:guitar musician:RingoStarr .\n",
"\n",
"musician:RingoStarr a schema:Musician ;\n",
" rdfs:label \"Ringo Starr\" .\n",
"```\n",
"\n",
"Using this structure, and the SPARQL statements you already know, to get the **names** of all musicians that collaborated in at least one song.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "7be32a274bb576eb4c154c2737bc5a26",
"grade": false,
"grade_id": "cell-523b963fa4e288d0",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT DISTINCT ?musician\n",
"WHERE {\n",
" ?song a s:Song .\n",
"# YOUR ANSWER HERE\n",
" ]\n",
"}\n",
"ORDER BY ?name"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "c8e3a929faf2afa72207c6921382654c",
"grade": true,
"grade_id": "cell-aa9a4e18d6fda225",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert 'musician' in s['columns']\n",
"assert 'Paul McCartney' in s['columns']['musician']\n",
"assert 'Peter Coe' in s['columns']['musician']\n",
"assert len(solution()['tuples']) >= 200"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### In how many songs did Ringo collaborate? (COUNT)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"Results can be aggregated using different functions.\n",
"One of the simplest functions is `COUNT`.\n",
"The syntax for COUNT is:\n",
" \n",
"```sparql\n",
"SELECT COUNT(?variable) as ?count_name\n",
"```\n",
"\n",
"Use `COUNT` and `GROUP BY` to get a "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "d8419711d2db43ad657e2658a1ea86c4",
"grade": false,
"grade_id": "cell-e89d08031e30b299",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX m: \n",
"PREFIX rdfs: \n",
"\n",
"# YOUR ANSWER HERE\n",
"WHERE {\n",
" ?song a s:Song .\n",
" ?song ?instrument m:RingoStarr .\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "29404e07edf639cdc0ce0d82e654ec31",
"grade": true,
"grade_id": "cell-903d2be00885e1d2",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"assert solution()['columns']['number'][0] == '412'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting the frequency of each instrument (GROUP BY)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Results can be grouped by one or more of the variables.\n",
"\n",
"Grouping is achieved with the `GROUP BY` statement. \n",
"The syntax for `GROUP BY` is:\n",
"\n",
" \n",
"```sparql\n",
"SELECT GROUP BY ?variable1 ?variable2 ...\n",
"```\n",
"\n",
"Once results are grouped, they can be aggregated using any aggregation function, such as `COUNT`.\n",
"\n",
"Using `GROUP BY` and `COUNT`, get the count of songs that use each instrument:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "7a0a7206384e7e1d9eb4450dd9e9871f",
"grade": false,
"grade_id": "cell-1429e4eb5400dbc7",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX m: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?instrument (COUNT(?song) as ?number)\n",
"WHERE {\n",
" ?song a s:Song .\n",
" ?song ?instrument m:RingoStarr .\n",
"}\n",
"# YOUR ANSWER HERE\n",
"ORDER BY DESC(?number)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "bd4dc379fea969d513be0ea97ee75922",
"grade": true,
"grade_id": "cell-907aaf6001e27e50",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert len(s['tuples']) == 37\n",
"assert s['columns']['number'][-1] == '1'\n",
"assert s['columns']['number'][0] == '233'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### How many different instruments are there in every song?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use other keywords inside our aggregation.\n",
"For example, we could use `DISTINCT` to remove duplicates before aggregating.\n",
"\n",
"Here is an example, which shows the number of songs each musician collaborated in.\n",
"It has to use `DISTINCT` because some artists play multiple instruments in a song."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?artist (COUNT(DISTINCT ?song) as ?number)\n",
"WHERE {\n",
" ?artist a s:Musician .\n",
" ?song ?instrument ?artist .\n",
"}\n",
"GROUP BY ?artist\n",
"ORDER BY DESC(?number)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, use the same principle to get the count of **different** instruments in each song.\n",
"Some songs have several musicians playing the same instrument, but we only care about *different* instruments in each song.\n",
"\n",
"Use `?number` for the count."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "4a231b4d6874dad435512b988c17c39e",
"grade": false,
"grade_id": "cell-ee208c762d00da9c",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX rdfs: \n",
"\n",
"# YOUR ANSWER HERE\n",
"WHERE {\n",
" [] a s:Song ;\n",
" rdfs:label ?song ;\n",
" ?instrument ?musician .\n",
"}\n",
"# YOUR ANSWER HERE\n",
"ORDER BY DESC(?number)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "8118099bf14d9f0eb241c4d93ea6f0b9",
"grade": true,
"grade_id": "cell-ddeec32b8ac3d894",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert s['columns']['number'][0] == '27'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Who is the vocalist in every song? (using OPTIONAL)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise, we will get a list of songs and their vocalists.\n",
"\n",
"We coul start with this query:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?song ?vocalist\n",
"WHERE {\n",
" ?song a s:Song .\n",
" ?song i:vocals ?vocalist\n",
"}\n",
"LIMIT 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, there are some songs that do not have a vocalist (at least, in the dataset).\n",
"Those songs will not appear in the list above, because we they do not match part of the `WHERE` clause.\n",
"\n",
"In these cases, we can specify optional values in a query using the `OPTIONAL` keyword.\n",
"When a set of clauses are inside an OPTIONAL group, the SPARQL endpoint will try to use them in the query.\n",
"If there are no results for that part of the query, the variables it specifies will not be bound (i.e. they will be empty).\n",
"\n",
"To exemplify this, we can use a property that **does not exist in the dataset**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?song ?musician\n",
"WHERE {\n",
" ?song a s:Song .\n",
" OPTIONAL {\n",
" ?song i:a_made_up_instrument ?musician\n",
" }\n",
"}\n",
"LIMIT 100"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the property does not exist, the query will still return all the songs.\n",
"In the column for our instrument, it returns an empty value.\n",
"\n",
"Now, use the same concept, to get a list of the **names** of the vocalists (if any) in each song."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "4b0a0854457c37640aad67f375ed3a17",
"grade": false,
"grade_id": "cell-dcd68c45c1608a28",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?song ?vocalist\n",
"WHERE {\n",
" ?s a s:Song .\n",
" ?s rdfs:label ?song .\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "f7122b2284b5d59d59ce4a2925f0bb21",
"grade": true,
"grade_id": "cell-1e706b9c1c1331bc",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert 'Paul McCartney' in s['columns']['vocalist']\n",
"assert 'Paul McCartney' in s['columns']['vocalist']\n",
"assert ('Besame Mucho', 'Paul McCartney') in s['tuples']\n",
"assert '' in s['columns']['vocalist'] # Some songs do not have a vocalist"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What songs do not have a vocalist? (Bound)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we only want to list those songs that **do not** have a vocalist.\n",
"\n",
"To do so, we can copy the query from the previous exercise, and filter the results with the `BOUND` function.\n",
"\n",
"`BOUND` will return `true` if the variable has a value, and `false` otherwise.\n",
"\n",
"This is very useful for two purposes.\n",
"Firstly, it allows us to look for patterns that **do not occur** in the graph, such as missing properties.\n",
"For instance, we could search for the authors with missing birth information so we can add it.\n",
"Secondly, we can use bound in filters to get conditional filters.\n",
"\n",
"Add a filter below to only get songs without a vocalist:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "09621e7af911faf39a834e8281bc6d1f",
"grade": false,
"grade_id": "cell-0c7cc924a13d792a",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?song\n",
"WHERE {\n",
" ?s a s:Song .\n",
" ?s rdfs:label ?song .\n",
" OPTIONAL {\n",
" ?s i:vocals ?vocalist\n",
" }\n",
"# YOUR ANSWER HERE\n",
"}\n",
"LIMIT 100"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "cebff8ce42f3f36923e81e083a23d24c",
"grade": true,
"grade_id": "cell-2541abc93ab4d506",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert len(s['tuples']) == 23"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Who played guitar OR bass in the most songs? (Advanced FILTER with GROUP)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this exercise, we want a table with the name of musicians that played either the guitar (`i:guitar`) or the bass (`i:bass`), the instrument they played, and the times they played it.\n",
"\n",
"If a musician played both instruments, it should appear twice."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "ea9797f3b2d001ea41d7fa7a5170d5fb",
"grade": false,
"grade_id": "cell-d750b6d64c6aa0a7",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX rdfs: \n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"\n",
"SELECT ?musician ?instrument (COUNT(DISTINCT ?song) AS ?number)\n",
"WHERE {\n",
" ?song ?ins ?player .\n",
" ?ins rdfs:label ?instrument .\n",
" ?player rdfs:label ?musician .\n",
"# YOUR ANSWER HERE\n",
"}\n",
"# YOUR ANSWER HERE\n",
"\n",
"ORDER BY DESC(?instrument) DESC(?number)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = solution()\n",
"assert ('George Harrison', 'guitar', '27') in s['tuples']\n",
"assert ('Stuart Sutcliffe', 'bass', '3') in s['tuples']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Who played the most instruments? (Advanced FILTER II)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, count how many instruments each musician have played in a song.\n",
"\n",
"**Do not count lead (`i:vocals`) or backing vocals (`i:backingvocals`) as instruments**."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "2d82df272d43f678d3b19bf0b41530c1",
"grade": false,
"grade_id": "cell-2f5aa516f8191787",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX rdfs: \n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"\n",
"# YOUR ANSWER HERE\n",
"WHERE {\n",
" ?song ?ins ?player .\n",
" ?ins rdfs:label ?instrument .\n",
" ?player rdfs:label ?musician .\n",
"# YOUR ANSWER HERE\n",
"}\n",
"GROUP BY ?musician\n",
"ORDER BY DESC(?instrument) DESC(?number)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "bc83dd9577c9111b1f0ef5bd40c4ec08",
"grade": true,
"grade_id": "cell-bcd0f7e26b6c11c2",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert ('John Lennon', '52') in s['tuples']\n",
"assert ('Andy White', '2') in s['tuples']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Which songs had Ringo in dums OR Lennon in lead vocals? (UNION)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can merge the results of several queries, just like using `JOIN` in SQL.\n",
"The keyword in SPARQL is `UNION`, because we are merging graphs.\n",
"\n",
"`UNION` is useful in many situations.\n",
"For instance, when there are equivalent properties, or when you want to use two search terms and FILTER would be too inefficient.\n",
"\n",
"The syntax is as follows:\n",
"\n",
"```sparql\n",
"SELECT ?title\n",
"WHERE {\n",
" { ?book dc10:title ?title }\n",
" UNION\n",
" { ?book dc11:title ?title }\n",
" \n",
" ... REST OF YOUR QUERY ...\n",
"\n",
"}\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "a1e20e2be817a592683dea89eed0120e",
"grade": false,
"grade_id": "cell-d3a742bd87d9c793",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX rdfs: \n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"\n",
"SELECT DISTINCT ?song\n",
"WHERE {\n",
"# YOUR ANSWER HERE\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "087630476d73bb415b065fafbd6024f0",
"grade": true,
"grade_id": "cell-409402df0e801d09",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"assert len(solution()['tuples']) == 246"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### In how many songs has each musician collaborated at least 10 times? (HAVING)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can filter results after an aggregation, using the `HAVING` statement.\n",
"Its syntax is:\n",
" \n",
"\n",
"```sparql\n",
"SELECT ...\n",
"WHERE ...\n",
"GROUP BY ...\n",
"HAVING ()\n",
"```\n",
"\n",
"e.g.\n",
"\n",
"```sparql\n",
"HAVING (?count > 10)\n",
"```\n",
"\n",
"Use this new statement to get the list of artists that played at least 10 times with the Beatlest, and the number of times they did:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "1d2cb88412c89c35861a4f9fccea3bf2",
"grade": false,
"grade_id": "cell-9d1ec854eb530235",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"\n",
"PREFIX rdfs: \n",
"\n",
"SELECT ?musician (COUNT(DISTINCT ?song) AS ?number) \n",
"WHERE {\n",
" ?song ?instrument [\n",
" rdfs:label ?musician \n",
" ]\n",
"}\n",
"GROUP BY ?musician\n",
"# YOUR ANSWER HERE\n",
"ORDER BY DESC(?number)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "aa20aa4d11632ea5bd6004df3187d979",
"grade": true,
"grade_id": "cell-a79c688b4566dbe8",
"locked": true,
"points": 0,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"s = solution()\n",
"assert len(s['tuples']) == 7\n",
"assert s['columns']['musician'][0] == 'Paul McCartney'\n",
"assert s['columns']['musician'][-1] == 'Mal Evans'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## **Optional** exercises"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"These are additional exercises that can be solved with more advanced concepts.\n",
"\n",
"If you are curious, you could also check the notebook on Advanced SPARQL concepts."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What instruments could each musician play? (GROUP_CONCAT)\n",
"\n",
"\n",
"Another option to aggregate results is to concatenate them.\n",
"You can do so with:\n",
"\n",
"```sparql\n",
"GROUP_CONCAT(?name; separator=\",\")\n",
"```\n",
"\n",
"Using `GROUP_CONCAT`, get a list of the instruments that each musician could play."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "508b7f8656e849838aa93cd38f1c6635",
"grade": false,
"grade_id": "cell-7ea1f5154cdd8324",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"PREFIX rdfs: \n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"\n",
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What types of vocals are there? (REGEX)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In one of the exercises, we excluded lead and backing vocals from the list of instruments.\n",
"However, are those the only types of vocals?\n",
"\n",
"You can check if a string or URI matches a regular expression with `regex(?variable, \"\")`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "cff1f9c034393f8af055e1f930d5fe32",
"grade": false,
"grade_id": "cell-b6bee887a1b1fc60",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/sitc/\n",
"PREFIX rdfs: \n",
"PREFIX s: \n",
"PREFIX i: \n",
"PREFIX m: \n",
"\n",
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [SPARQL queries of Beatles recording sessions](http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html)\n",
"* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
"* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2018 Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}