1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-12-22 11:48:12 +00:00
sitc/lod/00_SPARQL_Tutorial.ipynb

485 lines
17 KiB
Plaintext
Raw Normal View History

2021-02-22 11:55:40 +00:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](files/images/EscUpmPolit_p.gif \"UPM\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<header style=\"width:100%;position:relative\">\n",
" <div style=\"width:80%;float:right;\">\n",
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
" <h3>Department of Telematic Engineering Systems</h3>\n",
" <h5>Universidad Politécnica de Madrid. © Carlos A. Iglesias </h5>\n",
" </div>\n",
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
"</header>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
"\n",
2021-02-22 16:32:31 +00:00
"This lecture provides an introduction to RDF and the SPARQL query language.\n",
2021-02-22 11:55:40 +00:00
"\n",
"This is the first in a series of notebooks about SPARQL, which consists of:\n",
"\n",
2021-02-22 16:32:31 +00:00
"* This notebook, which explains basic concepts of RDF and SPARQL\n",
"* [A notebook](01_SPARQL_Introduction.ipynb) that provides an introduction to SPARQL through a collection of exercises of increasing difficulty.\n",
"* [An optional notebook](02_SPARQL_Custom_Endpoint.ipynb) with queries to a custom dataset.\n",
"The dataset is meant to be done after the [RDF exercises](../rdf/RDF.ipynb) and it is out of the scope of this course.\n",
"You can consult it if you are interested."
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# RDF basics\n",
"This section is taken from [[1](#1), [2](#2)].\n",
"\n",
"RDF allows us to make statements about resources. The format of these statements is simple. A statement always has the following structure:\n",
"\n",
" <subject> <predicate> <object>\n",
" \n",
2021-02-22 16:32:31 +00:00
"An RDF statement expresses a relationship between two resources. The **subject** and the **object** represent the two resources being related; the **predicate** represents the nature of their relationship.\n",
"The relationship is phrased in a directional way (from subject to object).\n",
"In RDF this relationship is known as a **property**.\n",
"Because RDF statements consist of three elements they are called **triples**.\n",
2021-02-22 11:55:40 +00:00
"\n",
2021-02-22 16:32:31 +00:00
"Here are some examples of RDF triples (informally expressed in pseudocode):\n",
2021-02-22 11:55:40 +00:00
"\n",
" <Bob> <is a> <person>.\n",
" <Bob> <is a friend of> <Alice>.\n",
" \n",
2021-02-22 16:32:31 +00:00
"Resources are identified by [IRIs](https://en.wikipedia.org/wiki/Internationalized_Resource_Identifier), which can appear in all three positions of a triple. For example, the IRI for Leonardo da Vinci in DBpedia is:\n",
2021-02-22 11:55:40 +00:00
"\n",
" <http://dbpedia.org/resource/Leonardo_da_Vinci>\n",
"\n",
"IRIs can be abbreviated as *prefixed names*. For example, \n",
" PREFIX dbr: <http://dbpedia.org/resource/>\n",
" <dbr:Leonardo_da_Vinci>\n",
" \n",
"Objects can be literals: \n",
" * strings (e.g., \"plain string\" or \"string with language\"@en)\n",
" * numbers (e.g., \"13.4\"^^xsd:float)\n",
" * dates (e.g., )\n",
" * booleans\n",
" * etc.\n",
" \n",
2021-02-22 16:32:31 +00:00
"RDF data is stored in RDF repositories that expose SPARQL endpoints.\n",
"Let's query one of the most famous RDF repositories: [dbpedia](https://wiki.dbpedia.org/).\n",
"First, we should learn how to execute SPARQL in a notebook."
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Executing SPARQL in a notebook\n",
2021-02-22 16:32:31 +00:00
"There are several ways to execute SPARQL in a notebook.\n",
"Some of the most popular are:\n",
"\n",
2021-02-22 11:55:40 +00:00
"* using libraries such as [sparql-client](https://pypi.org/project/sparql-client/) or [rdflib](https://rdflib.dev/sparqlwrapper/) that enable executing SPARQL within a Python3 kernel\n",
"* using other libraries. In our case, a light library has been developed (the file helpers.py) for accessing SPARQL endpoints using an HTTP connection.\n",
2021-02-22 16:32:31 +00:00
"* using the [graph notebook package](https://pypi.org/project/graph-notebook/)\n",
"* using a SPARQL kernel [sparql kernel](https://github.com/paulovn/sparql-kernel) instead of the Python3 kernel\n",
2021-02-22 11:55:40 +00:00
"\n",
"\n",
2021-02-22 16:32:31 +00:00
"We are going to use the second option to avoid installing new packages.\n",
"\n",
"To use the library, you need to:\n",
"\n",
"1. Import `sparql` from helpers (i.e., `helpers.py`, a file that is available in the github repository)\n",
"2. Use the `%%sparql` magic command to indicate the SPARQL endpoint and then the SPARQL code.\n",
2021-02-22 11:55:40 +00:00
"\n",
"Let's try it!\n",
"\n",
"# Queries agains DBPedia\n",
"\n",
2021-02-22 16:32:31 +00:00
"We are going to execute a SPARQL query against DBPedia. This section is based on [[8](#8)].\n",
2021-02-22 11:55:40 +00:00
"\n",
2021-02-22 16:32:31 +00:00
"First, we just create a query to retrieve arbitrary triples (subject, predicate, object) without any restriction (besides limiting the result to 10 triples)."
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
"outputs": [],
"source": [
"from helpers import sparql"
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
2023-02-13 17:26:14 +00:00
"%%sparql https://dbpedia.org/sparql\n",
2021-02-22 11:55:40 +00:00
"\n",
"SELECT ?s ?p ?o\n",
"WHERE {\n",
" ?s ?p ?o\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well, it worked, but the results are not particulary interesting. \n",
2021-02-22 16:32:31 +00:00
"Let's search for a famous football player, Fernando Torres.\n",
"\n",
"To do so, we will search for entities whose English \"human-readable representation\" (i.e., label) matches \"Fernando Torres\":"
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
2023-02-13 17:26:14 +00:00
"%%sparql https://dbpedia.org/sparql\n",
2021-02-22 11:55:40 +00:00
"\n",
"SELECT *\n",
"WHERE\n",
" {\n",
" ?athlete rdfs:label \"Fernando Torres\"@en \n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2021-02-22 16:32:31 +00:00
"Great, we found the IRI of the node: `http://dbpedia.org/resource/Fernando_Torres`\n",
"\n",
"Now we can start asking for more properties.\n",
"\n",
"To do so, go to http://dbpedia.org/resource/Fernando_Torres and you will see all the information available about Fernando Torres. Pay attention to the names of predicates to be able to create new queries. For example, we are interesting in knowing where Fernando Torres was born (`dbo:birthPlace`).\n",
2021-02-22 11:55:40 +00:00
"\n",
"Let's go!"
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"\n",
"SELECT *\n",
"WHERE\n",
" {\n",
" ?athlete rdfs:label \"Fernando Torres\"@en ;\n",
" dbo:birthPlace ?birthPlace . \n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2021-02-22 16:32:31 +00:00
"If we examine the SPARQL query, we find three blocks:\n",
"\n",
"* **PREFIX** section: IRIs of vocabularies and the prefix used below, to avoid long IRIs. e.g., by defining the `dbo` prefix in our example, the `dbo:birthPlace` below expands to `http://dbpedia.org/ontology/birthPlace`.\n",
"* **SELECT** section: variables we want to return (`*` is an abbreviation that selects all of the variables in a query)\n",
"* **WHERE** clause: triples where some elements are variables. These variables are bound during the query processing process and bounded variables are returned.\n",
2021-02-22 11:55:40 +00:00
"\n",
2021-02-22 16:32:31 +00:00
"Now take a closer look at the **WHERE** section.\n",
"We said earlier that triples are made out of three elements and each triple pattern should finish with a period (`.`) (although the last pattern can omit this).\n",
"However, when two or more triple patterns share the same subject, we omit it all but the first one, and use ` ;` as separator.\n",
"If if both the subject and predicate are the same, we could use a coma `,` instead.\n",
"This allows us to avoid repetition and make queries more readable.\n",
"But don't forget the space before your separators (`;` and `.`).\n",
2021-02-22 11:55:40 +00:00
"\n",
"The result is interesting, we know he was born in Fuenlabrada, but we see an additional (wrong) value, the Spanish national football team. The conversion process from Wikipedia to DBPedia should still be tuned :).\n",
"\n",
2021-02-22 16:32:31 +00:00
"We can *fix* it, by adding some more constaints.\n",
"In our case, only want a birth place that is also a municipality (i.e., its type is `http://dbpedia.org/resource/Municipalities_of_Spain`).\n",
"Let's see!"
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
"\n",
"SELECT *\n",
"WHERE\n",
" {\n",
" ?athlete rdfs:label \"Fernando Torres\"@en ;\n",
" dbo:birthPlace ?birthPlace .\n",
" ?birthPlace dbo:type dbr:Municipalities_of_Spain \n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great. Now it looks better.\n",
2021-02-22 16:32:31 +00:00
"Notice that we added a new prefix.\n",
2021-02-22 11:55:40 +00:00
"\n",
2021-02-22 16:32:31 +00:00
"Now, is Fuenlabrada is a big city?\n",
"Let's find out.\n",
"\n",
"**Hint**: you can find more subject / object / predicate nodes related to [Fuenlabrada])http://dbpedia.org/resource/Fuenlabrada) in the RDF graph just as we did before.\n",
"That is how we found the `dbo:areaTotal` property."
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbr: <http://dbpedia.org/resource/>\n",
"\n",
"SELECT *\n",
"WHERE\n",
" {\n",
" dbr:Fuenlabrada dbo:areaTotal ?area \n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2021-02-22 16:32:31 +00:00
"Well, it shows 39.1 km$^2$.\n",
"\n",
"Let's go back to our Fernando Torres.\n",
"What we are really insterested in is the name of the city he was born in, not its IRI.\n",
"As we saw before, the human-readable name is provided by the `rdfs:label` property:"
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbp: <http://dbpedia.org/property/>\n",
"\n",
"SELECT *\n",
"WHERE\n",
" {\n",
" ?player rdfs:label \"Fernando Torres\"@en ;\n",
" dbo:birthPlace ?birthPlace .\n",
" ?birthPlace dbo:type dbr:Municipalities_of_Spain ;\n",
" rdfs:label ?placeName \n",
" \n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well, we are almost there. We see that we receive the city name in many languages. We want just the English name. Let's filter!"
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbp: <http://dbpedia.org/property/>\n",
"\n",
"SELECT *\n",
"WHERE\n",
" {\n",
" ?player rdfs:label \"Fernando Torres\"@en ;\n",
" dbo:birthPlace ?birthPlace .\n",
" ?birthPlace dbo:type dbr:Municipalities_of_Spain ;\n",
" rdfs:label ?placeName .\n",
" FILTER ( LANG ( ?placeName ) = 'en' )\n",
" \n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2021-02-22 16:32:31 +00:00
"Awesome!\n",
"\n",
"But we said we don't care about the IRI of the place. We only want two pieces of data: Fernando's birth date and the name of his birthplace.\n",
"\n",
"Let's tune our query a bit more."
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbp: <http://dbpedia.org/property/>\n",
"\n",
"SELECT ?birthDate, ?placeName\n",
"WHERE\n",
" {\n",
" ?player rdfs:label \"Fernando Torres\"@en ;\n",
" dbo:birthDate ?birthDate ;\n",
" dbo:birthPlace ?birthPlace .\n",
" ?birthPlace dbo:type dbr:Municipalities_of_Spain ;\n",
" rdfs:label ?placeName .\n",
" FILTER ( LANG ( ?placeName ) = 'en' )\n",
" \n",
" }"
]
},
{
2021-02-22 16:32:31 +00:00
"cell_type": "markdown",
2021-02-22 11:55:40 +00:00
"metadata": {},
"source": [
2021-02-22 16:32:31 +00:00
"Great 😃\n",
"\n",
"Are there many football players born in Fuenlabrada? Let's find out!"
2021-02-22 11:55:40 +00:00
]
},
{
"cell_type": "code",
2021-02-22 16:32:31 +00:00
"execution_count": null,
2021-02-22 11:55:40 +00:00
"metadata": {},
2021-02-22 16:32:31 +00:00
"outputs": [],
2021-02-22 11:55:40 +00:00
"source": [
"%%sparql https://dbpedia.org/sparql\n",
"\n",
"PREFIX dbo: <http://dbpedia.org/ontology/>\n",
"PREFIX dbp: <http://dbpedia.org/property/>\n",
"\n",
"SELECT *\n",
"WHERE\n",
" {\n",
" ?player a dbo:SoccerPlayer ; \n",
2021-02-22 16:32:31 +00:00
" dbo:birthPlace dbr:Fuenlabrada . \n",
2021-02-22 11:55:40 +00:00
" }"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2021-02-22 16:32:31 +00:00
"Well, not that many. Observe we have used `a`.\n",
"It is just an abbreviation for `rdf:type`, both can be used interchangeably.\n",
2021-02-22 11:55:40 +00:00
"\n",
"If you want additional examples, you can follow the notebook by [Shawn Graham](https://github.com/o-date/sparql-and-lod/blob/master/sparql-intro.ipynb), which is based on the SPARQL tutorial by Matthew Lincoln, available [here in English](https://programminghistorian.org/en/lessons/retired/graph-databases-and-SPARQL) and [here in Spanish](https://programminghistorian.org/es/lecciones/retirada/sparql-datos-abiertos-enlazados]). You have also a local copy of these tutorials together with this notebook [here in English](https://htmlpreview.github.io/?https://github.com/gsi-upm/sitc/blob/master/lod/tutorial/graph-databases-and-SPARQL.html) and [here in Spanish](https://htmlpreview.github.io/?https://github.com/gsi-upm/sitc/blob/master/lod/tutorial/sparql-datos-abiertos-enlazados.html). \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* <a id=\"1\">[1]</a> [SPARQL by Example. A Tutorial. Lee Feigenbaum. W3C, 2009](https://www.w3.org/2009/Talks/0615-qbe/#q1)\n",
"* <a id=\"2\">[2]</a> [RDF Primer W3C](https://www.w3.org/TR/rdf11-primer/)\n",
"* <a id=\"3\">[3]</a> [SPARQL queries of Beatles recording sessions](http://www.snee.com/bobdc.blog/2017/11/sparql-queries-of-beatles-reco.html)\n",
"* <a id=\"4\">[4]</a> [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
"* <a id=\"5\">[5]</a> [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)\n",
"* <a id=\"6\">[6]</a> [RDF Graph Data Model. Learn about the RDF graph model used by Stardog.](https://www.stardog.com/tutorials/data-model)\n",
"* <a id=\"7\">[7]</a> [Learn SPARQL Write Knowledge Graph queries using SPARQL with step-by-step examples.](https://www.stardog.com/tutorials/sparql/)\n",
"* <a id=\"8\">[8]</a> [Running Basic SPARQL Queries Against DBpedia.](https://medium.com/virtuoso-blog/dbpedia-basic-queries-bc1ac172cc09)\n",
"* <a id=\"8\">[9]</a> [Intro SPARQL based on painters.](https://github.com/o-date/sparql-and-lod/blob/master/sparql-intro.ipynb)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© Carlos A. Iglesias, Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"datacleaner": {
"position": {
"top": "50px"
},
"python": {
"varRefreshCmd": "try:\n print(_datacleaner.dataframe_metadata())\nexcept:\n print([])"
},
"window_display": false
},
"kernelspec": {
2023-02-13 17:26:14 +00:00
"display_name": "Python 3 (ipykernel)",
2021-02-22 11:55:40 +00:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2023-02-13 17:26:14 +00:00
"version": "3.8.10"
2021-02-22 11:55:40 +00:00
},
"latex_envs": {
"LaTeX_envs_menu_present": true,
"autocomplete": true,
"bibliofile": "biblio.bib",
"cite_by": "apalike",
"current_citInitial": 1,
"eqLabelWithNumbers": true,
"eqNumInitial": 1,
"hotkeys": {
"equation": "Ctrl-E",
"itemize": "Ctrl-I"
},
"labels_anchors": false,
"latex_user_defs": false,
"report_style_numbering": false,
"user_envs_cfg": false
}
},
"nbformat": 4,
"nbformat_minor": 1
}