mirror of
https://github.com/gsi-upm/sitc
synced 2024-12-22 11:48:12 +00:00
477 lines
11 KiB
Plaintext
477 lines
11 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "markdown",
|
|
"checksum": "7276f055a8c504d3c80098c62ed41a4f",
|
|
"grade": false,
|
|
"grade_id": "cell-0bfe38f97f6ab2d2",
|
|
"locked": true,
|
|
"schema_version": 3,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"<header style=\"width:100%;position:relative\">\n",
|
|
" <div style=\"width:80%;float:right;\">\n",
|
|
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
|
|
" <h3>Department of Telematic Engineering Systems</h3>\n",
|
|
" <h5>Universidad Politécnica de Madrid</h5>\n",
|
|
" </div>\n",
|
|
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
|
|
"</header>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "markdown",
|
|
"checksum": "42642609861283bc33914d16750b7efa",
|
|
"grade": false,
|
|
"grade_id": "cell-0cd673883ee592d1",
|
|
"locked": true,
|
|
"schema_version": 3,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Introduction\n",
|
|
"\n",
|
|
"In the previous notebook, we learnt how to use SPARQL by querying DBpedia.\n",
|
|
"\n",
|
|
"In this notebook, we will use SPARQL on manually annotated data. The data was collected as part of a [previous exercise](../lod/).\n",
|
|
"\n",
|
|
"The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n",
|
|
"\n",
|
|
"Hence, there are two parts.\n",
|
|
"First, you will query a set of graphs annotated by students of this course.\n",
|
|
"Then, you will query a synthetic dataset that contains similar information."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "markdown",
|
|
"checksum": "a3ecb4b300a5ab82376a4a8cb01f7e6b",
|
|
"grade": false,
|
|
"grade_id": "cell-10264483046abcc4",
|
|
"locked": true,
|
|
"schema_version": 3,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Objectives\n",
|
|
"\n",
|
|
"* Experiencing the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n",
|
|
"* Understanding the challenges in querying multiple sources, with different annotators.\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "markdown",
|
|
"checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
|
|
"grade": false,
|
|
"grade_id": "cell-4f8492996e74bf20",
|
|
"locked": true,
|
|
"schema_version": 3,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Tools\n",
|
|
"\n",
|
|
"See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"cell_type": "markdown",
|
|
"checksum": "c5f8646518bd832a47d71f9d3218237a",
|
|
"grade": false,
|
|
"grade_id": "cell-eb13908482825e42",
|
|
"locked": true,
|
|
"schema_version": 3,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"Run this line to enable the `%%sparql` magic command."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from helpers import *"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exercises\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Querying the manually annotated dataset will be slightly different from querying DBpedia.\n",
|
|
"The main difference is that this dataset uses different graphs to separate the annotations from different students.\n",
|
|
"\n",
|
|
"**Each graph is a separate set of triples**.\n",
|
|
"For this exercise, you could think of graphs as individual endpoints.\n",
|
|
"\n",
|
|
"\n",
|
|
"First, let us get a list of graphs available:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.gsi.upm.es/hotels\n",
|
|
" \n",
|
|
"SELECT ?g (COUNT(?s) as ?count) WHERE {\n",
|
|
" GRAPH ?g {\n",
|
|
" ?s ?p ?o\n",
|
|
" }\n",
|
|
"}\n",
|
|
"GROUP BY ?g\n",
|
|
"ORDER BY desc(?count)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"You should see many graphs, with different triple counts.\n",
|
|
"\n",
|
|
"The biggest one should be http://fuseki.gsi.upm.es/synthetic"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Once you have this list, you can query specific graphs like so:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.gsi.upm.es/hotels\n",
|
|
" \n",
|
|
"SELECT *\n",
|
|
"WHERE {\n",
|
|
" GRAPH <http://fuseki.gsi.upm.es/synthetic>{\n",
|
|
" ?s ?p ?o .\n",
|
|
" }\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"There are two exercises in this notebook.\n",
|
|
"\n",
|
|
"In each of them, you are asked to run five queries, to answer the following questions:\n",
|
|
"\n",
|
|
"* Number of hotels (or entities) with reviews\n",
|
|
"* Number of reviews\n",
|
|
"* The hotel with the lowest average score\n",
|
|
"* The hotel with the highest average score\n",
|
|
"* A list of hotels with their addresses and telephone numbers"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Manually annotated data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Your task is to design five queries to answer the questions in the description, and run each of them in at least three graphs, other than the `synthetic` graph.\n",
|
|
"\n",
|
|
"To design the queries, what you know about the schema.org vocabularies, or explore subjects, predicates and objects in each of the graphs.\n",
|
|
"\n",
|
|
"Here's an example to get the entities and their types in a graph:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.gsi.upm.es/hotels\n",
|
|
"\n",
|
|
"PREFIX schema: <http://schema.org/>\n",
|
|
" \n",
|
|
"SELECT ?s ?o\n",
|
|
"WHERE {\n",
|
|
" GRAPH <http://fuseki.gsi.upm.es/35c20a49f8c6581be1cf7bd56d12d131>{\n",
|
|
" ?s a ?o .\n",
|
|
" }\n",
|
|
"\n",
|
|
"}\n",
|
|
"LIMIT 40"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Synthetic dataset\n",
|
|
"\n",
|
|
"Now, run the same queries in the synthetic dataset.\n",
|
|
"\n",
|
|
"The query below should get you started:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.gsi.upm.es/hotels\n",
|
|
" \n",
|
|
"SELECT *\n",
|
|
"WHERE {\n",
|
|
" GRAPH <http://fuseki.gsi.upm.es/synthetic>{\n",
|
|
" ?s ?p ?o .\n",
|
|
" }\n",
|
|
"}\n",
|
|
"LIMIT 10"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Optional exercise\n",
|
|
"\n",
|
|
"\n",
|
|
"Explore the graphs and find the most typical mistakes (e.g. using `http://schema.org/Hotel/Hotel`).\n",
|
|
"\n",
|
|
"Tip: You can use normal SPARQL queries with `BOUND` and `REGEX` to check if the annotations are correct.\n",
|
|
"\n",
|
|
"You can also query all the graphs at the same time. e.g. to get all types used:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%sparql http://fuseki.gsi.upm.es/hotels\n",
|
|
"\n",
|
|
"PREFIX schema: <http://schema.org/>\n",
|
|
" \n",
|
|
"SELECT DISTINCT ?o\n",
|
|
"WHERE {\n",
|
|
" GRAPH ?g {\n",
|
|
" ?s a ?o .\n",
|
|
" }\n",
|
|
" {\n",
|
|
" SELECT ?g\n",
|
|
" WHERE {\n",
|
|
" GRAPH ?g {}\n",
|
|
" FILTER (str(?g) != 'http://fuseki.gsi.upm.es/synthetic')\n",
|
|
" }\n",
|
|
" }\n",
|
|
"\n",
|
|
"\n",
|
|
"}\n",
|
|
"LIMIT 50"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Discussion\n",
|
|
"\n",
|
|
"Compare the results of the synthetic and the manual dataset, and answer these questions:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Both datasets should use the same schema. Are there any differences when it comes to using them?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "860c3977cd06736f1342d535944dbb63",
|
|
"grade": true,
|
|
"grade_id": "cell-9bd08e4f5842cb89",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 3,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Are the annotations used correctly in every graph?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "1946a7ed4aba8d168bb3fad898c05651",
|
|
"grade": true,
|
|
"grade_id": "cell-9dc1c9033198bb18",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 3,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Has any of the datasets been harder to query? If so, why?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"cell_type": "code",
|
|
"checksum": "6714abc5226618b76dc4c1aaed6d1a49",
|
|
"grade": true,
|
|
"grade_id": "cell-6c18003ced54be23",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 3,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## References"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
|
|
"* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Licence\n",
|
|
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
|
|
"\n",
|
|
"© Universidad Politécnica de Madrid."
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.8.10"
|
|
},
|
|
"toc": {
|
|
"base_numbering": 1,
|
|
"nav_menu": {},
|
|
"number_sections": true,
|
|
"sideBar": true,
|
|
"skip_h1_title": false,
|
|
"title_cell": "Table of Contents",
|
|
"title_sidebar": "Contents",
|
|
"toc_cell": false,
|
|
"toc_position": {},
|
|
"toc_section_display": true,
|
|
"toc_window_display": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|