1
0
mirror of https://github.com/gsi-upm/sitc synced 2025-01-09 20:41:27 +00:00

Compare commits

..

No commits in common. "a4f8f69b191150bec43e47d25d9cf79ec4f4122d" and "993749021323ef7b5a658cd9b28c1b4be8e8dab6" have entirely different histories.

9 changed files with 2003 additions and 2369 deletions

File diff suppressed because it is too large Load Diff

View File

@ -1,459 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "7276f055a8c504d3c80098c62ed41a4f",
"grade": false,
"grade_id": "cell-0bfe38f97f6ab2d2",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"<header style=\"width:100%;position:relative\">\n",
" <div style=\"width:80%;float:right;\">\n",
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
" <h3>Department of Telematic Engineering Systems</h3>\n",
" <h5>Universidad Politécnica de Madrid</h5>\n",
" </div>\n",
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
"</header>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "42642609861283bc33914d16750b7efa",
"grade": false,
"grade_id": "cell-0cd673883ee592d1",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Introduction\n",
"\n",
"In the previous notebook, we learnt how to use SPARQL by querying DBpedia.\n",
"\n",
"In this notebook, we will use SPARQL on manually annotated data. The data was collected as part of a [previous exercise](../lod/).\n",
"\n",
"The goal is to try SPARQL with data annotated by users with limited knowledge of vocabularies and semantics, and to compare the experience with similar queries to a more structured dataset.\n",
"\n",
"Hence, there are two parts.\n",
"First, you will query a set of graphs annotated by students of this course.\n",
"Then, you will query a synthetic dataset that contains similar information."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "a3ecb4b300a5ab82376a4a8cb01f7e6b",
"grade": false,
"grade_id": "cell-10264483046abcc4",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Objectives\n",
"\n",
"* Experiencing the usefulness of the Linked Open Data initiative by querying data from different RDF graphs and endpoints\n",
"* Understanding the challenges in querying multiple sources, with different annotators.\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "2fedf0d73fc90104d1ab72c3413dfc83",
"grade": false,
"grade_id": "cell-4f8492996e74bf20",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Tools\n",
"\n",
"See [the SPARQL notebook](./01_SPARQL_Introduction.ipynb#Tools)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "c5f8646518bd832a47d71f9d3218237a",
"grade": false,
"grade_id": "cell-eb13908482825e42",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"Run this line to enable the `%%sparql` magic command."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from helpers import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercises\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Querying the manually annotated dataset will be slightly different from querying DBpedia.\n",
"The main difference is that this dataset uses different graphs to separate the annotations from different students.\n",
"\n",
"**Each graph is a separate set of triples**.\n",
"For this exercise, you could think of graphs as individual endpoints.\n",
"\n",
"\n",
"First, let us get a list of graphs available:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
" \n",
"SELECT ?g (COUNT(?s) as ?count) WHERE {\n",
" GRAPH ?g {\n",
" ?s ?p ?o\n",
" }\n",
"}\n",
"GROUP BY ?g\n",
"ORDER BY desc(?count)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"You should see many graphs, with different triple counts.\n",
"\n",
"The biggest one should be http://fuseki.cluster.gsi.dit.upm.es/synthetic"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once you have this list, you can query specific graphs like so:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
" \n",
"SELECT *\n",
"WHERE {\n",
" GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
" ?s ?p ?o .\n",
" }\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are two exercises in this notebook.\n",
"\n",
"In each of them, you are asked to run five queries, to answer the following questions:\n",
"\n",
"* Number of hotels (or entities) with reviews\n",
"* Number of reviews\n",
"* The hotel with the lowest average score\n",
"* The hotel with the highest average score\n",
"* A list of hotels with their addresses and telephone numbers"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Manually annotated data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Your task is to design five queries to answer the questions in the description, and run each of them in at least three graphs, other than the `synthetic` graph.\n",
"\n",
"To design the queries, you can either use what you know about the schema.org vocabularies, or explore subjects, predicates and objects in each of the graphs.\n",
"\n",
"You will get a better understanding if you follow the exploratory path.\n",
"\n",
"Here's an example to get the entities and their types in a graph:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
"\n",
"PREFIX schema: <http://schema.org/>\n",
" \n",
"SELECT ?s ?o\n",
"WHERE {\n",
" GRAPH <http://fuseki.cluster.gsi.dit.upm.es/35c20a49f8c6581be1cf7bd56d12d131>{\n",
" ?s a ?o .\n",
" }\n",
"\n",
"}\n",
"LIMIT 40"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Synthetic dataset\n",
"\n",
"Now, run the same queries in the synthetic dataset.\n",
"\n",
"The query below should get you started:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
" \n",
"SELECT *\n",
"WHERE {\n",
" GRAPH <http://fuseki.cluster.gsi.dit.upm.es/synthetic>{\n",
" ?s ?p ?o .\n",
" }\n",
"}\n",
"LIMIT 10"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional exercise\n",
"\n",
"\n",
"Explore the graphs and find the most typical mistakes (e.g. using `http://schema.org/Hotel/Hotel`).\n",
"\n",
"Tip: You can use normal SPARQL queries with `BOUND` and `REGEX` to check if the annotations are correct.\n",
"\n",
"You can also query all the graphs at the same time. e.g. to get all types used:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%sparql http://fuseki.cluster.gsi.dit.upm.es/hotels\n",
"\n",
"PREFIX schema: <http://schema.org/>\n",
" \n",
"SELECT DISTINCT ?o\n",
"WHERE {\n",
" GRAPH ?g {\n",
" ?s a ?o .\n",
" }\n",
" {\n",
" SELECT ?g\n",
" WHERE {\n",
" GRAPH ?g {}\n",
" FILTER (str(?g) != 'http://fuseki.cluster.gsi.dit.upm.es/synthetic')\n",
" }\n",
" }\n",
"\n",
"\n",
"}\n",
"LIMIT 50"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Discussion\n",
"\n",
"Compare the results of the synthetic and the manual dataset, and answer these questions:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both datasets should use the same schema. Are there any differences when it comes to using them?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "860c3977cd06736f1342d535944dbb63",
"grade": true,
"grade_id": "cell-9bd08e4f5842cb89",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Are the annotations used correctly in every graph?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "1946a7ed4aba8d168bb3fad898c05651",
"grade": true,
"grade_id": "cell-9dc1c9033198bb18",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Has any of the datasets been harder to query? If so, why?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "6714abc5226618b76dc4c1aaed6d1a49",
"grade": true,
"grade_id": "cell-6c18003ced54be23",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* [RDFLib documentation](https://rdflib.readthedocs.io/en/stable/).\n",
"* [Wikidata Query Service query examples](https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/queries/examples)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Licence\n",
"The notebook is freely licensed under under the [Creative Commons Attribution Share-Alike license](https://creativecommons.org/licenses/by/2.0/). \n",
"\n",
"© 2018 Universidad Politécnica de Madrid."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

11
lod/README.md Normal file
View File

@ -0,0 +1,11 @@
# Files included #
* `validate.py` validates and serializes a turtle dataset
* `sparql.py` runs a custom sparql query on a given dataset (by default, `reviews.ttl`)
* `extract_data.py` extracts RDFa, micro-data and JSON-LD data from a given URL
# Installation #
```
pip install --user -r requirements.txt
```

1880
lod/SPARQL.ipynb Normal file

File diff suppressed because it is too large Load Diff

49
lod/extract_data.py Normal file
View File

@ -0,0 +1,49 @@
import sys
from future.standard_library import install_aliases
install_aliases()
from urllib import request, parse
from rdflib import Graph, term
from lxml import etree
if len(sys.argv) < 2:
print('Usage: python {} <URL>'.format(sys.argv[0]))
print('')
print('Extract rdfa, microdata and json-ld annotations from a website')
exit(1)
url = sys.argv[1]
g = Graph()
g.parse(url, format='rdfa')
g.parse(url, format='microdata')
def sanitize_triple(t):
"""Function to remove bad URIs from the graph that would otherwise
make the serialization fail."""
def sanitize_triple_item(item):
if isinstance(item, term.URIRef) and '/' not in item:
return term.URIRef(parse.quote(str(item)))
return item
return (sanitize_triple_item(t[0]),
sanitize_triple_item(t[1]),
sanitize_triple_item(t[2]))
with request.urlopen(url) as response:
# Get all json-ld objects embedded in the html file
html = response.read().decode('utf-8', errors='ignore')
parser = etree.XMLParser(recover=True)
root = etree.fromstring(html, parser=parser)
if root:
for jsonld in root.findall(".//script[@type='application/ld+json']"):
g.parse(data=jsonld.text, publicID=url, format='json-ld')
fixedgraph = Graph()
fixedgraph += [sanitize_triple(s) for s in g]
print(g.serialize(format='turtle').decode('utf-8', errors='ignore'))

View File

@ -1,22 +1,12 @@
'''
Helper functions and ipython magic for the SPARQL exercises.
The tests in the notebooks rely on the `LAST_QUERY` variable, which is updated by the `%%sparql` magic after every query.
This variable contains the full query used (`LAST_QUERY["query"]`), the endpoint it was sent to (`LAST_QUERY["endpoint"]`), and a dictionary with the response of the endpoint (`LAST_QUERY["results"]`).
For convenience, the results are also given as tuples (`LAST_QUERY["tuples"]`), and as a dictionary of of `{column:[values]}` (`LAST_QUERY["columns"]`).
'''
from IPython.core.magic import (register_line_magic, register_cell_magic, from IPython.core.magic import (register_line_magic, register_cell_magic,
register_line_cell_magic) register_line_cell_magic)
from IPython.display import HTML, display, Image, display_javascript
from IPython.display import HTML, display, Image
from urllib.request import Request, urlopen from urllib.request import Request, urlopen
from urllib.parse import quote_plus, urlencode from urllib.parse import quote_plus, urlencode
from urllib.error import HTTPError from urllib.error import HTTPError
import json import json
import sys
js = "IPython.CodeCell.options_default.highlight_modes['magic_sparql'] = {'reg':[/^%%sparql/]};"
display_javascript(js, raw=True)
def send_query(query, endpoint): def send_query(query, endpoint):
@ -30,11 +20,7 @@ def send_query(query, endpoint):
headers={'content-type': 'application/x-www-form-urlencoded', headers={'content-type': 'application/x-www-form-urlencoded',
'accept': FORMATS}, 'accept': FORMATS},
method='POST') method='POST')
res = urlopen(r) return json.loads(urlopen(r).read().decode('utf-8'));
data = res.read().decode('utf-8')
if res.getcode() == 200:
return json.loads(data)
raise Exception('Error getting results: {}'.format(data))
def tabulate(tuples, header=None): def tabulate(tuples, header=None):
@ -53,14 +39,11 @@ def tabulate(tuples, header=None):
LAST_QUERY = {} LAST_QUERY = {}
def solution():
return LAST_QUERY
def query(query, endpoint=None, print_table=False): def query(query, endpoint=None, print_table=False):
global LAST_QUERY global LAST_QUERY
endpoint = endpoint or "http://fuseki.cluster.gsi.dit.upm.es/sitc/" endpoint = endpoint or "http://dbpedia.org/sparql"
results = send_query(query, endpoint) results = send_query(query, endpoint)
tuples = to_table(results) tuples = to_table(results)
@ -97,30 +80,12 @@ def to_table(results):
@register_cell_magic @register_cell_magic
def sparql(line, cell): def sparql(line, cell):
'''
Sparql magic command for ipython. It can be used in a cell like this:
```
%%sparql
... Your SPARQL query ...
```
by default, it will use the DBpedia endpoint, but you can use a different endpoint like this:
```
%%sparql http://my-sparql-endpoint...
... Your SPARQL query ...
```
'''
try: try:
return query(cell, endpoint=line, print_table=True) return query(cell, endpoint=line, print_table=True)
except HTTPError as ex: except HTTPError as ex:
error_message = ex.read().decode('utf-8') error_message = ex.read().decode('utf-8')
print('Error {}. Reason: {}'.format(ex.status, ex.reason)) print('Error {}. Reason: {}'.format(ex.status, ex.reason))
print(error_message, file=sys.stderr) print(error_message)
def show_photos(values): def show_photos(values):

29
lod/reviews.ttl Normal file
View File

@ -0,0 +1,29 @@
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
_:Hotel1 a schema:Hotel ;
schema:description "A fictitious hotel" .
_:Review1 a schema:Review ;
schema:reviewBody "This is a great review" ;
schema:reviewRating [
a schema:Rating ;
schema:author <http://jfernando.es/me> ;
schema:ratingValue "0.7"
] ;
schema:itemReviewed _:Hotel1 .
_:Review2 a schema:Review ;
schema:reviewBody "This is a not so great review" ;
schema:reviewRating [
a schema:Rating ;
schema:author [ a schema:Person ;
schema:givenName "anonymous" ] ;
schema:ratingValue "0.3"
] ;
schema:itemReviewed _:Hotel1 .

23
lod/sparql.py Normal file
View File

@ -0,0 +1,23 @@
# !/bin/env python #
# Ejemplo de consultas SPARQL sobre turtle #
# python consultas.py #
import rdflib
import sys
dataset = sys.argv[1] if len(sys.argv) > 1 else 'reviews.ttl'
g = rdflib.Graph()
schema = rdflib.Namespace("http://schema.org/")
# Read Turtle file #
g.parse(dataset, format='turtle')
results = g.query(
"""SELECT DISTINCT ?review ?p ?o
WHERE {
?review a schema:Review.
?review ?p ?o.
}""", initNs={'schema': schema})
for row in results:
print("%s %s %s" % row)

6
lod/validate.py Normal file
View File

@ -0,0 +1,6 @@
import rdflib
import sys
g = rdflib.Graph()
dataset = sys.argv[1] if len(sys.argv) > 1 else 'reviews.ttl'
g.parse(dataset, format="n3")
print(g.serialize(format="n3").decode('utf-8'))