mirror of https://github.com/gsi-upm/sitc
You cannot select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1052 lines
34 KiB
Plaintext
1052 lines
34 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "1fba29f718bbaa14890b305223712474",
|
|
"grade": false,
|
|
"grade_id": "cell-2bd9e19ffed99f81",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"<header style=\"width:100%;position:relative\">\n",
|
|
" <div style=\"width:80%;float:right;\">\n",
|
|
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
|
|
" <h3>Department of Telematic Engineering Systems</h3>\n",
|
|
" <h5>Universidad Politécnica de Madrid</h5>\n",
|
|
" </div>\n",
|
|
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
|
|
"</header>"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "59c5cb46c9d722f691206e766e5af557",
|
|
"grade": false,
|
|
"grade_id": "cell-51338a0933103db9",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"# Introduction\n",
|
|
"\n",
|
|
"The goal of this exercise is to understand the usefulness of semantic annotation and the Linked Open Data initiative, by solving a practical use case.\n",
|
|
"\n",
|
|
"The student will achieve the goal through:\n",
|
|
"\n",
|
|
"* Analyzing the sequence of tasks required to generate and publish semantic data\n",
|
|
"* Extending their knowledge using the set of additional documents and specifications\n",
|
|
"* Creating a partial semantic definition using the Turtle format\n",
|
|
"\n",
|
|
"\n",
|
|
"# Objectives\n",
|
|
"\n",
|
|
"The main objective is to learn how annotations can be unified on the web, by following the Linked Data principles.\n",
|
|
"\n",
|
|
"\n",
|
|
"These concepts will be applied in a practical use case: obtaining a Graph of information about hotels and reviews about them.\n",
|
|
"\n",
|
|
"\n",
|
|
"# Tools\n",
|
|
"\n",
|
|
"This notebook is self-contained, but it requires some python libraries.\n",
|
|
"To install them, simply run the following line"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "387f9c38b548f29b56ae5ef5ae76fd4f",
|
|
"grade": false,
|
|
"grade_id": "cell-d7f1ea9c021693b8",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Requirement already satisfied: future in /home/cif/anaconda3/lib/python3.5/site-packages (from -r requirements.txt (line 1)) (0.16.0)\n",
|
|
"Requirement already satisfied: rdflib in /home/cif/anaconda3/lib/python3.5/site-packages (from -r requirements.txt (line 2)) (4.0.1)\n",
|
|
"Requirement already satisfied: rdflib-jsonld in /home/cif/.local/lib/python3.5/site-packages (from -r requirements.txt (line 3)) (0.4.0)\n",
|
|
"Requirement already satisfied: lxml in /home/cif/anaconda3/lib/python3.5/site-packages (from -r requirements.txt (line 4)) (4.2.4)\n",
|
|
"Requirement already satisfied: html5lib in /home/cif/anaconda3/lib/python3.5/site-packages (from -r requirements.txt (line 5)) (1.0.1)\n",
|
|
"Requirement already satisfied: isodate in /home/cif/anaconda3/lib/python3.5/site-packages (from rdflib->-r requirements.txt (line 2)) (0.5.4)\n",
|
|
"Requirement already satisfied: pyparsing in /home/cif/anaconda3/lib/python3.5/site-packages (from rdflib->-r requirements.txt (line 2)) (2.2.0)\n",
|
|
"Requirement already satisfied: webencodings in /home/cif/anaconda3/lib/python3.5/site-packages (from html5lib->-r requirements.txt (line 5)) (0.5.1)\n",
|
|
"Requirement already satisfied: six>=1.9 in /home/cif/anaconda3/lib/python3.5/site-packages (from html5lib->-r requirements.txt (line 5)) (1.11.0)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"!pip install --user -r requirements.txt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Linked Data, RDF and Turtle\n",
|
|
"\n",
|
|
"\n",
|
|
"The term [Linked Data](https://www.w3.org/wiki/LinkedData) refers to a set of best practices for publishing structured data on the Web.\n",
|
|
"These principles have been coined by Tim Berners-Lee in the design issue note Linked Data.\n",
|
|
"The principles are:\n",
|
|
"\n",
|
|
"1. Use URIs as names for things\n",
|
|
"2. Use HTTP URIs so that people can look up those names\n",
|
|
"3. When someone looks up a URI, provide useful information\n",
|
|
"4. Include links to other URIs, so that they can discover more things\n",
|
|
"\n",
|
|
"The [RDF](https://www.w3.org/RDF/) is a standard model for data interchange on the Web.\n",
|
|
"It formalizes some concepts behind Linked Data into a specification, which can be used to develop applications and store information.\n",
|
|
"\n",
|
|
"Explaining RDF is out of the scope of this notebook.\n",
|
|
"The [resources section](#Useful-resources) contains some links if you wish to learn about RDF.\n",
|
|
"\n",
|
|
"The main idea behind RDF is that information is encoded in the form of triples:\n",
|
|
"\n",
|
|
"```turtle\n",
|
|
"<subject> <predicate> <object>\n",
|
|
"```\n",
|
|
"\n",
|
|
"Each of these, (`<subject>`, `<predicate>` and `<object>`) should be unique identifiers.\n",
|
|
"\n",
|
|
"For example, to say Timmy is a 7 year-old whose dog is Tobby, we would write:\n",
|
|
"\n",
|
|
"```turtle\n",
|
|
"<http://example.org/Timmy> <http://example.org/hasDog> <http://example.org/Tobby>\n",
|
|
"<http://example.org/Timmy> <http://example.org/age> 7\n",
|
|
"```\n",
|
|
"\n",
|
|
"Note that we are not referring to \"any Timmy\", but to a *very specific* Timmy.\n",
|
|
"We could learn more about this particular boy using that URI.\n",
|
|
"The same goes for the dog, and for the concept of \"having a dog\", which we unambiguously encode as `<http://example.org/hasDog>`.\n",
|
|
"This concept may be described as taking care of a dog, for example, whereas a different property `<http://yourwebsite.com/hasDog>` could be described as being the legal owner of the dog.\n",
|
|
"\n",
|
|
"\n",
|
|
"RDF can be used to embed annotation in many places, including HTML document, using any compatible format.\n",
|
|
"The options include including RDFa, XML, JSON-LD and [Turtle](https://www.w3.org/TR/turtle/).\n",
|
|
"\n",
|
|
"\n",
|
|
"In the exercises, we will be using turtle notation, because it is very readable.\n",
|
|
"\n",
|
|
"Here's an example of document in Turtle, taken from the Turtle specification:\n",
|
|
"\n",
|
|
"```turtle\n",
|
|
"@base <http://example.org/> .\n",
|
|
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
|
|
"@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n",
|
|
"@prefix foaf: <http://xmlns.com/foaf/0.1/> .\n",
|
|
"@prefix rel: <http://www.perceive.net/schemas/relationship/> .\n",
|
|
"\n",
|
|
"<#green-goblin>\n",
|
|
" rel:enemyOf <#spiderman> ;\n",
|
|
" a foaf:Person ; # in the context of the Marvel universe\n",
|
|
" foaf:name \"Green Goblin\" .\n",
|
|
"\n",
|
|
"<#spiderman>\n",
|
|
" rel:enemyOf <#green-goblin> ;\n",
|
|
" a foaf:Person ;\n",
|
|
" foaf:name \"Spiderman\", \"Человек-паук\"@ru .\n",
|
|
"```\n",
|
|
"\n",
|
|
"\n",
|
|
"The second exercise will show you how to extract this information from any website.\n",
|
|
"\n",
|
|
"As you can observe in these examples, Turtle defines several ways to specify IRIs in a result. Please, consult the specification for further details. As an overview, IRIs can be:\n",
|
|
" * *relative IRIs*: IRIs resolved relative to the current base IRI. Thus, you should define a base IRI (@base <http://example.org>) and then relative IRIs (i.e. <#spiderman>). The resulting IRI is <http://example.org/spiderman>.\n",
|
|
" * *prefixed names*: a prefixed name (i.e. foaf:Person) is transformed into an IRI by concatenating the IRI of the prefix (@prefix foaf: <http://xmlns.com/foaf/0.1) and the local part of the prefixed name (i.e. Person). So, the resulting IRI is <http://xmlns.com/foaf/0.1/Person\n",
|
|
" * *absolute IRIs*: an already resolved IRI, p.ej. <http://example.com/Auto>."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Vocabularies and schema.org\n",
|
|
"\n",
|
|
"Concepts (predicates, types, etc.) can be defined in vocabularies.\n",
|
|
"These vocabularies can be reused in several applications.\n",
|
|
"In the example above, we used the concept of person from an external vocabulary (`foaf:Person`, i.e. http://xmlns.com/foaf/0.1/Person).\n",
|
|
"That way, we do not need to redefine the concept of Person in every application.\n",
|
|
"There are several well known vocabularies, such as:\n",
|
|
"\n",
|
|
"* Dublin core, for metadata: http://dublincore.org/\n",
|
|
"* FOAF (Friend-of-a-friend) for social networks: http://www.foaf-project.org/\n",
|
|
"* SIOC for online communities: https://www.w3.org/Submission/sioc-spec/\n",
|
|
"\n",
|
|
"Using the same vocabularies also makes it easier to automatically process and classify information.\n",
|
|
"\n",
|
|
"\n",
|
|
"That was the motivation behind Schema.org, a collaboration between Google, Microsoft, Yahoo and Yandex.\n",
|
|
"They aim to provide schemas for structured data annotation of Web sites, e-mails, etc., which can be leveraged by search engines and other automated processes.\n",
|
|
"\n",
|
|
"They rely on RDF for representation, and provide a set of common vocabularies that can be shared by every web developer.\n",
|
|
"\n",
|
|
"\n",
|
|
"There are thousands of properties in the schema.org vocabulary, and they offer a very comprehensive documentation.\n",
|
|
"\n",
|
|
"As an example, this is the documentation for hotels:\n",
|
|
"\n",
|
|
"* List of properties for the Hotel type: https://schema.org/Hotel\n",
|
|
"* Documentation for hotels: https://schema.org/docs/hotels.html\n",
|
|
"\n",
|
|
"\n",
|
|
"You can use the documentation to find properties (e.g. `checkinTime`), as well as the type of that property (e.g. `Datetime`)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "fe9a246ba580c71385e9b83d414a1216",
|
|
"grade": false,
|
|
"grade_id": "cell-a1b60daabb1a9d00",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"# Exercises"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "63879c425ec11742c95c728a578d109e",
|
|
"grade": false,
|
|
"grade_id": "cell-d9289e96b2b0f265",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Instructions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "e0b6464bce9263fb35543acf4acb31da",
|
|
"grade": false,
|
|
"grade_id": "cell-bb418e9bae1fef1a",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"First of all, run the line below.\n",
|
|
"It will import everything you need for the exercises."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "bf98cea45f42e3d0f1ab158693b40da7",
|
|
"grade": false,
|
|
"grade_id": "cell-4a1b60bd9974bbb1",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"application/javascript": [
|
|
"IPython.CodeCell.options_default.highlight_modes['magic_turtle'] = {'reg':[/^%%ttl/]};"
|
|
]
|
|
},
|
|
"metadata": {},
|
|
"output_type": "display_data"
|
|
}
|
|
],
|
|
"source": [
|
|
"from helpers import *\n",
|
|
"from rdflib import term, RDF, Namespace"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "3e23398d5277f2db2b3b5fb84f9623d6",
|
|
"grade": true,
|
|
"grade_id": "cell-da88c2f8170436fe",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"source": [
|
|
"You have to fill in the parts marked:\n",
|
|
"\n",
|
|
"```\n",
|
|
"# YOUR ANSWER HERE\n",
|
|
"```\n",
|
|
"\n",
|
|
"To make sure everything is working, try the following example.\n",
|
|
"The solution is:\n",
|
|
"\n",
|
|
"```turtle\n",
|
|
"@prefix foaf: <http://xmlns.com/foaf/0.1/> .\n",
|
|
"@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .\n",
|
|
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
|
|
"\n",
|
|
"<http://purl.org/net/bsletten> \n",
|
|
" a foaf:Person;\n",
|
|
" foaf:interest <http://www.w3.org/2000/01/sw/>;\n",
|
|
" foaf:based_near [\n",
|
|
" geo:lat \"34.0736111\" ;\n",
|
|
" geo:lon \"-118.3994444\"\n",
|
|
" ] .\n",
|
|
"```\n",
|
|
"\n",
|
|
"Fill in the answer and run the test code.\n",
|
|
"\n",
|
|
"This order (%%ttl) is a so-called magic cell command to execute a function. You can read more here https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "69182e8fadb9c9751f76786e0fcb8803",
|
|
"grade": false,
|
|
"grade_id": "cell-808cfcbf3891f39f",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/markdown": [
|
|
"Error on line ?\n",
|
|
"\n",
|
|
"Reason: No plugin registered for (ttl, <class 'rdflib.parser.Parser'>)\n",
|
|
"\n",
|
|
"If you don't know what this error means, try an online validator: http://ttl.summerofcode.be/\n"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.Markdown object>"
|
|
]
|
|
},
|
|
"execution_count": 4,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"%%ttl example\n",
|
|
"\n",
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "5982ca82090e267401af135ca1f371a8",
|
|
"grade": true,
|
|
"grade_id": "cell-23e61b9f48d597fc",
|
|
"locked": true,
|
|
"points": 1,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"g = solution('example')\n",
|
|
"test('Some triples have been loaded',\n",
|
|
" len(g))\n",
|
|
"test('A person has been defined',\n",
|
|
" g.subjects(RDF.type, term.URIRef('http://xmlns.com/foaf/0.1/Person')))\n",
|
|
"print('All tests passed. Well done!')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "a64acf02625b48b3c65b6e1bc1ba6c1a",
|
|
"grade": false,
|
|
"grade_id": "cell-e73f1933742f7ab3",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Exercise 1: Definition of a Hotel"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"We will define some basic information about a hotel, and some reviews.\n",
|
|
"This should be the same type of information that some aggregators (e.g. TripAdvisor) offer in their websites.\n",
|
|
"\n",
|
|
"Namely, you need to define at least two hotels (you may add more than one), with the following information:\n",
|
|
"* Description\n",
|
|
"* Address\n",
|
|
"* Contact information\n",
|
|
"* City and country (location)\n",
|
|
"* Email\n",
|
|
"* logo\n",
|
|
"* Opening hours\n",
|
|
"* Price range\n",
|
|
"* Amenities (optional)\n",
|
|
"* Geolocation (optional)\n",
|
|
"* Images (optional)\n",
|
|
"\n",
|
|
"You should also add at least three reviews about hotels, with the following information:\n",
|
|
"* Name of the user that reviewed the Hotel\n",
|
|
"* Rating\n",
|
|
"* Date\n",
|
|
"* Replies by other users (optional)\n",
|
|
"* Aspects rated in each review (cleanliness, staff, etc...) (optional)\n",
|
|
"* Information about the user (name, surname, date the account was created) (optional)\n",
|
|
"\n",
|
|
"\n",
|
|
"You can check any hotel website for inspiration, like this [review of a hotel in TripAdvisor](https://www.tripadvisor.es/Hotel_Review-g1437655-d1088667-Reviews-Hotel_Spa_La_Salve-Torrijos_Province_of_Toledo_Castile_La_Mancha.html)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"To make sure we are following Principles 1 and 2, we should use URIs that can be queried.\n",
|
|
"For the sake of this exercise, you can use the made-up `http://example/sitc/` as base for our URIs.\n",
|
|
"Hence, the URIs of our hotels will look like this: `http://example/sitc/my-fancy-hotel`.\n",
|
|
"These URIs can not be queried, **and should not be used in real annotations**, but we will see how to fix that in a future exercise.\n",
|
|
"\n",
|
|
"\n",
|
|
"We will use the vocabularies defined in https://schema.org e.g.:\n",
|
|
" * https://schema.org/Review defines properties about reviews\n",
|
|
" * https://schema.org/Hotel defines properties about hotels\n",
|
|
" \n",
|
|
"\n",
|
|
"Your definition has to be included in the following cell.\n",
|
|
"\n",
|
|
"So, your task is:\n",
|
|
"* Search the relevant properties of the vocabulary schema.org to represent the attributes of both reviews and hotels.\n",
|
|
"* Write two resources of type Hotel and three resources of type Review.\n",
|
|
"* Check that your syntax is correct, by executing your code in the cell below.\n",
|
|
"\n",
|
|
"**Tip**: Define the schema prefix first, to avoid repeating `<http://schema.org/...>`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "44f8be14db3d3e42b5b85f0485206346",
|
|
"grade": false,
|
|
"grade_id": "definition",
|
|
"locked": false,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"%%ttl hotel\n",
|
|
"\n",
|
|
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
|
|
"@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n",
|
|
"@prefix sitc: <http://example/sitc/> .\n",
|
|
"\n",
|
|
"\n",
|
|
"<http://example/sitc/GSIHOTEL> a <http://schema.org/Hotel> ;\n",
|
|
" <http://schema.org/description> \"This is just an example to get you started.\" .\n",
|
|
"\n",
|
|
"\n",
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "4f54963163a64f46058c86be139e5543",
|
|
"grade": true,
|
|
"grade_id": "definition-tests",
|
|
"locked": true,
|
|
"points": 10,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"g = solution('hotel')\n",
|
|
"test('Some triples are loaded',\n",
|
|
" len(g))\n",
|
|
"\n",
|
|
"hotels = set(g.subjects(RDF.type, schema['Hotel']))\n",
|
|
"test('At least 2 hotels are loaded',\n",
|
|
" hotels,\n",
|
|
" 2,\n",
|
|
" atLeast)\n",
|
|
"\n",
|
|
"for hotel in hotels:\n",
|
|
" if 'GSIHOTEL' in hotel: # Do not check the example hotel\n",
|
|
" continue\n",
|
|
" props = g.predicates(hotel)\n",
|
|
" test('Each hotel has all required properties',\n",
|
|
" props,\n",
|
|
" list(schema[i] for i in ['description', 'email', 'logo', 'priceRange']),\n",
|
|
" func=containsAll)\n",
|
|
"\n",
|
|
"reviews = set(g.subjects(RDF.type, schema['Review']))\n",
|
|
"test('At least 3 reviews are loaded',\n",
|
|
" reviews,\n",
|
|
" 3,\n",
|
|
" atLeast)\n",
|
|
"\n",
|
|
"for review in reviews:\n",
|
|
" props = g.predicates(review)\n",
|
|
" test('Each review has all required properties',\n",
|
|
" props,\n",
|
|
" list(schema[i] for i in ['itemReviewed', 'reviewBody', 'reviewRating']),\n",
|
|
" func=containsAll)\n",
|
|
" ratings = list(g.objects(review, schema['reviewRating']))\n",
|
|
" for rating in ratings:\n",
|
|
" value = g.value(rating, schema['ratingValue'])\n",
|
|
" test('The review should have ratings', value)\n",
|
|
"\n",
|
|
"authors = set(g.objects(None, schema['author']))\n",
|
|
"for author in authors:\n",
|
|
" for prop in g.predicates(author, None):\n",
|
|
" if 'name' in str(prop).lower():\n",
|
|
" break\n",
|
|
"else:\n",
|
|
" assert \"At least a reviewer has a name (surname, givenName...)\"\n",
|
|
"\n",
|
|
"print('All tests passed. Congratulations!')\n",
|
|
"print()\n",
|
|
"print('Now you can try to add the optional properties')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Exercise 2: Explore existing data"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"The goal of this exercise is to explore and compare annotations from existing websites.\n",
|
|
"\n",
|
|
"Semantic annotations are very useful on the web, because they allow `robots` to extract information about resources, and how they relate to other resources.\n",
|
|
"\n",
|
|
"For example, `schema.org` annotations on a website allow Google to show summaries and useful information (e.g. price and location of a hotel) in their results.\n",
|
|
"A similar technology powers their knowledge graph and the \"related search\". i.e. when you look for a famous actor, it will first show you their filmography, and a list of related actors.\n",
|
|
"\n",
|
|
"The information has to be provided using the official standards (RDF), to comply with the 3rd principle of linked data.\n",
|
|
"\n",
|
|
"To follow the 4<sup>th</sup> principle of linked data, the annotations should include links to known sources (e.g. DBpedia) whenever possible."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"Let us explore some semantic annotations from popular websites.\n",
|
|
"\n",
|
|
"First, start with hotel reviews and websites. Here are some examples:\n",
|
|
"\n",
|
|
"* TripAdvisor hotels\n",
|
|
"* Trivago\n",
|
|
"* Kayak\n",
|
|
"* Specific hotel reviews\n",
|
|
"\n",
|
|
"\n",
|
|
"These are just two examples:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Could not get rdfa data\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/markdown": [
|
|
"\n",
|
|
"Results:\n",
|
|
"\n",
|
|
"```turtle\n",
|
|
"@prefix ns1: <http://purl.org/dc/terms/> .\n",
|
|
"@prefix ns2: <http://www.w3.org/ns/rdfa#> .\n",
|
|
"@prefix ns3: <http://www.w3.org/2006/http#> .\n",
|
|
"@prefix ns4: <http://www.w3.org/ns/md#> .\n",
|
|
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
|
|
"@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n",
|
|
"@prefix xml: <http://www.w3.org/XML/1998/namespace> .\n",
|
|
"@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .\n",
|
|
"\n",
|
|
"<http://www.hotellasalve.com/> ns4:item () .\n",
|
|
"\n",
|
|
"[] a ns2:Error ;\n",
|
|
" ns1:date \"2019-02-13T17:42:13.310135\"^^xsd:dateTime ;\n",
|
|
" ns1:description \"__init__() got an unexpected keyword argument 'encoding'\" ;\n",
|
|
" ns2:context [ a ns3:Request ;\n",
|
|
" ns3:requestURI \"http://www.hotellasalve.com/\" ],\n",
|
|
" [ a ns3:Response ;\n",
|
|
" ns3:responseCode <http://www.w3.org/2006/http#400> ] .\n",
|
|
"\n",
|
|
"\n",
|
|
"```\n"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.Markdown object>"
|
|
]
|
|
},
|
|
"execution_count": 5,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"print_data('http://www.hotellasalve.com/')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Could not get rdfa data\n",
|
|
"https://photos.mandarinoriental.com/is/content/MandarinOriental/RZMAD - Madrid/Logos/hotel-ritz-hotel-logo-SVG.svg does not look like a valid URI, trying to serialize this will break.\n",
|
|
"tel:+34 91 701 67 67 does not look like a valid URI, trying to serialize this will break.\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/markdown": [
|
|
"\n",
|
|
"Results:\n",
|
|
"\n",
|
|
"```turtle\n",
|
|
"@prefix ns1: <http://schema.org/> .\n",
|
|
"@prefix ns2: <http://www.w3.org/2006/http#> .\n",
|
|
"@prefix ns3: <http://purl.org/dc/terms/> .\n",
|
|
"@prefix ns4: <http://www.w3.org/ns/rdfa#> .\n",
|
|
"@prefix ns5: <http://www.w3.org/ns/md#> .\n",
|
|
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
|
|
"@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n",
|
|
"@prefix xml: <http://www.w3.org/XML/1998/namespace> .\n",
|
|
"@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .\n",
|
|
"\n",
|
|
"<https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel> ns5:item ( [ a ns1:Hotel ;\n",
|
|
" ns1:address [ a ns1:PostalAddress ;\n",
|
|
" ns1:addressCountry \"Spain\"@en ;\n",
|
|
" ns1:addressLocality \"Madrid\"@en ;\n",
|
|
" ns1:postalCode \"28014\"@en ;\n",
|
|
" ns1:streetAddress \"Plaza de la Lealtad 5\"@en ] ;\n",
|
|
" ns1:description \"Experience our 5 Star hotel in central Madrid, Retiro Park offering luxurious rooms and suites, fine dining, private spa, meeting and wedding facilities.\"@en ;\n",
|
|
" ns1:email <mailto:reservations@mohg.com> ;\n",
|
|
" ns1:image <https%3A//photos.mandarinoriental.com/is/content/MandarinOriental/RZMAD%20-%20Madrid/Logos/hotel-ritz-hotel-logo-SVG.svg> ;\n",
|
|
" ns1:name \"Hotel Ritz, Madrid\"@en ;\n",
|
|
" ns1:tel <tel%3A%2B34%2091%20701%2067%2067> ;\n",
|
|
" ns1:url <https://www.google.com/maps/place/Hotel+Ritz,+Madrid/@40.4156097,-3.6946249,773m/data=!3m2!1e3!4b1!4m5!3m4!1s0xd42288329bef061:0xb9bba45ac90e2184!8m2!3d40.4156056!4d-3.6924362>,\n",
|
|
" <https://www.mandarinoriental.com/> ] ) ;\n",
|
|
" ns4:usesVocabulary ns1: .\n",
|
|
"\n",
|
|
"[] a ns4:Error ;\n",
|
|
" ns3:date \"2019-02-13T17:43:52.577508\"^^xsd:dateTime ;\n",
|
|
" ns3:description \"__init__() got an unexpected keyword argument 'encoding'\" ;\n",
|
|
" ns4:context [ a ns2:Request ;\n",
|
|
" ns2:requestURI \"https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel\" ],\n",
|
|
" [ a ns2:Response ;\n",
|
|
" ns2:responseCode <http://www.w3.org/2006/http#400> ] .\n",
|
|
"\n",
|
|
"\n",
|
|
"```\n"
|
|
],
|
|
"text/plain": [
|
|
"<IPython.core.display.Markdown object>"
|
|
]
|
|
},
|
|
"execution_count": 6,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"print_data('https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "a29112f51cc3299c7cae27841feb7410",
|
|
"grade": false,
|
|
"grade_id": "cell-9bf9c7d7516fae75",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"Once you've extracted and analyzed different sources, answer the following questions:\n",
|
|
"\n",
|
|
"\n",
|
|
"### Questions:\n",
|
|
"\n",
|
|
"What type of data do they offer?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "2a7a6ab7d69f7ca5db64233128260045",
|
|
"grade": true,
|
|
"grade_id": "cell-17508ecf96884653",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "531928c9e3b8462baddd4d700c240995",
|
|
"grade": false,
|
|
"grade_id": "cell-d36826d6323c96e8",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"What vocabularyes and ontologies do they use?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "f100004eceae0c8159ade9d713af47e7",
|
|
"grade": true,
|
|
"grade_id": "cell-17508ecf96884655",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"What are the similarities between sites"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "3d9ad086580ee27d93395dac8c16551d",
|
|
"grade": true,
|
|
"grade_id": "cell-30797c9ac87cc7e1",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"What are the similarities between sites"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "4c03ad45eb1234cadccab2b468a69123",
|
|
"grade": true,
|
|
"grade_id": "answer-similarities",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"What are the biggest differences"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "6ccc2db2be4826a146a6c34bc54f00de",
|
|
"grade": true,
|
|
"grade_id": "cell-17508ecf96884657",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "33e1ec78415c85a795e86211d88316c2",
|
|
"grade": false,
|
|
"grade_id": "cell-5f922dc14ad3236a",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"Are all properties from Exercise 1 given by the websites? What's missing?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"nbgrader": {
|
|
"checksum": "e0b4d9f1a2dfe5a7ab835f7349aa3796",
|
|
"grade": true,
|
|
"grade_id": "answer-missing",
|
|
"locked": false,
|
|
"points": 0,
|
|
"schema_version": 1,
|
|
"solution": true
|
|
}
|
|
},
|
|
"source": [
|
|
"# YOUR ANSWER HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "26eb04e562aa6c7d29efa8318982a337",
|
|
"grade": false,
|
|
"grade_id": "cell-7a3c1553c4d6a9b7",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"## Optional\n",
|
|
"\n",
|
|
"There is nothing special about review sites.\n",
|
|
"You can get information about any website.\n",
|
|
"\n",
|
|
"Verify this running checking:\n",
|
|
"\n",
|
|
"* News sites: e.g. https://edition.cnn.com/\n",
|
|
"* CMS: e.g. http://www.etsit.upm.es\n",
|
|
"* Twitter profiles: e.g. https://www.twitter.com/cif\n",
|
|
"* Mastodon (a Twitter alternative) profiles: e.g. https://mastodon.social/@Gargron/\n",
|
|
"* Twitter status pages: e.g. http://mobile.twitter.com/TBLInternetBot/status/1054438951237312514\n",
|
|
"* Mastodon (a Twitter alternative) status pages: e.g. https://mastodon.social/@Gargron/101202440923902326\n",
|
|
"* Wikipedia entries: e.g. https://es.wikipedia.org/wiki/Tim_Berners-Lee\n",
|
|
"* Facebook groups: e.g. https://www.facebook.com/universidadpolitecnicademadrid/"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"print_data('https://mastodon.social/@Gargron')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"deletable": false,
|
|
"editable": false,
|
|
"nbgrader": {
|
|
"checksum": "cffc12120c51a7d994063f66d788570a",
|
|
"grade": false,
|
|
"grade_id": "cell-ec8df1a53c3d3f23",
|
|
"locked": true,
|
|
"schema_version": 1,
|
|
"solution": false
|
|
}
|
|
},
|
|
"source": [
|
|
"# Useful resources\n",
|
|
"\n",
|
|
"* TTL validator: http://ttl.summerofcode.be/\n",
|
|
"* RDF-turtle specification: https://www.w3.org/TR/turtle/\n",
|
|
"* Schema.org documentation: https://schema.org\n",
|
|
"* Wikipedia entry on the Turtle syntax: https://en.wikipedia.org/wiki/Turtle_(syntax)\n",
|
|
"* RDFLib, the most popular python library for RDF (we use it in the tests): https://rdflib.readthedocs.io/"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Bibliography\n",
|
|
"\n",
|
|
"* W3C website on Linked Data: https://www.w3.org/wiki/LinkedData\n",
|
|
"* W3C website on RDF: https://www.w3.org/RDF/\n",
|
|
"* Turtle W3C recommendation: https://www.w3.org/TR/turtle/"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.5.5"
|
|
},
|
|
"latex_envs": {
|
|
"LaTeX_envs_menu_present": true,
|
|
"autocomplete": true,
|
|
"bibliofile": "biblio.bib",
|
|
"cite_by": "apalike",
|
|
"current_citInitial": 1,
|
|
"eqLabelWithNumbers": true,
|
|
"eqNumInitial": 1,
|
|
"hotkeys": {
|
|
"equation": "Ctrl-E",
|
|
"itemize": "Ctrl-I"
|
|
},
|
|
"labels_anchors": false,
|
|
"latex_user_defs": false,
|
|
"report_style_numbering": false,
|
|
"user_envs_cfg": false
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|