1
0
mirror of https://github.com/gsi-upm/sitc synced 2024-12-25 04:58:13 +00:00

Add RDF/Turtle exercise

This commit is contained in:
J. Fernando Sánchez 2019-02-13 17:51:18 +01:00
parent 8913c5ecde
commit a6670235ba
4 changed files with 1024 additions and 1 deletions

View File

@ -1872,7 +1872,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
"version": "3.7.2"
}
},
"nbformat": 4,

875
rdf/RDF.ipynb Normal file
View File

@ -0,0 +1,875 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "1fba29f718bbaa14890b305223712474",
"grade": false,
"grade_id": "cell-2bd9e19ffed99f81",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"<header style=\"width:100%;position:relative\">\n",
" <div style=\"width:80%;float:right;\">\n",
" <h1>Course Notes for Learning Intelligent Systems</h1>\n",
" <h3>Department of Telematic Engineering Systems</h3>\n",
" <h5>Universidad Politécnica de Madrid</h5>\n",
" </div>\n",
" <img style=\"width:15%;\" src=\"../logo.jpg\" alt=\"UPM\" />\n",
"</header>"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "59c5cb46c9d722f691206e766e5af557",
"grade": false,
"grade_id": "cell-51338a0933103db9",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"# Introduction\n",
"\n",
"The goal of this exercise is to understand the usefulness of semantic annotation and the Linked Open Data initiative, by solving a practical use case.\n",
"\n",
"The student will achieve the goal through:\n",
"\n",
"* Analyzing the sequence of tasks required to generate and publish semantic data\n",
"* Extending their knowledge using the set of additional documents and specifications\n",
"* Creating a partial semantic definition using the Turtle format\n",
"\n",
"\n",
"# Objectives\n",
"\n",
"The main objective is to learn how annotations can be unified on the web, by following the Linked Data principles.\n",
"\n",
"\n",
"These concepts will be applied in a practical use case: obtaining a Graph of information about hotels and reviews about them.\n",
"\n",
"\n",
"# Tools\n",
"\n",
"This notebook is self-contained, but it requires some python libraries.\n",
"To install them, simply run the following line"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "387f9c38b548f29b56ae5ef5ae76fd4f",
"grade": false,
"grade_id": "cell-d7f1ea9c021693b8",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"!pip install --user -r requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Linked Data, RDF and Turtle\n",
"\n",
"\n",
"The term [Linked Data](https://www.w3.org/wiki/LinkedData) refers to a set of best practices for publishing structured data on the Web.\n",
"These principles have been coined by Tim Berners-Lee in the design issue note Linked Data.\n",
"The principles are:\n",
"\n",
"1. Use URIs as names for things\n",
"2. Use HTTP URIs so that people can look up those names\n",
"3. When someone looks up a URI, provide useful information\n",
"4. Include links to other URIs, so that they can discover more things\n",
"\n",
"The [RDF](https://www.w3.org/RDF/) is a standard model for data interchange on the Web.\n",
"It formalizes some concepts behind Linked Data into a specification, which can be used to develop applications and store information.\n",
"\n",
"Explaining RDF is out of the scope of this notebook.\n",
"The [resources section](#Useful-resources) contains some links if you wish to learn about RDF.\n",
"\n",
"The main idea behind RDF is that information is encoded in the form of triples:\n",
"\n",
"```turtle\n",
"<subject> <predicate> <object>\n",
"```\n",
"\n",
"Each of these, (`<subject>`, `<predicate>` and `<object>`) should be unique identifiers.\n",
"\n",
"For example, to say Timmy is a 6 year-old whose dog is Tobby, we would write:\n",
"\n",
"```turtle\n",
"<http://example.org/Timmy> <http://example.org/hasDog> <http://example.org/Tobby>\n",
"<http://example.org/Timmy> <http://example.org/age> 7\n",
"```\n",
"\n",
"Note that we are not referring to \"any Timmy\", but to a *very specific* Timmy.\n",
"We could learn more about this particular boy using that URI.\n",
"The same goes for the dog, and for the concept of \"having a dog\", which we unambiguously encode as `<http://example.org/hasDog>`.\n",
"This concept may be described as taking care of a dog, for example, whereas a different property `<http://yourwebsite.com/hasDog>` could be described as being the legal owner of the dog.\n",
"\n",
"\n",
"RDF can be used to embed annotation in many places, including HTML document, using any compatible format.\n",
"The options include including RDFa, XML, JSON-LD and [Turtle](https://www.w3.org/TR/turtle/).\n",
"\n",
"\n",
"In the exercises, we will be using turtle notation, because it is very readable.\n",
"\n",
"Here's an example of document in Turtle, taken from the Turtle specification:\n",
"\n",
"```turtle\n",
"@base <http://example.org/> .\n",
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
"@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n",
"@prefix foaf: <http://xmlns.com/foaf/0.1/> .\n",
"@prefix rel: <http://www.perceive.net/schemas/relationship/> .\n",
"\n",
"<#green-goblin>\n",
" rel:enemyOf <#spiderman> ;\n",
" a foaf:Person ; # in the context of the Marvel universe\n",
" foaf:name \"Green Goblin\" .\n",
"\n",
"<#spiderman>\n",
" rel:enemyOf <#green-goblin> ;\n",
" a foaf:Person ;\n",
" foaf:name \"Spiderman\", \"Человек-паук\"@ru .\n",
"```\n",
"\n",
"\n",
"The second exercise will show you how to extract this information from any website."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Vocabularies and schema.org\n",
"\n",
"Concepts (predicates, types, etc.) can be defined in vocabularies.\n",
"These vocabularies can be reused in several applications.\n",
"In the example above, we used the concept of person from an external vocabulary (`foaf:Person`, i.e. http://xmlns.com/foaf/0.1/Person).\n",
"That way, we do not need to redefine the concept of Person in every application.\n",
"There are several well known vocabularies, such as:\n",
"\n",
"* Dublin core, for metadata: http://dublincore.org/\n",
"* FOAF (Friend-of-a-friend) for social networks: http://www.foaf-project.org/\n",
"* SIOC for online communities: https://www.w3.org/Submission/sioc-spec/\n",
"\n",
"Using the same vocabularies also makes it easier to automatically process and classify information.\n",
"\n",
"\n",
"That was the motivation behind Schema.org, a collaboration between Google, Microsoft, Yahoo and Yandex.\n",
"They aim to provide schemas for structured data annotation of Web sites, e-mails, etc., which can be leveraged by search engines and other automated processes.\n",
"\n",
"They rely on RDF for representation, and provide a set of common vocabularies that can be shared by every web developer.\n",
"\n",
"\n",
"There are thousands of properties in the schema.org vocabulary, and they offer a very comprehensive documentation.\n",
"\n",
"As an example, this is the documentation for hotels:\n",
"\n",
"* List of properties for the Hotel type: https://schema.org/Hotel\n",
"* Documentation for hotels: https://schema.org/docs/hotels.html\n",
"\n",
"\n",
"You can use the documentation to find properties (e.g. `checkinTime`), as well as the type of that property (e.g. `Datetime`)."
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "fe9a246ba580c71385e9b83d414a1216",
"grade": false,
"grade_id": "cell-a1b60daabb1a9d00",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"# Exercises"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "63879c425ec11742c95c728a578d109e",
"grade": false,
"grade_id": "cell-d9289e96b2b0f265",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Instructions"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "e0b6464bce9263fb35543acf4acb31da",
"grade": false,
"grade_id": "cell-bb418e9bae1fef1a",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"First of all, run the line below.\n",
"It will import everything you need for the exercises."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "bf98cea45f42e3d0f1ab158693b40da7",
"grade": false,
"grade_id": "cell-4a1b60bd9974bbb1",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"from helpers import *\n",
"from rdflib import term, RDF, Namespace"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "3e23398d5277f2db2b3b5fb84f9623d6",
"grade": true,
"grade_id": "cell-da88c2f8170436fe",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"You have to fill in the parts marked:\n",
"\n",
"```\n",
"# YOUR ANSWER HERE\n",
"```\n",
"\n",
"To make sure everything is working, try the following example.\n",
"The solution is:\n",
"\n",
"```turtle\n",
"@prefix foaf: <http://xmlns.com/foaf/0.1/> .\n",
"@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .\n",
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
"\n",
"<http://purl.org/net/bsletten> \n",
" a foaf:Person;\n",
" foaf:interest <http://www.w3.org/2000/01/sw/>;\n",
" foaf:based_near [\n",
" geo:lat \"34.0736111\" ;\n",
" geo:lon \"-118.3994444\"\n",
" ] .\n",
"```\n",
"\n",
"Fill in the answer and run the test code."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "69182e8fadb9c9751f76786e0fcb8803",
"grade": false,
"grade_id": "cell-808cfcbf3891f39f",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%ttl example\n",
"\n",
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "5982ca82090e267401af135ca1f371a8",
"grade": true,
"grade_id": "cell-23e61b9f48d597fc",
"locked": true,
"points": 1,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"g = solution('example')\n",
"test('Some triples have been loaded',\n",
" len(g))\n",
"test('A person has been defined',\n",
" g.subjects(RDF.type, term.URIRef('http://xmlns.com/foaf/0.1/Person')))\n",
"print('All tests passed. Well done!')"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "a64acf02625b48b3c65b6e1bc1ba6c1a",
"grade": false,
"grade_id": "cell-e73f1933742f7ab3",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Exercise 1: Definition of a Hotel"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will define some basic information about a hotel, and some reviews.\n",
"This should be the same type of information that some aggregators (e.g. TripAdvisor) offer in their websites.\n",
"\n",
"Namely, you need to define at least two hotels (you may add more than one), with the following information:\n",
"* Description\n",
"* Address\n",
"* Contact information\n",
"* City and country (location)\n",
"* Email\n",
"* logo\n",
"* Opening hours\n",
"* Price range\n",
"* Amenities (optional)\n",
"* Geolocation (optional)\n",
"* Images (optional)\n",
"\n",
"You should also add at least three reviews about hotels, with the following information:\n",
"* Name of the user that reviewed the Hotel\n",
"* Rating\n",
"* Date\n",
"* Replies by other users (optional)\n",
"* Aspects rated in each review (cleanliness, staff, etc...) (optional)\n",
"* Information about the user (name, surname, date the account was created) (optional)\n",
"\n",
"\n",
"You can check any hotel website for inspiration, like this [review of a hotel in TripAdvisor](https://www.tripadvisor.es/Hotel_Review-g1437655-d1088667-Reviews-Hotel_Spa_La_Salve-Torrijos_Province_of_Toledo_Castile_La_Mancha.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To make sure we are following Principles 1 and 2, we should use URIs that can be queried.\n",
"For the sake of this exercise, you have two options:\n",
" \n",
"* Use the made-up `http://example/sitc/` as base for our URIs.\n",
"Hence, the URIs of our hotels will look like this: `http://example/sitc/my-fancy-hotel`.\n",
"These URIs can not be queried, **and should not be used in real annotations**, but we will see how to fix that in a future exercise.\n",
"* Use (blank nodes)[https://en.wikipedia.org/wiki/Blank_node] (e.g. `_:my-fancy-hotel`), which cannot be used by other people, but can be re-used in your annotations.\n",
"\n",
"\n",
"We will use the vocabularies defined in https://schema.org e.g.:\n",
" * https://schema.org/Review defines properties about reviews\n",
" * https://schema.org/Hotel defines properties about hotels\n",
" \n",
"\n",
"Your definition has to be included in the following cell.\n",
"\n",
"**Tip**: Define the schema prefix first, to avoid repeating `<http://schema.org/...>`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "44f8be14db3d3e42b5b85f0485206346",
"grade": false,
"grade_id": "definition",
"locked": false,
"schema_version": 1,
"solution": true
}
},
"outputs": [],
"source": [
"%%ttl hotel\n",
"\n",
"@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .\n",
"@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .\n",
"@prefix sitc: <http://example/sitc/> .\n",
"\n",
"\n",
"<http://example/sitc/GSIHOTEL> a <http://schema.org/Hotel> ;\n",
" <http://schema.org/description> \"This is just an example to get you started.\" .\n",
"\n",
"\n",
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "4f54963163a64f46058c86be139e5543",
"grade": true,
"grade_id": "definition-tests",
"locked": true,
"points": 10,
"schema_version": 1,
"solution": false
}
},
"outputs": [],
"source": [
"g = solution('hotel')\n",
"test('Some triples are loaded',\n",
" len(g))\n",
"\n",
"hotels = set(g.subjects(RDF.type, schema['Hotel']))\n",
"test('At least 2 hotels are loaded',\n",
" hotels,\n",
" 2,\n",
" atLeast)\n",
"\n",
"for hotel in hotels:\n",
" if 'GSIHOTEL' in hotel: # Do not check the example hotel\n",
" continue\n",
" props = g.predicates(hotel)\n",
" test('Each hotel has all required properties',\n",
" props,\n",
" list(schema[i] for i in ['description', 'email', 'logo', 'priceRange']),\n",
" func=containsAll)\n",
"\n",
"reviews = set(g.subjects(RDF.type, schema['Review']))\n",
"test('At least 3 reviews are loaded',\n",
" reviews,\n",
" 3,\n",
" atLeast)\n",
"\n",
"for review in reviews:\n",
" props = g.predicates(review)\n",
" test('Each review has all required properties',\n",
" props,\n",
" list(schema[i] for i in ['itemReviewed', 'reviewBody', 'reviewRating']),\n",
" func=containsAll)\n",
" ratings = list(g.objects(review, schema['reviewRating']))\n",
" for rating in ratings:\n",
" value = g.value(rating, schema['ratingValue'])\n",
" test('The review should have ratings', value)\n",
"\n",
"authors = set(g.objects(None, schema['author']))\n",
"for author in authors:\n",
" for prop in g.predicates(author, None):\n",
" if 'name' in str(prop).lower():\n",
" break\n",
"else:\n",
" assert \"At least a reviewer has a name (surname, givenName...)\"\n",
"\n",
"print('All tests passed. Congratulations!')\n",
"print()\n",
"print('Now you can try to add the optional properties')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2: Explore existing data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The goal of this exercise is to explore and compare annotations from existing websites.\n",
"\n",
"Semantic annotations are very useful on the web, because they allow `robots` to extract information about resources, and how they relate to other resources.\n",
"\n",
"For example, `schema.org` annotations on a website allow Google to show summaries and useful information (e.g. price and location of a hotel) in their results.\n",
"A similar technology powers their knowledge graph and the \"related search\". i.e. when you look for a famous actor, it will first show you their filmography, and a list of related actors.\n",
"\n",
"The information has to be provided using the official standards (RDF), to comply with the 3rd principle of linked data.\n",
"\n",
"To follow the 4<sup>th</sup> principle of linked data, the annotations should include links to known sources (e.g. DBpedia) whenever possible."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let us explore some semantic annotations from popular websites.\n",
"\n",
"First, start with hotel reviews and websites. Here are some examples:\n",
"\n",
"* TripAdvisor hotels\n",
"* Trivago\n",
"* Kayak\n",
"* Specific hotel reviews\n",
"\n",
"\n",
"These are just two examples:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_data('http://www.hotellasalve.com/')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_data('https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel')"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "a29112f51cc3299c7cae27841feb7410",
"grade": false,
"grade_id": "cell-9bf9c7d7516fae75",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"Once you've extracted and analyzed different sources, answer the following questions:\n",
"\n",
"\n",
"### Questions:\n",
"\n",
"What type of data do they offer?"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "2a7a6ab7d69f7ca5db64233128260045",
"grade": true,
"grade_id": "cell-17508ecf96884653",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "531928c9e3b8462baddd4d700c240995",
"grade": false,
"grade_id": "cell-d36826d6323c96e8",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"What vocabularyes and ontologies do they use?"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "f100004eceae0c8159ade9d713af47e7",
"grade": true,
"grade_id": "cell-17508ecf96884655",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the similarities between sites"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "3d9ad086580ee27d93395dac8c16551d",
"grade": true,
"grade_id": "cell-30797c9ac87cc7e1",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the similarities between sites"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "4c03ad45eb1234cadccab2b468a69123",
"grade": true,
"grade_id": "answer-similarities",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the biggest differences"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "6ccc2db2be4826a146a6c34bc54f00de",
"grade": true,
"grade_id": "cell-17508ecf96884657",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "33e1ec78415c85a795e86211d88316c2",
"grade": false,
"grade_id": "cell-5f922dc14ad3236a",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"Are all properties from Exercise 1 given by the websites? What's missing?"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"nbgrader": {
"checksum": "e0b4d9f1a2dfe5a7ab835f7349aa3796",
"grade": true,
"grade_id": "answer-missing",
"locked": false,
"points": 0,
"schema_version": 1,
"solution": true
}
},
"source": [
"# YOUR ANSWER HERE"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "26eb04e562aa6c7d29efa8318982a337",
"grade": false,
"grade_id": "cell-7a3c1553c4d6a9b7",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"## Optional\n",
"\n",
"There is nothing special about review sites.\n",
"You can get information about any website.\n",
"\n",
"Verify this running checking:\n",
"\n",
"* News sites: e.g. https://edition.cnn.com/\n",
"* CMS: e.g. http://www.etsit.upm.es\n",
"* Twitter profiles: e.g. https://www.twitter.com/cif\n",
"* Mastodon (a Twitter alternative) profiles: e.g. https://mastodon.social/@Gargron/\n",
"* Twitter status pages: e.g. http://mobile.twitter.com/TBLInternetBot/status/1054438951237312514\n",
"* Mastodon (a Twitter alternative) status pages: e.g. https://mastodon.social/@Gargron/101202440923902326\n",
"* Wikipedia entries: e.g. https://es.wikipedia.org/wiki/Tim_Berners-Lee\n",
"* Facebook groups: e.g. https://www.facebook.com/universidadpolitecnicademadrid/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print_data('https://mastodon.social/@Gargron')"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"checksum": "cffc12120c51a7d994063f66d788570a",
"grade": false,
"grade_id": "cell-ec8df1a53c3d3f23",
"locked": true,
"schema_version": 1,
"solution": false
}
},
"source": [
"# Useful resources\n",
"\n",
"* TTL validator: http://ttl.summerofcode.be/\n",
"* RDF-turtle specification: https://www.w3.org/TR/turtle/\n",
"* Schema.org documentation: https://schema.org\n",
"* Wikipedia entry on the Turtle syntax: https://en.wikipedia.org/wiki/Turtle_(syntax)\n",
"* RDFLib, the most popular python library for RDF (we use it in the tests): https://rdflib.readthedocs.io/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Bibliography\n",
"\n",
"* W3C website on Linked Data: https://www.w3.org/wiki/LinkedData\n",
"* W3C website on RDF: https://www.w3.org/RDF/\n",
"* Turtle W3C recommendation: https://www.w3.org/TR/turtle/"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

148
rdf/helpers.py Normal file
View File

@ -0,0 +1,148 @@
import sys
import operator
import types
from future.standard_library import install_aliases
install_aliases()
from urllib import request, parse
from rdflib import Graph, term, Namespace, BNode
from lxml import etree
import IPython
js = "IPython.CodeCell.options_default.highlight_modes['magic_turtle'] = {'reg':[/^%%ttl/]};"
IPython.core.display.display_javascript(js, raw=True)
from IPython.core.magic import (register_line_magic, register_cell_magic,
register_line_cell_magic)
from IPython.display import HTML, display, Image, Markdown
schema = Namespace('http://schema.org/')
DEFINITIONS = {}
def solution(exercise='default'):
if exercise not in DEFINITIONS:
raise Exception('Solution for {} not found'.format(exercise))
return DEFINITIONS[exercise]
@register_cell_magic
def ttl(line, cell):
'''
TTL magic command for ipython. It can be used in a cell like this:
```
%%ttl
... Your TTL definition ...
```
The definition will be loaded into a DEFINITION variable, using RDFlib.
This definition can then be used for evaluation.
'''
g = Graph()
msg = '''Error on line {line}
Reason: {reason}
If you don\'t know what this error means, try an online validator: http://ttl.summerofcode.be/
'''
global DEFINITIONS
key = line or 'default'
try:
DEFINITIONS[key] = g.parse(data=cell,
format="ttl")
except SyntaxError as ex:
return Markdown(msg.format(line=ex.lines, reason=ex._why))
except Exception as ex:
return Markdown(msg.format(line='?', reason=ex))
return Markdown('File loaded!')
return HTML('Loaded!') #HTML('<code>{}</code>'.format(cell))
def extract_data(url):
g = Graph()
try:
g.parse(url, format='rdfa')
except Exception:
print('Could not get rdfa data', file=sys.stderr)
try:
g.parse(url, format='microdata')
except Exception:
print('Could not get microdata', file=sys.stderr)
def sanitize_triple(t):
"""Function to remove bad URIs from the graph that would otherwise
make the serialization fail."""
def sanitize_triple_item(item):
if isinstance(item, term.URIRef) and ' ' in item:
return term.URIRef(parse.quote(str(item)))
return item
return (sanitize_triple_item(t[0]),
sanitize_triple_item(t[1]),
sanitize_triple_item(t[2]))
with request.urlopen(url) as response:
# Get all json-ld objects embedded in the html file
html = response.read().decode('utf-8', errors='ignore')
parser = etree.XMLParser(recover=True)
root = etree.fromstring(html.encode(), parser=parser)
if root is not None and len(root):
for jsonld in root.findall(".//script[@type='application/ld+json']"):
g.parse(data=jsonld.text, publicID=BNode(), format='json-ld')
fixedgraph = Graph()
fixedgraph += [sanitize_triple(s) for s in g]
# print(g.serialize(format='turtle').decode('utf-8', errors='ignore'))
return fixedgraph
def turtle(g):
return Markdown('''
Results:
```turtle
{}
```
'''.format(g.serialize(format='turtle').decode('utf-8', errors='ignore')))
def print_data(url):
g = extract_data(url)
return turtle(g)
def test(description, got, expected=None, func=None):
if isinstance(got, types.GeneratorType):
got = set(got)
try:
if expected is None:
func = func or operator.truth
expected = True
assert func(got)
else:
func = func or operator.eq
assert func(got, expected)
except AssertionError:
print('Test failed: {}'.format(description), file=sys.stderr)
print('\tExpected: {}'.format(expected), file=sys.stderr)
print('\tGot: {}'.format(got), file=sys.stderr)
raise Exception('Test failed: {}'.format(description))
def atLeast(lst, number):
return len(set(lst))>=number
def containsAll(lst, other):
for i in other:
if i not in lst:
print('{} not found'.format(i), file=sys.stderr)
return False
return True