diff --git a/lod/SPARQL.ipynb b/lod/SPARQL.ipynb index aad8f51..3e0f071 100644 --- a/lod/SPARQL.ipynb +++ b/lod/SPARQL.ipynb @@ -1872,7 +1872,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.7.2" } }, "nbformat": 4, diff --git a/rdf/RDF.ipynb b/rdf/RDF.ipynb new file mode 100644 index 0000000..67803f0 --- /dev/null +++ b/rdf/RDF.ipynb @@ -0,0 +1,875 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "1fba29f718bbaa14890b305223712474", + "grade": false, + "grade_id": "cell-2bd9e19ffed99f81", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "
\n", + "
\n", + "

Course Notes for Learning Intelligent Systems

\n", + "

Department of Telematic Engineering Systems

\n", + "
Universidad Politécnica de Madrid
\n", + "
\n", + " \"UPM\"\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "59c5cb46c9d722f691206e766e5af557", + "grade": false, + "grade_id": "cell-51338a0933103db9", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "# Introduction\n", + "\n", + "The goal of this exercise is to understand the usefulness of semantic annotation and the Linked Open Data initiative, by solving a practical use case.\n", + "\n", + "The student will achieve the goal through:\n", + "\n", + "* Analyzing the sequence of tasks required to generate and publish semantic data\n", + "* Extending their knowledge using the set of additional documents and specifications\n", + "* Creating a partial semantic definition using the Turtle format\n", + "\n", + "\n", + "# Objectives\n", + "\n", + "The main objective is to learn how annotations can be unified on the web, by following the Linked Data principles.\n", + "\n", + "\n", + "These concepts will be applied in a practical use case: obtaining a Graph of information about hotels and reviews about them.\n", + "\n", + "\n", + "# Tools\n", + "\n", + "This notebook is self-contained, but it requires some python libraries.\n", + "To install them, simply run the following line" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "387f9c38b548f29b56ae5ef5ae76fd4f", + "grade": false, + "grade_id": "cell-d7f1ea9c021693b8", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "outputs": [], + "source": [ + "!pip install --user -r requirements.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Linked Data, RDF and Turtle\n", + "\n", + "\n", + "The term [Linked Data](https://www.w3.org/wiki/LinkedData) refers to a set of best practices for publishing structured data on the Web.\n", + "These principles have been coined by Tim Berners-Lee in the design issue note Linked Data.\n", + "The principles are:\n", + "\n", + "1. Use URIs as names for things\n", + "2. Use HTTP URIs so that people can look up those names\n", + "3. When someone looks up a URI, provide useful information\n", + "4. Include links to other URIs, so that they can discover more things\n", + "\n", + "The [RDF](https://www.w3.org/RDF/) is a standard model for data interchange on the Web.\n", + "It formalizes some concepts behind Linked Data into a specification, which can be used to develop applications and store information.\n", + "\n", + "Explaining RDF is out of the scope of this notebook.\n", + "The [resources section](#Useful-resources) contains some links if you wish to learn about RDF.\n", + "\n", + "The main idea behind RDF is that information is encoded in the form of triples:\n", + "\n", + "```turtle\n", + " \n", + "```\n", + "\n", + "Each of these, (``, `` and ``) should be unique identifiers.\n", + "\n", + "For example, to say Timmy is a 6 year-old whose dog is Tobby, we would write:\n", + "\n", + "```turtle\n", + " \n", + " 7\n", + "```\n", + "\n", + "Note that we are not referring to \"any Timmy\", but to a *very specific* Timmy.\n", + "We could learn more about this particular boy using that URI.\n", + "The same goes for the dog, and for the concept of \"having a dog\", which we unambiguously encode as ``.\n", + "This concept may be described as taking care of a dog, for example, whereas a different property `` could be described as being the legal owner of the dog.\n", + "\n", + "\n", + "RDF can be used to embed annotation in many places, including HTML document, using any compatible format.\n", + "The options include including RDFa, XML, JSON-LD and [Turtle](https://www.w3.org/TR/turtle/).\n", + "\n", + "\n", + "In the exercises, we will be using turtle notation, because it is very readable.\n", + "\n", + "Here's an example of document in Turtle, taken from the Turtle specification:\n", + "\n", + "```turtle\n", + "@base .\n", + "@prefix rdf: .\n", + "@prefix rdfs: .\n", + "@prefix foaf: .\n", + "@prefix rel: .\n", + "\n", + "<#green-goblin>\n", + " rel:enemyOf <#spiderman> ;\n", + " a foaf:Person ; # in the context of the Marvel universe\n", + " foaf:name \"Green Goblin\" .\n", + "\n", + "<#spiderman>\n", + " rel:enemyOf <#green-goblin> ;\n", + " a foaf:Person ;\n", + " foaf:name \"Spiderman\", \"Человек-паук\"@ru .\n", + "```\n", + "\n", + "\n", + "The second exercise will show you how to extract this information from any website." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Vocabularies and schema.org\n", + "\n", + "Concepts (predicates, types, etc.) can be defined in vocabularies.\n", + "These vocabularies can be reused in several applications.\n", + "In the example above, we used the concept of person from an external vocabulary (`foaf:Person`, i.e. http://xmlns.com/foaf/0.1/Person).\n", + "That way, we do not need to redefine the concept of Person in every application.\n", + "There are several well known vocabularies, such as:\n", + "\n", + "* Dublin core, for metadata: http://dublincore.org/\n", + "* FOAF (Friend-of-a-friend) for social networks: http://www.foaf-project.org/\n", + "* SIOC for online communities: https://www.w3.org/Submission/sioc-spec/\n", + "\n", + "Using the same vocabularies also makes it easier to automatically process and classify information.\n", + "\n", + "\n", + "That was the motivation behind Schema.org, a collaboration between Google, Microsoft, Yahoo and Yandex.\n", + "They aim to provide schemas for structured data annotation of Web sites, e-mails, etc., which can be leveraged by search engines and other automated processes.\n", + "\n", + "They rely on RDF for representation, and provide a set of common vocabularies that can be shared by every web developer.\n", + "\n", + "\n", + "There are thousands of properties in the schema.org vocabulary, and they offer a very comprehensive documentation.\n", + "\n", + "As an example, this is the documentation for hotels:\n", + "\n", + "* List of properties for the Hotel type: https://schema.org/Hotel\n", + "* Documentation for hotels: https://schema.org/docs/hotels.html\n", + "\n", + "\n", + "You can use the documentation to find properties (e.g. `checkinTime`), as well as the type of that property (e.g. `Datetime`)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "fe9a246ba580c71385e9b83d414a1216", + "grade": false, + "grade_id": "cell-a1b60daabb1a9d00", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "# Exercises" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "63879c425ec11742c95c728a578d109e", + "grade": false, + "grade_id": "cell-d9289e96b2b0f265", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "## Instructions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "e0b6464bce9263fb35543acf4acb31da", + "grade": false, + "grade_id": "cell-bb418e9bae1fef1a", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "First of all, run the line below.\n", + "It will import everything you need for the exercises." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "bf98cea45f42e3d0f1ab158693b40da7", + "grade": false, + "grade_id": "cell-4a1b60bd9974bbb1", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "outputs": [], + "source": [ + "from helpers import *\n", + "from rdflib import term, RDF, Namespace" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "3e23398d5277f2db2b3b5fb84f9623d6", + "grade": true, + "grade_id": "cell-da88c2f8170436fe", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "source": [ + "You have to fill in the parts marked:\n", + "\n", + "```\n", + "# YOUR ANSWER HERE\n", + "```\n", + "\n", + "To make sure everything is working, try the following example.\n", + "The solution is:\n", + "\n", + "```turtle\n", + "@prefix foaf: .\n", + "@prefix geo: .\n", + "@prefix rdf: .\n", + "\n", + " \n", + " a foaf:Person;\n", + " foaf:interest ;\n", + " foaf:based_near [\n", + " geo:lat \"34.0736111\" ;\n", + " geo:lon \"-118.3994444\"\n", + " ] .\n", + "```\n", + "\n", + "Fill in the answer and run the test code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "69182e8fadb9c9751f76786e0fcb8803", + "grade": false, + "grade_id": "cell-808cfcbf3891f39f", + "locked": false, + "schema_version": 1, + "solution": true + } + }, + "outputs": [], + "source": [ + "%%ttl example\n", + "\n", + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "5982ca82090e267401af135ca1f371a8", + "grade": true, + "grade_id": "cell-23e61b9f48d597fc", + "locked": true, + "points": 1, + "schema_version": 1, + "solution": false + } + }, + "outputs": [], + "source": [ + "g = solution('example')\n", + "test('Some triples have been loaded',\n", + " len(g))\n", + "test('A person has been defined',\n", + " g.subjects(RDF.type, term.URIRef('http://xmlns.com/foaf/0.1/Person')))\n", + "print('All tests passed. Well done!')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "a64acf02625b48b3c65b6e1bc1ba6c1a", + "grade": false, + "grade_id": "cell-e73f1933742f7ab3", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "## Exercise 1: Definition of a Hotel" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will define some basic information about a hotel, and some reviews.\n", + "This should be the same type of information that some aggregators (e.g. TripAdvisor) offer in their websites.\n", + "\n", + "Namely, you need to define at least two hotels (you may add more than one), with the following information:\n", + "* Description\n", + "* Address\n", + "* Contact information\n", + "* City and country (location)\n", + "* Email\n", + "* logo\n", + "* Opening hours\n", + "* Price range\n", + "* Amenities (optional)\n", + "* Geolocation (optional)\n", + "* Images (optional)\n", + "\n", + "You should also add at least three reviews about hotels, with the following information:\n", + "* Name of the user that reviewed the Hotel\n", + "* Rating\n", + "* Date\n", + "* Replies by other users (optional)\n", + "* Aspects rated in each review (cleanliness, staff, etc...) (optional)\n", + "* Information about the user (name, surname, date the account was created) (optional)\n", + "\n", + "\n", + "You can check any hotel website for inspiration, like this [review of a hotel in TripAdvisor](https://www.tripadvisor.es/Hotel_Review-g1437655-d1088667-Reviews-Hotel_Spa_La_Salve-Torrijos_Province_of_Toledo_Castile_La_Mancha.html)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To make sure we are following Principles 1 and 2, we should use URIs that can be queried.\n", + "For the sake of this exercise, you have two options:\n", + " \n", + "* Use the made-up `http://example/sitc/` as base for our URIs.\n", + "Hence, the URIs of our hotels will look like this: `http://example/sitc/my-fancy-hotel`.\n", + "These URIs can not be queried, **and should not be used in real annotations**, but we will see how to fix that in a future exercise.\n", + "* Use (blank nodes)[https://en.wikipedia.org/wiki/Blank_node] (e.g. `_:my-fancy-hotel`), which cannot be used by other people, but can be re-used in your annotations.\n", + "\n", + "\n", + "We will use the vocabularies defined in https://schema.org e.g.:\n", + " * https://schema.org/Review defines properties about reviews\n", + " * https://schema.org/Hotel defines properties about hotels\n", + " \n", + "\n", + "Your definition has to be included in the following cell.\n", + "\n", + "**Tip**: Define the schema prefix first, to avoid repeating ``." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "44f8be14db3d3e42b5b85f0485206346", + "grade": false, + "grade_id": "definition", + "locked": false, + "schema_version": 1, + "solution": true + } + }, + "outputs": [], + "source": [ + "%%ttl hotel\n", + "\n", + "@prefix rdf: .\n", + "@prefix rdfs: .\n", + "@prefix sitc: .\n", + "\n", + "\n", + " a ;\n", + " \"This is just an example to get you started.\" .\n", + "\n", + "\n", + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "4f54963163a64f46058c86be139e5543", + "grade": true, + "grade_id": "definition-tests", + "locked": true, + "points": 10, + "schema_version": 1, + "solution": false + } + }, + "outputs": [], + "source": [ + "g = solution('hotel')\n", + "test('Some triples are loaded',\n", + " len(g))\n", + "\n", + "hotels = set(g.subjects(RDF.type, schema['Hotel']))\n", + "test('At least 2 hotels are loaded',\n", + " hotels,\n", + " 2,\n", + " atLeast)\n", + "\n", + "for hotel in hotels:\n", + " if 'GSIHOTEL' in hotel: # Do not check the example hotel\n", + " continue\n", + " props = g.predicates(hotel)\n", + " test('Each hotel has all required properties',\n", + " props,\n", + " list(schema[i] for i in ['description', 'email', 'logo', 'priceRange']),\n", + " func=containsAll)\n", + "\n", + "reviews = set(g.subjects(RDF.type, schema['Review']))\n", + "test('At least 3 reviews are loaded',\n", + " reviews,\n", + " 3,\n", + " atLeast)\n", + "\n", + "for review in reviews:\n", + " props = g.predicates(review)\n", + " test('Each review has all required properties',\n", + " props,\n", + " list(schema[i] for i in ['itemReviewed', 'reviewBody', 'reviewRating']),\n", + " func=containsAll)\n", + " ratings = list(g.objects(review, schema['reviewRating']))\n", + " for rating in ratings:\n", + " value = g.value(rating, schema['ratingValue'])\n", + " test('The review should have ratings', value)\n", + "\n", + "authors = set(g.objects(None, schema['author']))\n", + "for author in authors:\n", + " for prop in g.predicates(author, None):\n", + " if 'name' in str(prop).lower():\n", + " break\n", + "else:\n", + " assert \"At least a reviewer has a name (surname, givenName...)\"\n", + "\n", + "print('All tests passed. Congratulations!')\n", + "print()\n", + "print('Now you can try to add the optional properties')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Exercise 2: Explore existing data" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The goal of this exercise is to explore and compare annotations from existing websites.\n", + "\n", + "Semantic annotations are very useful on the web, because they allow `robots` to extract information about resources, and how they relate to other resources.\n", + "\n", + "For example, `schema.org` annotations on a website allow Google to show summaries and useful information (e.g. price and location of a hotel) in their results.\n", + "A similar technology powers their knowledge graph and the \"related search\". i.e. when you look for a famous actor, it will first show you their filmography, and a list of related actors.\n", + "\n", + "The information has to be provided using the official standards (RDF), to comply with the 3rd principle of linked data.\n", + "\n", + "To follow the 4th principle of linked data, the annotations should include links to known sources (e.g. DBpedia) whenever possible." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let us explore some semantic annotations from popular websites.\n", + "\n", + "First, start with hotel reviews and websites. Here are some examples:\n", + "\n", + "* TripAdvisor hotels\n", + "* Trivago\n", + "* Kayak\n", + "* Specific hotel reviews\n", + "\n", + "\n", + "These are just two examples:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_data('http://www.hotellasalve.com/')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_data('https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "a29112f51cc3299c7cae27841feb7410", + "grade": false, + "grade_id": "cell-9bf9c7d7516fae75", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "Once you've extracted and analyzed different sources, answer the following questions:\n", + "\n", + "\n", + "### Questions:\n", + "\n", + "What type of data do they offer?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "2a7a6ab7d69f7ca5db64233128260045", + "grade": true, + "grade_id": "cell-17508ecf96884653", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "531928c9e3b8462baddd4d700c240995", + "grade": false, + "grade_id": "cell-d36826d6323c96e8", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "What vocabularyes and ontologies do they use?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "f100004eceae0c8159ade9d713af47e7", + "grade": true, + "grade_id": "cell-17508ecf96884655", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What are the similarities between sites" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "3d9ad086580ee27d93395dac8c16551d", + "grade": true, + "grade_id": "cell-30797c9ac87cc7e1", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What are the similarities between sites" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "4c03ad45eb1234cadccab2b468a69123", + "grade": true, + "grade_id": "answer-similarities", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What are the biggest differences" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "6ccc2db2be4826a146a6c34bc54f00de", + "grade": true, + "grade_id": "cell-17508ecf96884657", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "33e1ec78415c85a795e86211d88316c2", + "grade": false, + "grade_id": "cell-5f922dc14ad3236a", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "Are all properties from Exercise 1 given by the websites? What's missing?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "nbgrader": { + "checksum": "e0b4d9f1a2dfe5a7ab835f7349aa3796", + "grade": true, + "grade_id": "answer-missing", + "locked": false, + "points": 0, + "schema_version": 1, + "solution": true + } + }, + "source": [ + "# YOUR ANSWER HERE" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "26eb04e562aa6c7d29efa8318982a337", + "grade": false, + "grade_id": "cell-7a3c1553c4d6a9b7", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "## Optional\n", + "\n", + "There is nothing special about review sites.\n", + "You can get information about any website.\n", + "\n", + "Verify this running checking:\n", + "\n", + "* News sites: e.g. https://edition.cnn.com/\n", + "* CMS: e.g. http://www.etsit.upm.es\n", + "* Twitter profiles: e.g. https://www.twitter.com/cif\n", + "* Mastodon (a Twitter alternative) profiles: e.g. https://mastodon.social/@Gargron/\n", + "* Twitter status pages: e.g. http://mobile.twitter.com/TBLInternetBot/status/1054438951237312514\n", + "* Mastodon (a Twitter alternative) status pages: e.g. https://mastodon.social/@Gargron/101202440923902326\n", + "* Wikipedia entries: e.g. https://es.wikipedia.org/wiki/Tim_Berners-Lee\n", + "* Facebook groups: e.g. https://www.facebook.com/universidadpolitecnicademadrid/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print_data('https://mastodon.social/@Gargron')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "deletable": false, + "editable": false, + "nbgrader": { + "checksum": "cffc12120c51a7d994063f66d788570a", + "grade": false, + "grade_id": "cell-ec8df1a53c3d3f23", + "locked": true, + "schema_version": 1, + "solution": false + } + }, + "source": [ + "# Useful resources\n", + "\n", + "* TTL validator: http://ttl.summerofcode.be/\n", + "* RDF-turtle specification: https://www.w3.org/TR/turtle/\n", + "* Schema.org documentation: https://schema.org\n", + "* Wikipedia entry on the Turtle syntax: https://en.wikipedia.org/wiki/Turtle_(syntax)\n", + "* RDFLib, the most popular python library for RDF (we use it in the tests): https://rdflib.readthedocs.io/" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Bibliography\n", + "\n", + "* W3C website on Linked Data: https://www.w3.org/wiki/LinkedData\n", + "* W3C website on RDF: https://www.w3.org/RDF/\n", + "* Turtle W3C recommendation: https://www.w3.org/TR/turtle/" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/rdf/helpers.py b/rdf/helpers.py new file mode 100644 index 0000000..24c2f4d --- /dev/null +++ b/rdf/helpers.py @@ -0,0 +1,148 @@ +import sys +import operator +import types +from future.standard_library import install_aliases +install_aliases() + +from urllib import request, parse +from rdflib import Graph, term, Namespace, BNode +from lxml import etree + +import IPython +js = "IPython.CodeCell.options_default.highlight_modes['magic_turtle'] = {'reg':[/^%%ttl/]};" +IPython.core.display.display_javascript(js, raw=True) + + +from IPython.core.magic import (register_line_magic, register_cell_magic, + register_line_cell_magic) +from IPython.display import HTML, display, Image, Markdown + + +schema = Namespace('http://schema.org/') + +DEFINITIONS = {} + +def solution(exercise='default'): + if exercise not in DEFINITIONS: + raise Exception('Solution for {} not found'.format(exercise)) + return DEFINITIONS[exercise] + + +@register_cell_magic +def ttl(line, cell): + ''' + TTL magic command for ipython. It can be used in a cell like this: + + ``` + %%ttl + + ... Your TTL definition ... + + ``` + The definition will be loaded into a DEFINITION variable, using RDFlib. + This definition can then be used for evaluation. + ''' + g = Graph() + msg = '''Error on line {line} + +Reason: {reason} + +If you don\'t know what this error means, try an online validator: http://ttl.summerofcode.be/ +''' + global DEFINITIONS + key = line or 'default' + try: + DEFINITIONS[key] = g.parse(data=cell, + format="ttl") + except SyntaxError as ex: + return Markdown(msg.format(line=ex.lines, reason=ex._why)) + except Exception as ex: + return Markdown(msg.format(line='?', reason=ex)) + return Markdown('File loaded!') + + return HTML('Loaded!') #HTML('{}'.format(cell)) + + +def extract_data(url): + g = Graph() + try: + g.parse(url, format='rdfa') + except Exception: + print('Could not get rdfa data', file=sys.stderr) + try: + g.parse(url, format='microdata') + except Exception: + print('Could not get microdata', file=sys.stderr) + + + def sanitize_triple(t): + """Function to remove bad URIs from the graph that would otherwise + make the serialization fail.""" + def sanitize_triple_item(item): + if isinstance(item, term.URIRef) and ' ' in item: + return term.URIRef(parse.quote(str(item))) + return item + + return (sanitize_triple_item(t[0]), + sanitize_triple_item(t[1]), + sanitize_triple_item(t[2])) + + + with request.urlopen(url) as response: + # Get all json-ld objects embedded in the html file + html = response.read().decode('utf-8', errors='ignore') + parser = etree.XMLParser(recover=True) + root = etree.fromstring(html.encode(), parser=parser) + if root is not None and len(root): + for jsonld in root.findall(".//script[@type='application/ld+json']"): + g.parse(data=jsonld.text, publicID=BNode(), format='json-ld') + + + fixedgraph = Graph() + fixedgraph += [sanitize_triple(s) for s in g] + +# print(g.serialize(format='turtle').decode('utf-8', errors='ignore')) + return fixedgraph + +def turtle(g): + return Markdown(''' +Results: + +```turtle +{} +``` +'''.format(g.serialize(format='turtle').decode('utf-8', errors='ignore'))) + +def print_data(url): + g = extract_data(url) + return turtle(g) + + + +def test(description, got, expected=None, func=None): + if isinstance(got, types.GeneratorType): + got = set(got) + try: + if expected is None: + func = func or operator.truth + expected = True + assert func(got) + else: + func = func or operator.eq + assert func(got, expected) + except AssertionError: + print('Test failed: {}'.format(description), file=sys.stderr) + print('\tExpected: {}'.format(expected), file=sys.stderr) + print('\tGot: {}'.format(got), file=sys.stderr) + raise Exception('Test failed: {}'.format(description)) + + +def atLeast(lst, number): + return len(set(lst))>=number + +def containsAll(lst, other): + for i in other: + if i not in lst: + print('{} not found'.format(i), file=sys.stderr) + return False + return True \ No newline at end of file diff --git a/lod/requirements.txt b/rdf/requirements.txt similarity index 100% rename from lod/requirements.txt rename to rdf/requirements.txt