<header style="width:100%;position:relative">
  <div style="width:80%;float:right;">
    <h1>Course Notes for Learning Intelligent Systems</h1>
    <h3>Department of Telematic Engineering Systems</h3>
    <h5>Universidad Politécnica de Madrid</h5>
  </div>
        <img style="width:15%;" src="../logo.jpg" alt="UPM" />
</header>

# Introduction

The goal of this exercise is to understand the usefulness of semantic annotation and the Linked Open Data initiative, by solving a practical use case.

The student will achieve the goal through:

* Analyzing the sequence of tasks required to generate and publish semantic data
* Extending their knowledge using the set of additional documents and specifications
* Creating a partial semantic definition using the Turtle format


# Objectives

The main objective is to learn how annotations can be unified on the web, by following the Linked Data principles.


These concepts will be applied in a practical use case: obtaining a Graph of information about hotels and reviews about them.


# Tools

This notebook is self-contained, but it requires some python libraries.
To install them, simply run the following line

In [2]:
!pip install --user -r requirements.txt



# Linked Data, RDF and Turtle


The term [Linked Data](https://www.w3.org/wiki/LinkedData) refers to a set of best practices for publishing structured data on the Web.
These principles have been coined by Tim Berners-Lee in the design issue note Linked Data.
The principles are:

1. Use URIs as names for things
2. Use HTTP URIs so that people can look up those names
3. When someone looks up a URI, provide useful information
4. Include links to other URIs, so that they can discover more things

The [RDF](https://www.w3.org/RDF/) is a standard model for data interchange on the Web.
It formalizes some concepts behind Linked Data into a specification, which can be used to develop applications and store information.

Explaining RDF is out of the scope of this notebook.
The [resources section](#Useful-resources) contains some links if you wish to learn about RDF.

The main idea behind RDF is that information is encoded in the form of triples:

```turtle
<subject> <predicate> <object>
```

Each of these, (`<subject>`, `<predicate>` and `<object>`) should be unique identifiers.

For example, to say Timmy is a 7 year-old whose dog is Tobby, we would write:

```turtle
<http://example.org/Timmy>  <http://example.org/hasDog> <http://example.org/Tobby>
<http://example.org/Timmy>  <http://example.org/age> 7
```

Note that we are not referring to "any Timmy", but to a *very specific* Timmy.
We could learn more about this particular boy using that URI.
The same goes for the dog, and for the concept of "having a dog", which we unambiguously encode as `<http://example.org/hasDog>`.
This concept may be described as taking care of a dog, for example, whereas a different property `<http://yourwebsite.com/hasDog>` could be described as being the legal owner of the dog.


RDF can be used to embed annotation in many places, including HTML document, using any compatible format.
The options include including RDFa, XML, JSON-LD and [Turtle](https://www.w3.org/TR/turtle/).


In the exercises, we will be using turtle notation, because it is very readable.

Here's an example of document in Turtle, taken from the Turtle specification:

```turtle
@base <http://example.org/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rel: <http://www.perceive.net/schemas/relationship/> .

<#green-goblin>
    rel:enemyOf <#spiderman> ;
    a foaf:Person ;    # in the context of the Marvel universe
    foaf:name "Green Goblin" .

<#spiderman>
    rel:enemyOf <#green-goblin> ;
    a foaf:Person ;
    foaf:name "Spiderman", "Человек-паук"@ru .
```


The second exercise will show you how to extract this information from any website.

As you can observe in these examples, Turtle defines several ways to specify IRIs in a result. Please, consult the specification for further details. As an overview, IRIs can be:
 * *relative IRIs*: IRIs resolved relative to the current base IRI. Thus, you should define a base IRI (@base <http://example.org>) and then relative IRIs (i.e. <#spiderman>). The resulting IRI is <http://example.org/spiderman>.
 * *prefixed names*: a prefixed name (i.e. foaf:Person) is transformed into an IRI by concatenating the IRI of the prefix (@prefix foaf: <http://xmlns.com/foaf/0.1) and the local part of the prefixed name (i.e. Person). So, the resulting IRI is <http://xmlns.com/foaf/0.1/Person
 * *absolute IRIs*: an already resolved IRI, p.ej. <http://example.com/Auto>.

# Vocabularies and schema.org

Concepts (predicates, types, etc.) can be defined in vocabularies.
These vocabularies can be reused in several applications.
In the example above, we used the concept of person from an external vocabulary (`foaf:Person`, i.e. http://xmlns.com/foaf/0.1/Person).
That way, we do not need to redefine the concept of Person in every application.
There are several well known vocabularies, such as:

* Dublin core, for metadata: http://dublincore.org/
* FOAF (Friend-of-a-friend) for social networks: http://www.foaf-project.org/
* SIOC for online communities: https://www.w3.org/Submission/sioc-spec/

Using the same vocabularies also makes it easier to automatically process and classify information.


That was the motivation behind Schema.org, a collaboration between Google, Microsoft, Yahoo and Yandex.
They aim to provide schemas for structured data annotation of Web sites, e-mails, etc., which can be leveraged by search engines and other automated processes.

They rely on RDF for representation, and provide a set of common vocabularies that can be shared by every web developer.


There are thousands of properties in the schema.org vocabulary, and they offer a very comprehensive documentation.

As an example, this is the documentation for hotels:

* List of properties for the Hotel type: https://schema.org/Hotel
* Documentation for hotels: https://schema.org/docs/hotels.html


You can use the documentation to find properties (e.g. `checkinTime`), as well as the type of that property (e.g. `Datetime`).

# Exercises

## Instructions

First of all, run the line below.
It will import everything you need for the exercises.

In [3]:
from helpers import *
from rdflib import term, RDF, Namespace

You have to fill in the parts marked:

```
# YOUR ANSWER HERE
```

To make sure everything is working, try the following example.
The solution is:

```turtle
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://purl.org/net/bsletten> 
    a foaf:Person;
    foaf:interest <http://www.w3.org/2000/01/sw/>;
    foaf:based_near [
        geo:lat "34.0736111" ;
        geo:lon "-118.3994444"
   ] .
```

Fill in the answer and run the test code.

This order (%%ttl) is a so-called magic cell command to execute a function. You can read more here https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics

In [4]:
%%ttl example

# YOUR ANSWER HERE

Error on line ?

Reason: No plugin registered for (ttl, <class 'rdflib.parser.Parser'>)

If you don't know what this error means, try an online validator: http://ttl.summerofcode.be/


In [None]:
g = solution('example')
test('Some triples have been loaded',
     len(g))
test('A person has been defined',
     g.subjects(RDF.type, term.URIRef('http://xmlns.com/foaf/0.1/Person')))
print('All tests passed. Well done!')

## Exercise 1: Definition of a Hotel

We will define some basic information about a hotel, and some reviews.
This should be the same type of information that some aggregators (e.g. TripAdvisor) offer in their websites.

Namely, you need to define at least two hotels (you may add more than one), with the following information:
* Description
* Address
* Contact information
* City and country (location)
* Email
* logo
* Opening hours
* Price range
* Amenities (optional)
* Geolocation (optional)
* Images (optional)

You should also add at least three reviews about hotels, with the following information:
* Name of the user that reviewed the Hotel
* Rating
* Date
* Replies by other users (optional)
* Aspects rated in each review (cleanliness, staff, etc...) (optional)
* Information about the user (name, surname, date the account was created) (optional)


You can check any hotel website for inspiration, like this [review of a hotel in TripAdvisor](https://www.tripadvisor.es/Hotel_Review-g1437655-d1088667-Reviews-Hotel_Spa_La_Salve-Torrijos_Province_of_Toledo_Castile_La_Mancha.html)

To make sure we are following Principles 1 and 2, we should use URIs that can be queried.
For the sake of this exercise, you can use the made-up `http://example/sitc/` as base for our URIs.
Hence, the URIs of our hotels will look like this: `http://example/sitc/my-fancy-hotel`.
These URIs can not be queried, **and should not be used in real annotations**, but we will see how to fix that in a future exercise.


We will use the vocabularies defined in https://schema.org e.g.:
    * https://schema.org/Review defines properties about reviews
    * https://schema.org/Hotel defines properties about hotels
    

Your definition has to be included in the following cell.

So, your task is:
* Search the relevant properties of the vocabulary schema.org to represent the attributes of both reviews and hotels.
* Write two resources of type Hotel and three resources of type Review.
* Check that your syntax is correct, by executing your code in the cell below.

**Tip**: Define the schema prefix first, to avoid repeating `<http://schema.org/...>`.

In [None]:
%%ttl hotel

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sitc: <http://example/sitc/> .


<http://example/sitc/GSIHOTEL> a <http://schema.org/Hotel> ;
         <http://schema.org/description> "This is just an example to get you started." .


# YOUR ANSWER HERE

In [None]:
g = solution('hotel')
test('Some triples are loaded',
     len(g))

hotels = set(g.subjects(RDF.type, schema['Hotel']))
test('At least 2 hotels are loaded',
     hotels,
     2,
     atLeast)

for hotel in hotels:
    if 'GSIHOTEL' in hotel:  # Do not check the example hotel
        continue
    props = g.predicates(hotel)
    test('Each hotel has all required properties',
         props,
         list(schema[i] for i in ['description', 'email', 'logo', 'priceRange']),
         func=containsAll)

reviews = set(g.subjects(RDF.type, schema['Review']))
test('At least 3 reviews are loaded',
     reviews,
     3,
     atLeast)

for review in reviews:
    props = g.predicates(review)
    test('Each review has all required properties',
         props,
         list(schema[i] for i in ['itemReviewed', 'reviewBody', 'reviewRating']),
         func=containsAll)
    ratings = list(g.objects(review, schema['reviewRating']))
    for rating in ratings:
        value = g.value(rating, schema['ratingValue'])
        test('The review should have ratings', value)

authors = set(g.objects(None, schema['author']))
for author in authors:
    for prop in g.predicates(author, None):
        if 'name' in str(prop).lower():
            break
else:
    assert "At least a reviewer has a name (surname, givenName...)"

print('All tests passed. Congratulations!')
print()
print('Now you can try to add the optional properties')

## Exercise 2: Explore existing data

The goal of this exercise is to explore and compare annotations from existing websites.

Semantic annotations are very useful on the web, because they allow `robots` to extract information about resources, and how they relate to other resources.

For example, `schema.org` annotations on a website allow Google to show summaries and useful information (e.g. price and location of a hotel) in their results.
A similar technology powers their knowledge graph and the "related search". i.e. when you look for a famous actor, it will first show you their filmography, and a list of related actors.

The information has to be provided using the official standards (RDF), to comply with the 3rd principle of linked data.

To follow the 4<sup>th</sup> principle of linked data, the annotations should include links to known sources (e.g. DBpedia) whenever possible.

Let us explore some semantic annotations from popular websites.

First, start with hotel reviews and websites. Here are some examples:

* TripAdvisor hotels
* Trivago
* Kayak
* Specific hotel reviews


These are just two examples:

In [5]:
print_data('http://www.hotellasalve.com/')

Could not get rdfa data



Results:

```turtle
@prefix ns1: <http://purl.org/dc/terms/> .
@prefix ns2: <http://www.w3.org/ns/rdfa#> .
@prefix ns3: <http://www.w3.org/2006/http#> .
@prefix ns4: <http://www.w3.org/ns/md#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://www.hotellasalve.com/> ns4:item () .

[] a ns2:Error ;
    ns1:date "2019-02-13T17:42:13.310135"^^xsd:dateTime ;
    ns1:description "__init__() got an unexpected keyword argument 'encoding'" ;
    ns2:context [ a ns3:Request ;
            ns3:requestURI "http://www.hotellasalve.com/" ],
        [ a ns3:Response ;
            ns3:responseCode <http://www.w3.org/2006/http#400> ] .


```


In [6]:
print_data('https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel')

Could not get rdfa data
https://photos.mandarinoriental.com/is/content/MandarinOriental/RZMAD - Madrid/Logos/hotel-ritz-hotel-logo-SVG.svg does not look like a valid URI, trying to serialize this will break.
tel:+34 91 701 67 67 does not look like a valid URI, trying to serialize this will break.



Results:

```turtle
@prefix ns1: <http://schema.org/> .
@prefix ns2: <http://www.w3.org/2006/http#> .
@prefix ns3: <http://purl.org/dc/terms/> .
@prefix ns4: <http://www.w3.org/ns/rdfa#> .
@prefix ns5: <http://www.w3.org/ns/md#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel> ns5:item ( [ a ns1:Hotel ;
                ns1:address [ a ns1:PostalAddress ;
                        ns1:addressCountry "Spain"@en ;
                        ns1:addressLocality "Madrid"@en ;
                        ns1:postalCode "28014"@en ;
                        ns1:streetAddress "Plaza de la Lealtad 5"@en ] ;
                ns1:description "Experience our 5 Star hotel in central Madrid, Retiro Park offering luxurious rooms and suites, fine dining, private spa, meeting and wedding facilities."@en ;
                ns1:email <mailto:reservations@mohg.com> ;
                ns1:image <https%3A//photos.mandarinoriental.com/is/content/MandarinOriental/RZMAD%20-%20Madrid/Logos/hotel-ritz-hotel-logo-SVG.svg> ;
                ns1:name "Hotel Ritz, Madrid"@en ;
                ns1:tel <tel%3A%2B34%2091%20701%2067%2067> ;
                ns1:url <https://www.google.com/maps/place/Hotel+Ritz,+Madrid/@40.4156097,-3.6946249,773m/data=!3m2!1e3!4b1!4m5!3m4!1s0xd42288329bef061:0xb9bba45ac90e2184!8m2!3d40.4156056!4d-3.6924362>,
                    <https://www.mandarinoriental.com/> ] ) ;
    ns4:usesVocabulary ns1: .

[] a ns4:Error ;
    ns3:date "2019-02-13T17:43:52.577508"^^xsd:dateTime ;
    ns3:description "__init__() got an unexpected keyword argument 'encoding'" ;
    ns4:context [ a ns2:Request ;
            ns2:requestURI "https://www.mandarinoriental.com/madrid/hotel-ritz/luxury-hotel" ],
        [ a ns2:Response ;
            ns2:responseCode <http://www.w3.org/2006/http#400> ] .


```


Once you've extracted and analyzed different sources, answer the following questions:


### Questions:

What type of data do they offer?

# YOUR ANSWER HERE

What vocabularyes and ontologies do they use?

# YOUR ANSWER HERE

What are the similarities between sites

# YOUR ANSWER HERE

What are the similarities between sites

# YOUR ANSWER HERE

What are the biggest differences

# YOUR ANSWER HERE

Are all properties from Exercise 1 given by the websites? What's missing?

# YOUR ANSWER HERE

## Optional

There is nothing special about review sites.
You can get information about any website.

Verify this running checking:

* News sites: e.g. https://edition.cnn.com/
* CMS: e.g. http://www.etsit.upm.es
* Twitter profiles: e.g. https://www.twitter.com/cif
* Mastodon (a Twitter alternative) profiles: e.g. https://mastodon.social/@Gargron/
* Twitter status pages: e.g. http://mobile.twitter.com/TBLInternetBot/status/1054438951237312514
* Mastodon (a Twitter alternative) status pages: e.g. https://mastodon.social/@Gargron/101202440923902326
* Wikipedia entries: e.g. https://es.wikipedia.org/wiki/Tim_Berners-Lee
* Facebook groups: e.g. https://www.facebook.com/universidadpolitecnicademadrid/

In [None]:
print_data('https://mastodon.social/@Gargron')

# Useful resources

* TTL validator: http://ttl.summerofcode.be/
* RDF-turtle specification: https://www.w3.org/TR/turtle/
* Schema.org documentation: https://schema.org
* Wikipedia entry on the Turtle syntax: https://en.wikipedia.org/wiki/Turtle_(syntax)
* RDFLib, the most popular python library for RDF (we use it in the tests): https://rdflib.readthedocs.io/

# Bibliography

* W3C website on Linked Data: https://www.w3.org/wiki/LinkedData
* W3C website on RDF: https://www.w3.org/RDF/
* Turtle W3C recommendation: https://www.w3.org/TR/turtle/