Lesson Goals
+ +This lesson explains why many cultural institutions are adopting graph +databases, and how researchers can access these data though the query language +called SPARQL.
+ +Contents
+ +-
+
- Lesson Goals +
- Graph Databases, RDF, and Linked Open Data + +
- Real-world queries + +
- Working with SPARQL results + +
- Further reading +
Graph Databases, RDF, and Linked Open Data
+ +Many cultural institutions now offer access to their collections information +through web Application Programming Interfaces. While these APIs are a +powerful way to access individual records in a machine-readable manner, they are +not ideal for cultural heritage data because they are structured to work for a +predetermined set of queries. For example, a museum may have information on +donors, artists, artworks, exhibitions, and provenance, but its web API may +offer only object-wise retrieval, making it difficult or impossible to search +for associated data about donors, artists, provenance, etc. This structure is +great if you come looking for information about particular objects. However, it +makes it difficult to aggregate information about every artist or donor that +happens to be described in the dataset as well.
+ +RDF databases are well-suited to expressing complex relationships between many +entities, like people, places, events, and concepts tied to individual +objects. These databases are often referred to as “graph” databases because they +structure information as a graph or network, where a set of resources, or nodes, +are connected together by edges that describe the relationships between each +resource.
+ +Because RDF databases support the use of URLs (weblinks), they can be made +available online and linked to other databases, hence the term “Linked Open +Data”. Major art collections including the British Museum, Europeana, +the Smithsonian American Art Museum, and the Yale Center for +British Art have published their collections data as LOD. The Getty +Vocabulary Program, has also released their series of authoritative +databases on geographic place names, terms for describing art and architecture, +and variant spellings of artist names, as LOD.
+ +SPARQL is the language used to query these databases. This language is +particularly powerful because it does not presuppose the perspectives that users +will bring to the data. A query about objects and a query about donors is +basically equivalent to such a database. Unfortunately, many tutorials on SPARQL +use extremely simplified data models that don’t resemble the more complex +datasets released by cultural heritage institutions. This tutorial gives a crash +course on SPARQL using a dataset that a humanist might actually find in the +wilds of the Internet. In this tutorial, we will learn how to query the British +Museum Linked Open Data collection.
+ +RDF in brief
+ +RDF represents information in a series of three-part “statements” that comprise +a subject, predicate, and an object, e.g.:
+ +<The Nightwatch> <was created by> <Rembrandt van Rijn> .
+
(Note that just like any good sentence, they each have a period at the end.)
+ +Here, the subject <The Nightwatch>
and the object <Rembrandt van Rijn>
can
+be thought of as two nodes of the graph, with the predicate <was created by>
+defining an edge between them. (Technically, <was created by>
can, in other
+queries, be treated as an object or subject itself, but that is beyond the scope
+of this tutorial.)
A pseudo-RDF database might contain interrelated statements like these:
+ +...
+<The Nightwatch> <was created by> <Rembrandt van Rijn> .
+<The Nightwatch> <was created in> <1642> .
+<The Nightwatch> <has medium> <oil on canvas> .
+<Rembrandt van Rijn> <was born in> <1606> .
+<Rembrandt van Rijn> <has nationality> <Dutch> .
+<Johannes Vermeer> <has nationality> <Dutch> .
+<Woman with a Balance> <was created by> <Johannes Vermeer> .
+<Woman with a Balance> <has medium> <oil on canvas> .
+...
+
If we were to visualize these statements as nodes and edges within network +graph, it would appear like so:
+ + + +A traditional relational database might split attributes about artworks and +attributes about artists into separate tables. In an RDF/graph database, all +these data points belong to the same interconnected graph, which allows users +maximum flexibility in deciding how they wish to query it.
+ +Searching RDF with SPARQL
+ +SPARQL lets us translate heavily interlinked, graph data into normalized, +tabular data with rows and columns you can open in programs like Excel, or +import into a visualization suite such as plot.ly or +Palladio.
+ +It is useful to think of a SPARQL query as a Mad +Lib - a set of sentences with blanks in +them. The database will take this query and find every set of matching +statements that correctly fill in those blanks, returning the matching values to +us as a table. Take this SPARQL query:
+ +SELECT ?painting
+WHERE {
+ ?painting <has medium> <oil on canvas> .
+}
+
?painting
in this query stands in for the node (or nodes) that the database
+will return. On receiving this query, the database will search for all values of
+?painting
that properly complete the RDF statement <has medium> <oil on
+canvas> .
:
When the query runs against the full database, it looks for the subjects, +predicates, and objects that match this statement, while excluding the rest of +the data:
+ + + +And our results might look like this table:
+ +painting | +
---|
The Nightwatch | +
Woman with a Balance | +
What makes RDF and SPARQL powerful is the ability to create complex queries that +reference many variables at a time. For example, we could search our pseudo-RDF +database for paintings by any artist who is Dutch:
+ +SELECT ?artist ?painting
+WHERE {
+ ?artist <has nationality> <Dutch> .
+ ?painting <was created by> ?artist .
+}
+
Here we’ve introduced a second variable, ?artist
. The RDF database will return
+all matching combinations of ?artist
and ?painting
that fulfill both of
+these statements.
artist | +painting | +
---|---|
Rembrandt van Rijn | +The Nightwatch | +
Johannes Vermeer | +Woman with a Balance | +
URIs and Literals
+ +So far, we have been looking at a toy representation of RDF that uses +easy-to-read text. However, RDF is primarily stored as URIs (Uniform Resource +Identifiers) that separate conceptual entities from their plain-English (or +other language!) labels. (Note that a URL, or Uniform Resource Locator, is a URI +for a resource that is accessible on the web) In real RDF, our original +statement:
+ +<The Nightwatch> <was created by> <Rembrandt van Rijn> .
+
would more likely look something like this:
+ +<http://data.rijksmuseum.nl/item/8909812347> <http://purl.org/dc/terms/creator> <http://dbpedia.org/resource/Rembrandt>.
+
N.B. the Rijksmuseum has not (yet) built their own Linked Data site, so the URI in this query is just for demo purposes.
+ +In order to get the human-readable version of the information represented by +each of these URIs, what we’re really doing is just retrieving more RDF +statements. Even the predicate in that statement has its own literal label:
+ +<http://data.rijksmuseum.nl/item/8909812347> <http://purl.org/dc/terms/title> "The Nightwatch" .
+
+<http://purl.org/dc/terms/creator> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> "was created by" .
+
+<http://dbpedia.org/resource/Rembrandt> <http://xmlns.com/foaf/0.1/name> "Rembrandt van Rijn" .
+
You will notice that, unlike the URIs in the query that are surrounded by <>
,
+the objects of these statements are just strings of text within quotation
+marks, known as literals. Literals are unlike URIs in that they represent
+values, rather than references. For example,
+<http://dbpedia.org/resource/Rembrandt>
represents an entity that may
+reference (and be referenced by) any number of other statements (say, birth
+dates, students, or family members), while the text string "Rembrandt van
+Rijn"
stands only for itself. Literals do not point to other nodes in the
+graph, and they can only ever be objects in an RDF statement. Other literal
+values in RDF include dates and numbers.
See the predicates in these statements, with domain names like purl.org
,
+w3.org
, and xmlns.com
? These are some of the many providers of ontologies
+that help standardize the way we describe relationships between bits of
+information like “title”, “label”, “creator”, or “name”. The more RDF/LOD that
+you work with, the more of these providers you’ll find.
URIs can become unwieldy when composing SPARQL queries, which is why we’ll
+use prefixes. These are shortcuts that allow us to skip typing out entire long
+URIs. For example, remember that predicate for retrieving the title of the
+Nightwatch, <http://purl.org/dc/terms/title>
? With these prefixes, we just
+need to type dct:title
whenever we need to use a purl.org
predicate. dct:
+stands in for http://purl.org/dc/terms/
, and title
just gets pasted onto the
+end of this link.
For example, with the prefix PREFIX rkm: <http://data.rijksmuseum.nl/>
,
+appended to the start of our SPARQL query,
+<http://data.rijksmuseum.nl/item/8909812347>
becomes rkm:item/8909812347
+instead.
Be aware that, prefixes
+can be arbitrarily assigned with whatever abbreviations you like, different
+endpoints may use slightly different prefixes for the same namespace (e.g. dct
+vs. dcterms
for <http://purl.org/dc/terms/>
).
Terms to review
+ +-
+
- SPARQL - Protocol and RDF Query Language - The language used to query RDF graph databases +
- RDF - Resource Description Framework - A method for structuring data as a graph or network of connected statements, rather than a series of tables. +
- LOD - Linked Open Data - LOD is RDF data published online with dedicated URIs in such a manner than developers can reliably reference it. +
- statement - Sometimes also called a “triple”, an RDF statement is a quantum of knowledge comprising a subject, predicate, and object. +
- URI - Uniform Resource Identifier - a string of characters for identifying a resource. RDF statements use URIs to link various resources together. A URL, or uniform resource locator, is a type of URI that points to resources on the web. +
- literal - Some objects in RDF statements do not refer to other resources with a URI, but instead convey a value, such as text (
"Rembrandt van Rijn"
), a number (5
), or a date (1606-06-15
). These are known as literals.
+ - prefix - In order to simplify SPARQL queries, a user may specify prefixes that act as abbreviations for full URIs. These abbreviations, or QNames, are also used in namespaced XML documents. +
Real-world queries
+ +All the statements for one object
+ +Let’s start our first query using the British Museum SPARQL endpoint. A +SPARQL endpoint is a web address that accepts SPARQL queries and returns +results. The BM endpoint is like many others: if you navigate to it in a web +browser, it presents you with a text box for composing queries.
+ + + +When starting to explore a new RDF database, it helps to look at the +relationships that stem from a single example +object.
+ +(For each of the following queries, click on the “Run query” link below to see +the results. You can then run it as +is, or modify it before requesting the results. Remember when editing the query +before running to uncheck the ‘Include inferred’ box.)
+ +SELECT ?p ?o
+WHERE {
+ <http://collection.britishmuseum.org/id/object/PPA82633> ?p ?o .
+}
+
By calling SELECT ?p ?o
we’re asking the database to return the values of ?p
+and ?o
as described in the WHERE {}
command. This query returns every
+statement for which our example artwork,
+<http://collection.britishmuseum.org/id/object/PPA82633>
, is the subject. ?p
+is in the middle position of the RDF statement in the WHERE {}
command, so it
+returns any predicates matching this statement, while ?o
in the final position
+returns all objects. Though I have named them ?p
and ?o
here, as you will
+see below we can name these variables anything we like. Indeed, it will be
+useful to give them meaningful names for the complex queries that follow!.
Note: depending on how the British Museum has configured their SPARQL endpoint when you read this lesson, instead of seeing “prefixed” versions of the URLs (e.g. thes:8577
) you may instead see the full version http://collection.britishmuseum.org/id/thesauri/x8577
. As noted in the discussion of prefixes above, this still represents the same URI.
The BM endpoint formats the results table with hyperlinks for every variable +that is itself an RDF node, so by clicking on any one of these links you can +shift to seeing all the predicates and objects for that newly-selected node. +Note that BM automatically includes a wide range of SPARQL prefixes in its +queries, so you will find many hyperlinks are displayed in their abbreviated +versions; if you mouse over them your browser will display their unabbreviated +URIs.
+ + + +Let’s find out how they store the object type information: look for the
+predicate <bmo:PX_object_type>
(highlighted in the figure above) and click on
+the link for thes:x8577
to navigate to the node describing the particular
+object type “print”:
You’ll note how this node has an plain-text label, as well as ties to related +artwork type nodes within the database.
+ +Complex queries
+ +To find other objects of the same type with the preferred label “print”, we can +call this query:
+ +PREFIX bmo: <http://www.researchspace.org/ontology/>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+
+SELECT ?object
+WHERE {
+
+ # Search for all values of ?object that have a given "object type"
+ ?object bmo:PX_object_type ?object_type .
+
+ # That object type should have the label "print"
+ ?object_type skos:prefLabel "print" .
+}
+LIMIT 10
+
Run query / See a user-generated query
+ + + +Remember that, because "print"
here is a literal, we enclose it within
+quotation marks in our query. When you include literals in a SPARQL query, the
+database will only return exact matches for those values.
Note that, because ?object_type
is not present in the SELECT
command, it
+will not show up in the results table. However, it is essential to structuring
+our query, because it connects the dots from ?object
to the label "print"
.
FILTER
+ +In the previous query, our SPARQL query searched for an exact match for the
+object type with the text label “print”. However, often we want to match literal
+values that fall within a certain range, such as dates. For this, we’ll use the
+FILTER
command.
To find URIs for all the prints in the BM created between 1580 and 1600, we’ll +need to first figure out where the database stores dates in relationship to the +object node, and then add references to those dates in our query. Similar to the +way that we followed a single link to determine an object type, we must hop +through several nodes to find the production dates associated with a given +object:
+ + + +PREFIX bmo: <http://www.researchspace.org/ontology/>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX ecrm: <http://www.cidoc-crm.org/cidoc-crm/>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+
+# Return object links and creation date
+SELECT ?object ?date
+WHERE {
+
+ # We'll use our previous command to search only for
+ # objects of type "print"
+ ?object bmo:PX_object_type ?object_type .
+ ?object_type skos:prefLabel "print" .
+
+ # We need to link though several nodes to find the
+ # creation date associated with an object
+ ?object ecrm:P108i_was_produced_by ?production .
+ ?production ecrm:P9_consists_of ?date_node .
+ ?date_node ecrm:P4_has_time-span ?timespan .
+ ?timespan ecrm:P82a_begin_of_the_begin ?date .
+
+ # As you can see, we need to connect quite a few dots
+ # to get to the date node! Now that we have it, we can
+ # filter our results. Because we are filtering by date,
+ # we must attach the tag ^^xsd:date after our date strings.
+ # This tag tells the database to interpret the string
+ # "1580-01-01" as the date 1 January 1580.
+
+ FILTER(?date >= "1580-01-01"^^xsd:date &&
+ ?date <= "1600-01-01"^^xsd:date)
+}
+
Aggregation
+ +So far we have only used the SELECT
command to return a table of objects.
+However, SPARQL allows us to do more advanced analysis such as grouping,
+counting, and sorting.
Say we would like to keep looking at objects made between 1580 and 1600, but we
+want to understand how many objects of each type the BM has in its collections.
+Instead of limiting our results to objects of type “print”, we will instead use
+COUNT
to tally our search results by type.
PREFIX bmo: <http://www.researchspace.org/ontology/>
+PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
+PREFIX ecrm: <http://www.cidoc-crm.org/cidoc-crm/>
+PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+
+SELECT ?type (COUNT(?type) as ?n)
+WHERE {
+ # We still need to indicate the ?object_type variable,
+ # however we will not require it to match "print" this time
+
+ ?object bmo:PX_object_type ?object_type .
+ ?object_type skos:prefLabel ?type .
+
+ # Once again, we will also filter by date
+ ?object ecrm:P108i_was_produced_by ?production .
+ ?production ecrm:P9_consists_of ?date_node .
+ ?date_node ecrm:P4_has_time-span ?timespan .
+ ?timespan ecrm:P82a_begin_of_the_begin ?date .
+ FILTER(?date >= "1580-01-01"^^xsd:date &&
+ ?date <= "1600-01-01"^^xsd:date)
+}
+# The GROUP BY command designates the variable to tally by,
+# and the ORDER BY DESC() command sorts the results by
+# descending number.
+GROUP BY ?type
+ORDER BY DESC(?n)
+
Linking multiple SPARQL endpoints
+ +Up until now, we have constructed queries that look for patterns in one dataset +alone. In the ideal world envisioned by Linked Open Data advocates, multiple +databases can be interlinked to allow very complex queries dependent on +knowledge present in different locations. However, this is easier said than +done, and many endpoints (the BM’s included) do not yet reference outside +authorities.
+ +One endpoint that does, however, is
+Europeana’s. They have created links
+between the objects in their database and records about individuals in
+DBPedia and VIAF, places in
+GeoNames, and concepts in the Getty Art &
+Architecture thesaurus. SPARQL allows you to insert SERVICE
statements that
+instruct the database to “phone a friend” and run a portion of the query on
+an outside dataset, using the results to complete the query on the local
+dataset. While this lesson will go into the data models in Europeana and DBpedia in depth, the following query illustrates how a SELECT
statement works. You may run it yourself by copying and pasting the query text into the Europeana endpoint.
PREFIX edm: <http://www.europeana.eu/schemas/edm/>
+PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
+PREFIX dbo: <http://dbpedia.org/ontology/>
+PREFIX dbr: <http://dbpedia.org/resource/>
+PREFIX rdaGr2: <http://rdvocab.info/ElementsGr2/>
+
+# Find all ?object related by some ?property to an ?agent born in a
+# ?dutch_city
+SELECT ?object ?property ?agent ?dutch_city
+WHERE {
+ ?proxy ?property ?agent .
+ ?proxy ore:proxyFor ?object .
+
+ ?agent rdf:type edm:Agent .
+ ?agent rdaGr2:placeOfBirth ?dutch_city .
+
+ # ?dutch_city is defined by having "Netherlands" as its broader
+ # country in DBpedia. The SERVICE statement asks
+ # http://dbpdeia.org/sparql which cities have the country
+ # "Netherlands". The answers to that sub-query will then be
+ # used to finish off our original query about objects in the
+ # Europeana database
+
+ SERVICE <http://dbpedia.org/sparql> {
+ ?dutch_city dbo:country dbr:Netherlands .
+ }
+}
+# This query can potentially return a lot of objects, so let's
+# just request the first 100 in order to speed up the search
+LIMIT 100
+
An interlinked query like this means that we can ask Europeana questions about +its objects that rely on information about geography (what cities are in the +Netherlands?) that Europeana does not need to store and maintain itself. In the +future, more cultural LOD will hopefully link to authority databases like the +Getty’s Union List of Artist Names, allowing, for example, the British Museum to +outsource biographical data to the more complete resources at the Getty.
+ +Working with SPARQL results
+ +Having constructed and run a query… what do we do with the results? Many +endpoints offer, like the British Museum, a web-based browser that returns +human-readable results. However, SPARQL endpoints are designed to return +structured data to be used by other programs.
+ +Export results to CSV
+ +In the top right corner of the results page for the BM endpoint, you will find
+links for both JSON and XML downloads. Other endpoints may also offer the
+option for a CSV/TSV download, however this option is not always available. The
+JSON and XML output from a SPARQL endpoint contain not only the values returned
+from the SELECT
statement, but also additional metadata about variable types
+and languages.
Parsing the XML verson of this output may be done with a tool like Beautiful +Soup (see its Programming Historian +lesson) or Open +Refine. To quickly convert JSON results from a SPARQL +endpoint into a tabular format, I recommend the free command line utility +jq. (For a tutorial on using command +line programs, see “Introduction to the Bash Command +Line”.) The following query will convert the +special JSON RDF format into a CSV file, which you may load into your preferred +program for further analysis and visualization:
+ +jq -r '.head.vars as $fields | ($fields | @csv), (.results.bindings[] | [.[$fields[]].value] | @csv)' sparql.json > sparql.csv
+
Export results to Palladio
+ +The popular data exploration platform Palladio can directly load data from a
+SPARQL endpoint. On the “Create a new project” screen, a link at the bottom to
+“Load data from a SPARQL endpoint (beta)” will provide you a field to enter the
+endpoint address, and a box for the query itself. Depending on the endpoint, you
+may need to specify the file output type in the endpoint address; for example,
+to load data from the BM endpoint you must use the address
+http://collection.britishmuseum.org/sparql.json
. Try pasting in the
+aggregation query we used above to count artworks by type and clicking on “Run
+query”. Palladio should display a preview table.
After previewing the data returned by the endpoint, click on the “Load data” +button at the bottom of the screen to begin manipulating it. (See this +Programming Historian +lesson +for a more in-depth tutorial on Palladio.) For example, we might make a query +that returns links to the images of prints made between 1580 and +1600, +and render that data as a grid of images sorted by date:
+ + + +Note that Palladio is designed to work with relatively small amounts of data (on
+the order of hundreds or thousands of rows, not tens of thousands), so you may
+have to use the LIMIT
command that we used when querying the Europeana
+endpoint to reduce the number of results that you get back, just to keep the
+software from freezing.
Further reading
+ +In this tutorial we got a look at the structure of LOD as well as a real-life +example of how to write SPARQL queries for the British Museum’s database. You +also learned how to use aggregation commands in SPARQL to group, count, and sort +results rather than simply list them.
+ +There are even more ways to modify these queries, such as introducing OR
and
+UNION
statements (for describing conditional queries), and CONSTRUCT
+statements (for inferring new links based on defined rules), full-text
+searching, or doing other mathematical operations more complex than counting.
+For a more complete rundown of the commands available in SPARQL, see these
+links:
Both the Europeana and Getty Vocabularies LOD sites also offer extensive, and +quite complex example queries which can be good sources for understanding how to +search their data:
+ + + +