mirror of
https://github.com/balkian/balkian.github.com.git
synced 2025-08-24 12:42:20 +00:00
post on efficient collaboration
This commit is contained in:
104
content/post/bridging-rdf-json-ld-and-dataclasses.md
Normal file
104
content/post/bridging-rdf-json-ld-and-dataclasses.md
Normal file
@@ -0,0 +1,104 @@
|
||||
---
|
||||
title: "Bridging RDF, JSON-LD and Dataclasses"
|
||||
description:
|
||||
date: 2025-02-26T23:22:59+01:00
|
||||
image:
|
||||
math:
|
||||
license:
|
||||
hidden: false
|
||||
comments: false
|
||||
draft: false
|
||||
categories:
|
||||
- programming
|
||||
tags:
|
||||
- rdf
|
||||
- json-ld
|
||||
- pydantic
|
||||
- python
|
||||
---
|
||||
|
||||
In the RDF world, data is expressed as a collection of triples.
|
||||
These triples can contain IRIs that may or may not be accessible or valid.
|
||||
And the use of these IRIs may or may not adhere to a vocabulary.
|
||||
Checking the validity of the IRIs and the semantics of the triples is an additional step.
|
||||
|
||||
## The `rdflib` way
|
||||
|
||||
`rdflib` only models IRIs, values and namespaces.
|
||||
Developers need to be cognisant of the URIs they are using, and the vocabularies being used.
|
||||
Prior to version 2.0, senpy followed a very similar model.
|
||||
It had a base class to represent a generic node.
|
||||
Each instance then gets its own automatically generated id, and will act like a normal dictionary, whose keys and values will be serialized as a JSON-LD dictionary.
|
||||
Multiple subclasses were also included to model specific types of node, mostly to provide convenience methods for the given subtype.
|
||||
Here is an example of a subclass, `Entity`.
|
||||
|
||||
```python
|
||||
entry = Entry()
|
||||
|
||||
entry['vocab:property'] = 25
|
||||
|
||||
print(entry.jsonld())
|
||||
```
|
||||
|
||||
Would print something like this:
|
||||
|
||||
```json
|
||||
{
|
||||
"@id": ":Entry_202505....",
|
||||
"@type": "prefix:Entity",
|
||||
"vocab:property": 25
|
||||
}
|
||||
```
|
||||
|
||||
Producing correct triples using this model requires using the vocabularies and URIs properly, with little to no tooling to enforce it.
|
||||
This poses a big problem for a tool like Senpy, which aims to make it easier for professionals without a background in RDF to build and consume semantic NLP ser
|
||||
If an attribute is not a URI and is not included in the global JSON-LD context, it will not generate a triple in the final graph.
|
||||
Moreover, there is way to enforce that the vocabularies and the
|
||||
|
||||
Pros:
|
||||
* Flexible/extensible
|
||||
* Lightweight. This is mostly JSON-LD in Python's clothing.
|
||||
* Naturally maps to both `rdflib` and writing `json-ld`
|
||||
|
||||
Cons:
|
||||
* Discoverability. Documentation and examples are needed to know which attributes to use
|
||||
* Error-prone. It is easy to misuse a property, or introduce typos
|
||||
* Tight coupling with semantics/RDF. One needs to know a thing or two about RDF, especially if new vocabularies or annotations need to be used.
|
||||
|
||||
## The object-oriented way
|
||||
|
||||
An obvious alternative to this problem in an object-oriented language like python is to use classes to represent our data model.
|
||||
These classes can define the specific attributes available, and typing annotations can serve both as a guide for the developer, and as a means to automatically
|
||||
validate objects at runtime.
|
||||
There are tools like [pydantic](https://pydantic.dev/) that make this process very simple.
|
||||
Then, we only need to define how your models should be serialized into JSON-LD.
|
||||
We can thoroughly test this serialization to ensure that the resulting object is correct and produces the right RDF graph.
|
||||
Going back to our previous example, we could define an Entry class as a dataclass, and define all the possible types of annotations as attributes.
|
||||
|
||||
This model works great when all the possible attributes are known ahead of time.
|
||||
But it starts to break when the model provided is not comprehensive enough, or customers of your library need to provide their own ad-hoc annotations / attribut
|
||||
es.
|
||||
This could be solved by encouring consumers of our library to define their own subclasses whenever they need to add new attributes.
|
||||
This works perfectly fine for serialization, but it breaks if your library needs to automatically deserialize these subclasses.
|
||||
It also breaks if different parts of the code need to add their own attributes on the same data at the same time.
|
||||
This was precisely the case of `senpy`, where entities are annotated by different plugins, each providing a different set of annotations.
|
||||
|
||||
Pros:
|
||||
* Discoverability. All possible attributes are known ahead of time, including their possible types.
|
||||
* Decoupling from RDF. Developers only need to know about the dataclasses provided. The mapping to the RDF world is already encoded in the dataclass.
|
||||
|
||||
Cons:
|
||||
* Rigidity. Adding new types of annotations requires modifying the models, in the main module.
|
||||
* Polymorphism.
|
||||
|
||||
|
||||
## A hybrid approach
|
||||
|
||||
Whichever solution is chosen in the end, it needs to:
|
||||
|
||||
* Make it easy and error-proof to add the most common types of annotations
|
||||
* Allow for additional annotations/attributes to be added
|
||||
* Allow for upgrades in the future. i.e., converting the most common custom annotations into built-in ones
|
||||
* Allow for deserialization of custom types
|
||||
* Allow multiple consumers to add their own annotations
|
||||
|
Reference in New Issue
Block a user