post on efficient collaboration

2026-01-01 18:08:15 +00:00 · 2025-03-06 01:45:41 +01:00
parent 1d1f044d80
commit ec2e9ea7cc
6 changed files with 583 additions and 2 deletions
--- a/content/post/bridging-rdf-json-ld-and-dataclasses.md
+++ b/content/post/bridging-rdf-json-ld-and-dataclasses.md
@@ -0,0 +1,104 @@
+---
+title: "Bridging RDF, JSON-LD and Dataclasses"
+description: 
+date: 2025-02-26T23:22:59+01:00
+image: 
+math: 
+license: 
+hidden: false
+comments: false
+draft: false
+categories:
+    - programming
+tags:
+    - rdf
+    - json-ld
+    - pydantic
+    - python
+---
+
+In the RDF world, data is expressed as a collection of triples.
+These triples can contain IRIs that may or may not be accessible or valid.
+And the use of these IRIs may or may not adhere to a vocabulary.
+Checking the validity of the IRIs and the semantics of the triples is an additional step.
+
+## The `rdflib` way
+ 
+`rdflib` only models IRIs, values and namespaces.
+Developers need to be cognisant of the URIs they are using, and the vocabularies being used.
+Prior to version 2.0, senpy followed a very similar model.
+It had a base class to represent a generic node.
+Each instance then gets its own automatically generated id, and will act like a normal dictionary, whose keys and values will be serialized as a JSON-LD dictionary.
+Multiple subclasses were also included to model specific types of node, mostly to provide convenience methods for the given subtype.
+Here is an example of a subclass, `Entity`.
+
+```python
+entry = Entry()
+
+entry['vocab:property'] = 25
+
+print(entry.jsonld())
+```
+
+Would print something like this:
+
+```json
+{
+ "@id": ":Entry_202505....",
+ "@type": "prefix:Entity",
+ "vocab:property": 25
+}
+```
+
+Producing correct triples using this model requires using the vocabularies and URIs properly, with little to no tooling to enforce it.
+This poses a big problem for a tool like Senpy, which aims to make it easier for professionals without a background in RDF to build and consume semantic NLP ser
+If an attribute is not a URI and is not included in the global JSON-LD context, it will not generate a triple in the final graph.
+Moreover, there is way to enforce that the vocabularies and the 
+
+Pros:
+* Flexible/extensible
+* Lightweight. This is mostly JSON-LD in Python's clothing.
+* Naturally maps to both `rdflib` and writing `json-ld`
+
+Cons:
+* Discoverability. Documentation and examples are needed to know which attributes to use
+* Error-prone. It is easy to misuse a property, or introduce typos
+* Tight coupling with semantics/RDF. One needs to know a thing or two about RDF, especially if new vocabularies or annotations need to be used.
+
+## The object-oriented way
+ 
+An obvious alternative to this problem in an object-oriented language like python is to use classes to represent our data model.
+These classes can define the specific attributes available, and typing annotations can serve both as a guide for the developer, and as a means to automatically 
+validate objects at runtime.
+There are tools like [pydantic](https://pydantic.dev/) that make this process very simple.
+Then, we only need to define how your models should be serialized into JSON-LD.
+We can thoroughly test this serialization to ensure that the resulting object is correct and produces the right RDF graph.
+Going back to our previous example, we could define an Entry class as a dataclass, and define all the possible types of annotations as attributes.
+
+This model works great when all the possible attributes are known ahead of time.
+But it starts to break when the model provided is not comprehensive enough, or customers of your library need to provide their own ad-hoc annotations / attribut
+es.
+This could be solved by encouring consumers of our library to define their own subclasses whenever they need to add new attributes.
+This works perfectly fine for serialization, but it breaks if your library needs to automatically deserialize these subclasses.
+It also breaks if different parts of the code need to add their own attributes on the same data at the same time.
+This was precisely the case of `senpy`, where entities are annotated by different plugins, each providing a different set of annotations.
+
+Pros:
+* Discoverability. All possible attributes are known ahead of time, including their possible types.
+* Decoupling from RDF. Developers only need to know about the dataclasses provided. The mapping to the RDF world is already encoded in the dataclass.
+
+Cons:
+* Rigidity. Adding new types of annotations requires modifying the models, in the main module.
+* Polymorphism. 
+                                                                                     
+                                                                                                          
+## A hybrid approach
+
+Whichever solution is chosen in the end, it needs to:                                                                 
+                                                                                                                      
+* Make it easy and error-proof to add the most common types of annotations                                            
+* Allow for additional annotations/attributes to be added                                                             
+* Allow for upgrades in the future. i.e., converting the most common custom annotations into built-in ones            
+* Allow for deserialization of custom types                                                               
+* Allow multiple consumers to add their own annotations
+