A Distributed and Extensible Relational Data Model

In this post, I will describe the design of a relational data model
  • that can be distributed (available, partition tolerant, eventual consistent).
  • that can be dynamically extended by the applications operating on it.
  • where the cardinality of a relation is decided at query time.
  • where relations are dynamically typed and queries are statically typed.
  • where all past values of relations are kept in the data model, to make it possible to revisit them at a later time.
  • where hashes are used to reference values, which make them location independent and makes it possible to store values redundant.
  • that is built on simple principles and easy to implement using conventional technology.


In a relational data model, it is the associations between the entities (objects, things) that we are interested in. They are collected in relations. This representation is complementary to the object-oriented (hierarchical, document) data model, where objects (records, documents) contain references to other objects. One advantage of the relational data model is that it doesn't impose any particular access pattern, as relations can be navigated in any direction [1].

The entities in the relational data model we are about to construct are represented by unique identifiers, which are used to track the entities across multiple relations. An entity is defined by the relations it is part of. It has no other attributes or properties in addition to this.

To illustrate this, imagine that we want to keep track of the persons which appears in the photos in our digital photo collection. For this we need the table appears_inEE.

CREATE TABLE appears_inEE (e1 uuid, e2 uuid);

The table is a binary relation between 2 entities indicated by the letters 'EE' at the end of the name.

Here we can stop for a while and remeber that a table is not a list of objects, structs or records. Instead, it represents a relation between things.

As we can see, there is nothing in the table above that mentions persons or photos. Any type of entities can be related in this table. This could cause us trouble further on when constructing queries that should only give us persons and photos. Therefore, we will also create tables that represent the unary identity relation for each type of entity.

CREATE TABLE personE (e uuid);
CREATE TABLE photoE (e uuid);

So far, we have shown how to relate entities to form a graph. To go beyond the pure graph and fill the model with some content, we need to be able to relate entities with values (attributes) as well. A value is a sequence of bytes which is given meaning through its type. A sequence of bytes can be hashed to give us a uniform reference to any value [2].

We go back to the persons and photos, and decide that a person should have a name encoded in utf8 (text/plain) and that a photo should be associated with a jpeg file (image/jpeg). A SHA256 hash in hex format is 64 characters long.

CREATE TABLE nameEV (e uuid, v char(64));
CREATE TABLE imageEV (e uuid, v char(64));

CREATE TABLE text_plainV (v char(64));
CREATE TABLE image_jpegV (v char(64));

Still, we have no actual values in the model, just references to them. This is solved by adding a relation between the hash and the value.

CREATE TABLE text_value (v char(64), t text);
CREATE TABLE binary_value (v char(64), b bytea);

There will be one value relation for each native database type we want to use.

Large binary data may not be suitable to store in the database. It can also be more practical to fetch an image or document from a simple web server instead of the database. It depends on the system architecture. For these values, we introduce a relation to associate the value with an URL. The URL is itself a value with type text/plain.

CREATE TABLE locationVV (v1 char(64), v2 char(64));

The mapping of mime types to database types and whether values of a certain type are stored in the database or not is knowledge that needs to be shared among the various parts in the system that interacts with the relational model.

The arrangement to use hashes as references to values allow us to store values redundant, duplicated at several locations. Typically, the locationVV relation is populated by crawlers that find and track values stored in archives, on the web or local.

A legitimate concern with the proposed data model is the many joins required to put together a record with all values and associations for an entity. The overall performance of the data model may not be as good as a more complex relational model, but we are willing to trade performance for simplicity in this case. The goal is to allow the data model to grow dynamically, using just a few core concepts. To simplify the construction of queries, the plan is to design a specialized query language that can take advantage of the symmetries in the data model.

We observe that the relation appears_inEE has the cardinality many-to-many. How other cardinalities should be represented will be addressed later. First, we will take a step back and see how we can create and update relations.

A Distributed Event Log

Using the log at LinkedIn [3][4] as a source of inspiration, we will use a log of events as the original source of information flowing into the database. The log is kept in a directory in the filesystem. A log entry is one file, containing a JSON array of objects. Each event can either associate or dissociate entities and values using named relations.

    "e": "daa8665c-df7c-4f09-959f-27f71eb65884",
    "r": "personE",
    "o": "associate",
    "t": "2015-01-27T15:28:50.445170+00:00"
    "v": "245536acb08b0958f1aa78d391d5efc773e60fa2c39d942318aa04d6085bef40",
    "t": "Marcus",
    "r": "text_value"
   "v": "245536acb08b0958f1aa78d391d5efc773e60fa2c39d942318aa04d6085bef40",
    "r": "text_plainV",
    "o": "associate",
    "t": "2015-01-27T15:28:50.445170+00:00"
    "e": "daa8665c-df7c-4f09-959f-27f71eb65884",
    "v": "245536acb08b0958f1aa78d391d5efc773e60fa2c39d942318aa04d6085bef40",
    "r": "nameEV",
    "o": "associate",
    "t": "2015-01-27T15:28:50.445170+00:00"
    "e": "0b7a6027-756d-4f5e-ae1d-e36c593b5e95",
    "r": "photoE",
    "o": "associate",
    "t": "2015-01-27T15:28:50.445170+00:00"
    "e1": "daa8665c-df7c-4f09-959f-27f71eb65884",
    "e2": "0b7a6027-756d-4f5e-ae1d-e36c593b5e95",
    "r": "appears_inEE",
    "o": "associate",
    "t": "2015-01-27T15:28:50.445170+00:00"

With a modest amount of mental effort, it is possible to map this notation to the relations defined earlier. JSON attributes e, e1, e2 are entities and v, v1, v2 are values. The first log entry that introduces a new relation will effectively create the corresponding table, indexes and views, allowing the schema to grow dynamically. The JSON attribute 'r' names the relation in question. The JSON attribute 'o' names the operation to perform on the relation, where 'associate' adds the association to the relation. The counteraction is 'dissociate', which removes the association from the relation. The log entry is fixed in time with the JSON attribute 't'. (We can also see that values are logged using a simpler relation without time and operation. The hash of a value simply exists as an immutable fact.) All of these attributes come together to create an append-only relation in the database.

CREATE TABLE appears_inEEall (e1 uuid, e2 uuid, o text, t timestamp);

From this, we derive the appears_inEE relation. The plan is to include a pair (e1', e2') only if the number of rows with o='associate' for the pair is more common than the number of rows with o='dissociate'. Expressed alternatively, there must be more log entries that want to associate e1' and e2', than there are log entries that want to dissociate the two.

CREATE VIEW appears_inEE AS (
   SELECT e1, e2 FROM (
      SELECT DISTINCT ON (e1, e2) e1, e2, o FROM appears_inEEall
         GROUP BY o, e1, e2
         ORDER BY e1, e2, COUNT(o) DESC, o DESC) AS c
   WHERE o='associate');

The query groups the related entity pair with the operation in the inner SELECT and counts how many times each combination occurs. It then sorts them in descending order to get the most frequently occurring combination at the top. When the number of 'associate' and 'dissociate' operations are equal for a pair, 'dissociate' will end up on the top by a descending sort. It then picks the top row for each pair with DISTINCT ON (e1, e2). As a final touch, the outer SELECT will only keep the rows where the operation is 'associate'.
The reason for this arrangement is to let the database be distributable while trying to reach eventual consistency. We want to distribute the database to get data redundancy rather than load balancing. By synchronizing the event log directory between the participating nodes on the file level [5][6], changes from different nodes are merged. Just make sure the file names of the log entries are unique, for example a hash of the contents. I know I am out on thin ice here, but I believe this corresponds to an Available and Partition tolerant system in the context of the CAP theorem. The distributed system as a whole does not guarantee Consistency, even though it will eventually (sometimes) be consistent. The limitation of this arrangement is that the nodes must be cooperating and can't compete for limited resources. You need distributed transactions or transactions on a common write node to be able to do that.

Another benefit is the ability to go back in time, by creating a view like appears_inEE with a time constraint. Yet another benefit is the possibility for an application to read the entire history of how a relation has evolved to become what it is today, and make it possible to undo changes.

The data doesn't have to be merged into a single log. It is possible to partition the log locally and let the data be merged in the database. You can partition on where the data comes from, who is the author, what domain it belongs to or any other reason. This makes it easy to incorporate, discard and recombine information from different sources in the local database.

There is one subtle consequence of merging logs from different nodes, and that is that the history of a node may be altered. If you are offering the ability to move back in time, then this may be important to take into consideration. It is the timestamp in the log record that causes this, that events not necessarily happens in chronological order at a node, but it is also this timestamp that make the relations consistent across all nodes. There is another sequence of timestamps available that records events in chronological order for a node and that is the individual log file creation times. It may be better to use this timeline when moving back in time, but remember that it is not consistent across nodes, but unique to the node.


So far we have defined a many-to-many relation. We would also like to define one-to-many and one-to-one relations. This can be done using the recorded timestamp to only select the most recent association. We will begin with a redefinition of the many-to-many relation.

CREATE VIEW appears_inEEcnn AS (
   SELECT e1, e2, t FROM (
      SELECT DISTINCT ON (e1, e2) e1, e2, o, max(t) as t FROM appears_inEEall
         GROUP BY o, e1, e2
         ORDER BY e1, e2, COUNT(o) DESC, o DESC) AS c
   WHERE o='associate');

CREATE VIEW appears_inEEc1n AS (
   SELECT DISTINCT ON (e2) e1, e2, t FROM appears_inEEcnn
   ORDER BY e1, t DESC);

CREATE VIEW appears_inEEcn1 AS (
   SELECT DISTINCT ON (e1) e1, e2, t FROM appears_inEEcnn
   ORDER BY e2, t DESC);

CREATE VIEW appears_inEEc11 AS (
   SELECT e1, e2, t from appears_inEEc1n
   SELECT e1, e2, t from appears_inEEcn1);

Worth mentioning here is if a relation is used in different contexts, for example the relation portrait_ofEE which relates both families and photos, and persons and photos, the relations with lower cardinality are type agnostic, meaning that a photo can only be associated one-to-one with either a person or a family.

Another observation is that for one-to-many relations, when we use the 'associate' operation to overwrite the side with cardinality one, the previous association will be pushed away into a stack of old associations. Later, when the current association is 'dissociated', the association on top of the stack will reappear. An example would be a person associated with the name 'Per' and later corrected with a new association with the name 'Pär'. When 'Pär' is dissociated, 'Per' will reappear as the associated name. To avoid this, the application will have to dissociate the old name when the name changes. This procedure also makes the many-to-many relation identical to the one-to-many relation.

To undo the effects of a log entry, just change 'associate' to 'dissociate' and vice versa to form the inverse log entry.

Case Study: the Family Album

Below is the auto generated schema for a relational model representing a family album.

Wrapping Up

We have designed a distributed relational data model, dynamically extensible by users through a replicated event log. Relations associate entities, identified by UUID, with other entities and/or values, identified by their hash. Values are stored in remote or local archives and are accessed using URLs. Values can also be stored in the event log and be cached in the data model. Users can access historical values of a relation, alongside many-to-many, one-to-many and one-to-one interpretations of the current value of the relation.

The architecture supports a number of variations:
  • Log Reader and File Crawler can work over the web.
  • Log synchronization must not be bidirectional.
  • Data sets from different domains can easily be merged.
  • It is always possible to drop and recreate the relational database from the Event Log. This may take a long time though, if there are many events.
And going into unknown territory, we might ponder:
  • How do we merge different uuids that represent the same thing? (The aka20 relation!)
  • How far will it scale? Are there any bottlenecks?
  • What would a domain specific query language look like? Compiled to SQL.
  • Can the Web Application itself be stored in the Archive? Can it be modified in the GUI? Can it compile replacement parts of itself from source at runtime? Integrated make. Can the compiler and the build system also be stored in the archive and be modifiable? Can we undo if the ability to change the application breaks due to changes to its development environment? What if undo also breaks? Which interpreter do we target?

Post Scriptum

And when I have dwelled on a problem for a long time and reached a solution, I look out on the world and find that the problem was already solved by others, I just couldn't recognize it before I understood it within my own frame of reference. Entity systems [7] are similarly designed, but the database in a game needs to be denormalized to be fast and can't have all the dynamic properties of my solution. I also remember there was a guy in the nineties arguing against OOP on Usenet. He promoted tabular programming. Now I realize he really was on to something.


[1] Out of the Tar Pit, Ben Moseley and Peter Marks
A text in line with my own thoughts on how to design information systems. In particular, I found section 7 particularly familiar, where they describe how to turn informal requirements into formal requirements, which are then executed. I have written about this earlier here and here.

[2] The Mess we're in, Joe Armstrong
36 minutes into the video, Joe introduces the idea to use the hashes as references to values.

[3] The Log: What every software engineer should know about real-time data's unifying abtraction, Jay Kreps
Inspiring article about the usefulness of the log.

[4] Turning the Databse Inside-out with Apache Samsa, Martin Kleppman
More thoughts around the log and materialized views.

[5] http://syncthing.net/
File replication software.

[6] http://www.getsync.com/
File replication software.

[7] Entity-Component-System
Solving the mess with inheritance.

No comments:

Wifi Remote Control

I have started to automate my home and thought that a dedicated remote control would be useful as an alternative to web pages and phone apps...