Below is a snapshot of my understanding of Automerge and related projects as of 2024-07-25.

Preliminaries

Automerge Itself

Op-based CRDT
Requires the entire history in order to materialize a view
On concurrent change, it performs a topsort first to linearise the history
- This probably lends itself nicely to compression

Nested Docs

Documents can be linked to each other via automerge: URL
- This is similar to a WNFS soft link
Each doc acts as a “unit of change”, but I’m still a bit fuzzy on what that means
- In WNFS we used this distinction as an access control boundary
  - I’m increasingly of the opinion that we should do the same here for the access contol project

Compression

Compression on Automerge docs is performed by RLE and laying out as columns
RLE is not always more efficient(!), and there is no hard rule on how much to chunk by
- i.e. longer runs are not always preferred
The Automerge Binary Document Format doesn’t have runs of changes today, but it’s expected to be added
- Maybe this is the Compressed Change Chunk, but I’m not 100% certain
Runs are not generally directly appendable
Discovery of the content requires unpacking the document
- This means that two compressed chunks with overlapping content can be discovered to be concurrent if you expose the parent metadata, but you won’t know if one is contained in the other (which may be helpful for subsumption, see also the Leveled Sync Proposal below)

Sync System

OOM
- 100k docs
- Some 10ks of ops/doc
- Total: ~ 1B ops to keep in sync
If kept in a tree, balancing the branching factor changes round trips pretty significantly (esp since latency is usually the bottleneck). The wider the tree, the more redundant metadata needs to be synced.
- Binary tree ~ 30 RT
- Hex tree ~ 8 RT
- Practial Rateless IBLT scales with the delta between replica states: $∣ Alice Δ Bob ∣$
  - This is near optimal, often $\frac{1}{2}$ RT for reconcilliation
  - …but is much more complex, depending on concepts like invertable bloom filters, erasure coding, and peel compression

Primary Sync Proposal

Builds a Merkle Search Tree on top of the linear history
When syncing multiple documents:
1. Each document is first topsorted
2. They are all laid end-to-end
3. A Merkle Search Tree is built out of the elements

These diagrams would be better with Graphviz, but we have Mermaid so imagine that there’s blank nodes and whatnot. Also ignore that the doc numbers are backwards :P Mermaid is not the best at layout

flowchart
    subgraph doc1
      op_1a
      op_1b
    end

    subgraph doc2
      op_2a
      op_2b
      op_2c
    end

    subgraph doc3
      op_3c
      op_3b
      op_3a
      op_3d
    end

    op_2b --> op_2c
    op_2b --> op_3d
    op_2b --> op_3b
    
    op_3b --> op_3c
    op_3b --> op_3a
    op_3d --> op_2a


    op_2c --> op_1a
    op_2c --> op_1b

    doc1 ~~~ doc2 ~~~ doc3
    op_3b ~~~ op_3d

The intention above is that this remains in sequence if you perform an in-order traversal.
My understanding is that this doesnot break cleanly at document boundaries.
I think that the lack of document awareness means that you have to explore all child branches, despite the fact that we know that if you don’t have a parent you definitely don’t have the child node.

Idea for Takig Advantage of History

Essentially: use a MST for doc heads, and ship lists for the histories. The above example doesn’t have enough history to really show off the skip lists, but here it is translated.

flowchart
    subgraph doc1
      op_1b --> op_1a
    end

    subgraph doc2
      op_2a
      op_2b --> op_2a
      op_2c --> op_2a
    end

    subgraph doc3
      op_3a
      op_3b
      op_3c
      op_3d

      op_3d --> op_3c --> op_3b --> op_3a
      op_3d -.->|skip| op_3b
    end

    doc2node --> doc1node
    doc2node --> doc3node

    doc1node -.-> op_1b
    doc2node -.-> op_2b
    doc2node -.-> op_2c
    doc3node -.-> op_3d

Here’s the relevant video section¹ about how we did the skip list part for searching history.

Idea for Efficient Document Store Merge

This is just a stray thought:

Much like (my assumptions about the) Leveled Sync Proposal, it may make sense to add blank nodes to make documents break at natural tree boundaries (i.e. so that documents don’t share inner nodes up to some depth). This could make merging docs very efficient because you would only need to build the top portion of the strutcure, and not change the structure in relation to the sibling document changes.

flowchart
    subgraph doc1
      op_1a
      op_1b

      x11[blank]
      x12[blank]
      x13[blank]
    end

    subgraph doc2
      op_2a
      op_2b
      op_2c
    end

    subgraph doc3
      op_3b
      op_3a
      op_3c
      op_3d

      x31[blank]
    end

    op_2b --> op_3b
    op_2b --> op_2a
    op_2b --> op_2c

    op_3b --> op_3a
    op_3b --> op_3d
    op_3b --> op_3c
    
    op_2b --> op_1a --> op_1b

    op_1a -.-> x11
    op_1a -.-> x12
    op_1a -.-> x13
    op_3b -.-> x31

I don’t know is this is really a huge improvement, but it may make it more efficient to maintain doc sync for users that want nonidentical-but-overlapping collections of docs. A downside of this approach is that in filling the remaining space with blank nodes artifically inflates the size of the set making it deeper, but that overhead “should” become smaller as the doc size increases.

Leveled Sync Proposal

I think that this was the original idea. There is also a proposal to use leveled compression inside a Merkle structure that’s related to (but not exactly) an MST.

flowchart
    subgraph doc1
      op_1a
      op_1b
      op_1c
      op_1d
      op_1e

      run1_abc --> op_1a
      run1_abc --> op_1b
      run1_abc --> op_1c

      run1_de --> op_1d
      run1_de --> op_1e

      run1_abcde --> run1_abc
      run1_abcde --> run1_de
    end

    subgraph doc2
      op_2a
      op_2b
      op_2c

      run2_abc --> op_2a
      run2_abc --> op_2b
      run2_abc --> op_2c
    end

    subgraph doc3
      op_3a
      op_3b
      op_3c
      op_3d

      run3_ab --> op_3a
      run3_ab --> op_3b

      run3_cd --> op_3c
      run3_cd --> op_3d

      run3_abcd --> run3_ab
      run3_abcd --> run3_cd
    end

    run12 --> run1_abcde
    run12 --> run2_abc

    run123 --> run12
    run123 --> run3_abcd

I think that in this version, it doesn’t make sense to have runs end on the start of the next docuemnt (it’s unlikely that you’ll need to grab the very end of one doc, and the very start of the next). In the diagram above, we have the doc runs isolated until we get to the collection level. It may actually make sense for collection sync (if you know the collection that the user wants in advance) to keep generations of runs around so that runs of changes of related documents can get syned together, but that’s fairly half-baked at this stage.

There’s certainly a space/time tradeoff in this version. It’s very efficient to get (at least close to) the level of granularity that you need at runtime on static data, but you need to store the same change $n$ times. Thanks to RLE, this will be less than holding into the entire change, but there’s certainly some amplification.

We did something vaguely similar in WNFS which we called a “skeleton” which was a materialized view of the subdirectory structure. This made many things MUCH faster at runtime, but came at the cost of being brittle and (aside from SNARKing) impossible to use in a trustless setting (would need to be BFT).

A Distributed File System for Secure P2P Applications ↩

🎒 Monad Nomad

Explorer

Automerge Sync

Preliminaries

Automerge Itself

Nested Docs

Compression

Sync System

Primary Sync Proposal

Idea for Takig Advantage of History

Idea for Efficient Document Store Merge

Leveled Sync Proposal

Graph View

Table of Contents

Backlinks

🎒 Monad Nomad

Explorer

Automerge Sync

Preliminaries

Automerge Itself

Nested Docs

Compression

Sync System

Primary Sync Proposal

Idea for Takig Advantage of History

Idea for Efficient Document Store Merge

Leveled Sync Proposal

Footnotes

Graph View

Table of Contents

Backlinks