Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systematize types for documents and links between them #318

Closed
epatters opened this issue Jan 6, 2025 · 7 comments · Fixed by #369
Closed

Systematize types for documents and links between them #318

epatters opened this issue Jan 6, 2025 · 7 comments · Fixed by #369
Assignees
Labels
backend Backend, including web server and database strategic Design/architecture work and prior discussion required

Comments

@epatters
Copy link
Member

epatters commented Jan 6, 2025

We now have several different kinds of documents, such as model documents, diagram documents, and analysis documents. All of these documents are JSON-able and stored as document refs in the database. Moreover, there can be links/references/foreign keys between them. For example, a diagram references the model that it is a diagram in. These links need not be at the top-level of the JSON. Soon models will be able to import other models using special cells in the notebook.

In preparation for the v0.2 release, I would like to better systematize the notions of document and link in the hopes of avoiding breaking changes or ad hoc workarounds later.

In the original Node backend, @olynch had a notion of extern references in documents. I removed it when I rewrote the backend in Rust (#211) because we weren't using it for anything yet and so I wasn't sure whether/how I should re-implement it. However, it persists in the frontend type defs as ExternRef. I'd like to bring something like this back.

Working design

Certain special JSON keys, prefixed by the @ symbol, will be recognized by the backend and hoisted into a typed graph (actually, I would rather think of it as a model of some double theory) in the SQL DB. The objects will be documents:

interface Document<T extends string> {
  "@id": UUID;
  "@type": T;
}

Before a document is stored in a JSONB column, these special keys will be stripped and stored in their own columns.

The arrows will be links:

interface Link<T extends string> {
  "@repo": string;
  "@id": UUID;
  "@type": T;
}

The database will have a new links table into which the special keys are hoisted. Note that @type here is the type of the link, not of the target document.

For example, the document type for a diagram will look like:

interface DiagramDocument extends Document<"diagram> {
  name: string;
  inModel: Link<"in_model">;
  notebook: Notebook<DiagramJudgment>;
}

with a typical instance:

const diagram: DiagramDocument = {
  "@id": [...],
  "@type": diagram;
  name: "Evan's diagram";
  inModel: {
    "@repo": "next.catcolab.org",
    "@id": [...],
    "@type": "inModel",
  },
  notebook: [...]
};

Prior art

People have obviously thought about linking JSON before. In fact, I was rather annoyed that this wasn't properly standardized a decade ago. The best known candidates are:

  • JSON-LD (v1.0, v1.1): The encoding above is inspired by JSON-LD, but I'm not using it because JSON-LD tries to do two unrelated things: (1) allow JSON documents to be unambiguously linked and (2) use JSON as a surface syntax for RDF. The former is basically what I want but the latter is incoherent because the hierarchical structure of JSON and the graph structure of RDF are fundamentally incompatible. The spec has unacceptable assertions such as: "Unless otherwise specified, arrays are unordered in JSON-LD."
  • JSON Reference: very simple and used in JSON Schema, but too simple for my purposes since there is no provision for typing links.
@epatters epatters added backend Backend, including web server and database strategic Design/architecture work and prior discussion required labels Jan 6, 2025
@epatters epatters moved this to Backlog in CatColab v0.2 Jan 6, 2025
@olynch
Copy link
Collaborator

olynch commented Jan 6, 2025

For compositional modeling, we will want to:

  1. Link to other documents from notebook cells.
  2. Durably link to other documents (e.g., link to specific snapshots rather than UUIDs). Durable linking is essential for cross-team collaboration in the same way that software versioning is; if all you have is UUID refs then that's essentially treating all of catcolab like a monorepo, where changes to documents must be centrally coordinated.

Do you have an idea of how this fits in with your proposal for links?

@KevinDCarlson
Copy link
Collaborator

I hadn't seen this when working on #332 but was thinking of doing a much less sophisticated version while working on that, so I'm glad it's coming to the fore from multiple directions.

  • You don't think a Document should have any more fields? I guess Analysis notebooks don't currently have names, although I'm not sure I like that...Permissioning info should be common across all documents, though, shouldn't it?

  • How thought-out is the idea to let diagrams point to models in an arbitrary different repo? That's thrilling but kind of scary.

  • Re Owen's 2), should a link just go to a snapshot instead of a document? But snapshots don't have UUIDs. Should they? If that's tempting, we ought to change it this week.

  • It seems like for Owen's 1), we'll just stick links inside of cells; you already mention you don't expect them to be at top-level. However, I don't understand quite what the Links table will look like as a link doesn't seem to have a source right now. If it get a source UUID referring to potentially either a cell or a document, that could be interesting. (But do cells have UUIDs yet?)

@epatters epatters moved this from Ready to In progress in CatColab v0.2 Jan 29, 2025
@epatters
Copy link
Member Author

epatters commented Jan 29, 2025

Thanks both for these comments! I'm getting back to this now as I seek to pin down and then implement these changes before the v0.2 release.

Responding to @olynch's comments:

For compositional modeling, we will want to:

  1. Link to other documents from notebook cells.

Absolutely. My thinking for putting the @ symbol before id was that a link to another document can occur anywhere in the source document, such as in a notebook cell, and we want the backend to be able to recognize and process such links without having to know the document schema. This will be familiar to you, Owen, because your original design worked the same way, except that IIRC you used __extern__ instead of @id. (Bike-shedding here, I'm using the latter because JSON-LD does and also it's a bit shorter, while still standing out from "normal" keys in a JSON object.)

  1. Durably link to other documents (e.g., link to specific snapshots rather than UUIDs). Durable linking is essential for cross-team collaboration in the same way that software versioning is; if all you have is UUID refs then that's essentially treating all of catcolab like a monorepo, where changes to documents must be centrally coordinated.

Thanks for this reminder. This is very important but was omitted in my first attempt in the OP. I will address this in my next attempt below.

@epatters
Copy link
Member Author

epatters commented Jan 29, 2025

Responding to @KevinDCarlson's comments:

You don't think a Document should have any more fields? I guess Analysis notebooks don't currently have names, although I'm not sure I like that...Permissioning info should be common across all documents, though, shouldn't it?

I'm on board with adding a name/title to the base Document type, both because I agree that documents, includes analyses, should always have the possibility of getting human-readable names, and also because these names can then be processed by the backend (such as in your #347) without relying on tacit assumptions about whether the field is called name or title or whatever.

I can't think of any other user-specified "universal" fields for documents. Can you?

How thought-out is the idea to let diagrams point to models in an arbitrary different repo? That's thrilling but kind of scary.

It is not at all thought out. I was thinking that, even before getting to cross-repo references, it would be useful to have this in conjunction with your import/export functionality so that there is a record of where stuff came from. E.g., if you try to import a diagram pointing at model in catcolab.org into next.catcolab.org, you can get an informative error instead it failing to find the model without any hint as to why it's missing.

Re Owen's 2), should a link just go to a snapshot instead of a document? But snapshots don't have UUIDs. Should they? If that's tempting, we ought to change it this week.

See my updated proposal below.

It seems like for Owen's 1), we'll just stick links inside of cells; you already mention you don't expect them to be at top-level.

Right.

However, I don't understand quite what the Links table will look like as a link doesn't seem to have a source right now. If it get a source UUID referring to potentially either a cell or a document, that could be interesting. (But do cells have UUIDs yet?)

I was thinking that the source of the link is implicitly the document that contains the link declaration, while the target of the link is the document explicitly given in the link. Just like links in the World Wide Web!

@epatters
Copy link
Member Author

OK, here is v2 of my proposal:

interface Document<T extends string> {
  /** Unique identifier of the document ref in the database. */
  "@id": UUID;

  /** Type of the document, such as "model" or "analysis". */
  "@type": T;

  /** Human-readable name of document. */
  name?: string;
}

/** A link between documents. */
interface Link<T extends string> {
  /** Unique identifier of the target document. */
  "@id": UUID;

  /** Repository to which the target document belongs. */
  "@repo": string;

  /** Version of the target document.
  
  If null, refers to the head snapshot of document and thus the linked document is "live."
   */
  "@version": string | null;

  /** Type of the link, such as "diagramIn" or "analysisOf" .*/
  "@type": T;
}

At this stage, I am imagining that versions are a thing that we will be (optionally) attached to snapshots of a document, but I'm not committing to what a version will be, besides being representable as a string. It could be another UUID, it could be a version number, it could be a "SemVer" string if we think that even makes sense in this context.

@KevinDCarlson
Copy link
Collaborator

What about publicity and permissions fields on a general document?

Also, it seems a little funny that the document type and link types are just any stringy thing. Wouldn’t it be nicer to have them be terms of an enum we extend as needed, or is that inapplicable for some reason here? Are we anticipating user-generated new whole types of docs or links? I suppose if we implement a theory of document graphs, maybe users will be modifying this on their own, but it feels a little far-fetched…

Otherwise I feel good!

@epatters
Copy link
Member Author

epatters commented Jan 29, 2025

What about publicity and permissions fields on a general document?

This information is stored in the database outside the JSON blob comprising the document, so there's no need to put it into the document itself. If anything, it's risky to do so since any information stored in two places can easily become inconsistent.

Also, it seems a little funny that the document type and link types are just any stringy thing. Wouldn’t it be nicer to have them be terms of an enum we extend as needed, or is that inapplicable for some reason here? Are we anticipating user-generated new whole types of docs or links? I suppose if we implement a theory of document graphs, maybe users will be modifying this on their own, but it feels a little far-fetched…

Good point, and I might do something like that while resolving this issue if it feels natural. Fortunately, though, whether the type fields are arbitrary strings or a TypeScript enum of allowed strings is a matter solely concerning the frontend type defs and does not affect how the data is stored as JSON in the DB. So I'm much less worried about pinning that down right now since we can change it later without touching the data itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Backend, including web server and database strategic Design/architecture work and prior discussion required
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants