This issue is the home for the design discussion that came out of PR #1196's review.
Background
PR #1196 introduces the corpus-mutation extension system with a single entry point: a script defines a transform_corpus(corpus) global, and the host calls it. This is the simplest possible shape (let's call it Option 1) and it's what shipped in that PR.
During review, Alan raised the question of whether this shape is appropriate, particularly as we add:
- More lifecycle hooks (e.g.
before_extract, after_render).
- Extensions that register multiple capabilities from one file (a corpus transform and a Handlebars helper).
- Extensions that bundle non-code files (templates, assets, configuration).
- Extension enabling/disabling.
The PR ships Option 1. This issue captures the trade-off analysis and lays out the ladder of richer options we can add as needs arise.
The four entry-point patterns
Option 1: Reserved function names
Script defines a function with a known name (transform_corpus); the host calls it.
Examples: pytest, Sphinx.
Trade-off: minimal syntax. The host owns the names, so one file can expose many different known hooks, but a script can't add several capabilities of the same kind under names it picks itself.
My evaluation: right starting point for Mr. Docs, and it scales further than it looks. pytest recognizes dozens of pytest_* hooks, and a single conftest.py (or one Sphinx setup()) routinely defines many of them together. So "many capabilities in one file" is not a reason to leave this rung; the only thing it genuinely can't do is let a script name its own capabilities.
Option 2: Top-level registration calls
Script calls host.register_*(fn) in top-level code; the host stores the registration and invokes the callback at the right time.
Exampled: Darktable, LLVM/Clang plugins.
Trade-off: the script passes a name, so it can add capabilities the host never pre-named, like several generators with author-chosen names from one file. The price: the host must run the script just to learn those names, then keep each registered callback alive until it calls it. (GDB's pretty-printers work this way: register_pretty_printer stores the callable, and GDB invokes it later, once per value.)
My evaluation: not needed yet. Reserved names already give us "many capabilities per file", so that's not the reason to climb. The real trigger is wanting scripts to name their own capabilities, most concretely one extension that adds several named generators. Until that's a concrete need, paying the run-at-discovery and keep-alive cost is premature.
Option 3: Reserved register function + event emitter
Scripts export one reserved name (register); inside, it subscribes to host events.
Example: Antora.
Trade-off: single reserved name + familiar event pattern; adds an emitter abstraction layer.
My evaluation: probably not the right rung for Mr. Docs. Antora's pattern fits a pipeline with many extension points throughout the build; we have fewer. The emitter abstraction is overhead for our shape. We could skip rung 3 and jump from rung 2 straight to rung 4 if/when needed.
Option 4: Manifest + accompanying code
An extension is a directory: a manifest file (JSON/YAML) declares the extension name and capabilities; one or more accompanying files contain the actual logic.
Examples: Claude Code skills (Markdown frontmatter + body).
Trade-off: most expressive; supports paired helpers, auxiliary files, enable/disable, configurable extensions. Requires the most infrastructure.
My evaluation: the right answer once we want enable/disable, named extensions, auxiliary files, or configurable extensions. Heaviest but most expressive. The natural top of the ladder.
The ladder
The options aren't mutually exclusive. They form a complexity ladder:
| Rung |
Pattern |
What you get |
| 1 |
Reserved name (Option 1) |
Many fixed-name hooks per file, no ceremony, but the host owns the names |
| 2 |
Registration calls (Option 2) |
Script-chosen names: one extension adds several capabilities under names it picks (e.g., multiple named generators) |
| 3 |
Manifest + code (Option 4) |
Shared files: an extension is a directory bundling code, helpers, and assets |
| ... |
... |
enable/disable, configuration schemas, ... |
PR #1196 ships rung 1. Higher rungs land as concrete use cases surface.
Future questions to settle here
These came up in the PR review. They are not blocking PR #1196 but should inform the ladder above.
- Paired helpers: should one extension file expose both a corpus transform and a Handlebars helper? Reserved names already allow this (pytest and Sphinx both put many different hooks in one file), so it does not force rung 2; it just needs a second reserved name.
- Auxiliary files: should an extension be a directory with assets/templates/config, not just a script? This forces rung 3.
- Enable/disable: how do users opt individual extensions in or out? Likely needs a config-side knob and probably an extension name (which forces a manifest).
- Registering generators: should one extension add several named output formats (e.g., a Markdown generator)? This is the real case that needs script-chosen names, so it's what would justify rung 2 (a name-bearing
register_generator) or a manifest that lists them, not the reserved-name rung.
- Invariant safety: we all seem to agree that extensions should not break invariants; but some features require breaking them. As real use cases land, this tension will need a concrete resolution (tighter allowlist, opt-in unsafe mutations, post-hoc validation, etc.).
This issue is the home for the design discussion that came out of PR #1196's review.
Background
PR #1196 introduces the corpus-mutation extension system with a single entry point: a script defines a
transform_corpus(corpus)global, and the host calls it. This is the simplest possible shape (let's call it Option 1) and it's what shipped in that PR.During review, Alan raised the question of whether this shape is appropriate, particularly as we add:
before_extract,after_render).The PR ships Option 1. This issue captures the trade-off analysis and lays out the ladder of richer options we can add as needs arise.
The four entry-point patterns
Option 1: Reserved function names
Script defines a function with a known name (
transform_corpus); the host calls it.Examples: pytest, Sphinx.
Trade-off: minimal syntax. The host owns the names, so one file can expose many different known hooks, but a script can't add several capabilities of the same kind under names it picks itself.
My evaluation: right starting point for Mr. Docs, and it scales further than it looks. pytest recognizes dozens of
pytest_*hooks, and a singleconftest.py(or one Sphinxsetup()) routinely defines many of them together. So "many capabilities in one file" is not a reason to leave this rung; the only thing it genuinely can't do is let a script name its own capabilities.Option 2: Top-level registration calls
Script calls
host.register_*(fn)in top-level code; the host stores the registration and invokes the callback at the right time.Exampled: Darktable, LLVM/Clang plugins.
Trade-off: the script passes a name, so it can add capabilities the host never pre-named, like several generators with author-chosen names from one file. The price: the host must run the script just to learn those names, then keep each registered callback alive until it calls it. (GDB's pretty-printers work this way:
register_pretty_printerstores the callable, and GDB invokes it later, once per value.)My evaluation: not needed yet. Reserved names already give us "many capabilities per file", so that's not the reason to climb. The real trigger is wanting scripts to name their own capabilities, most concretely one extension that adds several named generators. Until that's a concrete need, paying the run-at-discovery and keep-alive cost is premature.
Option 3: Reserved
registerfunction + event emitterScripts export one reserved name (
register); inside, it subscribes to host events.Example: Antora.
Trade-off: single reserved name + familiar event pattern; adds an emitter abstraction layer.
My evaluation: probably not the right rung for Mr. Docs. Antora's pattern fits a pipeline with many extension points throughout the build; we have fewer. The emitter abstraction is overhead for our shape. We could skip rung 3 and jump from rung 2 straight to rung 4 if/when needed.
Option 4: Manifest + accompanying code
An extension is a directory: a manifest file (JSON/YAML) declares the extension name and capabilities; one or more accompanying files contain the actual logic.
Examples: Claude Code skills (Markdown frontmatter + body).
Trade-off: most expressive; supports paired helpers, auxiliary files, enable/disable, configurable extensions. Requires the most infrastructure.
My evaluation: the right answer once we want enable/disable, named extensions, auxiliary files, or configurable extensions. Heaviest but most expressive. The natural top of the ladder.
The ladder
The options aren't mutually exclusive. They form a complexity ladder:
PR #1196 ships rung 1. Higher rungs land as concrete use cases surface.
Future questions to settle here
These came up in the PR review. They are not blocking PR #1196 but should inform the ladder above.
register_generator) or a manifest that lists them, not the reserved-name rung.