| commissioned-by | spacedock@template | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| entity-type | experiment | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| entity-label | experiment | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| entity-label-plural | experiments | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| id-style | slug | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| state | $inline | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| stages |
|
This repository is a public Spacedock workflow template for running repeatable scientific research loops. It helps a team turn broad research directions into falsifiable hypotheses, review each protocol, run a pilot before the full experiment, analyze evidence, and preserve lessons for future work.
Use this template when your project needs disciplined experiment tracking across ideas, protocol review, execution, analysis, and conclusion. It is intentionally domain-neutral: adapt the executor commands to your lab, simulator, benchmark, survey, notebook, or evaluation harness.
Reference this public README URL when invoking spacedock:commission:
https://raw.githubusercontent.com/spacedock-dev/research-workflow-template/main/README.md
Example prompt:
Commission a new Spacedock workflow using this public workflow template:
https://raw.githubusercontent.com/spacedock-dev/research-workflow-template/main/README.md
Adapt it to my project. Keep the concept -> ideate -> expanded path for
research directions and the hypothesis -> propose -> pilot -> full -> analyze ->
conclude path for experiments. Preserve the one-independent-variable rule,
proposal review, pilot gate, artifact-level attribution, and durable learning
logs. My research area is: <brief description>. Put the generated workflow in:
docs/research.
- One independent variable per hypothesis. Change one protocol element at a time: treatment, prompt, model, reagent, parameter, dataset slice, measurement method, or analysis rule.
- Fixed controls. Hold controls, sampling plan, runtime, instrument setup, inclusion criteria, randomization, and scoring method constant unless the hypothesis is specifically about one of them.
- Pilot before full. A focused pilot checks whether the intervention fires, whether safety and validity checks pass, and whether the full run is worth the cost.
- Clean audit before score. Do not trust a result until provenance, coverage, and exclusion checks are clean.
- Evidence over assertion. Credit an effect only when the intervention reached the committed artifact or measured system.
- Learning is an artifact. The experiment entity is the source of truth. Durable cross-experiment lessons belong in a learning log; workflow changes belong in a workflow-refinement log.
| Role | Responsibility |
|---|---|
| Captain | Owns research strategy and approves gates. |
| First officer | Runs the Spacedock workflow, dispatches workers, advances state, and owns waits. |
| Ensign | Performs scoped work: ideation, protocol authoring, pilot execution, analysis, and artifact reads. |
| Gatekeeper | Reviews proposed protocols before pilot execution. |
| Executor | Runs the project-specific experiment, simulation, benchmark, study, or analysis job. |
Two entity kinds share this workflow directory:
- Concept (
exp<NNNN>-<slug>.md,kind: concept) is a research direction. It followsconcept -> ideate -> expanded. - Hypothesis (
exp<NNNN>-<slug>.md,kind: hypothesis) is one testable protocol change. It follows:
hypothesis -> propose -> pilot -> full -> analyze -> conclude
|
+-> hypothesis (revisable flaw)
+-> conclude (cleanly falsified)
This workflow uses id-style: slug, so the filename slug is the Spacedock
identity. The exp<NNNN> prefix is part of the slug, not a separate generated
frontmatter id.
- Concepts and hypotheses share one
exp<NNNN>slug prefix space. - Do not set a separate
id:field in new entities; the slug is the id. - Use folder form (
exp<NNNN>-<slug>/index.md) only when evidence becomes too large for a single markdown file.
| Field | Type | Description |
|---|---|---|
title |
string | Human-readable title. |
status |
enum | concept, ideate, expanded, hypothesis, propose, pilot, full, analyze, conclude. |
kind |
enum | concept or hypothesis. |
source |
string | Where the entity came from. |
started / completed |
ISO 8601 | Start and terminal dates. |
verdict |
enum | PASSED, REJECTED, or INCONCLUSIVE at terminal state. |
score |
number | Optional priority from 0.0 to 1.0. |
worktree |
string | Optional working directory if the experiment needs one. |
A broad research direction is filed.
- Inputs: prior findings, literature gaps, failed experiments, reviewer questions, operator hunches, or a task-gap ranking.
- Outputs: a concept entity with
## Direction, expected value, and known constraints. - Good: concrete enough to generate falsifiable hypotheses.
- Bad: "improve results" without a suspected mechanism.
An ensign reads the concept, prior learnings, current baseline protocol, and available evidence, then writes 2-5 hypothesis entities.
- Inputs: concept entity, prior conclusions, baseline method, constraints, and available executor surface.
- Outputs: hypotheses with one independent variable, named target outcomes, controls, acceptance criteria, and expected artifact signatures.
- Good: each hypothesis can be falsified by a pilot.
- Bad: one large hypothesis containing several unrelated interventions.
The concept has produced candidate hypotheses and no longer needs active work.
- Inputs: concept entity and generated hypothesis list.
- Outputs: concept body updated with links to the generated hypotheses.
- Good: later readers can see how the direction branched.
- Bad: marking a concept expanded without creating or linking hypotheses.
A queued, fully formed hypothesis.
Each hypothesis should include:
## Hypothesiswith the falsifiable claim and the single change.## Independent variablenaming exactly what changes.## Held constantnaming controls and invariants.## Target outcomesnaming primary and secondary outcomes.## Acceptance criteriawith pass/fail thresholds and audit requirements.## Risk and validity notesfor leakage, confounds, safety, and cost.
The ensign authors the protocol package, then a gatekeeper reviews it. The captain makes the final call unless autonomous approval is explicitly enabled for a clean happy path.
- Inputs: hypothesis entity, baseline protocol, prior learnings, and domain constraints.
- Outputs: protocol diff, pilot plan, full-run plan, frozen or versioned
execution artifacts, and a
## Gatekeeper reviewblock. - Good: the protocol is mechanically executable and changes only the declared independent variable.
- Bad: hidden control changes, missing audit path, leakage, undeclared safety risk, or vague success criteria.
A focused pilot checks whether the intervention is real, measurable, and worth a full run.
- Inputs: frozen or versioned pilot protocol.
- Outputs: pilot run directory, audit/provenance check, outcome delta versus baseline, artifact-level evidence, and a go/revise/reject recommendation.
- Good: the pilot exercises the changed behavior without damaging controls or canaries.
- Bad: advancing to full because the result "looks promising" without clean audit and attribution.
Run the full experiment using the same protocol that passed pilot, with only the declared sample-size or coverage expansion.
- Inputs: approved pilot protocol and full-run plan.
- Outputs: full run directory, raw results, audit/provenance report, and headline score or effect estimate.
- Good: pilot and full differ only in declared coverage or sample size.
- Bad: changing method, controls, or scoring between pilot and full.
Interpret the full experiment quantitatively and mechanistically.
- Inputs: full run artifacts, audit report, baseline comparison, target outcomes, controls, and canaries.
- Outputs: analysis answering result, attribution, moved controls, remaining confounds, and recommended next step.
- Good: conclusions distinguish confirmed effect, noise, underpowered signal, confound, and infrastructure failure.
- Bad: treating a score delta as truth without checking mechanism and audit.
Write the verdict and archive or promote.
- Inputs: analysis, acceptance criteria, audit result, and follow-up routing.
- Outputs: final verdict, evidence summary, caveats, mechanism, next action, and one-line durable learning.
- Good: a future researcher can tell why the hypothesis was accepted, rejected, revised, or marked inconclusive.
- Bad: terminal state without a verdict, evidence pointer, or follow-up.
At propose, review the protocol before pilot execution. Each rule receives
PASS, WARN, or FAIL; unevaluable rules are FAIL with evidence naming
what was missing.
| Rule | Check |
|---|---|
| G1 single independent variable | The proposed protocol changes exactly the variable named in the hypothesis. |
| G2 provenance and leakage guard | The protocol avoids holdout answers, hidden labels, future observations, and unauthorized references. |
| G3 controls held constant | Baseline, control group, sample definition, scoring, environment, randomization, and runtime remain fixed. |
| G4 focused pilot | The pilot includes target cases, stable controls, and canaries for broad changes. |
| G5 reproducibility | Execution artifacts are frozen, versioned, or immutable enough to rerun. |
| G6 measurement validity | Outcomes measure the claim and are not self-anchored. |
| G7 actionability | The executor can mechanically run the proposed change and observe whether it fired. |
| G8 safety, ethics, and cost | Risks are declared and bounded. |
| G9 analysis before results | Acceptance criteria, exclusions, and statistical tests are written before execution. |
| G10 follow-up routing | The proposal names how outcomes route: advance, revise, reject, or escalate. |
Recommendation:
APPROVEwhen no rules fail.REVISEwhen failures are mechanical and fixable without changing the hypothesis.REJECTwhen a failure compromises integrity, leakage guard, controls, safety, or the declared scientific claim.
Spacedock manages workflow state; the project supplies the command that runs the experiment. Both tiers call one command shape:
./scripts/run-experiment <hypothesis-id> --tier pilot --out runs/<hypothesis-id>/pilot
./scripts/run-experiment <hypothesis-id> --tier full --out runs/<hypothesis-id>/fullEach run writes meta.json, protocol.md, results.json, audit.json,
logs/, and artifacts/ under --out. Long-running runs that outlive an agent
turn use a detached launcher and a done sentinel.
See EXECUTOR.md for the full contract: required output schemas,
the pilot-vs-full rule, the detached-run handle directory, and example
foreground and detached wrappers. EXECUTOR.md is the source of truth.
Concept:
---
title: <research direction>
status: concept
kind: concept
source:
started:
completed:
verdict:
score:
worktree:
---
## Direction
<theme, rationale, constraints, and why this direction may improve the target outcome>Hypothesis:
---
title: <one-line hypothesis>
status: hypothesis
kind: hypothesis
source:
started:
completed:
verdict:
score:
worktree:
---
## Hypothesis
<falsifiable claim>
## Independent variable
<the one thing that changes>
## Held constant
<controls, runtime, sampling, scoring, inclusion criteria, environment>
## Target outcomes
<primary, secondary, controls/canaries>
## Acceptance criteria
<thresholds, audit requirements, attribution requirements>
## Gatekeeper review
## Pilot result
## Run result
## Analysis
## Failure Review
## Follow-up Routing
## VerdictUse a self-learning log for portable scientific lessons:
# Self-Learning Log
## Concluded Experiments
- **exp<NNNN> - PASSED/REJECTED/INCONCLUSIVE.** One-line lesson with the mechanism,
caveat, and evidence pointer.Use a workflow-refinement log for changes to this workflow's structure:
# Workflow-Refinement Log
## exp<NNNN> - <title>
- layer: <which workflow layer changed>
- refinement type: new-stage | reorder | replace | new-protocol | gate-rule | other
- finding: <what happened across the pilot/full run>
- learning: <transferable workflow lesson>
- bears-on: <related experiment ids>
- evidence: <entity section / run dir / artifact pointer>
- status: open | adopted-into-workflow | rejected-as-written- Keep this README free of private paths, private benchmark names, and machine-specific commands.
- Prefer stable branch or versioned tag URLs when sharing the template.
- If the template changes incompatibly, publish a new versioned URL instead of silently changing old behavior.