jankneumann · jankneumann · Jun 5, 2026 · May 26, 2026 · May 26, 2026 · May 26, 2026
diff --git a/openspec/changes/seed-sentinel-security-eval/constitution.md b/openspec/changes/seed-sentinel-security-eval/constitution.md
@@ -0,0 +1,94 @@
+# Sentinel Constitution
+
+> Adapted from the Cisco [foundry-security-spec](https://github.com/CiscoDevNet/foundry-security-spec)
+> constitution v0.2.0. These principles are **inviolable** — they encode production
+> failures and their fixes. Any requirement, design, or implementation in the
+> `sentinel-security-eval` capability that contradicts a principle is wrong, except
+> where the **Deviations** section below records an explicit, mitigated exception.
+
+## Principles
+
+### I. Evidence Over Assertion
+Findings are backed by mechanically verifiable evidence with resolving code citations, never by model confidence alone. A verdict without a satisfied evidence gate is not a finding.
+
+### II. Surface Only What Survives
+Only findings that pass triage gates reach operators. Unvetted detections stay in internal storage; the human reviewer's queue is a privilege earned by surviving the gate.
+
+### III. Liveness By Heartbeat, Never By Clock
+Agent health is determined by recent heartbeat activity, not wall-clock runtime. Absence of heartbeat — not elapsed time — signals a dead agent and triggers claim release. Wall-clock may rotate sessions but never reclaims work.
+
+### IV. Claims Are Atomic And Mortal
+Concurrent agents receive different work units (atomic claiming). A dead agent's claims release automatically within a bounded heartbeat-stale window, with no operator intervention.
+
+### V. The Provider Is The Rate Arbiter
+The system adapts to upstream provider backpressure (HTTP 429, quota errors) rather than enforcing static internal rate caps. Backoff is shared across all agents calling a provider. *(See Deviation D-1 regarding multi-provider operation.)*
+
+### VI. Coverage Before Yield
+Auto-stop requires **both** coverage-complete **and** yield-below-threshold. Low yield before coverage is complete continues the run; coverage-complete with nonzero yield resets the yield timer.
+
+### VII. Exploited Means Demonstrated
+The `exploited` flag requires independent, clean-room reproduction of headline impact on a live testbed. Agent self-verification, argument, or inference never qualifies.
+
+### VIII. Fingerprints Are Stable Under Edit
+Finding identity derives from code structure — `(normalized_path, symbol, vulnerability_class)` — not from text position (line number, snippet hash). The same finding survives edits to the surrounding file.
+
+### IX. Sandbox By Infrastructure, Not By Prompt
+The runtime environment (container, gateway, firewall, security groups) enforces network and filesystem boundaries. Prompt-level rules are defense-in-depth only, never the boundary itself.
+
+### X. The Operator Outranks Every Agent
+Operator configuration is authoritative. Agent consensus, peer messages, and self-suggestions are hints. The operator's hard-rules and config always win.
+
+### XI. Persist Atomically
+Artifacts read by multiple components update by complete write-then-atomic-replace, never delete-then-write. No reader ever observes partial state.
+
+## Deviations
+
+This adoption of Sentinel records the following **explicit, mitigated** exceptions to the
+constitution. They exist because this repository's platform is multi-vendor by design.
+A deviation is only legitimate while its mitigation holds; remove the deviation or
+strengthen the mitigation, never let it go silent.
+
+### D-1 — Multi-vendor LLM routing (exception to Principle V and foundry §11.2)
+
+**The invariant being relaxed.** Foundry assumes a single LLM provider so that a finding's
+verdict is *reproducible*: re-run the triage, get the same answer. Principle V also frames
+"the provider" (singular) as the rate arbiter.
+
+**What Sentinel does instead.** Sentinel reuses this repository's existing **multi-vendor
+routing** (Claude, Codex, and other configured vendors). Verdicts may therefore be produced
+by different models across runs, which weakens bit-for-bit reproducibility.
+
+**Why.** The platform's core value is vendor diversity and cross-checking. Rather than treat
+multi-vendor operation as a reproducibility *liability* to tolerate, Sentinel treats it as a
+*consensus mechanism* — the same way this repository already synthesizes vendor-diverse code
+reviews (`parallel-infrastructure`'s `ConsensusSynthesizer`). A verdict corroborated across
+calibrated vendors is more stable and more defensible than a single provider's verdict.
+
+The governing rule that makes this sound: **never place raw outputs from different vendors on
+one shared scale.** Inconsistency comes from cross-vendor scale-mixing, not from multi-vendor
+itself. Each vendor must be internally consistent; only calibrated, then synthesized, results
+are combined.
+
+**Mitigations (binding):**
+1. **Verdict-provenance** — every verdict records the vendor, model, and rule/corpus version
+   that produced it (see `sentinel-security-eval` requirement "Verdict Provenance"). A verdict
+   without provenance is invalid.
+2. **Within-vendor consistency** — a given verdict and its severity are produced by one vendor
+   applying the rubric uniformly, so each vendor's scale is self-consistent. Raw outputs from
+   different vendors are never compared or merged on a shared scale before calibration.
+3. **Cross-vendor calibration** — before results from different vendors are combined, their
+   scales are calibrated to a common reference so that, e.g., one vendor's CVSS band maps to
+   another's. Calibration is owned configuration, not per-run model whim.
+4. **Principled synthesis** — per-vendor verdicts are integrated via the consensus model
+   (`confirmed` / `unconfirmed` / `disagreement`, with per-vendor dispositions recorded),
+   reusing the same `ConsensusSynthesizer` substrate as code review. The synthesized consensus
+   verdict — not a lone vendor's — is what reaches the Reporter (see `sentinel-security-eval`
+   requirement "Multi-Vendor Verdict Consensus and Calibration").
+5. **Shared, per-provider backoff** — Principle V's rate-arbiter behavior is preserved
+   *per provider*: backoff state is shared across all agents calling the same provider, so
+   the multi-vendor fan-out does not rediscover each provider's limit N times.
+
+**Residual risk (accepted).** Stability now rests on calibration quality rather than on a
+single provider. Mis-calibration between vendors is the residual risk; it is mitigated by
+treating calibration as owned, versioned configuration and by surfacing cross-vendor
+`disagreement` (rather than silently averaging it) for human attention.
diff --git a/openspec/changes/seed-sentinel-security-eval/contracts/README.md b/openspec/changes/seed-sentinel-security-eval/contracts/README.md
@@ -0,0 +1,15 @@
+# Contracts — Seed Sentinel Security-Evaluation Capability
+
+This is a **seed-only spec change**: it authors governance + specification artifacts and
+introduces no runnable interfaces. The contract sub-types were evaluated as follows.
+
+| Sub-type | Applicable? | Why |
+|---|---|---|
+| OpenAPI | No | The seed adds no API endpoints. Sentinel reuses `agent-coordinator`'s existing MCP/HTTP surfaces; concrete eval endpoints are authored in roadmap implementation changes. |
+| Database | No | No schema in the seed. The finding store / coverage checklist schemas are authored when their roles are implemented (roadmap). |
+| Events | No | No new events in the seed. |
+| Type stubs | No | Nothing to generate without OpenAPI/DB contracts. |
+
+**No contracts applicable for this change.** When `/plan-roadmap` decomposes the seed,
+each role-implementation change introduces its own contracts (e.g., a finding-store DB
+contract for the Detector/Triager, OpenAPI for the Reporter's publish surface).
diff --git a/openspec/changes/seed-sentinel-security-eval/design.md b/openspec/changes/seed-sentinel-security-eval/design.md
@@ -0,0 +1,64 @@
+# Design — Sentinel Security-Evaluation Seed
+
+## Context
+
+This change seeds Cisco's foundry-security-spec into the repo as the `sentinel-security-eval` capability. The defining design decision (Gate-1 Approach A + clarification record) is that Sentinel's roles **map onto the coordinator/worker/validator infrastructure that already exists**, rather than standing up a parallel security-eval stack. This document records that mapping, the boundary between the seed and the roadmap, and the analysis behind the one accepted deviation.
+
+## D1 — Role → existing-infrastructure binding
+
+Each foundry role binds to an existing capability or primitive. The seed spec adds Sentinel-specific *behavior*; the *substrate* is reused.
+
+| Foundry role | Binds onto | Nature of binding |
+|---|---|---|
+| **Orchestrator** | `agent-coordinator` (the long-running coordinator service) + `Agent Orchestration` requirement | Sentinel's two lanes (lifecycle/conversational) are a specialization of existing orchestration; no new orchestrator process. |
+| **Indexer** | `codebase-analysis` capability + `docs/architecture-analysis/` artifacts | Structural index reuses existing codebase-analysis machinery; query interface is the new surface. |
+| **Cartographer** | `codebase-analysis` (architecture summary) extended with security-context documents | Net-new security documents; reuses the analysis substrate. |
+| **Detector** | `evaluation-framework` / `gen-eval-framework` (generation/evaluation passes) | Detection rules + exploratory hunting are new; the eval-pass execution model is reused. |
+| **Triager** | `evaluation-framework` + `agent-coordinator` Verification Gateway/Policies | Verdict assignment + evidence gate are new; verification-tier plumbing is reused. |
+| **Validator** | `live-service-testing` capability | Clean-room reproduction against a testbed maps directly onto live-service testing. |
+| **Coverage-Guide** | `roadmap-orchestration` (goal→checklist decomposition) + `observability` | Checklist derivation reuses decomposition patterns; coverage status feeds observability. |
+| **Reporter** | `observability` + `merge-pull-requests`/GitHub issue tooling | Per-finding reports + rollup reuse observability and GitHub publishing surfaces. |
+| **Coordination substrate** | `agent-coordinator` **Work Queue** + **Heartbeat and Dead Agent Detection** | Atomic claiming, `open/blocked/closed` states, heartbeat liveness, auto-block — all already specified; Sentinel depends on them (no `MODIFIED` delta in the seed). |
+| **Multi-vendor verdict consensus** | `parallel-infrastructure` **`ConsensusSynthesizer`** (`consensus_synthesizer.py`) | The same `confirmed`/`unconfirmed`/`disagreement` synthesis the repo uses for vendor-diverse code review is reused to synthesize per-vendor security verdicts (see D3). |
+
+**Why dependency, not `MODIFIED`:** the seed introduces no change to coordinator behavior — it consumes the queue and heartbeat as-is. Authoring `MODIFIED` deltas now would mean inventing extensions (e.g., eval-role claim tags) before they're concrete. Those belong to roadmap implementation changes, which will copy the exact existing requirement text and modify it.
+
+## D2 — Seed ↔ roadmap boundary
+
+| In the seed (this change) | Deferred to `/plan-roadmap` |
+|---|---|
+| Vendored `constitution.md` + Deviations | Wiring each role into runnable agents |
+| `sentinel-security-eval` capability spec (roles, lifecycle, governance, policy) | Indexer parser, Detector rule corpus, Triager investigation loop |
+| Role→infra binding table (this doc) | Testbed provisioning + Validator reproduction harness |
+| Verdict-provenance requirement | Dashboard/feed implementation (FR-120–FR-124) |
+| Recording the 5 extensions as candidates | Adopting any extension role |
+
+The seed is **spec + governance only**. `openspec validate --strict` is the acceptance test; no role logic ships here.
+
+## D3 — Deviation analysis (multi-vendor consensus vs. single-provider)
+
+Foundry §11.2 + Constitution V assume a single LLM provider so verdicts are reproducible. This repo is multi-vendor by design. Rather than tolerate multi-vendor as a reproducibility liability, Sentinel treats it as a **consensus mechanism** — the same way the repo already synthesizes vendor-diverse code reviews. The governing rule: inconsistency comes from *mixing raw cross-vendor outputs on one scale*, not from multi-vendor per se. So we make each vendor internally consistent, calibrate across vendors, then synthesize.
+
+- **The pipeline:** *within-vendor consistency* (each vendor applies the rubric uniformly) → *cross-vendor calibration* (owned, versioned config maps vendor scales to a common reference) → *principled synthesis* (`confirmed`/`unconfirmed`/`disagreement` via `parallel-infrastructure`'s `ConsensusSynthesizer`, per-vendor dispositions recorded). The synthesized consensus verdict — not a lone vendor's — is what the Reporter publishes.
+- **What we gain:** a consensus verdict corroborated across calibrated vendors is *more* stable and defensible than a single provider's verdict; cross-vendor `disagreement` becomes a first-class signal surfaced to humans (e.g., `needs-review`).
+- **What we keep:** the rate-arbiter behavior of Principle V, preserved *per provider* (shared per-provider backoff).
+- **Provenance:** the **Verdict Provenance** requirement records vendor/model/corpus-version plus, for synthesized verdicts, each participating vendor's disposition and the consensus status; re-run comparison is consensus-aware so a change in participating-vendor set is not mistaken for a target regression (relevant to foundry SC-005 dedup).
+- **Residual risk (accepted):** stability now rests on *calibration quality* rather than on a single provider. Mis-calibration between vendors is the residual risk, mitigated by treating calibration as owned, versioned configuration and by surfacing (not averaging) disagreement. When no calibration exists for a vendor pair, Sentinel falls back to single-vendor verdicts with provenance rather than fabricating a cross-vendor consensus.
+
+## D4 — Deferred extensions (adopt-when preconditions)
+
+Recorded so the roadmap can pick them up with the right gating:
+
+| Extension | Adopt when | Do not adopt when |
+|---|---|---|
+| Deep-Tester | A stable testbed exists and findings need PoC binaries | No testbed |
+| Variant-Hunter | A vector store, semantic embeddings, and a true-positive corpus all exist | Any of those is missing (true for the seed — no vector store) |
+| Attack-Mapper | Reviewers ask about chaining and evaluations are >2 quarters old | First build |
+| Remediator | A code-review process for AI changes and merge gating are mature | Reporter output is not yet trusted |
+| Self-Improver | The rule corpus has measured gaps with examples | Day one |
+
+## D5 — Risks
+
+- **Spec size:** one large capability spec (~21 requirements). Mitigated by the roadmap decomposing it into per-role implementation changes.
+- **Binding drift:** the design table is documentation, not enforced by spec structure. Mitigated by roadmap changes authoring concrete `MODIFIED` deltas when they extend coordinator behavior.
+- **Deviation creep:** the multi-vendor exception could erode reproducibility further. Mitigated by the binding verdict-provenance requirement and the residual-risk note.
diff --git a/openspec/changes/seed-sentinel-security-eval/handoffs/plan-1.json b/openspec/changes/seed-sentinel-security-eval/handoffs/plan-1.json
@@ -0,0 +1,64 @@
+{
+  "schema_version": 1,
+  "written_at": "2026-05-26T11:48:57.057275+00:00",
+  "coordinator_error": {
+    "error_type": "rpc_failed",
+    "message": "HTTP Error 422: Unprocessable Entity",
+    "http_status": null
+  },
+  "payload": {
+    "agent_name": "claude_code",
+    "session_id": null,
+    "summary": "Seeded Cisco foundry-security-spec as the Sentinel security-evaluation capability via OpenSpec (not spec-kit). All ~35 foundry clarification markers were resolved interactively before authoring; the seed vendors an adapted constitution and hardens the spec into WHEN/THEN deltas, deferring role implementation to /plan-roadmap.",
+    "completed_work": [],
+    "in_progress": [],
+    "decisions": [
+      {
+        "title": "Name the system 'Sentinel'",
+        "rationale": "Operator-owned name per foundry \u00a70.",
+        "capability": "sentinel-security-eval",
+        "supersedes": null
+      },
+      {
+        "title": "Map 8 roles onto existing coordinator",
+        "rationale": "Reuse agent-coordinator/worker/validator infra rather than a parallel stack.",
+        "capability": "sentinel-security-eval",
+        "supersedes": null
+      },
+      {
+        "title": "Approach A: single cohesive capability",
+        "rationale": "Cohesive, reviewable, archivable; bindings in design.md; no MODIFIED deltas in seed.",
+        "capability": "sentinel-security-eval",
+        "supersedes": null
+      },
+      {
+        "title": "Accept multi-vendor deviation D-1",
+        "rationale": "Reuse repo multi-vendor routing; mitigate with verdict-provenance requirement.",
+        "capability": "sentinel-security-eval",
+        "supersedes": null
+      }
+    ],
+    "next_steps": [
+      "After Gate 2 approval: wire constitution reference into openspec/project.md (task 2.1).",
+      "Run /plan-roadmap against the seed to decompose role implementation."
+    ],
+    "relevant_files": [
+      {
+        "path": "openspec/changes/seed-sentinel-security-eval/proposal.md",
+        "description": "approved proposal"
+      },
+      {
+        "path": "openspec/changes/seed-sentinel-security-eval/constitution.md",
+        "description": "vendored constitution + Deviation D-1"
+      },
+      {
+        "path": "openspec/changes/seed-sentinel-security-eval/specs/sentinel-security-eval/spec.md",
+        "description": "hardened seed spec"
+      },
+      {
+        "path": "openspec/changes/seed-sentinel-security-eval/design.md",
+        "description": "role->infra binding table"
+      }
+    ]
+  }
+}