feat(gen-eval): extract gen-eval framework into reusable packages/gen-eval#194
feat(gen-eval): extract gen-eval framework into reusable packages/gen-eval#194jankneumann wants to merge 24 commits into
Conversation
…acts, work-packages Extract gen-eval (~6.5 KLOC framework) from agent-coordinator/evaluation/gen_eval/ into packages/gen-eval/ as a standalone pip-installable package. Establishes the packages/ convention for future shared-library extractions (Hoverfly simulation, prompt/harness optimization, vendor convention packages). Decisions: - packages/gen-eval/ with src/ layout, uv_build backend, four optional extras - Framework-only data split; descriptors stay in consumers - Optional [mcp] extra serves both MCP-routing (coordinator) and pure-Python (agentic-assistant) consumer profiles - Full cutover in one atomic change; metrics.py reverse coupling resolved by moving into the package - agentic-assistant adoption proven via examples/agentic-assistant-quickstart.md rather than a cross-repo PR 7-package DAG, coordinated tier, 6 valid parallel pairs (validated against parallel-zones). openspec validate --strict passes; work-packages schema + DAG + lock-keys pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n B, scope fixes Iterate-on-plan: extract-gen-eval-package, iteration 1/3 Findings addressed (12): - [critical] feasibility F1: metrics.py wholesale move was wrong-grained; refactored to surgical extraction of GenEvalMetrics only (D3, task 2.4, task 2.4.1, spec MODIFIED requirement, wp-framework-move verification). The other 10 classes + compute_effect_size stay in coordinator. - [high] completeness F2: coordinator-test import sites missing from plan; expanded task 3.1 (fixed path: tests/test_evaluation/test_gen_eval/, 29 files), task 4.3 (now covers test_check_docker_imports.py + CLAUDE.md), wp-package-tests and wp-coordinator-migrate scopes. - [high] completeness F3: skill-script import sites missing from plan; expanded D9, task 5.3, added task 5.3.1, wp-skills-update scope now covers playwright-validator/scripts/cli.py + findings.py + gen-eval- scenario/SKILL.md line 172. - [high] feasibility F4: Docker build-context hand-waved and would have broken Railway deploy; D8 now picks Option B (wheel install) with full implementation: COPY dist/gen_eval-*.whl + UV_FIND_LINKS in Dockerfile, Makefile build-image target, .gitignore dist/, README. Railway untouched. - [medium] parallelizability F5: .github/workflows/ci.yml scope-collision resolved by Option B choice (only wp-package-tests writes ci.yml). - [medium] clarity F6: task 3.1 source path corrected. - [medium] completeness F7: D7 hedge replaced with concrete sample-frontend fixture paths. - [medium] testability F8: added gen_eval.metrics surface test (task 2.4.1) and verification step on wp-framework-move. - [medium] clarity F9: wp-framework-move locks/scope semantics clarified for the partial-file edit of evaluation/metrics.py. - [low] consistency F10: spec MODIFIED requirement language fixed. - [low] clarity F11: parallelizability summary added to tasks.md. - [low] assumptions F12: editable-vs-Docker interaction documented in D5. Validation: openspec validate extract-gen-eval-package --strict passes. Co-Authored-By: Claude <noreply@anthropic.com>
PLAN_ITERATE sub-agent caught a wrong-grained D3 (the original plan moved all of evaluation/metrics.py into the package — actually contains 11 classes, 10 of which are coordinator-domain). D3 now specifies surgical extraction of GenEvalMetrics only; added task 2.4.1 surface-test guard; corrected test relocation path in task 3.1 from tests/evaluation/gen_eval/ to actual tests/test_evaluation/test_gen_eval/. Work-packages.yaml updated with new verification steps for the metrics surface and coordinator-side metrics resolution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PLAN_REVIEW round 1: dispatched parallel-review-plan to claude+codex+gemini. 3/3 vendors completed; 5 unique blocking findings surfaced across vendors. Applied inline PLAN_FIX: - wp-coordinator-migrate: removed contradictory deny on agent-coordinator/evaluation/gen_eval/** (Gemini-1: blocked Task 4.4 data relocation by scope). - D5+D8: documented the [tool.uv.sources] vs UV_FIND_LINKS conflict inside the Docker build context; mandated --no-sources on the Dockerfile's uv sync (Codex-1: Docker build would fail without it). - D8: Railway prebuild story is now concrete — railway.toml gains a [build] buildCommand that runs `uv build` for gen-eval; wp-coordinator-migrate scope.write_allow + locks.files include railway.toml (Codex-2). - Task 7.4 + wp-integration.docker-smoke: fixed Docker context from repo-root (`.`) to agent-coordinator/, matching D8 Option B and Railway's service root. Also makes the smoke step build the wheel first via `make build-image` (Codex-3 / Claude-4). - Task 5.2 + D9: validate-feature/SKILL.md descriptor-discovery glob updated from */evaluation/gen_eval/descriptors/*.yaml to */evaluation/descriptors/*.yaml to follow D7 relocation; old glob would silently skip gen-eval coverage (Codex-4). - Task 5.C: grep verification now covers BOTH dotted (evaluation.gen_eval) and slash (evaluation/gen_eval) forms (Codex-4 follow-on). openspec validate --strict still passes. Outcome: not_converged — fixes applied; a round-2 convergence run would verify no new blocking findings emerged. Persisting raw vendor findings, consensus, and updated plan artifacts for round-2 / postmortem.
PLAN_REVIEW round 2: re-dispatched to codex-local and gemini-local after the round-1 PLAN_FIX (commit 7540ffb). 2/2 vendors completed; 4 new blocking findings. Key round-2 finding: - Gemini-1 (Critical): Railway's [build] buildCommand only fires for the Nixpacks builder, NOT the Dockerfile builder. The round-1 fix that added buildCommand = "cd ../packages/gen-eval && uv build ..." to railway.toml is INVALID — Railway will silently ignore it under the existing Dockerfile-based build. The Railway prebuild story needs a structural re-think (service-root reconfig, vendoring, or alternative deploy path), not another inline patch. Other round-2 blockers (Codex): - wp-coordinator-migrate description still phrases railway.toml as "unchanged" in places while D8 task 4.5 now requires editing it. - CI docker-smoke job lacks the working-directory + uv setup the prebuild wheel step needs. - CI smoke-import list still imports `evaluation.gen_eval.mcp_service` — needs to update in lockstep. - wp-skills-update verification grep should mirror task 5.C's dotted+slash coverage. Decision: stopping after round 2. Returning outcome=not_converged. The Railway issue exceeds the scope of inline PLAN_FIX and warrants either a PLAN_ITERATE iteration or a human-gated design review on the deploy strategy. Looping to round 3 with another speculative fix would re-introduce the same class of error. Handoff persisted at openspec/changes/extract-gen-eval-package/handoffs/plan_review-1.json.
Multi-vendor PLAN_REVIEW ran 2 rounds. Round 1: 3 vendors, 5 blocking findings, PLAN_FIX (7540ffb) applied inline fixes. Round 2: 2 vendors, caught that round-1's Railway buildCommand fix is invalid — Railway's [build] buildCommand only applies to the Nixpacks builder, not Dockerfile builds. The wheel-into-Docker-context story needs structural redesign; returning not_converged rather than applying another speculative patch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… advisories
Resolves the round-2 not_converged blocker. Strategy A reconfigures Railway's
service root to the repo root via dashboard (one-time operator change),
eliminating Option B's wheel-prebuild machinery (Makefile target, dist/
.gitignore, railway.toml [build] buildCommand, --no-sources, UV_FIND_LINKS) —
all of which were either broken under the Dockerfile builder or unnecessary
once the package is in-context.
Design changes
- D8 rewritten: Strategy A (repo-root build context) with operator-tracked
Railway dashboard change as the explicit trade-off. The Dockerfile builder's
silent drop of [build] buildCommand makes Option B too dangerous.
- D2: widen uv_build pin from `>=0.9,<0.10` to `>=0.9,<1.0` to absorb pre-1.0
minor releases without monthly maintenance commits.
Spec changes
- Split "documented consumer adoption contract" into two scenarios:
(a) automated quickstart shape checks (doc exists, references uv add, links
resolve, descriptor-template parses)
(b) manual end-to-end release-gate walkthrough
Task changes
- 2.4.1: tighten surface test to explicit allowlist with self-diagnosing message
- 4.1: pin new test file to test_gen_eval_extraction.py (scope hygiene)
- 4.3: add explicit __all__ removal in evaluation/__init__.py
- 4.5: full rewrite for Strategy A (Dockerfile COPY prefixes, docker-compose
context update, railway.toml dashboard documentation, README deployment
section); no Makefile, no .gitignore dist/, no --no-sources
- 5.5: new — repo-wide stale-reference sweep for CLAUDE.md/README.md/docs/.github/
- 7.4: docker smoke uses repo-root context (Strategy A)
- 7.6: new — CI coordinator docker-build step repoint, owned by wp-integration
Work-packages changes
- wp-coordinator-migrate: locks/scope drop Makefile/.gitignore/ci.yml; gain
docker-compose.yml; tests/** tightened to specific test files; new comment
blocks document the test-scope and CI-ownership boundaries
- wp-package-tests: description updated — no longer mentions Option B
prebuild step
- wp-integration: gains .github/workflows/ci.yml in scope; new verification
step asserting CI step repoint; docker smoke command uses `.` context
All three validators green: openspec --strict, work-packages schema/DAG/locks,
parallel-zones scope+lock overlap (6 valid parallel pairs after wp-framework-move).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 3 surfaced 6 critical findings (codex 4, gemini 2 with overlap), all textual contradictions left over from the round-2 Strategy A pivot rather than new design holes. PLAN_FIX (this commit) addresses each: 1. design.md D5: drop editable=true + Option B paragraphs; document Strategy A reachability and non-editable rationale; point at task 7.4.1 in-container smoke (resolves codex C1 + C2). 2. tasks.md: add task 7.4.1 (in-container import gen_eval smoke, catches the multi-stage / non-editable path-dep runtime hazard codex round-3 flagged). Task 7.6 expanded to include the smoke-import line rewrite from evaluation.gen_eval.mcp_service to gen_eval.mcp_service. 3. tasks.md: refresh parallelizability summary for Strategy A reality (drops post-D8-Option-B, Makefile/.gitignore, pre-build wheel step). 4. work-packages.yaml wp-coordinator-migrate: remove ci.yml editing claim from description (it is owned by wp-integration via task 7.6). 5. work-packages.yaml wp-skills-update: verification grep split into dotted + slash forms (matches checkpoint 5.C; resolves gemini C2). 6. work-packages.yaml wp-integration: now owns task 5.5 (repo-wide stale-reference sweep). scope.write_allow expanded to CLAUDE.md, README.md, docs/**, apps/**, .github/**. Three new verification steps: ci_smoke_import_updated, in_container_import_smoke, repo_wide_sweep_clean. Resolves codex C3 + gemini C1. Validators all green: openspec validate --strict passes; work-packages schema/depends_on/DAG/lock_keys all pass; parallel-zones scope+lock overlap: 6 valid parallel pairs, no overlap. Declare convergence on round 3 with PLAN_FIX evidence rather than running a round-4 vendor pass (max_rounds=3, fixes are mechanical, no new design decisions introduced). Plan is ready for IMPLEMENT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PLAN_REVIEW round 3 converged via PLAN_FIX evidence (commit ed4f83b addresses all 6 critical findings raised by codex+gemini). All validators green: - openspec validate --strict: passes - work-packages schema/DAG/locks: passes - parallel-zones scope+lock overlap: 6 valid parallel pairs Loop state now at current_phase=IMPLEMENT, previous_phase=PLAN_REVIEW. Next: dispatch IMPLEMENT for the 7 work packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ject (extract-gen-eval-package tasks 1.1-1.3, 1.C) Land the empty package skeleton called out in wp-package-scaffold: - packages/gen-eval/pyproject.toml: uv_build backend (>=0.9,<1.0 per D2), four optional extras (mcp, sdk, db, all) matching the four lazy-import boundaries in the existing framework code, gen-eval console-script entry point, dev dependency-group with pytest/ruff/mypy. - packages/gen-eval/src/gen_eval/__init__.py: empty package shell (only exports __version__) so import gen_eval resolves but the framework code lands later under wp-framework-move. - packages/gen-eval/tests/test_smoke.py: TDD-RED smoke test asserting import gen_eval resolves to this packages/ tree, not the legacy agent-coordinator/evaluation/gen_eval/ location. Verifies the canonical-module-name scenario (gen-eval-framework spec). - packages/gen-eval/README.md: status banner, install profiles, layout. - packages/gen-eval/.gitignore: dist/, .venv/, caches. Checkpoint 1.C verified locally: cd packages/gen-eval && uv sync -> succeeds uv run pytest tests/test_smoke.py -> 2 passed uv build -> wheel + sdist in dist/ No changes to agent-coordinator or any consumer in this commit. The framework code move, coordinator pyproject swap, Dockerfile pivot, and skill invocation updates land in subsequent work packages per work-packages.yaml. Implements OpenSpec: extract-gen-eval-package Co-Authored-By: Claude <noreply@anthropic.com>
… extract GenEvalMetrics (extract-gen-eval-package tasks 2.1-2.5, 2.C) wp-framework-move: relocate the framework code and assert the new canonical-module-name surface (gen-eval-framework spec). Code move (preserves git history via git mv): - 23 .py files from agent-coordinator/evaluation/gen_eval/ -> packages/gen-eval/src/gen_eval/ (includes __init__, __main__, all generators, evaluator, orchestrator, semantic_judge, descriptor, manifest, models, reports, feedback, fidelity, findings_emitter, change_detector, config, coordinator, dtu_scaffold, openspec_seed). - 7 client transports under packages/gen-eval/src/gen_eval/clients/ (base, http, mcp, cli, db, wait, __init__). - schemas/, dtu/github/, dtu/transports/ shipped with the package. - evaluation/gen_eval/descriptors/sample-frontend.yaml -> packages/gen-eval/tests/fixtures/sample-descriptor.yaml plus its sample-frontend/ fixture tree (index.html, specs/sample.spec.md). - Repo-root evaluation/gen_eval/ tree fully removed (carved-out empty). - dtu/*/fidelity-report.json deleted (generated artifacts, not source). Surgical GenEvalMetrics extraction (D3): - packages/gen-eval/src/gen_eval/metrics.py is a brand-new file containing ONLY the GenEvalMetrics @DataClass plus its imports under leading-underscore aliases so the public surface is exactly {"GenEvalMetrics"}. - agent-coordinator/evaluation/metrics.py: the GenEvalMetrics class is removed; the other 10 metric classes and compute_effect_size remain unchanged (still consumed by evaluation/ablation.py, reports/generator.py, and coordinator tests). A migration note replaces the deleted code. Import rewrites inside the moved package: - 13 modules updated from `from evaluation.gen_eval.X` -> `from gen_eval.X` and `from evaluation.metrics import GenEvalMetrics` -> `from gen_eval.metrics import GenEvalMetrics` (in reports.py). - mcp_service.py and clients/mcp_client.py wrap their fastmcp imports in try/except with the documented `uv add 'gen-eval[mcp]'` hint per D4. Tests (TDD GREEN for tasks 2.1, 2.4.1): - tests/test_public_api_parity.py asserts three representative public APIs (Evaluator, parse_openspec_change, Scenario) plus a static check that no module under the package imports from agent_coordinator/src. - tests/test_metrics_surface.py asserts gen_eval.metrics exposes EXACTLY {"GenEvalMetrics"} -- explicit allowlist, self-diagnosing failure message listing unexpected names. Local verification: cd packages/gen-eval && uv sync && uv run pytest tests/ -v -> 7 passed grep -rE "from (agent_coordinator|src\.coordination_)" packages/gen-eval/src/ -> empty grep -r "from evaluation.gen_eval" packages/gen-eval/src/ -> empty cd agent-coordinator && uv run python -c "from evaluation.metrics import TimingMetric, ... compute_effect_size; print('ok')" -> ok cd packages/gen-eval && uv run python -c "import gen_eval.mcp_service" -> ImportError with [mcp] hint NOTICED BUT NOT TOUCHING: - agent-coordinator/evaluation/gen_eval/ now holds only descriptors/, manifests/, scenarios/ (per-project data). Relocation one level up per D7 happens under wp-coordinator-migrate (task 4.4) along with the coordinator's pyproject swap, lazy-import rewrites, and Dockerfile Strategy-A pivot. Out of scope for wp-framework-move. Implements OpenSpec: extract-gen-eval-package Co-Authored-By: Claude <noreply@anthropic.com>
Sub-agent a4590eb4efe3cd21f stalled (stream watchdog, 600s no progress). Synthetic handoff (implement-1.json) captures: 2 commits landed (13c0e04 wp-package-scaffold, b1e106e wp-framework-move), wp-package-tests staged + matching deletions unstaged, wp-coordinator-migrate in-progress unstaged. Working tree preserved — no destructive recovery applied. Per feedback_harness_worktree_silent_noop.md memory, this matches the known failure mode: Agent(isolation='worktree') silently no-opped — the harness worktree at .claude/worktrees/agent-a4590eb4efe3cd21f stayed at [main] and was never used. Sub-agent committed directly to the orchestrator's openspec/extract-gen-eval-package branch. Recovery options documented in handoff next_steps[]. Recommended path: commit the clean staged wp-package-tests work + stash wp-coordinator-migrate WIP + re-dispatch IMPLEMENT in a fresh session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p-package-tests move) Closes the half-state introduced by bcde916, which accidentally swept the pre-staged wp-package-tests additions into the IMPLEMENT=failed record-keeping commit. The additions landed under packages/gen-eval/tests/ in bcde916; this commit removes the matching source-side files in agent-coordinator/tests/test_evaluation/test_gen_eval/ so the move is coherent end-to-end (no duplicated tests on disk). 29 deletions: 28 test files + __init__.py (not carried over — modern pytest importlib mode doesn't need it). Sub-agent a4590eb4efe3cd21f originally staged the additions but left deletions unstaged before stalling; orchestrator detected the inconsistency post-hoc and closed it. Attribution: wp-package-tests work is the sub-agent's; this closing commit is bookkeeping. extract-gen-eval-package wp-package-tests tasks 3.1-3.x (per work-packages.yaml). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update completed_work and next_steps to accurately describe what happened: 28 wp-package-tests additions were swept into the record-keeping commit bcde916 (not "staged but not committed" as originally claimed). The matching 29 deletions were closed in the follow-up commit. Documentation-only fix; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…act-gen-eval-package)
… (extract-gen-eval-package wp-coordinator-migrate)
Tasks 4.1-4.6:
- 4.1 Add test_gen_eval_extraction.py: (a) gen_eval.mcp_service importable,
(b) /gen-eval/scenarios returns scenarios via AsyncClient TestClient
- 4.2 Update pyproject.toml: replace gen-eval optional-dep stub with
gen-eval[mcp] in [project.dependencies] + [tool.uv.sources] path-dep
- 4.3 Rewrite all evaluation.gen_eval imports in src/ to gen_eval.
Remove gen_eval re-export from evaluation/__init__.py and its __all__.
Update test_check_docker_imports.py fixture strings.
Update agent-coordinator/CLAUDE.md docker-smoke line.
Update scripts/check_docker_imports.py docstring example.
- 4.4 Relocate per-project data: evaluation/gen_eval/{descriptors,manifests,scenarios}
→ evaluation/{descriptors,manifests,scenarios}. Delete the now-empty
evaluation/gen_eval/ directory.
- 4.5 Dockerfile (Strategy A — repo-root build context):
- Prefix all COPY sources with agent-coordinator/
- Add COPY packages/gen-eval/ /workspace/packages/gen-eval/ before uv sync
- Add GEN_EVAL_DATA_DIR=/app/evaluation ENV
- docker-compose.yml: context=.. dockerfile=agent-coordinator/Dockerfile
- railway.toml: dockerfilePath=agent-coordinator/Dockerfile + dashboard
change documentation comment block
- README.md: add ## Deployment section documenting the Railway dashboard
prerequisite, rationale, and rollback path
- 4.6 Update packages/gen-eval/src/gen_eval/mcp_service.py:
- Add GEN_EVAL_DATA_DIR env-var support in _find_base_dir (D7)
- Fix run_evaluation: project_root = base.parent, cmd uses gen_eval not
evaluation.gen_eval
- Fix _find_latest_report path candidates
Update __main__.py docstring: python -m gen_eval (not evaluation.gen_eval)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ract-gen-eval-package wp-skills-update)
Tasks 5.1-5.4:
- 5.1 skills/gen-eval/SKILL.md: descriptor find glob → evaluation/descriptors/;
python -m gen_eval (was evaluation.gen_eval)
- 5.2 skills/validate-feature/SKILL.md: descriptor find glob → evaluation/descriptors/;
python -m gen_eval with GEN_EVAL_DATA_DIR=<module_root>/evaluation;
GENEVAL_MODULE_ROOT dirname depth corrected (3 levels, not 4)
- 5.3 skills/playwright-validator/scripts/cli.py: from gen_eval.openspec_seed import;
drop sys.path.insert(0, agent-coordinator/) workaround; update help text,
arg default path, and _minimal_parse docstring
- 5.3 skills/playwright-validator/scripts/{findings,descriptor,parser}.py: update
module path references in docstrings; evaluation/gen_eval → gen_eval
- 5.3 skills/playwright-validator/SKILL.md: all evaluation/gen_eval path refs updated
- 5.3.1 skills/gen-eval-scenario/SKILL.md: descriptor find glob → evaluation/descriptors/;
from gen_eval.models import Scenario; schema reference updated
- 5.4 Run install.sh --mode rsync to regenerate .claude/skills/ and .agents/skills/
runtime mirrors. Verified zero matches for evaluation.gen_eval /
evaluation/gen_eval across all three skill trees.
- Also update skills/tests/ validate-feature and playwright-validator for consistency
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…and ADR (extract-gen-eval-package wp-examples-doc)
Tasks 6.1-6.4:
- 6.1 examples/agentic-assistant-quickstart.md: complete adoption walkthrough
(uv add command, descriptor creation, template-only run, MCP reasoning,
CI integration, next steps)
- 6.2 examples/descriptor-template.yaml: annotated minimal descriptor
(http service, two interfaces, optional auth/CLI/categories/budget fields)
- 6.3 packages/gen-eval/README.md: public API summary (CLI flags table,
Python re-exports, MCP service API, GEN_EVAL_DATA_DIR docs, report
artifacts, layout, links). Replaces the scaffold stub.
- 6.4 docs/decisions/packages-convention-gen-eval-extraction.md: ADR
documenting the packages/ convention and gen-eval extraction
(Context, Decision, Consequences, Alternatives considered)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-gen-eval-package tasks 3.2, 3.3) - test_mcp_import_error: asserts gen_eval.mcp_service raises ImportError with [mcp] hint when fastmcp is absent (uses sys.modules blocker per D4) - test_sdist_contents: asserts sdist contains schemas/ and dtu/ and does NOT include coordinator-specific descriptors or manifests (per D7) - pyproject.toml: register 'slow' marker to silence PytestUnknownMarkWarning Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…t-gen-eval-package wp-integration) - ci.yml: add gen-eval-tests job (packages/gen-eval/); repoint docker-smoke-import to repo-root context + file: agent-coordinator/Dockerfile; update smoke-import list from evaluation.gen_eval.mcp_service to gen_eval.mcp_service; update gen-eval job to run test_gen_eval_extraction.py and add GEN_EVAL_DATA_DIR env var - docs/mental-models.md: update manifest/scenario paths to evaluation/ (not gen_eval/) - docs/software-factory-tooling.md: update directory layout and Python import examples from evaluation.gen_eval to gen_eval; update output paths from evaluation/gen_eval/dtu/ - docs/architecture-analysis/: refresh all artifacts post-extraction (python_analysis.json no longer references evaluation.gen_eval; packages/gen-eval component now visible) - tasks.md: mark all wp-integration tasks complete (5.3, 5.3.1, 5.4, 5.5, 5.C, 6.1-6.4, 6.C, 7.1, 7.2, 7.6) Validation: openspec validate extract-gen-eval-package --strict passes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…xtract-gen-eval-package) - implement-2.json handoff: all 7 work packages complete, 548 tests pass, openspec validate --strict passes, branch at 7f6125a - loop-state.json: current_phase IMPLEMENT → CLEANUP, previous_phase updated, implement-2 added to handoff_ids Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9a66f6b979
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Copy the gen-eval sibling package FIRST so the path-dep in | ||
| # agent-coordinator/pyproject.toml resolves: [tool.uv.sources] | ||
| # gen-eval = { path = "../packages/gen-eval" } | ||
| COPY packages/gen-eval/ /workspace/packages/gen-eval/ |
There was a problem hiding this comment.
Copy gen-eval to the path uv resolves
With this Dockerfile the builder runs uv sync from /app, while agent-coordinator/pyproject.toml and uv.lock resolve the path dependency as ../packages/gen-eval, i.e. /packages/gen-eval. The new COPY puts the package at /workspace/packages/gen-eval, so the CI smoke image and production Docker builds that use the repo-root context will fail before installing dependencies.
Useful? React with 👍 / 👎.
| # Default: relative to this file (valid for package-internal fixtures) | ||
| return Path(__file__).parent |
There was a problem hiding this comment.
Default to the coordinator data directory in local runs
When the coordinator is run from source without GEN_EVAL_DATA_DIR (for example cd agent-coordinator && uv run coordination-api), this fallback now points at the installed gen_eval package directory, but the scenarios/descriptors were moved to agent-coordinator/evaluation/. In that context /gen-eval/scenarios returns an empty list and /gen-eval/run reports No descriptor found; Docker only avoids this because it sets the env var explicitly.
Useful? React with 👍 / 👎.
Summary
Extracts the gen-eval framework (~6.5 KLOC, 23 modules) from
agent-coordinator/evaluation/gen_eval/into a standalone reusable package atpackages/gen-eval/. Establishes thepackages/<name>/convention for future shared-component extractions (simulation, prompt-optimization, vendor-convention skills). Sibling repos (newsletter-aggregator, agentic-assistant) can now install viauv addinstead of copy-pasting source.Evidence Trail
make architectureregenerateddocs/architecture-analysis/;python_analysis.jsonno longer referencesevaluation.gen_evalopenspec/extract-gen-eval-package(HEAD911e721)CHANGES MADE
packages/gen-eval/scaffolded withpyproject.toml, smoke test, full framework source migrated (~6.5 KLOC across 23 modules)agent-coordinatorrepurposed as a gen-eval consumer; in-tree importsfrom evaluation.gen_eval import …updated tofrom gen_eval import …; lazy MCP imports incoordination_api.py/coordination_mcp.pyupdated;pyproject.tomladdsgen-eval = { path = \"../packages/gen-eval\", extras = [\"mcp\"] }GenEvalMetricsextracted fromagent-coordinator/evaluation/metrics.pyintopackages/gen-eval/src/gen_eval/metrics.py(the one reverse coupling)agent-coordinator/tests/test_evaluation/test_gen_eval/topackages/gen-eval/tests/test_mcp_import_error.py(assertsgen_eval.mcp_serviceraisesImportErrorwith[mcp]hint whenfastmcpabsent),test_sdist_contents.py(asserts sdist containsschemas/+dtu/and excludes coordinator-specific descriptors)skills/gen-eval/SKILL.md,skills/validate-feature/SKILL.md,skills/playwright-validator/SKILL.mdDIDN'T TOUCH
/cleanup-featureper autopilot workflow (CLAUDE.md: "/cleanup-featureand/merge-pull-requestsrun Docker-dependent checks before merge")uv add ../agentic-coding-tools/packages/gen-evalCONCERNS
analystarchetype's system_prompt ("read-only") conflicts with the VALIDATE task (writes evidence). Substantive validation evidence is the IMPLEMENT phase output (openspec validate strict + 548 tests + architecture refresh) — this is real and sufficient, but the artifact-production overhead from a clean VALIDATE run is missing. Docker phases land in/cleanup-featureregardless. Filing a separate OpenSpec change for the archetype-mapping bug + a related IMPLEMENT-phase contract violation.bcde916is the failure record from the first run; the work was resumed and completed in this run). Audit trail inopenspec/changes/extract-gen-eval-package/handoffs/implement-1.json(failure) andimplement-2.json(resume success).[mcp]extras split right, or should the MCP module always import-guard rather than gate on extras? (b) does thedescriptors/manifests/scenariosboundary keep the right things in the consumer vs. the package? (c) any cross-repo consumer that imports gen-eval will need to update its install command — is that captured anywhere documentation reviewers should verify?Test plan
/cleanup-feature extract-gen-eval-packageruns full Docker validation pipeline (deploy, smoke, security, E2E)uv pip install -e packages/gen-evalin a fresh venv succeedsuv pip install -e packages/gen-eval[mcp]in a fresh venv succeeds andpython -c \"import gen_eval.mcp_service\"workspython -m gen_evalagainst agent-coordinator to confirm consumer wiring works end-to-endpython -m evaluation.gen_evalis broken (old path) and any remaining imports ofevaluation.gen_evalare caught by CIGenerated by /autopilot — awaiting human approval for merge.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com