Skip to content

feat(evaluator): publish_to_intake entrypoint (D9)#456

Merged
SandyChapman merged 2 commits into
mainfrom
aalgo-290-publish-to-intake/schapman
Jun 26, 2026
Merged

feat(evaluator): publish_to_intake entrypoint (D9)#456
SandyChapman merged 2 commits into
mainfrom
aalgo-290-publish-to-intake/schapman

Conversation

@SandyChapman

@SandyChapman SandyChapman commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What

Adds publish_to_intake (D9, AALGO-290) — the explicit, post-run consumer of AgentEvalResult that writes a completed agent evaluation to Intake. Builds on the D8 mapping module (#443).

report = await publish_to_intake(
    result, platform=client, experiment_id="my-exp",
    workspace=..., agent_name=..., model_name=...,
)

Per trial (concurrent, bounded by a semaphore): POST /ingest/atif → resolve the root span via the traces query-back (design §3.5 option 1) → one POST /evaluator-results per metric output. Returns a PublishReport (per-trial session/span ids + result counts).

Design notes

  • No bespoke HTTP client. All shapes come from the D8 mapping (*CreateParams TypedDicts); all wire calls go through the generated SDK intake resources (client.intake.ingest.atif, .traces, .evaluator_results). It imports the SDK client types, never the Intake service.
  • References an existing Experiment; never creates one — the caller pre-creates the group/experiment via the Experiments SDK (exist_ok=True). Ingest 400s on an unknown experiment and we surface it.
  • Agent identity is a parameter (agent_name/agent_version/model_name) because it lives on the run target, which AgentEvalResult doesn't carry (§3.9 feat(evaluator): port metric output protocol #6).
  • Raises on HTTP/validation failure; the local bundle is never touched.

Testing

  • Unit (tests/intake/test_publish.py, fake async client): round-trip, multi-trial, score-less trial, workspace resolution, span-resolution failure, ingest-failure propagation.
  • Live integration (tests/integration/test_publish_to_intake.py): stands up ClickHouse (Docker) + auth,entities,intake via session fixtures, runs the real round-trip, and asserts every field read back through the public Intake API — experiment_context.{experiment_id,test_case_id}, root_span_id, and each evaluator-result name/data_type/value/string_value/session_id/span_id. Marked integration (excluded from the unit suite; skips when Docker is absent). Verified end-to-end locally (~18s).

ruff, ty, and 28 unit tests green.

Notes / follow-ups

  • Re-publish is not idempotent (/evaluator-results mints a fresh id per call — ask X1); MVP stance is write-once per result.
  • The integration test spawns the real platform rather than the in-process testcontainers/ASGI style used elsewhere — works, but heavier; can be aligned later if CI flakes.
  • D3/D4/D5 were not hard dependencies of D9; whether to keep publish.py inline or refactor into adapter layers is a separate, open discussion (tracked, not decided here).

Refs: AALGO-290.

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Added an evaluator-to-Intake publish flow that returns a publish report including published trials and skipped items.
    • Added filtering and reporting for non-finite metric outputs and FAILED scoring outputs.
    • Added trace span resolution to correctly associate published rows back to each trial.
  • Bug Fixes
    • Improved robustness for missing workspace values, unresolvable trace spans, and ingestion/publish partial failures.
  • Tests
    • Expanded unit tests for multi-trial publishing, empty-score behavior, and skip reasons.
    • Added end-to-end coverage using a live Intake service (with an optional integration test marker).

publish_to_intake(result, *, platform, experiment_id, workspace=None, ...) is the
explicit, post-run consumer of AgentEvalResult. Per trial it POSTs the ATIF
trajectory, resolves the root span via the traces query-back (§3.5 option 1),
then POSTs one evaluator-result per metric output. All request shapes come from
the D8 mapping module; wire calls go through the generated SDK intake resources.

- Async, bounded-concurrent across trials (asyncio.gather + semaphore).
- References an existing Experiment; never creates one (caller pre-creates via
  the Experiments SDK). Agent identity is taken as arguments since it lives on
  the run target, which AgentEvalResult does not carry (§3.9 #6).
- Returns a PublishReport (per-trial session/span ids + result counts).
- Raises on HTTP/validation failure; the local bundle is never touched.

Unit-tested with a fake async client (round-trip, multi-trial, score-less
trial, workspace resolution, span-resolution failure, ingest-failure
propagation). Manually validated end-to-end against live Intake + ClickHouse.

Refs: AALGO-290

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
@SandyChapman SandyChapman requested review from a team as code owners June 25, 2026 12:41
@github-actions github-actions Bot added the feat label Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds skipped-output handling to intake mapping, publishes eval results with partial failure reporting, and expands unit and live integration coverage.

Changes

Intake publish flow

Layer / File(s) Summary
Score mapping and skips
plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py
score_to_evaluator_results now returns rows plus skipped outputs, filters failed scores and non-finite metric values, and records skip reasons.
Publish orchestration
plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py
publish_to_intake adds publish reporting models, resolves the root span from Intake traces, publishes trials concurrently, and raises PublishError with partial results on failure.
Mapping tests
plugins/nemo-evaluator/tests/intake/test_mapping.py
Test helpers and assertions now cover row extraction, diagnostic comments, skipped outputs, and non-finite value handling.
Publish tests and integration harness
plugins/nemo-evaluator/pyproject.toml, plugins/nemo-evaluator/tests/intake/test_publish.py, plugins/nemo-evaluator/tests/integration/test_publish_to_intake.py
Fake platform publish tests cover success, workspace resolution, missing span, ingest failure, skipped outputs, and partial failures; integration coverage adds the pytest marker, live services setup, and end-to-end publish checks.

Possibly related PRs

Suggested reviewers

  • ngoncharenko
  • arpitsardhana
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 18.87% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Clearly names the main change: adding the publish_to_intake entrypoint for evaluator intake publishing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch aalgo-290-publish-to-intake/schapman

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py (1)

138-138: 🩺 Stability & Availability | 🔵 Trivial

Use asyncio.TaskGroup here. asyncio.gather lets other trial publishes keep running after the first exception, so a failed POST can leave orphaned in-flight writes. Python 3.11+ is already required, so structured cancellation is available.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py` at line 138, The
trial publishing fan-out in publish() still uses asyncio.gather, which allows
other _publish_trial tasks to keep running after one fails. Replace that gather
call with asyncio.TaskGroup so all in-flight trial publishes are structurally
cancelled on the first exception, and update the surrounding logic in
publish/_publish_trial to collect results from the TaskGroup without changing
the existing publish flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py`:
- Line 138: The trial publishing fan-out in publish() still uses asyncio.gather,
which allows other _publish_trial tasks to keep running after one fails. Replace
that gather call with asyncio.TaskGroup so all in-flight trial publishes are
structurally cancelled on the first exception, and update the surrounding logic
in publish/_publish_trial to collect results from the TaskGroup without changing
the existing publish flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d476ccbf-3112-47f9-952f-ec1099a6f6e9

📥 Commits

Reviewing files that changed from the base of the PR and between 15a7e44 and 78b2a3d.

📒 Files selected for processing (4)
  • plugins/nemo-evaluator/pyproject.toml
  • plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py
  • plugins/nemo-evaluator/tests/intake/test_publish.py
  • plugins/nemo-evaluator/tests/integration/test_publish_to_intake.py

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor
Suite Lines Covered Line Rate Branch Rate
Unit Tests 21322/27924 76.4% 61.4%
Integration Tests 12354/26693 46.3% 19.8%

Comment thread plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py Outdated
Comment thread plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py

@arpitsardhana arpitsardhana left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be potential of intake corruption of some trials of experiment get persisted while other did not

…ke (D9)

Stands up ClickHouse (Docker) + the platform (auth,entities,intake) via session
fixtures, creates an Experiment, runs publish_to_intake, and asserts the full
round-trip back through the public Intake API: experiment_context propagation
(experiment_id + test_case_id), root-span resolution, and every evaluator-result
field/data_type (NUMERIC/BOOLEAN/TEXT, including false->0.0).

Marked 'integration' (registered in the plugin pyproject), so it runs under
make test-integration / -m integration and is excluded from the unit suite; it
skips cleanly when Docker is unavailable, mirroring the intake conftest's
_docker_available() gate. Verified end-to-end locally (~18s).

Refs: AALGO-290

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Sandy Chapman <schapman@nvidia.com>
@SandyChapman SandyChapman force-pushed the aalgo-290-publish-to-intake/schapman branch from 78b2a3d to 69e2ac1 Compare June 26, 2026 13:22

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py (1)

135-160: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

skipped leaks omissions from failed trials into the report.

skipped is shared across trials and extended mid-trial (Line 160). If a trial raises after appending some omissions (e.g. eval-create fails on a later score), those entries persist and surface in report.skipped, even though the trial is reported as failed. Collect omissions locally and merge only from succeeded trials.

Proposed fix: return skipped from each trial
-    semaphore = asyncio.Semaphore(max_concurrency)
-    skipped: list[SkippedScore] = []
-
-    async def _publish_trial(trial: AgentEvalTrial) -> PublishedTrial:
+    semaphore = asyncio.Semaphore(max_concurrency)
+
+    async def _publish_trial(trial: AgentEvalTrial) -> tuple[PublishedTrial, list[SkippedScore]]:
         async with semaphore:
             ...
             written = 0
+            trial_skipped: list[SkippedScore] = []
             for score in scores_by_trial.get(trial.id, []):
                 rows, omitted = mapping.score_to_evaluator_results(score, session_id=session_id, span_id=span_id)
                 for row in rows:
                     row["workspace"] = resolved_workspace
                     await platform.intake.evaluator_results.create(**row)
                     written += 1
-                skipped.extend(SkippedScore(trial_id=trial.id, name=item.name, reason=item.reason) for item in omitted)
+                trial_skipped.extend(SkippedScore(trial_id=trial.id, name=item.name, reason=item.reason) for item in omitted)
 
-            return PublishedTrial(...)
+            return PublishedTrial(...), trial_skipped

Then accumulate skipped only from PublishedTrial outcomes when unpacking outcomes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py` around lines 135
- 160, The `skipped` list in `_publish_trial` is currently shared across trials
and gets extended before the trial fully succeeds, which can leak omissions from
failed trials into the final report. Move omission collection to a
per-trial/local variable in `_publish_trial`, return it with `PublishedTrial`,
and only merge skipped entries into the top-level `skipped` list when unpacking
successful `outcomes` so failed trials do not contribute to `report.skipped`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py`:
- Around line 135-160: The `skipped` list in `_publish_trial` is currently
shared across trials and gets extended before the trial fully succeeds, which
can leak omissions from failed trials into the final report. Move omission
collection to a per-trial/local variable in `_publish_trial`, return it with
`PublishedTrial`, and only merge skipped entries into the top-level `skipped`
list when unpacking successful `outcomes` so failed trials do not contribute to
`report.skipped`.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f12023e-d2f2-4f77-a8ff-aec5aa41af5c

📥 Commits

Reviewing files that changed from the base of the PR and between 78b2a3d and 69e2ac1.

📒 Files selected for processing (6)
  • plugins/nemo-evaluator/pyproject.toml
  • plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py
  • plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py
  • plugins/nemo-evaluator/tests/intake/test_mapping.py
  • plugins/nemo-evaluator/tests/intake/test_publish.py
  • plugins/nemo-evaluator/tests/integration/test_publish_to_intake.py
✅ Files skipped from review due to trivial changes (1)
  • plugins/nemo-evaluator/pyproject.toml

@arpitsardhana arpitsardhana self-requested a review June 26, 2026 17:30
@SandyChapman SandyChapman added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit 09174c0 Jun 26, 2026
53 checks passed
@SandyChapman SandyChapman deleted the aalgo-290-publish-to-intake/schapman branch June 26, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants