feat(evaluator): publish_to_intake entrypoint (D9) by SandyChapman · Pull Request #456 · NVIDIA-NeMo/nemo-platform

SandyChapman · 2026-06-25T12:41:50Z

What

Adds publish_to_intake (D9, AALGO-290) — the explicit, post-run consumer of AgentEvalResult that writes a completed agent evaluation to Intake. Builds on the D8 mapping module (#443).

report = await publish_to_intake(
    result, platform=client, experiment_id="my-exp",
    workspace=..., agent_name=..., model_name=...,
)

Per trial (concurrent, bounded by a semaphore): POST /ingest/atif → resolve the root span via the traces query-back (design §3.5 option 1) → one POST /evaluator-results per metric output. Returns a PublishReport (per-trial session/span ids + result counts).

Design notes

No bespoke HTTP client. All shapes come from the D8 mapping (*CreateParams TypedDicts); all wire calls go through the generated SDK intake resources (client.intake.ingest.atif, .traces, .evaluator_results). It imports the SDK client types, never the Intake service.
References an existing Experiment; never creates one — the caller pre-creates the group/experiment via the Experiments SDK (exist_ok=True). Ingest 400s on an unknown experiment and we surface it.
Agent identity is a parameter (agent_name/agent_version/model_name) because it lives on the run target, which AgentEvalResult doesn't carry (§3.9 feat(evaluator): port metric output protocol #6).
Raises on HTTP/validation failure; the local bundle is never touched.

Testing

Unit (tests/intake/test_publish.py, fake async client): round-trip, multi-trial, score-less trial, workspace resolution, span-resolution failure, ingest-failure propagation.
Live integration (tests/integration/test_publish_to_intake.py): stands up ClickHouse (Docker) + auth,entities,intake via session fixtures, runs the real round-trip, and asserts every field read back through the public Intake API — experiment_context.{experiment_id,test_case_id}, root_span_id, and each evaluator-result name/data_type/value/string_value/session_id/span_id. Marked integration (excluded from the unit suite; skips when Docker is absent). Verified end-to-end locally (~18s).

ruff, ty, and 28 unit tests green.

Notes / follow-ups

Re-publish is not idempotent (/evaluator-results mints a fresh id per call — ask X1); MVP stance is write-once per result.
The integration test spawns the real platform rather than the in-process testcontainers/ASGI style used elsewhere — works, but heavier; can be aligned later if CI flakes.
D3/D4/D5 were not hard dependencies of D9; whether to keep publish.py inline or refactor into adapter layers is a separate, open discussion (tracked, not decided here).

Refs: AALGO-290.

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Added an evaluator-to-Intake publish flow that returns a publish report including published trials and skipped items.
- Added filtering and reporting for non-finite metric outputs and FAILED scoring outputs.
- Added trace span resolution to correctly associate published rows back to each trial.
Bug Fixes
- Improved robustness for missing workspace values, unresolvable trace spans, and ingestion/publish partial failures.
Tests
- Expanded unit tests for multi-trial publishing, empty-score behavior, and skip reasons.
- Added end-to-end coverage using a live Intake service (with an optional integration test marker).

publish_to_intake(result, *, platform, experiment_id, workspace=None, ...) is the explicit, post-run consumer of AgentEvalResult. Per trial it POSTs the ATIF trajectory, resolves the root span via the traces query-back (§3.5 option 1), then POSTs one evaluator-result per metric output. All request shapes come from the D8 mapping module; wire calls go through the generated SDK intake resources. - Async, bounded-concurrent across trials (asyncio.gather + semaphore). - References an existing Experiment; never creates one (caller pre-creates via the Experiments SDK). Agent identity is taken as arguments since it lives on the run target, which AgentEvalResult does not carry (§3.9 #6). - Returns a PublishReport (per-trial session/span ids + result counts). - Raises on HTTP/validation failure; the local bundle is never touched. Unit-tested with a fake async client (round-trip, multi-trial, score-less trial, workspace resolution, span-resolution failure, ingest-failure propagation). Manually validated end-to-end against live Intake + ClickHouse. Refs: AALGO-290 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Sandy Chapman <schapman@nvidia.com>

coderabbitai · 2026-06-25T12:47:27Z

📝 Walkthrough

Walkthrough

Adds skipped-output handling to intake mapping, publishes eval results with partial failure reporting, and expands unit and live integration coverage.

Changes

Intake publish flow

Layer / File(s)	Summary
Score mapping and skips `plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py`	`score_to_evaluator_results` now returns rows plus skipped outputs, filters failed scores and non-finite metric values, and records skip reasons.
Publish orchestration `plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py`	`publish_to_intake` adds publish reporting models, resolves the root span from Intake traces, publishes trials concurrently, and raises `PublishError` with partial results on failure.
Mapping tests `plugins/nemo-evaluator/tests/intake/test_mapping.py`	Test helpers and assertions now cover row extraction, diagnostic comments, skipped outputs, and non-finite value handling.
Publish tests and integration harness `plugins/nemo-evaluator/pyproject.toml`, `plugins/nemo-evaluator/tests/intake/test_publish.py`, `plugins/nemo-evaluator/tests/integration/test_publish_to_intake.py`	Fake platform publish tests cover success, workspace resolution, missing span, ingest failure, skipped outputs, and partial failures; integration coverage adds the pytest marker, live services setup, and end-to-end publish checks.

Possibly related PRs

NVIDIA-NeMo/nemo-platform#443: Changes the intake mapping path that publish_to_intake now consumes, including skipped-output handling.

Suggested reviewers

ngoncharenko
arpitsardhana

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 18.87% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Clearly names the main change: adding the publish_to_intake entrypoint for evaluator intake publishing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch aalgo-290-publish-to-intake/schapman

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

🧹 Nitpick comments (1)

plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py (1)
138-138: 🩺 Stability & Availability | 🔵 Trivial

Use asyncio.TaskGroup here. asyncio.gather lets other trial publishes keep running after the first exception, so a failed POST can leave orphaned in-flight writes. Python 3.11+ is already required, so structured cancellation is available.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py` at line 138, The
trial publishing fan-out in publish() still uses asyncio.gather, which allows
other _publish_trial tasks to keep running after one fails. Replace that gather
call with asyncio.TaskGroup so all in-flight trial publishes are structurally
cancelled on the first exception, and update the surrounding logic in
publish/_publish_trial to collect results from the TaskGroup without changing
the existing publish flow.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py`:
- Line 138: The trial publishing fan-out in publish() still uses asyncio.gather,
which allows other _publish_trial tasks to keep running after one fails. Replace
that gather call with asyncio.TaskGroup so all in-flight trial publishes are
structurally cancelled on the first exception, and update the surrounding logic
in publish/_publish_trial to collect results from the TaskGroup without changing
the existing publish flow.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d476ccbf-3112-47f9-952f-ec1099a6f6e9

📥 Commits

Reviewing files that changed from the base of the PR and between 15a7e44 and 78b2a3d.

📒 Files selected for processing (4)

plugins/nemo-evaluator/pyproject.toml
plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py
plugins/nemo-evaluator/tests/intake/test_publish.py
plugins/nemo-evaluator/tests/integration/test_publish_to_intake.py

github-actions · 2026-06-25T12:54:55Z

Suite	Lines Covered	Line Rate	Branch Rate
Unit Tests	21322/27924	76.4%	61.4%
Integration Tests	12354/26693	46.3%	19.8%

arpitsardhana

There can be potential of intake corruption of some trials of experiment get persisted while other did not

…ke (D9) Stands up ClickHouse (Docker) + the platform (auth,entities,intake) via session fixtures, creates an Experiment, runs publish_to_intake, and asserts the full round-trip back through the public Intake API: experiment_context propagation (experiment_id + test_case_id), root-span resolution, and every evaluator-result field/data_type (NUMERIC/BOOLEAN/TEXT, including false->0.0). Marked 'integration' (registered in the plugin pyproject), so it runs under make test-integration / -m integration and is excluded from the unit suite; it skips cleanly when Docker is unavailable, mirroring the intake conftest's _docker_available() gate. Verified end-to-end locally (~18s). Refs: AALGO-290 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Sandy Chapman <schapman@nvidia.com>

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py (1)

135-160: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

skipped leaks omissions from failed trials into the report.

skipped is shared across trials and extended mid-trial (Line 160). If a trial raises after appending some omissions (e.g. eval-create fails on a later score), those entries persist and surface in report.skipped, even though the trial is reported as failed. Collect omissions locally and merge only from succeeded trials.

Proposed fix: return skipped from each trial

-    semaphore = asyncio.Semaphore(max_concurrency)
-    skipped: list[SkippedScore] = []
-
-    async def _publish_trial(trial: AgentEvalTrial) -> PublishedTrial:
+    semaphore = asyncio.Semaphore(max_concurrency)
+
+    async def _publish_trial(trial: AgentEvalTrial) -> tuple[PublishedTrial, list[SkippedScore]]:
         async with semaphore:
             ...
             written = 0
+            trial_skipped: list[SkippedScore] = []
             for score in scores_by_trial.get(trial.id, []):
                 rows, omitted = mapping.score_to_evaluator_results(score, session_id=session_id, span_id=span_id)
                 for row in rows:
                     row["workspace"] = resolved_workspace
                     await platform.intake.evaluator_results.create(**row)
                     written += 1
-                skipped.extend(SkippedScore(trial_id=trial.id, name=item.name, reason=item.reason) for item in omitted)
+                trial_skipped.extend(SkippedScore(trial_id=trial.id, name=item.name, reason=item.reason) for item in omitted)
 
-            return PublishedTrial(...)
+            return PublishedTrial(...), trial_skipped

Then accumulate skipped only from PublishedTrial outcomes when unpacking outcomes.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py` around lines 135
- 160, The `skipped` list in `_publish_trial` is currently shared across trials
and gets extended before the trial fully succeeds, which can leak omissions from
failed trials into the final report. Move omission collection to a
per-trial/local variable in `_publish_trial`, return it with `PublishedTrial`,
and only merge skipped entries into the top-level `skipped` list when unpacking
successful `outcomes` so failed trials do not contribute to `report.skipped`.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py`:
- Around line 135-160: The `skipped` list in `_publish_trial` is currently
shared across trials and gets extended before the trial fully succeeds, which
can leak omissions from failed trials into the final report. Move omission
collection to a per-trial/local variable in `_publish_trial`, return it with
`PublishedTrial`, and only merge skipped entries into the top-level `skipped`
list when unpacking successful `outcomes` so failed trials do not contribute to
`report.skipped`.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f12023e-d2f2-4f77-a8ff-aec5aa41af5c

📥 Commits

Reviewing files that changed from the base of the PR and between 78b2a3d and 69e2ac1.

📒 Files selected for processing (6)

plugins/nemo-evaluator/pyproject.toml
plugins/nemo-evaluator/src/nemo_evaluator/intake/mapping.py
plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py
plugins/nemo-evaluator/tests/intake/test_mapping.py
plugins/nemo-evaluator/tests/intake/test_publish.py
plugins/nemo-evaluator/tests/integration/test_publish_to_intake.py

✅ Files skipped from review due to trivial changes (1)

plugins/nemo-evaluator/pyproject.toml

SandyChapman requested review from a team as code owners June 25, 2026 12:41

github-actions Bot added the feat label Jun 25, 2026

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

arpitsardhana reviewed Jun 25, 2026

View reviewed changes

Comment thread plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py Outdated

arpitsardhana reviewed Jun 25, 2026

View reviewed changes

Comment thread plugins/nemo-evaluator/src/nemo_evaluator/intake/publish.py

arpitsardhana reviewed Jun 25, 2026

View reviewed changes

SandyChapman force-pushed the aalgo-290-publish-to-intake/schapman branch from 78b2a3d to 69e2ac1 Compare June 26, 2026 13:22

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

arpitsardhana self-requested a review June 26, 2026 17:30

arpitsardhana approved these changes Jun 26, 2026

View reviewed changes

SandyChapman added this pull request to the merge queue Jun 26, 2026

Merged via the queue into main with commit 09174c0 Jun 26, 2026
53 checks passed

SandyChapman deleted the aalgo-290-publish-to-intake/schapman branch June 26, 2026 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(evaluator): publish_to_intake entrypoint (D9)#456

feat(evaluator): publish_to_intake entrypoint (D9)#456
SandyChapman merged 2 commits into
mainfrom
aalgo-290-publish-to-intake/schapman

SandyChapman commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

arpitsardhana left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SandyChapman commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Design notes

Testing

Notes / follow-ups

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arpitsardhana left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SandyChapman commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading

github-actions Bot commented Jun 25, 2026 •

edited

Loading