Skip to content

Track cached, reasoning, and cost usage in v1#1704

Open
xeophon wants to merge 1 commit into
feat/nano-as-v1from
v1/usage-telemetry
Open

Track cached, reasoning, and cost usage in v1#1704
xeophon wants to merge 1 commit into
feat/nano-as-v1from
v1/usage-telemetry

Conversation

@xeophon

@xeophon xeophon commented Jun 16, 2026

Copy link
Copy Markdown
Member

Overview

Add provider-reported usage telemetry to v1 traces and surface it consistently across evaluation, training serialization, persistence, and the rich dashboard.

Details

  • Extend Usage with cached input tokens, reasoning tokens, and provider-reported cost, with per-call aggregation across a rollout.
  • Preserve response usage on sampled message nodes so it survives wire and disk serialization.
  • Map OpenAI Chat Completions, OpenAI Responses, and Anthropic usage details into the shared v1 representation using their SDK models.
  • Keep provider usage separate from renderer-derived training sequence lengths while carrying cache and reasoning details through the training and legacy bridges.
  • Show cached input, reasoning tokens, and accumulated cost alongside token counts in the rich dashboard.

Note

Medium Risk
Changes how prompt_tokens is interpreted per provider and persists usage on traces, which can shift displayed totals and anything consuming stored usage—though scope is telemetry/display rather than core rollout logic.

Overview
Extends v1 usage telemetry so provider-reported cache reads, reasoning-token subsets, and optional USD cost flow from dialect parsers into traces, training serialization, and the eval rich dashboard.

Usage gains cached_input_tokens, reasoning_tokens, cost, plus aggregate and input_tokens so totals treat cached input as a disjoint bucket. OpenAI Chat, Responses, and Anthropic dialects map SDK usage details into that shape (notably prompt_tokens is uncached input for OpenAI; Anthropic prompt_tokens includes cache-creation tokens). Per-turn usage on MessageNode is serialized on wire/disk; Trace.usage / Branch.usage roll up per model call. The train client round-trips cache/reasoning into synthetic chat completions; the legacy bridge reads the new fields from v0 dicts. The eval dashboard shows cached, reasoning, and cost beside token counts.

Reviewed by Cursor Bugbot for commit 81e1fc7. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Track cached, reasoning, and cost usage fields across v1 dialects, traces, and dashboard

  • Extends the Usage type with cached_input_tokens, reasoning_tokens, cost, and an aggregate classmethod; total_tokens now includes cached input tokens.
  • Updates OpenAI Chat, Responses, and Anthropic dialect translators to populate the new fields from provider-reported usage details.
  • Adds a Trace.usage computed property that aggregates Usage across all model call nodes in the trace.
  • Updates the eval dashboard to display cached tokens, reasoning tokens, and USD cost per row; hides token counts when both prompt and completion are zero.
  • MessageNode.usage is no longer excluded from serialization, so per-node usage is now retained in trace output.
  • Behavioral Change: prompt_tokens in OpenAI Chat and Responses dialects now reflects uncached input tokens only; Anthropic prompt_tokens now includes cache-creation tokens.

Macroscope summarized 81e1fc7.

@xeophon xeophon force-pushed the v1/usage-telemetry branch from 980e068 to 81e1fc7 Compare June 16, 2026 12:41
@xeophon xeophon changed the base branch from codex/v1-prime-config to feat/nano-as-v1 June 16, 2026 12:41
@macroscopeapp

macroscopeapp Bot commented Jun 16, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR adds new token tracking fields and changes the semantic meaning of prompt_tokens (now excludes cached tokens) across multiple dialects and core types. The persistence behavior of usage data also changes from transient to persisted. These are meaningful runtime behavior changes that warrant review.

You can customize Macroscope's approvability policy. Learn more.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 980e06833e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

provider/SDK enum skew (e.g. a value the pinned `openai` rejects)."""

model_config = ConfigDict(extra="allow")
usage: ResponseUsage | None = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep Responses usage parsing permissive

When an openai_responses endpoint returns a usage object with only aggregate counts such as input_tokens, output_tokens, and total_tokens, this typed field makes OpenAIResponse.model_validate(raw) in EvalClient.get_response validate against the SDK ResponseUsage, whose nested token-detail objects are required. The previous extra-allow/dict path accepted those responses and treated missing details as 0, but now the rollout fails before parse_response; keep usage permissive or normalize missing detail fields before validation.

Useful? React with 👍 / 👎.

@mikasenghaas mikasenghaas left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, i like the token breakdown. im not sure i like the cost breakdown just yet. afaiu, the current pi inference cost is not actually accurate bc we dont account for cached tokens? at least this made the cost or running benches so far CRAZY HIGH. maybe this is resolved on pi inference now? also, can we test this against some of the common apis, like oai/ant at least, maybe also deepseek/kimi/minimax etc.?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants