Track cached, reasoning, and cost usage in v1#1704
Conversation
980e068 to
81e1fc7
Compare
ApprovabilityVerdict: Needs human review This PR adds new token tracking fields and changes the semantic meaning of You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 980e06833e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| provider/SDK enum skew (e.g. a value the pinned `openai` rejects).""" | ||
|
|
||
| model_config = ConfigDict(extra="allow") | ||
| usage: ResponseUsage | None = None |
There was a problem hiding this comment.
Keep Responses usage parsing permissive
When an openai_responses endpoint returns a usage object with only aggregate counts such as input_tokens, output_tokens, and total_tokens, this typed field makes OpenAIResponse.model_validate(raw) in EvalClient.get_response validate against the SDK ResponseUsage, whose nested token-detail objects are required. The previous extra-allow/dict path accepted those responses and treated missing details as 0, but now the rollout fails before parse_response; keep usage permissive or normalize missing detail fields before validation.
Useful? React with 👍 / 👎.
mikasenghaas
left a comment
There was a problem hiding this comment.
nice, i like the token breakdown. im not sure i like the cost breakdown just yet. afaiu, the current pi inference cost is not actually accurate bc we dont account for cached tokens? at least this made the cost or running benches so far CRAZY HIGH. maybe this is resolved on pi inference now? also, can we test this against some of the common apis, like oai/ant at least, maybe also deepseek/kimi/minimax etc.?
Overview
Add provider-reported usage telemetry to v1 traces and surface it consistently across evaluation, training serialization, persistence, and the rich dashboard.
Details
Usagewith cached input tokens, reasoning tokens, and provider-reported cost, with per-call aggregation across a rollout.Note
Medium Risk
Changes how
prompt_tokensis interpreted per provider and persists usage on traces, which can shift displayed totals and anything consuming stored usage—though scope is telemetry/display rather than core rollout logic.Overview
Extends v1 usage telemetry so provider-reported cache reads, reasoning-token subsets, and optional USD cost flow from dialect parsers into traces, training serialization, and the eval rich dashboard.
Usagegainscached_input_tokens,reasoning_tokens,cost, plusaggregateandinput_tokensso totals treat cached input as a disjoint bucket. OpenAI Chat, Responses, and Anthropic dialects map SDK usage details into that shape (notablyprompt_tokensis uncached input for OpenAI; Anthropicprompt_tokensincludes cache-creation tokens). Per-turn usage onMessageNodeis serialized on wire/disk;Trace.usage/Branch.usageroll up per model call. The train client round-trips cache/reasoning into synthetic chat completions; the legacy bridge reads the new fields from v0 dicts. The eval dashboard shows cached, reasoning, and cost beside token counts.Reviewed by Cursor Bugbot for commit 81e1fc7. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Track cached, reasoning, and cost usage fields across v1 dialects, traces, and dashboard
Usagetype withcached_input_tokens,reasoning_tokens,cost, and anaggregateclassmethod;total_tokensnow includes cached input tokens.Trace.usagecomputed property that aggregatesUsageacross all model call nodes in the trace.MessageNode.usageis no longer excluded from serialization, so per-node usage is now retained in trace output.prompt_tokensin OpenAI Chat and Responses dialects now reflects uncached input tokens only; Anthropicprompt_tokensnow includes cache-creation tokens.Macroscope summarized 81e1fc7.