Skip to content

Conversation

lukeramsden
Copy link
Contributor

@lukeramsden lukeramsden commented Aug 30, 2025

Summary

This PR adds comprehensive cached input token tracking to the BAML Usage reporting system. The Collector and Usage now track and report cached input tokens alongside the existing input and output tokens.

Changes Made

  • Core Usage struct: Added cached_input_tokens: Option<i64> field to track cached tokens
  • LLM Provider Integration: Implemented cached token extraction for all supported providers:
    • Anthropic: Extracts from cache_read_input_tokens
    • OpenAI: Extracts from input_tokens_details.cached_tokens
    • Google/Vertex: Uses cached_content_token_count field
    • AWS Bedrock: Set to None (no cached token support currently in the SDK version BAML uses - and there is some sort of dependency issue when upgrading, see Cargo.toml)
  • Token Aggregation: Updated all token aggregation logic in Collector and FunctionLog to sum cached tokens
  • Language Bindings: Added cached token support to all client libraries:
    • TypeScript: usage.cachedInputTokens
    • Python: usage.cached_input_tokens
    • Go: usage.CachedInputTokens()
    • Ruby: usage.cached_input_tokens
  • RPC Integration: Updated RPC types and converters to include cached token data

Test Plan

  • Core library compilation verified
  • All provider response handlers updated with cached token extraction
  • Language binding interfaces expanded with cached token accessors
  • Token aggregation logic preserves cached token counts across multiple calls
  • RPC serialization includes cached token data

Technical Notes

  • Cached tokens are tracked separately from input/output tokens for better cost analysis
  • Provider-specific token extraction handles cases where cached token data is unavailable
  • All changes are backward compatible with existing Usage API
  • Language bindings maintain consistent naming conventions across all supported languages

Closes #2349


Important

Add cached input token tracking to BAML Usage reporting, updating core structures, provider integrations, token aggregation, language bindings, and tests.

  • Behavior:
    • Added cached_input_tokens field to LLMUsage in events.rs, trace_event.rs, and mod.rs to track cached tokens.
    • Implemented cached token extraction for providers: Anthropic (from cache_read_input_tokens), OpenAI (from input_tokens_details.cached_tokens), Google/Vertex (from cached_content_token_count), and AWS Bedrock (set to None).
    • Updated token aggregation logic in storage.rs and llm_response_to_log_event.rs to sum cached tokens.
  • Language Bindings:
    • Added cached token support to TypeScript (native.d.ts), Python (log_collector.rs), Go (rawobjects_public.go), and Ruby (log_collector.rs).
  • RPC Integration:
    • Updated RPC types and converters in trace_data.rs to include cached token data.
  • Tests:
    • Added tests in test_collector.py and collector.test.ts to verify cached token tracking for various providers and scenarios.

This description was created by Ellipsis for 8fa77ed. You can customize this summary. It will automatically update as commits are pushed.

- Add `cached_input_tokens` field to core Usage struct
- Update LLMUsage and LLMCompleteResponseMetadata to include cached tokens
- Implement token extraction for all LLM providers:
  - Anthropic: Sum of cache_read_input_tokens + cache_creation_input_tokens
  - OpenAI: Extract from input_tokens_details.cached_tokens
  - Google/Vertex: Use cached_content_token_count field
  - AWS: Set to None (no cached token support)
- Update token aggregation logic in Collector and FunctionLog
- Add cached token support to all language bindings:
  - TypeScript: Add cachedInputTokens getter
  - Python: Add cached_input_tokens property
  - Go: Add CachedInputTokens() method
  - Ruby: Add cached_input_tokens method
- Update RPC types and converters for cached token reporting
- Add some integration tests for cached token tracking
Copy link

vercel bot commented Aug 30, 2025

@lukeramsden is attempting to deploy a commit to the Boundary Team on Vercel.

A member of the Team first needs to authorize it.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important

Looks good to me! 👍

Reviewed everything up to 8fa77ed in 2 minutes and 9 seconds. Click for details.
  • Reviewed 1233 lines of code in 26 files
  • Skipped 0 files when reviewing.
  • Skipped posting 7 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. integ-tests/typescript/tests/providers/caching.test.ts:4
  • Draft comment:
    Good integration tests covering various caching scenarios for each provider. The tests verify that for providers such as OpenAI and Gemini, the 'cachedInputTokens' field is defined and aggregated correctly, while for AWS it remains null. This meets the requirement to track cached input tokens.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
2. integ-tests/typescript/tests/providers/caching.test.ts:17
  • Draft comment:
    Using expect(...).toBeDefined() for cached tokens is appropriate. Consider also verifying that if a provider supports caching, the value is a number rather than just defined.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50% The comment misunderstands the API - cachedInputTokens is intentionally nullable, not always a number. The current test is actually more correct as it allows for both valid states (number when cached, null when not cached). The suggestion would make the test more restrictive than the actual API contract. Could there be value in having separate test cases - one for verifying the field exists, and another for type checking when we know caching occurred? While separate test cases could be valuable, that would be a larger refactoring. The current approach correctly tests the API contract where the field can be null. The comment should be deleted as it suggests a change that would make the tests incorrect by not allowing valid null values.
3. integ-tests/typescript/tests/providers/caching.test.ts:37
  • Draft comment:
    The test calculates totalCachedTokens using a reduce. Good approach to verify that the collector aggregates cached tokens correctly. Ensure that using the nullish operator (|| 0) is consistent with the intended type.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 50% <= threshold 50% The comment starts with a purely informative statement about the test using a reduce function, which is not allowed. However, it ends with a suggestion to ensure that the use of the nullish operator is consistent with the intended type, which is a valid comment as it asks for confirmation on a specific implementation detail.
4. integ-tests/typescript/tests/providers/caching.test.ts:74
  • Draft comment:
    The tests for repeated large content for both OpenAI and Gemini are comprehensive. One suggestion is to encapsulate repeated logic for making calls and verifying caching into helper functions to reduce duplication and improve maintainability.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50% None
5. integ-tests/typescript/tests/providers/caching.test.ts:115
  • Draft comment:
    The test for Google/Vertex caching correctly checks that the cached tokens field is defined and that the collector-level usage matches. It confirms that caching is provider-specific.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None
6. integ-tests/typescript/tests/providers/caching.test.ts:124
  • Draft comment:
    The test for Vertex AI streaming with cached token tracking uses async streaming and validates that the log type is 'stream'. It is good that you verify that cached tokens are captured even during streaming. Consider also asserting that the stream's chunks are non-empty if appropriate.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment is suggesting additional validation that's not central to the test's purpose. The test is specifically focused on cached token tracking during streaming operations. The final response length check is already sufficient for ensuring the stream produced content. Adding chunk-level validation would be redundant and distract from the main purpose. Perhaps checking chunk content could catch subtle streaming-specific issues that might be missed by only checking the final response. The test's purpose is to verify cached token tracking works with streaming, not to validate streaming functionality itself. The final response check is sufficient as a sanity check that streaming worked. Delete the comment. It suggests adding validation that's not relevant to the test's core purpose of verifying cached token tracking during streaming operations.
7. integ-tests/typescript/tests/providers/caching.test.ts:204
  • Draft comment:
    The mixed providers test is solid, ensuring that when providers with caching (OpenAI, Gemini) are mixed with one without caching (AWS), the aggregated cached tokens equal the sum from providers with caching only. This confirms that the aggregation logic in usage reporting is working as intended.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_Aw1igTGWk7ymd8uD

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

Copy link

🔒 Entelligence AI Vulnerability Scanner

No security vulnerabilities found!

Your code passed our comprehensive security analysis.

📊 Files Analyzed: 3 files


Comment on lines 405 to 417
pub fn __repr__(&self) -> String {
format!(
"Usage(input_tokens={}, output_tokens={})",
"Usage(input_tokens={}, output_tokens={}, cached_input_tokens={})",
self.inner
.input_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string()),
self.inner
.output_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string()),
self.inner
.cached_input_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string())
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correctness: __repr__ method for Usage will print cached_input_tokens as the third argument, but if self.inner.input_tokens or self.inner.output_tokens is None, the output will be misaligned and misleading.

🤖 AI Agent Prompt for Cursor/Windsurf

📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue

In engine/language_client_python/src/types/log_collector.rs, lines 405-417, the __repr__ method for Usage prints input_tokens, output_tokens, and cached_input_tokens. However, if any of these are None, the output may be misleading or misaligned. Please ensure that each field is printed in the correct order and that None values are clearly represented, so the output always matches the format Usage(input_tokens=..., output_tokens=..., cached_input_tokens=...).
📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
pub fn __repr__(&self) -> String {
format!(
"Usage(input_tokens={}, output_tokens={})",
"Usage(input_tokens={}, output_tokens={}, cached_input_tokens={})",
self.inner
.input_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string()),
self.inner
.output_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string()),
self.inner
.cached_input_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string())
)
format!(
"Usage(input_tokens={}, output_tokens={}, cached_input_tokens={})",
self.inner
.input_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string()),
self.inner
.output_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string()),
self.inner
.cached_input_tokens
.map_or_else(|| "None".to_string(), |v| v.to_string())
)

Comment on lines 384 to 396
pub fn to_s(&self) -> String {
format!(
"Usage(input_tokens={}, output_tokens={})",
"Usage(input_tokens={}, output_tokens={}, cached_input_tokens={})",
self.inner
.input_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string()),
self.inner
.output_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string()),
self.inner
.cached_input_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string())
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correctness: to_s method will panic at runtime if self.inner is None, as it assumes self.inner is always present.

🤖 AI Agent Prompt for Cursor/Windsurf

📋 Copy this prompt to your AI coding assistant (Cursor, Windsurf, etc.) to get help fixing this issue

In engine/language_client_ruby/ext/ruby_ffi/src/types/log_collector.rs, lines 384-396, the `to_s` method assumes `self.inner` is always present, which can cause a panic if it is None. Update the method to safely handle the case where `self.inner` is None, returning a string with all fields as 'null' in that case.
📝 Committable Code Suggestion

‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
pub fn to_s(&self) -> String {
format!(
"Usage(input_tokens={}, output_tokens={})",
"Usage(input_tokens={}, output_tokens={}, cached_input_tokens={})",
self.inner
.input_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string()),
self.inner
.output_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string()),
self.inner
.cached_input_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string())
)
pub fn to_s(&self) -> String {
match &self.inner {
Some(inner) => format!(
"Usage(input_tokens={}, output_tokens={}, cached_input_tokens={})",
inner.input_tokens.map_or_else(|| "null".to_string(), |v| v.to_string()),
inner.output_tokens.map_or_else(|| "null".to_string(), |v| v.to_string()),
inner.cached_input_tokens.map_or_else(|| "null".to_string(), |v| v.to_string())
),
None => "Usage(input_tokens=null, output_tokens=null, cached_input_tokens=null)".to_string(),
}
}

Comment on lines 307 to 319
pub fn to_string(&self) -> String {
format!(
"Usage(input_tokens={}, output_tokens={})",
"Usage(input_tokens={}, output_tokens={}, cached_input_tokens={})",
self.inner
.input_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string()),
self.inner
.output_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string()),
self.inner
.cached_input_tokens
.map_or_else(|| "null".to_string(), |v| v.to_string())
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

performance: to_string method uses chained .map_or_else for each token field, but this is not a significant performance issue given the small, fixed number of fields and negligible runtime impact.

@aaronvg
Copy link
Contributor

aaronvg commented Sep 4, 2025

checking this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] Cached tokens reported in Usage
2 participants