Token usage heuristic #5834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

zhao-oai wants to merge 3 commits into main from token-usage-heuristic

Collaborator

zhao-oai commented Oct 27, 2025

No description provided.

zhao-oai added 2 commits

October 27, 2025 14:43


          normalizing model slug in get_model_info

470b13c


          token usage heuristic

8835b95

Collaborator Author

zhao-oai commented Oct 27, 2025

@codex review please

Contributor

chatgpt-codex-connector bot commented Oct 27, 2025

Codex Review: Didn't find any major issues. 🚀

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Collaborator Author

zhao-oai commented Oct 27, 2025

@codex what are the performance implications of this PR

Contributor

chatgpt-codex-connector bot commented Oct 27, 2025

Note

You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard.

Collaborator Author

zhao-oai commented Oct 27, 2025

@codex what are the performance implications of this PR

Contributor

chatgpt-codex-connector bot commented Oct 27, 2025

Note

You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard.

Collaborator Author

zhao-oai commented Oct 28, 2025

Contributor

chatgpt-codex-connector bot commented Oct 28, 2025

Note

You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard.

Collaborator Author

zhao-oai commented Oct 28, 2025

@codex what are the performance implications of this PR

Contributor

chatgpt-codex-connector bot commented Oct 28, 2025

Note

You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard.

Collaborator Author

zhao-oai commented Oct 28, 2025

@codex hello

Contributor

chatgpt-codex-connector bot commented Oct 28, 2025

Note

You have reached your Codex usage limits. You can see your limits in the Codex usage dashboard.

Collaborator Author

zhao-oai commented Oct 28, 2025

@codex review

Contributor

chatgpt-codex-connector bot commented Oct 28, 2025

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Collaborator Author

zhao-oai commented Oct 28, 2025

@codex what are the performance implications of this PR

Contributor

chatgpt-codex-connector bot commented Oct 28, 2025

Key locations

Core logic lives in codex-rs/core/src/chat_completions.rs, where the new ChatUsageHeuristic is created before dispatching a Chat Completions request and then updated while processing the SSE stream. codex-rs/core/src/chat_completions.rsL39-L779
Regression coverage for the new behaviour was added in codex-rs/core/tests/chat_completions_sse.rs (completed_event_includes_usage_estimate). codex-rs/core/tests/chat_completions_sse.rsL188-L229

Performance analysis

Request setup

Every call now builds a tokenizer (Tokenizer::for_model) and walks the entire messages payload to approximate prompt-token counts before the request is submitted. codex-rs/core/src/chat_completions.rsL47-L87 codex-rs/core/src/chat_completions.rsL121-L128
- tiktoken_rs::get_bpe_from_model is reasonably fast when the model table is cached, but constructing a tokenizer still allocates and can add several hundred microseconds of CPU per request. On cold paths or when falling back to Tokenizer::try_default, the cost climbs because the BPE tables must be loaded.
- The pre-scan token counting is O(total characters in the serialized prompt). Large conversation histories will incur proportional CPU time because each string is re-tokenized purely for accounting; this is extra work beyond the existing JSON assembly.

Streaming path

While consuming the SSE stream, every emitted assistant chunk, reasoning chunk, tool-call identifier, name, and argument fragment is re-tokenized to update the heuristic counters. codex-rs/core/src/chat_completions.rsL592-L712
- This adds a Tokenizer::count call per delta. For tightly chunked providers (dozens or hundreds of deltas per turn) the additional CPU can be noticeable; each call runs a full BPE encode on the fragment even though the text is already being buffered for output.
- Because the heuristic is optional, it is skipped only if tokenizer construction fails (rare in practice), so most streams will pay this per-delta cost.
The stream still performs the same number of allocations for downstream events; the only extra work at completion time is cloning the computed usage into the Completed event, which is negligible compared with the tokenization work. codex-rs/core/src/chat_completions.rsL525-L779

Memory & concurrency

The heuristic holds an Arc<Tokenizer> plus three i64 counters, so the steady-state memory footprint per stream increases by a few dozen bytes. No additional concurrency primitives were introduced, so there is no locking contention—just extra CPU time on the task that handles the stream. codex-rs/core/src/chat_completions.rsL39-L105
The added unit test runs only in CI and does not affect production performance, but it does validate that the Completed event now carries a usage estimate. codex-rs/core/tests/chat_completions_sse.rsL188-L229

Net effect

Costs: Extra CPU both before the request (prompt re-tokenization) and during streaming (per-delta BPE counting). These costs scale with prompt size and the number of streamed chunks, so high-traffic deployments should expect a modest increase in CPU utilization.
Benefits: Downstream consumers receive an approximate TokenUsage without waiting for provider-side accounting, which can eliminate separate follow-up calls or heuristics elsewhere in the stack, potentially offsetting work in other services. codex-rs/core/src/chat_completions.rsL525-L779
Overall: The change trades additional CPU on the chat-streaming task for richer metadata in the Completed event. Monitor throughput-critical nodes to ensure the added tokenization does not push them over latency or utilization budgets.


          Merge branch 'main' into token-usage-heuristic

eb77fc7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet