Skip to content

docs(streaming): add section on token usage tracking #1282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 29, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions docs/user-guides/advanced/streaming.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,54 @@ print(result)

For the complete working example, check out this [demo script](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/scripts/demo_streaming.py).

## Token Usage Tracking

When streaming is enabled, NeMo Guardrails automatically enables token usage tracking by setting the `stream_usage` parameter to `True` for the underlying LLM model. This feature:

- Provides token usage statistics even when streaming responses.
- Is primarily supported by OpenAI, AzureOpenAI, and other providers. The NVIDIA NIM provider supports it by default.
- Allows to safely pass token usage statistics to LLM providers. If the LLM provider you use don't support it, the parameter is ignored.

### Version Requirements

For optimal token usage tracking with streaming, ensure you're using recent versions of LangChain packages:

- `langchain-openai >= 0.1.0` for basic streaming token support (minimum requirement)
- `langchain-openai >= 0.2.0` for enhanced features and stability
- `langchain >= 0.2.14` and `langchain-core >= 0.2.14` for universal token counting support

```{note}
The NeMo Guardrails toolkit requires `langchain-openai >= 0.1.0` as an optional dependency, which provides basic streaming token usage support. For enhanced features and stability, consider upgrading to `langchain-openai >= 0.2.0` in your environment.
```

### Accessing Token Usage Information

You can access token usage statistics through the detailed logging capabilities of the NeMo Guardrails toolkit. Use the `log` generation option to capture comprehensive information about LLM calls, including token usage:

```python
response = rails.generate(messages=messages, options={
"log": {
"llm_calls": True,
"activated_rails": True
}
})

for llm_call in response.log.llm_calls:
print(f"Task: {llm_call.task}")
print(f"Total tokens: {llm_call.total_tokens}")
print(f"Prompt tokens: {llm_call.prompt_tokens}")
print(f"Completion tokens: {llm_call.completion_tokens}")
```

Alternatively, you can use the `explain()` method to get a summary of token usage:

```python
info = rails.explain()
info.print_llm_calls_summary()
```

For more information about streaming token usage support across different providers, refer to the [LangChain documentation on token usage tracking](https://python.langchain.com/docs/how_to/chat_token_usage_tracking/#streaming). For detailed information about accessing generation logs and token usage, see the [Generation Options](generation-options.md#detailed-logging-information) and [Detailed Logging](../detailed-logging/README.md) documentation.

### Server API

To make a call to the NeMo Guardrails Server in streaming mode, you have to set the `stream` parameter to `True` inside the JSON body. For example, to get the completion for a chat session using the `/v1/chat/completions` endpoint:
Expand Down