Skip to content

Commit 7f0f46e

Browse files
authored
docs(streaming): add section on token usage tracking (#1282)
1 parent 8fdb27d commit 7f0f46e

File tree

1 file changed

+48
-0
lines changed

1 file changed

+48
-0
lines changed

docs/user-guides/advanced/streaming.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -141,6 +141,54 @@ async for chunk in app.stream_async(
141141

142142
This feature enables seamless integration of NeMo Guardrails with any streaming LLM or token source while maintaining all the safety features of output rails.
143143

144+
## Token Usage Tracking
145+
146+
When streaming is enabled, NeMo Guardrails automatically enables token usage tracking by setting the `stream_usage` parameter to `True` for the underlying LLM model. This feature:
147+
148+
- Provides token usage statistics even when streaming responses.
149+
- Is primarily supported by OpenAI, AzureOpenAI, and other providers. The NVIDIA NIM provider supports it by default.
150+
- Allows to safely pass token usage statistics to LLM providers. If the LLM provider you use don't support it, the parameter is ignored.
151+
152+
### Version Requirements
153+
154+
For optimal token usage tracking with streaming, ensure you're using recent versions of LangChain packages:
155+
156+
- `langchain-openai >= 0.1.0` for basic streaming token support (minimum requirement)
157+
- `langchain-openai >= 0.2.0` for enhanced features and stability
158+
- `langchain >= 0.2.14` and `langchain-core >= 0.2.14` for universal token counting support
159+
160+
```{note}
161+
The NeMo Guardrails toolkit requires `langchain-openai >= 0.1.0` as an optional dependency, which provides basic streaming token usage support. For enhanced features and stability, consider upgrading to `langchain-openai >= 0.2.0` in your environment.
162+
```
163+
164+
### Accessing Token Usage Information
165+
166+
You can access token usage statistics through the detailed logging capabilities of the NeMo Guardrails toolkit. Use the `log` generation option to capture comprehensive information about LLM calls, including token usage:
167+
168+
```python
169+
response = rails.generate(messages=messages, options={
170+
"log": {
171+
"llm_calls": True,
172+
"activated_rails": True
173+
}
174+
})
175+
176+
for llm_call in response.log.llm_calls:
177+
print(f"Task: {llm_call.task}")
178+
print(f"Total tokens: {llm_call.total_tokens}")
179+
print(f"Prompt tokens: {llm_call.prompt_tokens}")
180+
print(f"Completion tokens: {llm_call.completion_tokens}")
181+
```
182+
183+
Alternatively, you can use the `explain()` method to get a summary of token usage:
184+
185+
```python
186+
info = rails.explain()
187+
info.print_llm_calls_summary()
188+
```
189+
190+
For more information about streaming token usage support across different providers, refer to the [LangChain documentation on token usage tracking](https://python.langchain.com/docs/how_to/chat_token_usage_tracking/#streaming). For detailed information about accessing generation logs and token usage, see the [Generation Options](generation-options.md#detailed-logging-information) and [Detailed Logging](../detailed-logging/README.md) documentation.
191+
144192
### Server API
145193

146194
To make a call to the NeMo Guardrails Server in streaming mode, you have to set the `stream` parameter to `True` inside the JSON body. For example, to get the completion for a chat session using the `/v1/chat/completions` endpoint:

0 commit comments

Comments
 (0)