You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user-guides/advanced/streaming.md
+48Lines changed: 48 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -141,6 +141,54 @@ async for chunk in app.stream_async(
141
141
142
142
This feature enables seamless integration of NeMo Guardrails with any streaming LLM or token source while maintaining all the safety features of output rails.
143
143
144
+
## Token Usage Tracking
145
+
146
+
When streaming is enabled, NeMo Guardrails automatically enables token usage tracking by setting the `stream_usage` parameter to `True` for the underlying LLM model. This feature:
147
+
148
+
- Provides token usage statistics even when streaming responses.
149
+
- Is primarily supported by OpenAI, AzureOpenAI, and other providers. The NVIDIA NIM provider supports it by default.
150
+
- Allows to safely pass token usage statistics to LLM providers. If the LLM provider you use don't support it, the parameter is ignored.
151
+
152
+
### Version Requirements
153
+
154
+
For optimal token usage tracking with streaming, ensure you're using recent versions of LangChain packages:
155
+
156
+
- `langchain-openai >= 0.1.0`for basic streaming token support (minimum requirement)
157
+
- `langchain-openai >= 0.2.0`for enhanced features and stability
158
+
- `langchain >= 0.2.14`and `langchain-core >= 0.2.14` for universal token counting support
159
+
160
+
```{note}
161
+
The NeMo Guardrails toolkit requires `langchain-openai >= 0.1.0` as an optional dependency, which provides basic streaming token usage support. For enhanced features and stability, consider upgrading to `langchain-openai >= 0.2.0` in your environment.
162
+
```
163
+
164
+
### Accessing Token Usage Information
165
+
166
+
You can access token usage statistics through the detailed logging capabilities of the NeMo Guardrails toolkit. Use the `log` generation option to capture comprehensive information about LLM calls, including token usage:
Alternatively, you can use the `explain()` method to get a summary of token usage:
184
+
185
+
```python
186
+
info = rails.explain()
187
+
info.print_llm_calls_summary()
188
+
```
189
+
190
+
For more information about streaming token usage support across different providers, refer to the [LangChain documentation on token usage tracking](https://python.langchain.com/docs/how_to/chat_token_usage_tracking/#streaming). For detailed information about accessing generation logs and token usage, see the [Generation Options](generation-options.md#detailed-logging-information) and [Detailed Logging](../detailed-logging/README.md) documentation.
191
+
144
192
### Server API
145
193
146
194
To make a call to the NeMo Guardrails Server in streaming mode, you have to set the `stream` parameter to `True` inside the JSON body. For example, to get the completion for a chat session using the `/v1/chat/completions` endpoint:
0 commit comments