-
Notifications
You must be signed in to change notification settings - Fork 504
feat(streaming): support external async token generators #1286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -71,7 +71,73 @@ result = await app.generate_async( | |||||
print(result) | ||||||
``` | ||||||
|
||||||
For the complete working example, check out this [demo script](https://github.com/NVIDIA/NeMo-Guardrails/tree/develop/examples/scripts/demo_streaming.py). | ||||||
### Using External Token Generators | ||||||
|
||||||
You can also provide your own async generator that yields tokens, which is useful when: | ||||||
|
||||||
- You want to use a different LLM provider that has its own streaming API | ||||||
- You have pre-generated responses that you want to stream through guardrails | ||||||
- You want to implement custom token generation logic | ||||||
- You want to test your output rails or its config in streaming mode wihtout relying on an LLM which generates stochastic outputs. | ||||||
|
||||||
To use an external generator, pass it to the `generator` parameter of `stream_async`: | ||||||
|
||||||
```python | ||||||
from nemoguardrails import LLMRails | ||||||
from typing import AsyncIterator | ||||||
|
||||||
app = LLMRails(config) | ||||||
|
||||||
async def my_token_generator() -> AsyncIterator[str]: | ||||||
# this could be from OpenAI, Anthropic, or any other source | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
tokens = ["Hello", " ", "world", "!"] | ||||||
for token in tokens: | ||||||
yield token | ||||||
|
||||||
# use the external generator with guardrails | ||||||
async for chunk in app.stream_async( | ||||||
messages=history, | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. missing |
||||||
generator=my_token_generator() | ||||||
): | ||||||
print(f"CHUNK: {chunk}") | ||||||
``` | ||||||
|
||||||
When using an external generator: | ||||||
|
||||||
- The internal LLM generation is completely bypassed | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- Output rails are still applied if configured | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- The generator should yield string tokens | ||||||
|
||||||
Example with a real LLM API: | ||||||
|
||||||
```python | ||||||
async def openai_streaming_generator(messages) -> AsyncIterator[str]: | ||||||
"""Example using OpenAI's streaming API.""" | ||||||
import openai | ||||||
|
||||||
stream = await openai.ChatCompletion.create( | ||||||
model="gpt-4o", | ||||||
messages=messages, | ||||||
stream=True | ||||||
) | ||||||
|
||||||
# Yield tokens as they arrive | ||||||
async for chunk in stream: | ||||||
if chunk.choices[0].delta.content: | ||||||
yield chunk.choices[0].delta.content | ||||||
|
||||||
config = RailsConfig.from_path("config/with_output_rails") | ||||||
app = LLMRails(config) | ||||||
|
||||||
async for chunk in app.stream_async( | ||||||
messages=[{"role": "user", "content": "Tell me a story"}], | ||||||
generator=openai_streaming_generator(messages) | ||||||
): | ||||||
# output rails will be applied to these chunks | ||||||
print(chunk, end="", flush=True) | ||||||
``` | ||||||
|
||||||
This feature enables seamless integration of NeMo Guardrails with any streaming LLM or token source while maintaining all the safety features of output rails. | ||||||
|
||||||
### Server API | ||||||
|
||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1053,8 +1053,21 @@ def stream_async( | |
options: Optional[Union[dict, GenerationOptions]] = None, | ||
state: Optional[Union[dict, State]] = None, | ||
include_generation_metadata: Optional[bool] = False, | ||
generator: Optional[AsyncIterator[str]] = None, | ||
) -> AsyncIterator[str]: | ||
"""Simplified interface for getting directly the streamed tokens from the LLM.""" | ||
|
||
# if an external generator is provided, use it directly | ||
if generator: | ||
if self.config.rails.output.streaming.enabled: | ||
return self._run_output_rails_in_streaming( | ||
streaming_handler=generator, | ||
messages=messages, | ||
prompt=prompt, | ||
) | ||
else: | ||
return generator | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this still be used with explain / generation options? |
||
self.explain_info = self._ensure_explain_info() | ||
|
||
streaming_handler = StreamingHandler( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected a mistype and also changed the wording. Is this the correct usage, on predefined / given assistant responses only with output rails?