Skip to content

feat: implement parallel streaming output rails execution #1263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 21, 2025

Conversation

Pouyanpi
Copy link
Collaborator

@Pouyanpi Pouyanpi commented Jul 4, 2025

This PR implements streaming for parallel output rails execution

  • Add _run_output_rails_in_parallel_streaming method to run output rails concurrently
  • Use asyncio tasks to execute multiple rails simultaneously during streaming
  • Implement early termination when any rail blocks content to optimize performance
  • Register the new action in the runtime dispatcher
  • Add proper error handling and cancellation for robust parallel execution
  • Avoid full flow state management issues that can occur with hide_prev_turn logic during streaming
  • Add comprehensive tests for parallel streaming functionality

requires #1259

@Pouyanpi Pouyanpi self-assigned this Jul 4, 2025
@Pouyanpi Pouyanpi added this to the v0.15.0 milestone Jul 4, 2025
@Pouyanpi Pouyanpi added the enhancement New feature or request label Jul 4, 2025
@Pouyanpi Pouyanpi marked this pull request as ready for review July 7, 2025 12:05
@Pouyanpi Pouyanpi requested a review from Copilot July 7, 2025 12:05
Copilot

This comment was marked as outdated.

@Pouyanpi Pouyanpi requested a review from tgasser-nv July 9, 2025 15:18
@Pouyanpi Pouyanpi force-pushed the feat/parallel-output-rails-streaming branch from eb420fa to c33a0b7 Compare July 9, 2025 15:34
@Pouyanpi Pouyanpi requested a review from Copilot July 9, 2025 15:35
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds parallel streaming capabilities to the output rails execution path, improving throughput and allowing early termination when any rail blocks content.

  • Introduce a parallel flag in the OutputRails config and wire it through the LLM pipeline.
  • Implement a new _run_output_rails_in_parallel_streaming action that runs rails concurrently with early-stop on block, and register it at runtime.
  • Extend the test suite with comprehensive scenarios covering allowed flows, blocking, performance, error handling, and default behavior.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
nemoguardrails/rails/llm/config.py Added parallel flag to OutputRails config.
nemoguardrails/rails/llm/llmrails.py Integrated parallel streaming logic: context prep, dispatch call, and error handling.
nemoguardrails/colang/v1_0/runtime/runtime.py Implemented _run_output_rails_in_parallel_streaming action.
nemoguardrails/colang/runtime.py Registered the new parallel streaming action in the dispatcher.
tests/test_parallel_streaming_output_rails.py Added extensive tests covering parallel streaming, blocking, and performance.
Comments suppressed due to low confidence (1)

nemoguardrails/rails/llm/llmrails.py:334

  • After canceling pending tasks, await them (for example via await asyncio.gather(*tasks, return_exceptions=True)) to ensure proper cleanup and avoid Task was destroyed but it is pending! warnings.
            )

Comment on lines +1364 to +1370
context.update(context_message["content"])

Copy link
Preview

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the context dict with a string (context_message["content"]) will iterate over its characters; instead assign the content under a specific key or ensure the updated value is a mapping.

Suggested change
context.update(context_message["content"])
if isinstance(context_message.get("content"), dict):
context.update(context_message["content"])
else:
log.warning("context_message['content'] is not a dictionary and will be ignored.")

Copilot uses AI. Check for mistakes.

@@ -1346,6 +1346,32 @@ def _get_latest_user_message(
return message
return {}

def _prepare_context_for_parallel_rails(
Copy link
Preview

Copilot AI Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Defining helper functions (_prepare_context_for_parallel_rails, _create_events_for_chunk) inside a method can make the code harder to navigate; consider extracting them to module-level helpers.

Copilot uses AI. Check for mistakes.

Copy link
Collaborator

@tgasser-nv tgasser-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Can you run a local "integration test" where you enable streaming and run parallel input and output rails through the 3 Nemoguard NVCF models to check the responses make sense and everything works correctly? Please paste in the outputs, latency measurements, etc into the PR comments


# cancel remaining tasks
for pending_task in tasks:
if not pending_task.done():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we try and cancel a task that's already done, will that throw an exception? A task could finish in between the pending_task.done() and pending_task.cancel() calls. We need to be able to cancel a task that's already done and no Exceptions to be thrown

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we try and cancel a task that's already done, will that throw an exception?

calling cancel() on a task that is already done or cancelled will not throw an exception. It simply returns False and has no effect. There is no need to guard cancel() with a try/except for this reason.

but this is a point good we might want to document.

A task could finish in between the pending_task.done() and pending_task.cancel() calls

it shows a race condition which is the expected behavior, but we can log a warning:

For example

  1. rails A, B, C are running in parallel
  2. rail A completes and says "blocked"
  3. while we're processing A's result rail B also completes and says "blocked"
  4. We cancel rail C (which is still pending)
  5. We only report that A blocked, not B

based on our assumptions:

  • not a bug: multiple rails might detect violations, but we only need to know that at least one violation occurred
  • expected behavior: first violation detected stops the stream
  • correct outcome: content is blocked regardless of which rail detected it first

so for content modifying rails one should use sequential execution.
for read only validation rails parallel execution is fine.

@Pouyanpi
Copy link
Collaborator Author

Pouyanpi commented Jul 10, 2025

PERFORMANCE COMPARISON SUMMARY

Config 1

config:

  • Stream first: False
  • Chunk size: 200
  • Context size: 50

Output Rails:

  • content safety check output $model=content_safety
  • self check output

Latency Improvements (First Chunk):
Test 1: 2.488s → 1.535s (38.3% faster, 1.62x speedup)
Test 2: 2.255s → 0.956s (57.6% faster, 2.36x speedup)
Test 3: 2.097s → 0.567s (73.0% faster, 3.70x speedup)

Total Duration Comparison:
Test 1: 2.489s (sequential) vs 2.117s (parallel)
Test 2: 2.256s (sequential) vs 1.525s (parallel)
Test 3: 2.099s (sequential) vs 1.097s (parallel)

Rail Execution Details:

  • Total LLM calls: 30
  • Rails executed:
    • content_safety_check_output $model=content_safety: 12 call(s)
    • general: 6 call(s)
    • self_check_output: 12 call(s)

Config 2 (Lower chunk size higher output rails invocation)

config:

  • Streaming enabled: True
  • Stream first: False
  • Chunk size: 30
  • Context size: 5

Output Rails:

  • content safety check output $model=content_safety
  • self check output

Latency Improvements (First Chunk):
Test 1: 2.015s → 0.690s (65.7% faster, 2.92x speedup)
Test 2: 1.605s → 0.586s (63.5% faster, 2.74x speedup)
Test 3: 1.547s → 0.366s (76.3% faster, 4.23x speedup)

Total Duration Comparison:
Test 1: 3.925s (sequential) vs 2.202s (parallel)
Test 2: 4.507s (sequential) vs 1.162s (parallel)
Test 3: 1.547s (sequential) vs 0.940s (parallel)

Rail Execution Details:

  • Total LLM calls: 44
  • Rails executed:
    • content_safety_check_output $model=content_safety: 19 call(s)
    • general: 6 call(s)
    • self_check_output: 19 call(s)

Base automatically changed from fix/streaming-outputrails to develop July 10, 2025 11:17
- Add _run_output_rails_in_parallel_streaming method to run output rails concurrently
- Use asyncio tasks to execute multiple rails simultaneously during streaming
- Implement early termination when any rail blocks content to optimize performance
- Register the new action in the runtime dispatcher
- Add proper error handling and cancellation for robust parallel execution
- Avoid full flow state management issues that can occur with hide_prev_turn logic during streaming
- Add comprehensive tests for parallel streaming functionality

parallel rails accept flow params dict
@Pouyanpi Pouyanpi force-pushed the feat/parallel-output-rails-streaming branch from c33a0b7 to 4cb2762 Compare July 10, 2025 12:27
@codecov-commenter
Copy link

codecov-commenter commented Jul 10, 2025

Codecov Report

Attention: Patch coverage is 79.12088% with 19 lines in your changes missing coverage. Please review.

Project coverage is 69.83%. Comparing base (ef97795) to head (2c2469d).

Files with missing lines Patch % Lines
nemoguardrails/colang/v1_0/runtime/runtime.py 69.76% 13 Missing ⚠️
nemoguardrails/rails/llm/llmrails.py 86.66% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1263      +/-   ##
===========================================
+ Coverage    69.78%   69.83%   +0.04%     
===========================================
  Files          161      161              
  Lines        16057    16135      +78     
===========================================
+ Hits         11206    11268      +62     
- Misses        4851     4867      +16     
Flag Coverage Δ
python 69.83% <79.12%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
nemoguardrails/colang/runtime.py 88.57% <100.00%> (+0.69%) ⬆️
nemoguardrails/rails/llm/config.py 90.32% <100.00%> (+0.01%) ⬆️
nemoguardrails/rails/llm/llmrails.py 88.88% <86.66%> (-0.04%) ⬇️
nemoguardrails/colang/v1_0/runtime/runtime.py 82.47% <69.76%> (-2.50%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@tgasser-nv tgasser-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the local performance test values, this looks great.

@Pouyanpi Pouyanpi merged commit bfe81f1 into develop Jul 21, 2025
17 checks passed
@Pouyanpi Pouyanpi deleted the feat/parallel-output-rails-streaming branch July 21, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request status: in review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants