Skip to content

feat(tau2): Add Tau2 agentic RL training example with proxy server#892

Merged
garrett4wade merged 28 commits intomainfrom
mzy/tau2-proxy
Feb 5, 2026
Merged

feat(tau2): Add Tau2 agentic RL training example with proxy server#892
garrett4wade merged 28 commits intomainfrom
mzy/tau2-proxy

Conversation

@nuzant
Copy link
Collaborator

@nuzant nuzant commented Feb 4, 2026

Description

Add a complete Tau2 agentic RL training example that demonstrates multi-turn agent training using the OpenAI-compatible proxy server.

Key additions:

  • examples/tau2/ - Complete training example for Tau2 agentic RL
    • agent.py - Agent implementation with airline environment task
    • train.py - Training script with GRPO workflow
    • utils.py - Utility functions for reward computation and data processing
    • config_1.7b_airline.yaml - Config for 1.7B model training
    • config_8b_airline.yaml - Config for 8B model training
    • README.md - Comprehensive documentation

Test additions:

  • Add integration tests for tau2 example in test_examples.py

Notes:

  1. The examples use tree training by default, which could largely accelerate training steps without harming training performance.
  2. This PR is based on PR Fix Proxy Mode with extra_body support, enforcing max_tokens, switching to individual mode, and fix prompt template and tool call parsing #775 and [WIP] example: tau2-bench training #740 . Big thanks to @samjia2000 and @dhh1995 .
  3. Reward curves and examples with larger scales are still in progress and will be added in future PRs.

Related Issue

N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Generated with Claude Code

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @nuzant, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system by integrating a full-fledged example for training agentic Reinforcement Learning models within the Tau2 benchmark. This example demonstrates multi-turn agent training facilitated by an OpenAI-compatible proxy server. Alongside this new feature, the PR delivers crucial bug fixes to improve the stability and correctness of FSDP, the OpenAI proxy, and the Slurm scheduler, ensuring a more robust training environment.

Highlights

  • New Tau2 Agentic RL Training Example: Introduced a complete example for multi-turn agent training using the Tau2 benchmark and an OpenAI-compatible proxy server.
  • Bug Fixes: Addressed issues in FSDP vocab logits calculation, OpenAI proxy server timeout/async generation, OpenAI cache reward tracking, and Slurm scheduler quoting.
  • Integration Tests: Added new integration tests specifically for the Tau2 example to ensure its functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • areal/engine/fsdp_engine.py
    • Removed outdated comments regarding gather_packed_tree_vocab_stats for tree training.
  • areal/experimental/openai/cache.py
    • Enhanced the export_interactions method to filter out incomplete interactions from the cache, preventing warning spam and ensuring data integrity.
  • areal/experimental/openai/proxy/proxy_rollout_server.py
    • Implemented a _warn_once utility to deduplicate recurring warning messages, improving log clarity, and applied it to various warning scenarios.
  • areal/models/tree_attn/tree.py
    • Added TYPE_CHECKING for BlockMask and explicitly set dtype=torch.int32 for torch.tril_indices to prevent type inference issues.
  • areal/scheduler/slurm.py
    • Added a debug log statement to display the srun_cmd for better troubleshooting.
  • areal/tests/test_examples.py
    • Included a new test_tau2 integration test for the Tau2 airline domain training, which involves launching an SGLang user LLM server and running the training process.
    • Refined example output logging to skip empty lines.
  • examples/tau2/README.md
    • Added a comprehensive README detailing the Tau2 agent training example, its architecture, prerequisites, configuration, and usage instructions for both single-node and multi-node setups.
  • examples/tau2/agent.py
    • Introduced Tau2AgentWorkflow, an AReaL workflow for running Tau2 customer service simulations using an OpenAI-compatible proxy, including Tau2Runner for managing simulation logic.
  • examples/tau2/config_1.7b_airline.yaml
    • Provided a configuration file for small-scale Tau2 airline domain training using a 1.7B model.
  • examples/tau2/config_8b_airline.yaml
    • Provided a configuration file for multi-node Slurm-based Tau2 airline domain training using an 8B model.
  • examples/tau2/train.py
    • Implemented the main training script for the Tau2 benchmark, handling dataset creation and orchestrating the PPOTrainer with the new Tau2AgentWorkflow.
  • examples/tau2/utils.py
    • Introduced utility dataclasses for Tau2 environment and PPO configurations.
    • Patched tau2.utils.llm_utils.get_response_cost to suppress noisy LiteLLM warnings for unmapped models.
Activity
  • A new feature for agentic RL training using the Tau2 benchmark has been implemented.
  • Several bug fixes were introduced to enhance system stability and logging.
  • Comprehensive documentation for the new example has been added.
  • New integration tests were developed to validate the Tau2 training example.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive example for Tau2 agentic RL training, along with several bug fixes and improvements. However, two medium-severity vulnerabilities were identified: sensitive information logging in the Slurm scheduler (full commands with potential API keys) and a potential memory exhaustion (DoS) in the proxy rollout server due to unbounded log message storage. Additionally, the review suggests improving test stability by replacing time.sleep with polling, correcting documentation issues in the README, and refining exception handling and logging practices for better efficiency and maintainability.

@nuzant nuzant changed the title feat(tau2): Add Tau2 agentic RL training example with OpenAI proxy feat(tau2): Add Tau2 agentic RL training example with proxy server Feb 4, 2026
@nuzant nuzant added the safe-to-test Ready to run unit-tests in a PR. label Feb 4, 2026
nuzant and others added 18 commits February 4, 2026 19:23
- Add config_types.py for custom experiment config (Tau2ExperimentConfig)
- Add 7B model configuration (config_7b.yaml)
- Implement lazy attention mask creation for tree training in FSDP engine
- Fix controller mode import with dynamic PYTHONPATH handling
- Silence verbose logging in tree attention module
- Update slurm scheduler with shlex.quote for robust shell escaping

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@nuzant nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Feb 4, 2026
@nuzant nuzant temporarily deployed to AReaL-unittests February 4, 2026 11:44 — with GitHub Actions Inactive
Comment on lines +62 to +71
```bash
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-72B \
--host 0.0.0.0 \
--port 8000 \
--tool-call-parser qwen25 \
--chat-template ./qwen3_nonthinking.jinja \
--dp-size 2 \
--tp-size 4
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead use RolloutController to launcher the servers in training script? Two commands may increase verbosity

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we have a way to gracefully do this right now. There are two options:

  1. Just use rollout controller to launch the servers and collect addresses from name resolve. However, now rollout controller launches servers with multiple addresses, and we need to change the agent workflow to distribute user requests among these addresses.

  2. Use proxy for user requests as well. This seems to be an elegant solution, but our current implementation does not support multiple proxy endpoints in a single agent workflow run.

I think we should open a new PR to implement option 2 and change the example then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. We can just implement a router in rollout controller.

Comment on lines +62 to +71
```bash
python3 -m sglang.launch_server \
--model-path Qwen/Qwen2.5-72B \
--host 0.0.0.0 \
--port 8000 \
--tool-call-parser qwen25 \
--chat-template ./qwen3_nonthinking.jinja \
--dp-size 2 \
--tp-size 4
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. We can just implement a router in rollout controller.

@nuzant
Copy link
Collaborator Author

nuzant commented Feb 5, 2026

Update: Changed litellm.acompletion to AsyncOpenAI completion create due to a bug in litellm.acompletion, which causes ConnectionError and unexpectedly discards some trajectories.

Copy link
Collaborator

@garrett4wade garrett4wade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@garrett4wade garrett4wade merged commit 9f19f64 into main Feb 5, 2026
1 check passed
@garrett4wade garrett4wade deleted the mzy/tau2-proxy branch February 5, 2026 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants