feat(tau2): Add Tau2 agentic RL training example with proxy server by nuzant · Pull Request #892 · inclusionAI/AReaL

nuzant · 2026-02-04T10:22:09Z

Description

Add a complete Tau2 agentic RL training example that demonstrates multi-turn agent training using the OpenAI-compatible proxy server.

Key additions:

examples/tau2/ - Complete training example for Tau2 agentic RL
- agent.py - Agent implementation with airline environment task
- train.py - Training script with GRPO workflow
- utils.py - Utility functions for reward computation and data processing
- config_1.7b_airline.yaml - Config for 1.7B model training
- config_8b_airline.yaml - Config for 8B model training
- README.md - Comprehensive documentation

Test additions:

Add integration tests for tau2 example in test_examples.py

Notes:

The examples use tree training by default, which could largely accelerate training steps without harming training performance.
This PR is based on PR Fix Proxy Mode with extra_body support, enforcing max_tokens, switching to individual mode, and fix prompt template and tool call parsing #775 and [WIP] example: tau2-bench training #740 . Big thanks to @samjia2000 and @dhh1995 .
Reward curves and examples with larger scales are still in progress and will be added in future PRs.

Related Issue

N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A

Additional Context

Generated with Claude Code

gemini-code-assist · 2026-02-04T10:22:34Z

Summary of Changes

Hello @nuzant, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system by integrating a full-fledged example for training agentic Reinforcement Learning models within the Tau2 benchmark. This example demonstrates multi-turn agent training facilitated by an OpenAI-compatible proxy server. Alongside this new feature, the PR delivers crucial bug fixes to improve the stability and correctness of FSDP, the OpenAI proxy, and the Slurm scheduler, ensuring a more robust training environment.

Highlights

New Tau2 Agentic RL Training Example: Introduced a complete example for multi-turn agent training using the Tau2 benchmark and an OpenAI-compatible proxy server.
Bug Fixes: Addressed issues in FSDP vocab logits calculation, OpenAI proxy server timeout/async generation, OpenAI cache reward tracking, and Slurm scheduler quoting.
Integration Tests: Added new integration tests specifically for the Tau2 example to ensure its functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

areal/engine/fsdp_engine.py
- Removed outdated comments regarding gather_packed_tree_vocab_stats for tree training.
areal/experimental/openai/cache.py
- Enhanced the export_interactions method to filter out incomplete interactions from the cache, preventing warning spam and ensuring data integrity.
areal/experimental/openai/proxy/proxy_rollout_server.py
- Implemented a _warn_once utility to deduplicate recurring warning messages, improving log clarity, and applied it to various warning scenarios.
areal/models/tree_attn/tree.py
- Added TYPE_CHECKING for BlockMask and explicitly set dtype=torch.int32 for torch.tril_indices to prevent type inference issues.
areal/scheduler/slurm.py
- Added a debug log statement to display the srun_cmd for better troubleshooting.
areal/tests/test_examples.py
- Included a new test_tau2 integration test for the Tau2 airline domain training, which involves launching an SGLang user LLM server and running the training process.
- Refined example output logging to skip empty lines.
examples/tau2/README.md
- Added a comprehensive README detailing the Tau2 agent training example, its architecture, prerequisites, configuration, and usage instructions for both single-node and multi-node setups.
examples/tau2/agent.py
- Introduced Tau2AgentWorkflow, an AReaL workflow for running Tau2 customer service simulations using an OpenAI-compatible proxy, including Tau2Runner for managing simulation logic.
examples/tau2/config_1.7b_airline.yaml
- Provided a configuration file for small-scale Tau2 airline domain training using a 1.7B model.
examples/tau2/config_8b_airline.yaml
- Provided a configuration file for multi-node Slurm-based Tau2 airline domain training using an 8B model.
examples/tau2/train.py
- Implemented the main training script for the Tau2 benchmark, handling dataset creation and orchestrating the PPOTrainer with the new Tau2AgentWorkflow.
examples/tau2/utils.py
- Introduced utility dataclasses for Tau2 environment and PPO configurations.
- Patched tau2.utils.llm_utils.get_response_cost to suppress noisy LiteLLM warnings for unmapped models.

Activity

A new feature for agentic RL training using the Tau2 benchmark has been implemented.
Several bug fixes were introduced to enhance system stability and logging.
Comprehensive documentation for the new example has been added.
New integration tests were developed to validate the Tau2 training example.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive example for Tau2 agentic RL training, along with several bug fixes and improvements. However, two medium-severity vulnerabilities were identified: sensitive information logging in the Slurm scheduler (full commands with potential API keys) and a potential memory exhaustion (DoS) in the proxy rollout server due to unbounded log message storage. Additionally, the review suggests improving test stability by replacing time.sleep with polling, correcting documentation issues in the README, and refining exception handling and logging practices for better efficiency and maintainability.

areal/tests/test_examples.py

examples/tau2/README.md

areal/scheduler/slurm.py

areal/experimental/openai/proxy/proxy_rollout_server.py

areal/experimental/openai/cache.py

areal/tests/test_examples.py

examples/tau2/README.md

examples/tau2/agent.py

- Add config_types.py for custom experiment config (Tau2ExperimentConfig) - Add 7B model configuration (config_7b.yaml) - Implement lazy attention mask creation for tree training in FSDP engine - Fix controller mode import with dynamic PYTHONPATH handling - Silence verbose logging in tree attention module - Update slurm scheduler with shlex.quote for robust shell escaping Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

examples/tau2/agent.py

garrett4wade · 2026-02-04T11:11:20Z

examples/tau2/README.md

+```bash
+python3 -m sglang.launch_server \
+    --model-path Qwen/Qwen2.5-72B \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --tool-call-parser qwen25 \
+    --chat-template ./qwen3_nonthinking.jinja \
+    --dp-size 2 \
+    --tp-size 4
+```


Can we instead use RolloutController to launcher the servers in training script? Two commands may increase verbosity

I am not sure we have a way to gracefully do this right now. There are two options:

Just use rollout controller to launch the servers and collect addresses from name resolve. However, now rollout controller launches servers with multiple addresses, and we need to change the agent workflow to distribute user requests among these addresses.

Use proxy for user requests as well. This seems to be an elegant solution, but our current implementation does not support multiple proxy endpoints in a single agent workflow run.

I think we should open a new PR to implement option 2 and change the example then.

Okay. We can just implement a router in rollout controller.

examples/tau2/README.md

garrett4wade · 2026-02-05T03:28:44Z

examples/tau2/README.md

+```bash
+python3 -m sglang.launch_server \
+    --model-path Qwen/Qwen2.5-72B \
+    --host 0.0.0.0 \
+    --port 8000 \
+    --tool-call-parser qwen25 \
+    --chat-template ./qwen3_nonthinking.jinja \
+    --dp-size 2 \
+    --tp-size 4
+```


Okay. We can just implement a router in rollout controller.

examples/tau2/README.md

examples/tau2/agent.py

examples/tau2/train.py

…u2-proxy-2

nuzant · 2026-02-05T07:39:04Z

Update: Changed litellm.acompletion to AsyncOpenAI completion create due to a bug in litellm.acompletion, which causes ConnectionError and unexpectedly discards some trajectories.

garrett4wade

LGTM

gemini-code-assist bot reviewed Feb 4, 2026

View reviewed changes

nuzant changed the title ~~feat(tau2): Add Tau2 agentic RL training example with OpenAI proxy~~ feat(tau2): Add Tau2 agentic RL training example with proxy server Feb 4, 2026

nuzant added the safe-to-test Ready to run unit-tests in a PR. label Feb 4, 2026

nuzant had a problem deploying to AReaL-unittests February 4, 2026 11:01 — with GitHub Actions Error

nuzant and others added 18 commits February 4, 2026 19:23

fix fsdp and megatron tree

4cfbdf5

simplify vocab max min calc

99d0877

try fix slurm scheduler

a386361

fix comments

2d89aa6

add tau2

51cdc55

fix fsdp vocab logits min max

0acf22a

fix train filter func name

ea60531

avoid warning spamming

fb412a5

.

1525fc9

fix configs

eae338d

fix timeout

b6ee1de

add tau2 readme

1ea3bed

fix tests

d472cd7

update readme.md

ef27f47

remove redundant

b45245c

.

472fa39

fix comments

5650d0f

nuzant force-pushed the mzy/tau2-proxy branch from 01b024a to 5650d0f Compare February 4, 2026 11:27

format

5e0d0ba

nuzant added safe-to-test Ready to run unit-tests in a PR. and removed safe-to-test Ready to run unit-tests in a PR. labels Feb 4, 2026

nuzant temporarily deployed to AReaL-unittests February 4, 2026 11:44 — with GitHub Actions Inactive

garrett4wade reviewed Feb 4, 2026

View reviewed changes

dhh1995 reviewed Feb 5, 2026

View reviewed changes

examples/tau2/README.md Outdated Show resolved Hide resolved

nuzant added 2 commits February 5, 2026 10:16

remove agent workflow

78ec4ca

remove resolved litellm issue

5d60115

garrett4wade reviewed Feb 5, 2026

View reviewed changes

nuzant added 6 commits February 5, 2026 13:47

tau2 proxy async openai

9b03f50

update readme

bee0537

Merge branch 'mzy/antcode/tau2-proxy-asyncopenai' into mzy/antcode/ta…

b35e2f3

…u2-proxy-2

clean code

7ef16ad

fix litellm spam

ba065cb

fix readme

39c7b86

.

aa4f4db

garrett4wade approved these changes Feb 5, 2026

View reviewed changes

garrett4wade merged commit 9f19f64 into main Feb 5, 2026
1 check passed

garrett4wade deleted the mzy/tau2-proxy branch February 5, 2026 08:12

This was referenced Feb 5, 2026

Fix Proxy Mode with extra_body support, enforcing max_tokens, switching to individual mode, and fix prompt template and tool call parsing #775

Closed

[WIP] example: tau2-bench training #740

Closed

Conversation

nuzant commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Checklist

Additional Context

Uh oh!

gemini-code-assist bot commented Feb 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

nuzant Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

garrett4wade Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nuzant commented Feb 5, 2026

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nuzant commented Feb 4, 2026 •

edited

Loading