-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Resets recurrent state after episode termination in RSL-RL play.py
#3838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical bug in the RSL-RL evaluation pipeline where recurrent policy networks (e.g., LSTMs, GRUs) retained hidden states across episode boundaries during evaluation. The evaluation script (play.py) was missing the necessary logic to reset recurrent states when episodes terminated, causing residual memory from previous episodes to contaminate the policy's behavior at the start of new episodes. The fix captures the dones tensor from the environment step and passes it to policy_nn.reset(), which conditionally resets recurrent states for terminated environments. This brings the evaluation loop into parity with the training loop, which already handled this correctly. The changes include appropriate version bumping (patch increment to 0.47.2) and changelog documentation.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added logic to capture dones tensor and call policy_nn.reset(dones) to properly reset recurrent states between episodes during evaluation |
| source/isaaclab/config/extension.toml | 5/5 | Incremented patch version from 0.47.1to 0.47.2 to reflect the bug fix |
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added release notes for version 0.47.2 documenting the recurrent policy state reset fix |
Confidence score: 4/5
- This PR addresses a clear bug with a focused fix that aligns with established training behavior, making it safe to merge with minimal risk.
- Score is 4/5 because while the fix is correct and necessary, there's a minor concern: the
reset()call may not be implemented or tested for all policy architectures, and the change assumespolicy_nnalways has areset()method that accepts adonestensor - this could potentially cause issues with non-recurrent policies or custom architectures if not handled properly. - Pay close attention to
scripts/reinforcement_learning/rsl_rl/play.pyto verify that thereset()method is available on all policy types used in evaluation, particularly ensuring non-recurrent policies handle this gracefully.
Sequence Diagram
sequenceDiagram
participant User
participant main
participant AppLauncher
participant gym
participant env
participant policy
participant policy_nn
User->>main: Execute play.py
main->>AppLauncher: Launch simulation app
main->>gym: gym.make(task, cfg, render_mode)
gym->>env: Create environment
main->>env: wrap with RslRlVecEnvWrapper
main->>main: Load runner and checkpoint
main->>main: Get inference policy
main->>main: Extract policy neural network
main->>env: get_observations()
env->>main: Return obs
loop While simulation running
main->>policy: policy(obs)
policy->>main: Return actions
main->>env: env.step(actions)
env->>main: Return obs, rewards, dones, info
Note over main,policy_nn: NEW: Reset recurrent states for terminated episodes
main->>policy_nn: policy_nn.reset(dones)
policy_nn->>main: Reset complete
end
main->>env: env.close()
main->>User: Simulation complete
3 files reviewed, no comments
|
@ClemensSchwarke can you take a look at this? I think the change looks good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
play.py
c09f7ee to
00ef36e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a critical bug in the RSL-RL evaluation script (play.py) where recurrent neural network policies (LSTM/GRU) were not properly resetting their hidden states between episodes. The codebase uses the RSL-RL framework for reinforcement learning, where training and evaluation follow different code paths. During training, episode resets are handled automatically, but the evaluation script was missing this logic. The fix captures the dones signal from env.step() and calls policy_nn.reset(dones) to clear recurrent states when episodes terminate. This ensures each episode starts with a clean memory state, preventing information leakage across episode boundaries. Version metadata was updated from 0.47.2 to 0.47.3 to reflect this patch-level bug fix.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added recurrent state reset logic after step, capturing dones and calling policy_nn.reset(dones) |
| source/isaaclab/config/extension.toml | 5/5 | Bumped patch version from 0.47.2 to 0.47.3 for bug fix release |
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added changelog entry documenting the recurrent state reset fix for version 0.47.3 |
Confidence score: 4/5
- This PR is safe to merge with minimal risk, addressing a specific bug in recurrent policy evaluation
- Score reflects a well-targeted fix with appropriate version bumping and documentation, though minor concerns exist about the interaction between
policy_nn.reset()and the environment's reset mechanism, and potential edge cases with partial resets in vectorized environments - Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure the
policy_nn.reset(dones)call is compatible with all policy network types and vectorized environment configurations
Sequence Diagram
sequenceDiagram
participant User
participant play.py as play.py Script
participant AppLauncher
participant Gym as Gymnasium
participant Env as Environment
participant Wrapper as RslRlVecEnvWrapper
participant Runner as OnPolicyRunner/DistillationRunner
participant Policy as policy_nn (Actor-Critic)
participant SimApp as Simulation App
User->>play.py: Execute play.py with task config
play.py->>AppLauncher: Initialize AppLauncher(args_cli)
AppLauncher->>SimApp: Launch Omniverse app
play.py->>play.py: Parse hydra config for env_cfg & agent_cfg
play.py->>play.py: Determine checkpoint path (resume_path)
play.py->>Gym: gym.make(task, cfg=env_cfg, render_mode)
Gym->>Env: Create environment instance
play.py->>Wrapper: RslRlVecEnvWrapper(env)
play.py->>Runner: Create OnPolicyRunner/DistillationRunner
play.py->>Runner: runner.load(resume_path)
Runner->>Policy: Load policy checkpoint
play.py->>Runner: policy = runner.get_inference_policy()
play.py->>Env: obs = env.get_observations()
loop Simulation Loop (while simulation_app.is_running())
play.py->>Policy: actions = policy(obs)
play.py->>Env: obs, _, dones, _ = env.step(actions)
Note over play.py,Policy: Bug Fix: Reset recurrent states on termination
play.py->>Policy: policy_nn.reset(dones)
alt Video Recording Enabled
play.py->>play.py: Increment timestep counter
alt timestep == video_length
play.py->>play.py: Break loop
end
end
end
play.py->>Env: env.close()
play.py->>SimApp: simulation_app.close()
3 files reviewed, 2 comments
00ef36e to
26a7828
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR addresses a critical bug in the RSL-RL evaluation workflow where recurrent policy networks (LSTM/GRU) were not resetting hidden states after episode terminations during inference. The fix modifies play.py to capture the dones signal from env.step() and explicitly call policy_nn.reset(dones) to clear recurrent memory for terminated episodes. This ensures proper episode boundary handling during policy evaluation, preventing residual memory from previous episodes from affecting subsequent rollouts. The changes are minimal and focused: capturing the previously ignored fourth return value from env.step() and adding a single reset call. The fix aligns with standard RL evaluation practices where recurrent states should be cleared at episode boundaries, similar to how the training workflow in train.py likely handles resets through the OnPolicyRunner wrapper.
PR Description Notes:
- The PR checklist indicates tests were added, but no test files appear in the changed files list
- Documentation checkbox is unchecked despite this being a behavioral fix that may warrant documentation
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added recurrent state reset after episode termination by capturing dones and calling policy_nn.reset(dones) |
| source/isaaclab/config/extension.toml | 5/5 | Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning |
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added changelog entry for version 0.47.3 documenting the recurrent policy reset fix |
Confidence score: 4/5
- This PR is safe to merge with minimal risk as it fixes a clear bug in recurrent policy evaluation with a straightforward solution
- Score reflects that while the fix is correct, the unconditional call to
reset()on every step may cause issues if feedforward policies don't implement a no-op reset method, and the PR claims tests were added but they're not visible in the changed files - Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify that all policy types in the codebase properly handle the
reset()method call
Sequence Diagram
sequenceDiagram
participant User
participant PlayScript as play.py
participant Env as Environment
participant Policy as Policy Network
participant Simulation as SimulationApp
User->>PlayScript: Execute play.py
PlayScript->>Env: gym.make(task)
PlayScript->>Env: get_observations()
Env-->>PlayScript: obs
loop While simulation running
PlayScript->>Policy: policy(obs)
Policy-->>PlayScript: actions
PlayScript->>Env: step(actions)
Env-->>PlayScript: obs, _, dones, _
PlayScript->>Policy: reset(dones)
Note over Policy: Reset recurrent states<br/>for terminated episodes
alt Video recording enabled
PlayScript->>PlayScript: Check if video complete
alt Video length reached
PlayScript->>PlayScript: break
end
end
end
PlayScript->>Env: close()
PlayScript->>Simulation: close()
3 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review. The developer has addressed previous feedback by updating the CHANGELOG category from "Changed" to "Fixed" (as appropriate for a bug fix), ensuring proper semantic versioning with the patch version increment to 0.47.3. The core fix remains targeted and minimal: capturing the dones signal from env.step() and passing it to policy_nn.reset(dones) to clear recurrent states selectively for terminated episodes. This ensures recurrent policies (LSTM/GRU) don't carry residual memory between episodes during evaluation in the RSL-RL play.py workflow.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added version 0.47.3 entry documenting the recurrent state reset bug fix under "Fixed" section |
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Captures dones from env.step() and calls policy_nn.reset(dones) to reset recurrent states after episode termination |
| source/isaaclab/config/extension.toml | 5/5 | Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning for bug fix |
Confidence score: 4/5
- This PR is safe to merge with only minor considerations regarding non-recurrent policy performance
- Score reflects that the core bug fix is correct and properly documented, but one previous concern about calling reset unconditionally on feedforward policies remains unaddressed in the visible changes
- Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure
policy_nn.reset()is either a no-op for feedforward policies or has negligible performance impact
3 files reviewed, no comments
827b813 to
9b93780
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a bug in the RSL-RL evaluation workflow where recurrent policy hidden states were not being reset between episode terminations. The fix modifies play.py to capture the dones tensor from env.step() and pass it to policy_nn.reset(dones), ensuring that recurrent networks (LSTMs, GRUs) start each new episode with clean hidden states rather than carrying over residual memory from previous episodes. This aligns the evaluation behavior with the standard training loop pattern where policies are reset on episode boundaries. The change also includes a changelog entry documenting this fix as version 0.47.4.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added capture of dones from env.step() and call to policy_nn.reset(dones) to reset recurrent states on episode termination |
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added version0.47.4 changelog entry documenting the recurrent policy reset bug fix |
Confidence score: 4/5
- This PR addresses a legitimate bug with a targeted fix that follows standard RL patterns for handling recurrent policies
- Score reflects that while the fix is correct, calling reset unconditionally on every step for all policy types may have minor performance implications, and the changelog entry uses "Fixed" correctly but has a minor date inconsistency (shows October instead of current date)
- The main file requiring attention is
play.pyto verify that calling reset on every step is acceptable for all policy architectures in the codebase
Sequence Diagram
sequenceDiagram
participant User
participant play.py
participant gym.make
participant RslRlVecEnvWrapper
participant OnPolicyRunner/DistillationRunner
participant policy
participant policy_nn
participant env
User->>play.py: Run play.py script
play.py->>gym.make: Create environment
gym.make-->>play.py: Return env
play.py->>RslRlVecEnvWrapper: Wrap environment
RslRlVecEnvWrapper-->>play.py: Return wrapped env
play.py->>OnPolicyRunner/DistillationRunner: Initialize runner
play.py->>OnPolicyRunner/DistillationRunner: load(resume_path)
play.py->>OnPolicyRunner/DistillationRunner: get_inference_policy()
OnPolicyRunner/DistillationRunner-->>play.py: Return policy
play.py->>env: get_observations()
env-->>play.py: Return obs
loop While simulation running
play.py->>policy: policy(obs)
policy-->>play.py: Return actions
play.py->>env: step(actions)
env-->>play.py: Return obs, rewards, dones, info
Note over play.py,policy_nn: Bug Fix: Reset recurrent states on termination
play.py->>policy_nn: reset(dones)
Note over policy_nn: Resets hidden states for<br/>terminated episodes
end
play.py->>env: close()
2 files reviewed, 1 comment
| obs, _, _, _ = env.step(actions) | ||
| obs, _, dones, _ = env.step(actions) | ||
| # reset recurrent states for episodes that have terminated | ||
| policy_nn.reset(dones) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style: calling reset() on every step (even when dones is all False) may have overhead - consider only calling when dones.any() is True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@greptile-apps Does that make much difference? dones.any() adds overhead too, and for MLPs reset() is a no-op, so it might actually be slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You make a good point. Let me check what reset() actually does in the policy network.You're absolutely right. For MLPs, reset() is typically a no-op (just returns immediately), so calling dones.any() adds an extra operation that might not be worth it. The overhead of checking dones.any() could actually be comparable to or greater than just calling reset() unconditionally, especially since:
dones.any()requires a reduction operation across all environments- For non-recurrent policies,
reset()is essentially free (just returns) - Even for recurrent policies,
reset()with an all-False mask should be fast
The current implementation is actually cleaner and the performance impact is likely negligible either way. I'll withdraw my comment - unconditionally calling reset() is the right approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This review covers only the changes made since the last review, not the entire PR. The latest changes address prior review feedback by correcting the changelog category from "Changed" to "Fixed" to properly reflect that this is a bug fix rather than a feature change. The core fix remains unchanged: capturing the dones tensor from env.step() and calling policy_nn.reset(dones) after each step to clear recurrent hidden states when episodes terminate. This ensures recurrent policies (LSTM, GRU) don't carry over memory between episodes during evaluation in the RSL-RL play.py workflow.
Important Files Changed
| Filename | Score | Overview |
|---|---|---|
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Updated changelog category from "Changed" to "Fixed" for the recurrent policy reset bug fix entry |
| scripts/reinforcement_learning/rsl_rl/play.py | 4/5 | Added recurrent state reset after env.step() by capturing dones tensor and calling policy_nn.reset(dones) |
Confidence score: 4/5
- This PR is safe to merge with low risk as it addresses a critical bug in recurrent policy evaluation
- Score reflects that the fix correctly resolves the memory persistence issue and changelog now properly categorizes this as a bug fix, though minor performance optimization opportunities remain from previous reviews (calling reset unconditionally on every step even when no episodes terminate, and potential impacts on feedforward policies)
- Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify the reset mechanism works correctly across both recurrent and feedforward policy types in production scenarios
2 files reviewed, no comments
|
can you please run |
…tion in RSL-RL framework
4333169 to
162881c
Compare
@kellyguo11 Thank you for pointing it out. I ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes recurrent policy evaluation by adding a reset() call after episode termination. The change addresses issue #3837 where recurrent state persisted between episodes during evaluation, causing inconsistent policy behavior.
Key changes:
- Captures
donesoutput fromenv.step()(previously ignored) inplay.py:187 - Calls
policy_nn.reset(dones)to reset recurrent states for terminated episodes inplay.py:189 - Version bumped to 0.47.6 with appropriate changelog entry
Issue identified:
- The
reset(dones)signature needs verification - the RSL-RL exporter showsreset()with no parameters (exporter.py:103), which may cause a runtime TypeError if the actual policy class doesn't accept thedonesargument
Confidence Score: 2/5
- This PR requires verification before merging due to potential signature mismatch
- The fix addresses a real bug, but there's a critical concern about the
reset(dones)method signature - the codebase's own exporter showsreset()with no parameters, suggesting potential runtime TypeError. This needs testing or verification that RSL-RL v3.0.1+ supports this signature before merge. - scripts/reinforcement_learning/rsl_rl/play.py requires verification that the policy reset method accepts the dones parameter
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| scripts/reinforcement_learning/rsl_rl/play.py | 3/5 | Added reset call with dones argument to reset recurrent policy states after episode termination, but potential signature mismatch with RSL-RL reset() method |
| source/isaaclab/config/extension.toml | 5/5 | Version bumped from 0.47.5 to 0.47.6 for bug fix release |
| source/isaaclab/docs/CHANGELOG.rst | 5/5 | Added changelog entry for recurrent state reset fix in version 0.47.6 |
Sequence Diagram
sequenceDiagram
participant Main as Main Loop
participant Policy as Policy Network
participant Env as Environment
Main->>Policy: policy(obs)
Policy-->>Main: actions
Main->>Env: env.step(actions)
Env-->>Main: obs, _, dones, _
Note over Main,Policy: NEW: Reset on termination
Main->>Policy: policy_nn.reset(dones)
Note over Policy: Resets recurrent state<br/>for terminated episodes
3 files reviewed, 1 comment
| obs, _, _, _ = env.step(actions) | ||
| obs, _, dones, _ = env.step(actions) | ||
| # reset recurrent states for episodes that have terminated | ||
| policy_nn.reset(dones) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: verify that the reset() method in RSL-RL ActorCritic/Policy classes accepts a dones parameter - the exporter in source/isaaclab_rl/isaaclab_rl/rsl_rl/exporter.py:103 shows reset() with no parameters, which may cause a TypeError at runtime
Description
This PR fixes an issue in recurrent policy evaluation where the recurrent state was not being reset after an episode termination. The missing reset caused residual memory to persist between episodes. The fix ensures that
reset()is now called during evaluation inplay.pyfor policy networks, including recurrent.Fixes #3837
Type of change
Screenshots
N/A
Checklist
pre-commitchecks with./isaaclab.sh --formatconfig/extension.tomlfileCONTRIBUTORS.mdor my name already exists there