Resets recurrent state after episode termination in RSL-RL `play.py` #3838

bikcrum · 2025-10-26T22:33:00Z

Description

This PR fixes an issue in recurrent policy evaluation where the recurrent state was not being reset after an episode termination. The missing reset caused residual memory to persist between episodes. The fix ensures that reset() is now called during evaluation in play.py for policy networks, including recurrent.

Fixes #3837

Type of change

Bug fix (non-breaking change which fixes an issue)

Screenshots

N/A

Checklist

I have read and understood the contribution guidelines
I have run the pre-commit checks with ./isaaclab.sh --format
I have made corresponding changes to the documentation where necessary
My changes generate no new warnings
I have added tests verifying that recurrent states are correctly reset during evaluation
I have updated the changelog and corresponding version in the extension’s config/extension.toml file
I have added my name to the CONTRIBUTORS.md or my name already exists there

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a critical bug in the RSL-RL evaluation pipeline where recurrent policy networks (e.g., LSTMs, GRUs) retained hidden states across episode boundaries during evaluation. The evaluation script (play.py) was missing the necessary logic to reset recurrent states when episodes terminated, causing residual memory from previous episodes to contaminate the policy's behavior at the start of new episodes. The fix captures the dones tensor from the environment step and passes it to policy_nn.reset(), which conditionally resets recurrent states for terminated environments. This brings the evaluation loop into parity with the training loop, which already handled this correctly. The changes include appropriate version bumping (patch increment to 0.47.2) and changelog documentation.

Important Files Changed

Filename	Score	Overview
scripts/reinforcement_learning/rsl_rl/play.py	4/5	Added logic to capture `dones` tensor and call `policy_nn.reset(dones)` to properly reset recurrent states between episodes during evaluation
source/isaaclab/config/extension.toml	5/5	Incremented patch version from 0.47.1to 0.47.2 to reflect the bug fix
source/isaaclab/docs/CHANGELOG.rst	5/5	Added release notes for version 0.47.2 documenting the recurrent policy state reset fix

Confidence score: 4/5

This PR addresses a clear bug with a focused fix that aligns with established training behavior, making it safe to merge with minimal risk.
Score is 4/5 because while the fix is correct and necessary, there's a minor concern: the reset() call may not be implemented or tested for all policy architectures, and the change assumes policy_nn always has a reset() method that accepts a dones tensor - this could potentially cause issues with non-recurrent policies or custom architectures if not handled properly.
Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify that the reset() method is available on all policy types used in evaluation, particularly ensuring non-recurrent policies handle this gracefully.

Sequence Diagram

sequenceDiagram
    participant User
    participant main
    participant AppLauncher
    participant gym
    participant env
    participant policy
    participant policy_nn
    
    User->>main: Execute play.py
    main->>AppLauncher: Launch simulation app
    main->>gym: gym.make(task, cfg, render_mode)
    gym->>env: Create environment
    main->>env: wrap with RslRlVecEnvWrapper
    main->>main: Load runner and checkpoint
    main->>main: Get inference policy
    main->>main: Extract policy neural network
    main->>env: get_observations()
    env->>main: Return obs
    
    loop While simulation running
        main->>policy: policy(obs)
        policy->>main: Return actions
        main->>env: env.step(actions)
        env->>main: Return obs, rewards, dones, info
        Note over main,policy_nn: NEW: Reset recurrent states for terminated episodes
        main->>policy_nn: policy_nn.reset(dones)
        policy_nn->>main: Reset complete
    end
    
    main->>env: env.close()
    main->>User: Simulation complete

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ooctipus · 2025-10-27T04:11:23Z

@ClemensSchwarke can you take a look at this? I think the change looks good to me.

Mayankm96

Thanks!

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a critical bug in the RSL-RL evaluation script (play.py) where recurrent neural network policies (LSTM/GRU) were not properly resetting their hidden states between episodes. The codebase uses the RSL-RL framework for reinforcement learning, where training and evaluation follow different code paths. During training, episode resets are handled automatically, but the evaluation script was missing this logic. The fix captures the dones signal from env.step() and calls policy_nn.reset(dones) to clear recurrent states when episodes terminate. This ensures each episode starts with a clean memory state, preventing information leakage across episode boundaries. Version metadata was updated from 0.47.2 to 0.47.3 to reflect this patch-level bug fix.

Important Files Changed

Filename	Score	Overview
scripts/reinforcement_learning/rsl_rl/play.py	4/5	Added recurrent state reset logic after step, capturing `dones` and calling `policy_nn.reset(dones)`
source/isaaclab/config/extension.toml	5/5	Bumped patch version from 0.47.2 to 0.47.3 for bug fix release
source/isaaclab/docs/CHANGELOG.rst	5/5	Added changelog entry documenting the recurrent state reset fix for version 0.47.3

Confidence score: 4/5

This PR is safe to merge with minimal risk, addressing a specific bug in recurrent policy evaluation
Score reflects a well-targeted fix with appropriate version bumping and documentation, though minor concerns exist about the interaction between policy_nn.reset() and the environment's reset mechanism, and potential edge cases with partial resets in vectorized environments
Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure the policy_nn.reset(dones) call is compatible with all policy network types and vectorized environment configurations

Sequence Diagram

sequenceDiagram
    participant User
    participant play.py as play.py Script
    participant AppLauncher
    participant Gym as Gymnasium
    participant Env as Environment
    participant Wrapper as RslRlVecEnvWrapper
    participant Runner as OnPolicyRunner/DistillationRunner
    participant Policy as policy_nn (Actor-Critic)
    participant SimApp as Simulation App

    User->>play.py: Execute play.py with task config
    play.py->>AppLauncher: Initialize AppLauncher(args_cli)
    AppLauncher->>SimApp: Launch Omniverse app
    play.py->>play.py: Parse hydra config for env_cfg & agent_cfg
    play.py->>play.py: Determine checkpoint path (resume_path)
    play.py->>Gym: gym.make(task, cfg=env_cfg, render_mode)
    Gym->>Env: Create environment instance
    play.py->>Wrapper: RslRlVecEnvWrapper(env)
    play.py->>Runner: Create OnPolicyRunner/DistillationRunner
    play.py->>Runner: runner.load(resume_path)
    Runner->>Policy: Load policy checkpoint
    play.py->>Runner: policy = runner.get_inference_policy()
    play.py->>Env: obs = env.get_observations()
    
    loop Simulation Loop (while simulation_app.is_running())
        play.py->>Policy: actions = policy(obs)
        play.py->>Env: obs, _, dones, _ = env.step(actions)
        Note over play.py,Policy: Bug Fix: Reset recurrent states on termination
        play.py->>Policy: policy_nn.reset(dones)
        alt Video Recording Enabled
            play.py->>play.py: Increment timestep counter
            alt timestep == video_length
                play.py->>play.py: Break loop
            end
        end
    end
    
    play.py->>Env: env.close()
    play.py->>SimApp: simulation_app.close()

_{3 files reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

source/isaaclab/docs/CHANGELOG.rst

scripts/reinforcement_learning/rsl_rl/play.py

greptile-apps

Greptile Overview

Greptile Summary

This PR addresses a critical bug in the RSL-RL evaluation workflow where recurrent policy networks (LSTM/GRU) were not resetting hidden states after episode terminations during inference. The fix modifies play.py to capture the dones signal from env.step() and explicitly call policy_nn.reset(dones) to clear recurrent memory for terminated episodes. This ensures proper episode boundary handling during policy evaluation, preventing residual memory from previous episodes from affecting subsequent rollouts. The changes are minimal and focused: capturing the previously ignored fourth return value from env.step() and adding a single reset call. The fix aligns with standard RL evaluation practices where recurrent states should be cleared at episode boundaries, similar to how the training workflow in train.py likely handles resets through the OnPolicyRunner wrapper.

PR Description Notes:

The PR checklist indicates tests were added, but no test files appear in the changed files list
Documentation checkbox is unchecked despite this being a behavioral fix that may warrant documentation

Important Files Changed

Filename	Score	Overview
scripts/reinforcement_learning/rsl_rl/play.py	4/5	Added recurrent state reset after episode termination by capturing `dones` and calling `policy_nn.reset(dones)`
source/isaaclab/config/extension.toml	5/5	Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning
source/isaaclab/docs/CHANGELOG.rst	5/5	Added changelog entry for version 0.47.3 documenting the recurrent policy reset fix

Confidence score: 4/5

This PR is safe to merge with minimal risk as it fixes a clear bug in recurrent policy evaluation with a straightforward solution
Score reflects that while the fix is correct, the unconditional call to reset() on every step may cause issues if feedforward policies don't implement a no-op reset method, and the PR claims tests were added but they're not visible in the changed files
Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify that all policy types in the codebase properly handle the reset() method call

Sequence Diagram

sequenceDiagram
    participant User
    participant PlayScript as play.py
    participant Env as Environment
    participant Policy as Policy Network
    participant Simulation as SimulationApp

    User->>PlayScript: Execute play.py
    PlayScript->>Env: gym.make(task)
    PlayScript->>Env: get_observations()
    Env-->>PlayScript: obs
    
    loop While simulation running
        PlayScript->>Policy: policy(obs)
        Policy-->>PlayScript: actions
        PlayScript->>Env: step(actions)
        Env-->>PlayScript: obs, _, dones, _
        PlayScript->>Policy: reset(dones)
        Note over Policy: Reset recurrent states<br/>for terminated episodes
        
        alt Video recording enabled
            PlayScript->>PlayScript: Check if video complete
            alt Video length reached
                PlayScript->>PlayScript: break
            end
        end
    end
    
    PlayScript->>Env: close()
    PlayScript->>Simulation: close()

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ClemensSchwarke

Looks good, thanks a lot!

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review. The developer has addressed previous feedback by updating the CHANGELOG category from "Changed" to "Fixed" (as appropriate for a bug fix), ensuring proper semantic versioning with the patch version increment to 0.47.3. The core fix remains targeted and minimal: capturing the dones signal from env.step() and passing it to policy_nn.reset(dones) to clear recurrent states selectively for terminated episodes. This ensures recurrent policies (LSTM/GRU) don't carry residual memory between episodes during evaluation in the RSL-RL play.py workflow.

Important Files Changed

Filename	Score	Overview
source/isaaclab/docs/CHANGELOG.rst	5/5	Added version 0.47.3 entry documenting the recurrent state reset bug fix under "Fixed" section
scripts/reinforcement_learning/rsl_rl/play.py	4/5	Captures `dones` from env.step() and calls `policy_nn.reset(dones)` to reset recurrent states after episode termination
source/isaaclab/config/extension.toml	5/5	Bumped patch version from 0.47.2 to 0.47.3 following semantic versioning for bug fix

Confidence score: 4/5

This PR is safe to merge with only minor considerations regarding non-recurrent policy performance
Score reflects that the core bug fix is correct and properly documented, but one previous concern about calling reset unconditionally on feedforward policies remains unaddressed in the visible changes
Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to ensure policy_nn.reset() is either a no-op for feedforward policies or has negligible performance impact

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a bug in the RSL-RL evaluation workflow where recurrent policy hidden states were not being reset between episode terminations. The fix modifies play.py to capture the dones tensor from env.step() and pass it to policy_nn.reset(dones), ensuring that recurrent networks (LSTMs, GRUs) start each new episode with clean hidden states rather than carrying over residual memory from previous episodes. This aligns the evaluation behavior with the standard training loop pattern where policies are reset on episode boundaries. The change also includes a changelog entry documenting this fix as version 0.47.4.

Important Files Changed

Filename	Score	Overview
scripts/reinforcement_learning/rsl_rl/play.py	4/5	Added capture of `dones` from `env.step()` and call to `policy_nn.reset(dones)` to reset recurrent states on episode termination
source/isaaclab/docs/CHANGELOG.rst	5/5	Added version0.47.4 changelog entry documenting the recurrent policy reset bug fix

Confidence score: 4/5

This PR addresses a legitimate bug with a targeted fix that follows standard RL patterns for handling recurrent policies
Score reflects that while the fix is correct, calling reset unconditionally on every step for all policy types may have minor performance implications, and the changelog entry uses "Fixed" correctly but has a minor date inconsistency (shows October instead of current date)
The main file requiring attention is play.py to verify that calling reset on every step is acceptable for all policy architectures in the codebase

Sequence Diagram

sequenceDiagram
    participant User
    participant play.py
    participant gym.make
    participant RslRlVecEnvWrapper
    participant OnPolicyRunner/DistillationRunner
    participant policy
    participant policy_nn
    participant env

    User->>play.py: Run play.py script
    play.py->>gym.make: Create environment
    gym.make-->>play.py: Return env
    play.py->>RslRlVecEnvWrapper: Wrap environment
    RslRlVecEnvWrapper-->>play.py: Return wrapped env
    play.py->>OnPolicyRunner/DistillationRunner: Initialize runner
    play.py->>OnPolicyRunner/DistillationRunner: load(resume_path)
    play.py->>OnPolicyRunner/DistillationRunner: get_inference_policy()
    OnPolicyRunner/DistillationRunner-->>play.py: Return policy
    play.py->>env: get_observations()
    env-->>play.py: Return obs
    
    loop While simulation running
        play.py->>policy: policy(obs)
        policy-->>play.py: Return actions
        play.py->>env: step(actions)
        env-->>play.py: Return obs, rewards, dones, info
        Note over play.py,policy_nn: Bug Fix: Reset recurrent states on termination
        play.py->>policy_nn: reset(dones)
        Note over policy_nn: Resets hidden states for<br/>terminated episodes
    end
    
    play.py->>env: close()

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-10-28T22:09:59Z

scripts/reinforcement_learning/rsl_rl/play.py

-            obs, _, _, _ = env.step(actions)
+            obs, _, dones, _ = env.step(actions)
+            # reset recurrent states for episodes that have terminated
+            policy_nn.reset(dones)


style: calling reset() on every step (even when dones is all False) may have overhead - consider only calling when dones.any() is True

@greptile-apps Does that make much difference? dones.any() adds overhead too, and for MLPs reset() is a no-op, so it might actually be slower.

You make a good point. Let me check what reset() actually does in the policy network.You're absolutely right. For MLPs, reset() is typically a no-op (just returns immediately), so calling dones.any() adds an extra operation that might not be worth it. The overhead of checking dones.any() could actually be comparable to or greater than just calling reset() unconditionally, especially since:

dones.any() requires a reduction operation across all environments

For non-recurrent policies, reset() is essentially free (just returns)

Even for recurrent policies, reset() with an all-False mask should be fast

The current implementation is actually cleaner and the performance impact is likely negligible either way. I'll withdraw my comment - unconditionally calling reset() is the right approach here.

greptile-apps

Greptile Overview

Greptile Summary

This review covers only the changes made since the last review, not the entire PR. The latest changes address prior review feedback by correcting the changelog category from "Changed" to "Fixed" to properly reflect that this is a bug fix rather than a feature change. The core fix remains unchanged: capturing the dones tensor from env.step() and calling policy_nn.reset(dones) after each step to clear recurrent hidden states when episodes terminate. This ensures recurrent policies (LSTM, GRU) don't carry over memory between episodes during evaluation in the RSL-RL play.py workflow.

Important Files Changed

Filename	Score	Overview
source/isaaclab/docs/CHANGELOG.rst	5/5	Updated changelog category from "Changed" to "Fixed" for the recurrent policy reset bug fix entry
scripts/reinforcement_learning/rsl_rl/play.py	4/5	Added recurrent state reset after env.step() by capturing dones tensor and calling policy_nn.reset(dones)

Confidence score: 4/5

This PR is safe to merge with low risk as it addresses a critical bug in recurrent policy evaluation
Score reflects that the fix correctly resolves the memory persistence issue and changelog now properly categorizes this as a bug fix, though minor performance optimization opportunities remain from previous reviews (calling reset unconditionally on every step even when no episodes terminate, and potential impacts on feedforward policies)
Pay close attention to scripts/reinforcement_learning/rsl_rl/play.py to verify the reset mechanism works correctly across both recurrent and feedforward policy types in production scenarios

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

kellyguo11 · 2025-11-01T00:10:27Z

can you please run ./isaaclab.sh -f to resolve the linter issues?

…tion in RSL-RL framework

bikcrum · 2025-11-01T05:15:20Z

can you please run ./isaaclab.sh -f to resolve the linter issues?

@kellyguo11 Thank you for pointing it out. I ran ./isaaclab.sh -f and pushed the fixes for the files I changed earlier. It touched a few other files too, let me know if you want me to push those as well.

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes recurrent policy evaluation by adding a reset() call after episode termination. The change addresses issue #3837 where recurrent state persisted between episodes during evaluation, causing inconsistent policy behavior.

Key changes:

Captures dones output from env.step() (previously ignored) in play.py:187
Calls policy_nn.reset(dones) to reset recurrent states for terminated episodes in play.py:189
Version bumped to 0.47.6 with appropriate changelog entry

Issue identified:

The reset(dones) signature needs verification - the RSL-RL exporter shows reset() with no parameters (exporter.py:103), which may cause a runtime TypeError if the actual policy class doesn't accept the dones argument

Confidence Score: 2/5

This PR requires verification before merging due to potential signature mismatch
The fix addresses a real bug, but there's a critical concern about the reset(dones) method signature - the codebase's own exporter shows reset() with no parameters, suggesting potential runtime TypeError. This needs testing or verification that RSL-RL v3.0.1+ supports this signature before merge.
scripts/reinforcement_learning/rsl_rl/play.py requires verification that the policy reset method accepts the dones parameter

Important Files Changed

File Analysis

Filename	Score	Overview
scripts/reinforcement_learning/rsl_rl/play.py	3/5	Added reset call with dones argument to reset recurrent policy states after episode termination, but potential signature mismatch with RSL-RL reset() method
source/isaaclab/config/extension.toml	5/5	Version bumped from 0.47.5 to 0.47.6 for bug fix release
source/isaaclab/docs/CHANGELOG.rst	5/5	Added changelog entry for recurrent state reset fix in version 0.47.6

Sequence Diagram

sequenceDiagram
    participant Main as Main Loop
    participant Policy as Policy Network
    participant Env as Environment
    
    Main->>Policy: policy(obs)
    Policy-->>Main: actions
    Main->>Env: env.step(actions)
    Env-->>Main: obs, _, dones, _
    Note over Main,Policy: NEW: Reset on termination
    Main->>Policy: policy_nn.reset(dones)
    Note over Policy: Resets recurrent state<br/>for terminated episodes

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-01T05:15:47Z

scripts/reinforcement_learning/rsl_rl/play.py

-            obs, _, _, _ = env.step(actions)
+            obs, _, dones, _ = env.step(actions)
+            # reset recurrent states for episodes that have terminated
+            policy_nn.reset(dones)


logic: verify that the reset() method in RSL-RL ActorCritic/Policy classes accepts a dones parameter - the exporter in source/isaaclab_rl/isaaclab_rl/rsl_rl/exporter.py:103 shows reset() with no parameters, which may cause a TypeError at runtime

bikcrum requested a review from ooctipus as a code owner October 26, 2025 22:33

github-actions bot added bug Something isn't working isaac-lab Related to Isaac Lab team labels Oct 26, 2025

greptile-apps bot reviewed Oct 26, 2025

View reviewed changes

ooctipus requested a review from ClemensSchwarke October 27, 2025 04:10

Mayankm96 approved these changes Oct 27, 2025

View reviewed changes

Mayankm96 changed the title ~~Bugfix: Reset recurrent state after episode termination during evaluation in RSL-RL framework.~~ Resets recurrent state after episode termination in RSL-RL play.py Oct 27, 2025

Mayankm96 added this to Isaac Lab Oct 27, 2025

Mayankm96 moved this to In review in Isaac Lab Oct 27, 2025

bikcrum force-pushed the bugfix/recurrent-policy-reset-eval branch from c09f7ee to 00ef36e Compare October 27, 2025 21:16

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

source/isaaclab/docs/CHANGELOG.rst Show resolved Hide resolved

scripts/reinforcement_learning/rsl_rl/play.py Show resolved Hide resolved

bikcrum force-pushed the bugfix/recurrent-policy-reset-eval branch from 00ef36e to 26a7828 Compare October 27, 2025 21:19

greptile-apps bot reviewed Oct 27, 2025

View reviewed changes

ClemensSchwarke approved these changes Oct 28, 2025

View reviewed changes

greptile-apps bot reviewed Oct 28, 2025

View reviewed changes

bikcrum force-pushed the bugfix/recurrent-policy-reset-eval branch from 827b813 to 9b93780 Compare October 28, 2025 22:06

greptile-apps bot reviewed Oct 28, 2025

View reviewed changes

greptile-apps bot reviewed Oct 29, 2025

View reviewed changes

ooctipus approved these changes Oct 29, 2025

View reviewed changes

kellyguo11 moved this from In review to In progress in Isaac Lab Nov 1, 2025

Bugfix: Reset recurrent state after episode termination during evalua…

162881c

…tion in RSL-RL framework

bikcrum force-pushed the bugfix/recurrent-policy-reset-eval branch from 4333169 to 162881c Compare November 1, 2025 05:09

greptile-apps bot reviewed Nov 1, 2025

View reviewed changes

Resets recurrent state after episode termination in RSL-RL play.py #3838

Are you sure you want to change the base?

Resets recurrent state after episode termination in RSL-RL play.py #3838

Conversation

bikcrum commented Oct 26, 2025

Description

Type of change

Screenshots

Checklist

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

ooctipus commented Oct 27, 2025

Uh oh!

Mayankm96 left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

ClemensSchwarke left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Sequence Diagram

Uh oh!

greptile-apps bot Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

bikcrum Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 4/5

Uh oh!

kellyguo11 commented Nov 1, 2025

Uh oh!

bikcrum commented Nov 1, 2025

Uh oh!

greptile-apps bot left a comment

Resets recurrent state after episode termination in RSL-RL `play.py` #3838

Resets recurrent state after episode termination in RSL-RL `play.py` #3838