Environment seeding

Hi, thanks for your great work!

I have a question regarding reproducibility of environment initializations.

If I set a seed and specify the number of environments I want to create, is it possible to always obtain the same initial conditions for those environments?

The reason I’m asking is that I’m comparing a baseline against my own implementation, and for a fair evaluation I’d like them to start under identical conditions. At least for an evaluation phase of $N$ steps (without resets in between), it would be helpful to ensure both runs see the same setup.

Is there currently a way to “copy” or reproduce the initial conditions, or would I need to implement a deterministic reset mechanism myself?

Thanks again!

That's an example of my code: 
```python
def rollout_policy(
    env: TransformedEnv,
    policy: torch.nn.Module,
    steps: int,
    video_path: str | None = None,
) -> Tuple[List[float], List[float]]:
    """
    Generic rollout function for both learned and heuristic policies.
    Runs `policy` in `env` for `steps` timesteps, returns mean/std reward traces.
    """

    def save_video(frames: List[np.ndarray], filename: str, fps: int = 30):
        if not frames:
            print("[warning] no frames to write", filename)
            return
        h, w, _ = frames[0].shape
        fourcc = cv2.VideoWriter_fourcc(*"mp4v")
        if not filename.endswith(".mp4"):
            filename += ".mp4"
        vout = cv2.VideoWriter(filename, fourcc, fps, (w, h))
        for fr in frames:
            vout.write(fr)
        vout.release()

    # video setup
    frames = []
    callback = None
    if video_path is not None:
        callback = lambda e, td: frames.append(e.render(mode="rgb_array"))

    with torch.no_grad():
        td = env.rollout(
            max_steps=steps,
            policy=policy,
            callback=callback,
            break_when_any_done=False,
            auto_cast_to_device=True,
        )

    # rewards [T, n_envs, n_agents]
    rewards = td.get(("next",) + env.reward_key, None)
    if rewards is None:
        rewards = td.get(env.reward_key, None)
    if rewards is None:
        raise RuntimeError("No rewards found in rollout")

    mean_trace, std_trace = [], []
    for t in range(rewards.shape[1]):  # loop over timesteps
        r_t = rewards[:, t, :]  # [n_envs, n_agents]
        r_tot = r_t.mean(dim=1).squeeze(-1)  # [n_envs]
        m, s = _iqm_and_iqrstd_1d(r_tot.cpu().numpy())
        mean_trace.append(float(m))
        std_trace.append(float(s))

    if video_path is not None:
        save_video(frames, video_path, fps=30)

    return mean_trace, std_trace
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Environment seeding #163

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Environment seeding #163

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions