Fixed eval.py on MPS #702

IliaLarchenko · 2025-02-09T11:48:07Z

What this does

I encountered a very weird bug while training/evaluating the model on mps device: unexpectedly observation.state values become completely random float numbers that mess up the whole evaluation.

Turned out it happens after this line:

lerobot/lerobot/scripts/eval.py

Line 157 in 638d411

    
           observation = {key: observation[key].to(device, non_blocking=True) for key in observation}

non_blocking=True is not supported by MPS devices and can result in random numbers. There are multiple issues related to it in different libraries. While writing this I also found that there are 2 open issues with the same problem here:

#475
#496

This fix should solve them both.

My fix is simple use non_blocking=True only on CUDA.

.to(device, non_blocking=True) is also used in train.py, though I don't see the same issue in training. This is probably because the error happens only with non-contiguous tensors, which is not the case for training (but I didn't look into it; maybe train.py also requires this fix).

How it was tested

It can be hard to reproduce as the error appears randomly, but just try to train or evaluate any model on an MPS device. (I used the Pusht dataset, but in other issues, it happened with Aloha.)
E.g. try to evaluate the diffusion_pusht model on mps: https://huggingface.co/lerobot/diffusion_pusht

python lerobot/scripts/eval.py \
    --policy.path=lerobot/diffusion_pusht \
    --env.type=pusht \
    --eval.n_episodes=10 \
    --eval.batch_size=10 \
    --device=mps \
    --use_amp=false

It fails without the fix (either silently by getting a 0 success rate or by returning NaN at some step) but works well with fix.

Cadene · 2025-02-10T12:59:49Z

OMG. We knew there was a bug but we didnt know why. Thanks!

IliaLarchenko · 2025-02-10T14:35:23Z

OMG. We knew there was a bug but we didnt know why. Thanks!

Yes, it was tough to find. It silently and randomly replaces some numbers in the tensor with completely meaningless ones, and the training/validation can fail only a few steps later. I literally debugged it line by line to find the issue.

aliberts

Before | After

episode_1.mp4

I think this is the one-liner fix of the year.
Congrats @IliaLarchenko, this one was bugging us for a while! 🤗

jgrizou · 2025-02-14T09:09:45Z

Thanks @IliaLarchenko for this fix. I tested it on pusht with device = "mps". The policy now works with the fix but would not work before.

That is also likely affecting:

2_evaluate_pretrained_policy.py (default policy lerobot/diffusion_pusht not working, same as example above)
3_train_policy.py and train.py (training failed to produce a working policy initially). More thorough check needed but I replaced all the non_blocking=True to non_blocking==device.type == "cuda".

I also tried to add this on top of the eval and train files

import os
os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'

as per https://www.reddit.com/r/pytorch/comments/1c3kwwg/how_do_i_fix_the_mps_notimplemented_error_for_m1/

It seems to help but hard to tell. The pusht policy trained is not as good as the lerobot/pusht one.

jgrizou · 2025-02-14T09:15:42Z

@aliberts snap here I started writing this post yesterday, have a look into

https://github.com/huggingface/lerobot/blob/main/examples/2_evaluate_pretrained_policy.py#L80 apply the same fix
https://github.com/huggingface/lerobot/blob/main/examples/3_train_policy.py#L8 here I had to change more as one element in the bath is not a Tensor. Replaced by

for key in batch:
    if isinstance(batch[key], torch.Tensor):
        batch[key] = batch[key].to(device, non_blocking=device.type == "cuda")

https://github.com/huggingface/lerobot/blob/main/lerobot/scripts/train.py#L211

Maybe more. And unsure of the effect in the training stack as harder to anlayse.

IliaLarchenko added 2 commits February 9, 2025 18:37

use non_blocking transfer to device only for cuda

ef532c6

.contiguous() not needed

5901b90

IliaLarchenko added 2 commits February 10, 2025 21:00

Merge branch 'huggingface:main' into mps_eval_bug_fix

aecc61e

format fix

4d9d34e

Merge branch 'huggingface:main' into mps_eval_bug_fix

31233d1

aliberts approved these changes Feb 13, 2025

View reviewed changes

aliberts merged commit c574eb4 into huggingface:main Feb 13, 2025
6 checks passed

IliaLarchenko deleted the mps_eval_bug_fix branch February 14, 2025 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed eval.py on MPS #702

Fixed eval.py on MPS #702

IliaLarchenko commented Feb 9, 2025

Cadene commented Feb 10, 2025

IliaLarchenko commented Feb 10, 2025

aliberts left a comment •

edited

Loading

jgrizou commented Feb 14, 2025

jgrizou commented Feb 14, 2025

Fixed eval.py on MPS #702

Fixed eval.py on MPS #702

Conversation

IliaLarchenko commented Feb 9, 2025

What this does

How it was tested

Cadene commented Feb 10, 2025

IliaLarchenko commented Feb 10, 2025

aliberts left a comment • edited Loading

Choose a reason for hiding this comment

Before | After

jgrizou commented Feb 14, 2025

jgrizou commented Feb 14, 2025

aliberts left a comment •

edited

Loading