You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@ajscalers, have you tried running this example on other GPUs? Many libraries use options like gradient checkpointing by default, which can help reduce memory usage. This issue is not a bug, for this example, please enable that option.
Below are the results on Gaudi2 (Gaudi3 has 128 GB of HBM2e memory, which is 33% more than the 96 GB available in Gaudi2),
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
docker run -it --runtime habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --entrypoint /bin/bash vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:1.19.0-561
This throws an out of memory error on Gaudi 3 with 8 GPUs.
Expected behavior
This should run to completion, as a 8B model with 512 sequence length and a batch size of >= 64 can run on other GPUs with similar per GPU memory.
The text was updated successfully, but these errors were encountered: