Any idea to run on 8 V100-32G gpus? #89

civat · 2025-02-14T02:38:36Z

I tried:

offloading (optimizer state and parameters)
gradient checkpointing
reduce the max prompt length (256 now)
per_device_train_batch_size = 1
num_generations=1

OOM still occurs.
Any ideas to solve? (No A100 available now)

Thank you very much!

Quinn777 · 2025-02-14T14:05:52Z

use lora

m-Just · 2025-02-16T14:36:12Z

Freezing the vision encoder saves a lot of memory.

happyz123456789 · 2025-02-20T15:56:46Z

Could you teach me how do you solve it?

civat · 2025-02-22T08:05:03Z

Could you teach me how do you solve it?

Use PyTorch's torch.utils.checkpoint

Most parts of the full model do not support the gradient checkpointing even if gradient_checkpointing=True in the config file. So you need to add torch.utils.checkpoint in the expensive computation in the modeling_qwen2_vl.py file.

After that, the training memory at the first step is about 15GB, which is a huge reduction.
But there is a new issue. I found that the memory increases after several training steps, and OOM occurs at the 16-th step. I do not why. I am trying to profile the memory usage.

happyz123456789 · 2025-02-22T11:33:45Z

Could you teach me how do you solve it?

Use PyTorch's torch.utils.checkpoint

Most parts of the full model do not support the gradient checkpointing even if gradient_checkpointing=True in the config file. So you need to add torch.utils.checkpoint in the expensive computation in the modeling_qwen2_vl.py file.

After that, the training memory at the first step is about 15GB, which is a huge reduction. But there is a new issue. I found that the memory increases after several training steps, and OOM occurs at the 16-th step. I do not why. I am trying to profile the memory usage.

thank you very much

civat changed the title ~~Any ideas to run on 8 V100-32G gpus?~~ Any idea to run on 8 V100-32G gpus? Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any idea to run on 8 V100-32G gpus? #89

Any idea to run on 8 V100-32G gpus? #89

civat commented Feb 14, 2025

Quinn777 commented Feb 14, 2025

m-Just commented Feb 16, 2025

happyz123456789 commented Feb 20, 2025

civat commented Feb 22, 2025

happyz123456789 commented Feb 22, 2025

Any idea to run on 8 V100-32G gpus? #89

Any idea to run on 8 V100-32G gpus? #89

Comments

civat commented Feb 14, 2025

Quinn777 commented Feb 14, 2025

m-Just commented Feb 16, 2025

happyz123456789 commented Feb 20, 2025

civat commented Feb 22, 2025

happyz123456789 commented Feb 22, 2025