Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any idea to run on 8 V100-32G gpus? #89

Open
civat opened this issue Feb 14, 2025 · 5 comments
Open

Any idea to run on 8 V100-32G gpus? #89

civat opened this issue Feb 14, 2025 · 5 comments

Comments

@civat
Copy link

civat commented Feb 14, 2025

I tried:

  1. offloading (optimizer state and parameters)
  2. gradient checkpointing
  3. reduce the max prompt length (256 now)
  4. per_device_train_batch_size = 1
  5. num_generations=1

OOM still occurs.
Any ideas to solve? (No A100 available now)

Thank you very much!

@civat civat changed the title Any ideas to run on 8 V100-32G gpus? Any idea to run on 8 V100-32G gpus? Feb 14, 2025
@Quinn777
Copy link

use lora

@m-Just
Copy link

m-Just commented Feb 16, 2025

Freezing the vision encoder saves a lot of memory.

@happyz123456789
Copy link

Could you teach me how do you solve it?

@civat
Copy link
Author

civat commented Feb 22, 2025

Could you teach me how do you solve it?

Use PyTorch's torch.utils.checkpoint

Most parts of the full model do not support the gradient checkpointing even if gradient_checkpointing=True in the config file. So you need to add torch.utils.checkpoint in the expensive computation in the modeling_qwen2_vl.py file.

After that, the training memory at the first step is about 15GB, which is a huge reduction.
But there is a new issue. I found that the memory increases after several training steps, and OOM occurs at the 16-th step. I do not why. I am trying to profile the memory usage.

@happyz123456789
Copy link

Could you teach me how do you solve it?

Use PyTorch's torch.utils.checkpoint

Most parts of the full model do not support the gradient checkpointing even if gradient_checkpointing=True in the config file. So you need to add torch.utils.checkpoint in the expensive computation in the modeling_qwen2_vl.py file.

After that, the training memory at the first step is about 15GB, which is a huge reduction. But there is a new issue. I found that the memory increases after several training steps, and OOM occurs at the 16-th step. I do not why. I am trying to profile the memory usage.

thank you very much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants