Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen2.5-VL-3B OOM #107

Open
Liuziyu77 opened this issue Feb 17, 2025 · 4 comments
Open

Qwen2.5-VL-3B OOM #107

Liuziyu77 opened this issue Feb 17, 2025 · 4 comments

Comments

@Liuziyu77
Copy link

torchrun --nnodes $NNODES --nproc_per_node $GPUS_PER_NODE --node_rank $SLURM_NODEID --master_addr $(scontrol show hostname $SLURM_NODELIST | head -n1) --master_port ${MASTER_PORT} ./src/open_r1/grpo.py \
    --output_dir ${SAVE_PATH}  \
    --model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
    --dataset_name ${DATA_PATH} \
    --deepspeed ./local_scripts/zero3.json \
    --max_prompt_length 1024 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --logging_steps 1 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing false \
    --attn_implementation flash_attention_2 \
    --max_pixels 401408 \
    --num_train_epochs 1 \
    --run_name Qwen2-VL-2B-Debug \
    --save_steps 100 \
    --save_only_model true \
    --num_generations 8

When I try to train Qwen2.5-VL-3B on 8*A100, it out of memory. That's why?

@wnzhyee
Copy link

wnzhyee commented Feb 17, 2025

same peoblem. due to the logs, it may caused by the flash_attention_2 method is not used correctly, but i dont know how to change it.

@lzk9508
Copy link

lzk9508 commented Feb 20, 2025

same peoblem. due to the logs, it may caused by the flash_attention_2 method is not used correctly, but i dont know how to change it.

same problem

@YEXINGZHE54
Copy link

same problem, why it consumes all memory for Qwen2-VL-2B-Instruct

@tcy6
Copy link

tcy6 commented Feb 23, 2025

同问,感觉代码是不是哪里有bug,按理来说在这么小的设置下不应该OOM的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants