support prompt_logp_compute_kv_cache in no vllm trainer #82

Yukino256 · 2025-02-13T08:41:51Z

solving this issue: #71
and the code mainly copied and modified from andyl98:grpo-vram-optimization

In my test the grpo runs ~~faster at least 3x~~ well without OOM in Qwen2VL-7B model:

And, since I have never successfully run the vllm version, I can't modify the vllm_trainer code.

my test code is:

src/open_r1/grpo.py \
--deepspeed local_scripts/zero3.json \
--output_dir="${OUTPUT_DIR}" \
--model_name_or_path="${MODEL_PATH}" \
--dataset_name="${DATA_PATH}" \
--max_prompt_length 8192 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--logging_steps 1 \
--bf16 \
--report_to wandb \
--gradient_checkpointing false \
--attn_implementation flash_attention_2 \
--max_pixels 2359296 \
--save_total_limit 8 \
--num_train_epochs 2 \
--run_name Qwen2-VL-2B-8k \
--save_steps 100 \
--save_only_model true

And you can add:
--logit_computation_mini_batch_size X
if the trl package is newest

solving this issue: Deep-Agent#71

chenllliang · 2025-02-14T02:38:55Z

hi thanks for your contribution. can you provide more detailed running time comparison and the performance comparison on geoqa and clevr ？

Yukino256 · 2025-02-14T03:57:28Z

geoqa

Hi, thank you. I will try it as soon as possible🤗

Yukino256 · 2025-02-14T05:36:19Z

hi thanks for your contribution. can you provide more detailed running time comparison and the performance comparison on geoqa and clevr ？

@chenllliang
Hi, I'm sorry that I made a mistake that i used the old code where the batch decoding error is not fixed.So the speed up does not exist in fact.

Using Qwen2-VL-7B-Instruct for test:

In th GEOQA dataset, the raw code has 105s/it, and my changed code has 114s/it in 8*A800 80G.
But, the raw code got OOM error in the secode iteration:

And my changed code indeed runs well without OOM:

my test code is:

export DEBUG_MODE="true"
export LOG_PATH="./debug_log_GEOQA.txt"

OUTPUT_DIR=/grpo-result-7b-RAW-GEOQA
MODEL_PATH=/Qwen2-VL-7B-Instruct
DATA_PATH=/PKUGEOQA_R1V_Train_8K


set -x
set -e
set -u

export LANG=en_US.UTF-8
export NCCL_NET_GDR_LEVEL=2
export NCCL_IB_GID_INDEX=3
export NCCL_IB_DISABLE=0
export NCCL_IB_RETRY_CNT=7
export CUDA_LAUNCH_BLOCKING=0
export NCCL_DEBUG=info


export WANDB_BASE_URL=https://api.wandb.ai
export WANDB_PROJECT=r1-test
export WANDB_API_KEY="xxxxxx"
WANDB_RUN_NAME=GEOQA-RAW
wandb login $WANDB_API_KEY

NUM_GPUS_PER_NODE=$(nvidia-smi -L | wc -l)

torchrun --nnodes=$WORLD_SIZE --nproc_per_node=$NUM_GPUS_PER_NODE --node_rank=$RANK --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT \
src/open_r1/grpo.py \
--deepspeed local_scripts/zero3.json \
--output_dir="${OUTPUT_DIR}" \
--model_name_or_path="${MODEL_PATH}" \
--dataset_name="${DATA_PATH}" \
--max_prompt_length 8192 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--num_generations 8 \
--logging_steps 1 \
--bf16 \
--report_to wandb \
--gradient_checkpointing false \
--attn_implementation flash_attention_2 \
--max_pixels 2359296 \
--save_total_limit 8 \
--num_train_epochs 10 \
--run_name $WANDB_RUN_NAME \
--save_steps 100 \
--save_only_model true

ZCMax · 2025-02-14T08:42:44Z

[rank0]:   File "/mnt/petrelfs/zhuchenming/R1-V/src/open-r1-multimodal/src/open_r1/trainer/grpo_trainer.py", line 470, in compute_loss
[rank0]:     per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw)
[rank0]: TypeError: Qwen2VLGRPOTrainer._get_per_token_logps() missing 2 required positional arguments: 'num_logits_to_keep' and 'mini_batch_size'

After upadting your commit, it seems an error occours.

Yukino256 · 2025-02-14T09:03:01Z

[rank0]:   File "/mnt/petrelfs/zhuchenming/R1-V/src/open-r1-multimodal/src/open_r1/trainer/grpo_trainer.py", line 470, in compute_loss
[rank0]:     per_token_logps = self._get_per_token_logps(model, prompt_completion_ids, attention_mask, pixel_values, image_grid_thw)
[rank0]: TypeError: Qwen2VLGRPOTrainer._get_per_token_logps() missing 2 required positional arguments: 'num_logits_to_keep' and 'mini_batch_size'

After upadting your commit, it seems an error occours.

hello! It seems that the code is not updated ? I changed this line into multi lines, may be the git didn't work properly?

ZCMax · 2025-02-14T09:32:49Z

Sorry but I think you didn't update the code in the right place ~

ZCMax · 2025-02-14T09:36:37Z

I also have another question: I set the mini_batch_size set 1, and the number of my prompt tokens could be around 2100, but it stiil occurs OOM on 8XA100 80G.

Yukino256 · 2025-02-14T10:12:18Z

I also have another question: I set the mini_batch_size set 1, and the number of my prompt tokens could be around 2100, but it stiil occurs OOM on 8XA100 80G.

Hello, I'm checking the code error. And about the OOM error, does--deepspeed local_scripts/zero3.json is added? 😭😭

Yukino256 · 2025-02-14T12:19:14Z

@ZCMax Hello! Bugs should be fixed! I think trying 7B with zero3 is OK !🥲🥲

CAOANJIA · 2025-02-18T13:18:22Z

@ZCMax Hello! Bugs should be fixed! I think trying 7B with zero3 is OK !🥲🥲

请教一下，加了zero3好像多机就会卡住？

Yukino256 · 2025-02-19T02:30:58Z

@ZCMax Hello! Bugs should be fixed! I think trying 7B with zero3 is OK !🥲🥲

请教一下，加了zero3好像多机就会卡住？

Hello！目前他这个原生代码就不能用多机跑好像，我都是单机8卡跑的。他们好像目前还没实现多机？
参见#57

rrustlee · 2025-02-21T10:29:48Z

@Yukino256 请问您有遇到类似的报错？在utils.py的内部应该是由o3引起的，好像不会对结果造成影响，但我还是有点疑惑

Yukino256 · 2025-02-21T10:34:46Z

@Yukino256 请问您有遇到类似的报错？在utils.py的内部应该是由o3引起的，好像不会对结果造成影响，但我还是有点疑惑

这个我用他这个源代码就有，应该不是我这个代码加上去的

Yukino256 added 2 commits February 13, 2025 16:25

support prompt cache

70b3826

solving this issue: Deep-Agent#71

prompt cache logps computation copy from andyl98:grpo-vram-optimization

eef3a5e

Yukino256 closed this Feb 14, 2025

Yukino256 added 2 commits February 14, 2025 19:22

Merge branch 'Deep-Agent:main' into main

eaadf8e

fix bugs about tensors

8a04572

Yukino256 reopened this Feb 14, 2025

optimize code logic

bb26201

Yukino256 added 3 commits February 19, 2025 11:58

fix bug about ids

b81d24e

fix bug about device

d4afe91

fix bug about completion_mask

8eef3ff

Yukino256 mentioned this pull request Feb 21, 2025

建议GRPOTrainer更新为trl最新代码，解决7B模型训练OOM问题 #71

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support prompt_logp_compute_kv_cache in no vllm trainer #82

support prompt_logp_compute_kv_cache in no vllm trainer #82

Yukino256 commented Feb 13, 2025 •

edited

Loading

chenllliang commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

Yukino256 commented Feb 14, 2025 •

edited

Loading

ZCMax commented Feb 14, 2025 •

edited

Loading

Yukino256 commented Feb 14, 2025 •

edited

Loading

ZCMax commented Feb 14, 2025

ZCMax commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

CAOANJIA commented Feb 18, 2025

Yukino256 commented Feb 19, 2025 •

edited

Loading

rrustlee commented Feb 21, 2025

Yukino256 commented Feb 21, 2025

support prompt_logp_compute_kv_cache in no vllm trainer #82

Are you sure you want to change the base?

support prompt_logp_compute_kv_cache in no vllm trainer #82

Conversation

Yukino256 commented Feb 13, 2025 • edited Loading

chenllliang commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

Yukino256 commented Feb 14, 2025 • edited Loading

ZCMax commented Feb 14, 2025 • edited Loading

Yukino256 commented Feb 14, 2025 • edited Loading

ZCMax commented Feb 14, 2025

ZCMax commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

Yukino256 commented Feb 14, 2025

CAOANJIA commented Feb 18, 2025

Yukino256 commented Feb 19, 2025 • edited Loading

rrustlee commented Feb 21, 2025

Yukino256 commented Feb 21, 2025

Yukino256 commented Feb 13, 2025 •

edited

Loading

Yukino256 commented Feb 14, 2025 •

edited

Loading

ZCMax commented Feb 14, 2025 •

edited

Loading

Yukino256 commented Feb 14, 2025 •

edited

Loading

Yukino256 commented Feb 19, 2025 •

edited

Loading