(Training qwen2.5-VL-7B-Instruct) AssertionError: Input and cos/sin must have the same dtype, got torch.float16 and torch.bfloat16 #105

six-finger · 2025-02-17T02:48:09Z

Bash file:

Log:

lky-violet · 2025-02-17T03:03:41Z

Hello, when I switched the model from Qwen2.5-VL-3B-Instruct to Qwen2-VL-2B-Instruct, the error was resolved. I suspect it might be due to differences in model precision?

six-finger · 2025-02-17T03:18:06Z

Hello, when I switched the model from Qwen2.5-VL-3B-Instruct to Qwen2-VL-2B-Instruct, the error was resolved. I suspect it might be due to differences in model precision?

This issue appears to be due to changes in the transformers library version. A similar issue (huggingface/transformers#36188) references the transformers version (f7a3c62), but after installing that specific version, I encountered a new error:

weizhepei · 2025-02-18T01:51:29Z

+1 Same issue when using this script:

CUDA_VISIBLE_DEVICES="1,2,3,4,5,6,7" torchrun --nproc_per_node="7" \
    --nnodes="1" \
    --node_rank="0" \
    --master_addr="127.0.0.1" \
    --master_port="12345" \
    src/open_r1/grpo.py \
    --output_dir $OUTPUT_DIR \
    --model_name_or_path $QWEN_PATH \
    --dataset_name $HF_DATASET \
    --max_prompt_length 512 \
    --max_completion_length 1024 \
    --temperature 1.0 \
    --num_generations 4 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --logging_steps 1 \
    --bf16 \
    --report_to wandb \
    --gradient_checkpointing false \
    --attn_implementation flash_attention_2 \
    --max_pixels 401408 \
    --num_train_epochs 2 \
    --run_name $RUN_NAME \
    --save_steps 100 \
    --save_only_model true \
    --deepspeed local_scripts/zero3.json

Lib versions:

flash-attn                2.7.4.post1              pypi_0    pypi
r1-v                      0.1.0                     dev_0    <develop>
transformers              4.50.0.dev0              pypi_0    pypi
vllm                      0.7.2                    pypi_0    pypi

@TobiasLee Any pointers on this issue? 👀

robinjoe93 · 2025-02-19T06:30:44Z

may be a "deepspeed" error. I run this command without "--deepspeed local_scripts/zero3.json", it can work

lky-violet · 2025-02-19T14:32:55Z

may be a "deepspeed" error. I run this command without "--deepspeed local_scripts/zero3.json", it can work

I test your method: delete "--deepspeed local_scripts/zero3.json". I only have 4 A100 GPUs, but when I run the code with export CUDA_VISIBLE_DEVICES="0,1,6,7", it outputs the error: ** CUDA out of memory. Tried to allocate 30.00 MiB. GPU 3 has a total capacity of 79.15 GiB**. What should I do?

robinjoe93 · 2025-02-20T01:41:25Z

may be a "deepspeed" error. I run this command without "--deepspeed local_scripts/zero3.json", it can work

I test your method: delete "--deepspeed local_scripts/zero3.json". I only have 4 A100 GPUs, but when I run the code with export CUDA_VISIBLE_DEVICES="0,1,6,7", it outputs the error: ** CUDA out of memory. Tried to allocate 30.00 MiB. GPU 3 has a total capacity of 79.15 GiB**. What should I do?

decrease "max_prompt_length" \ "num_generations" \ "max_completion_length" \ "max_prompt_length"

Syazvinski · 2025-02-23T01:30:46Z

Temporary fix:
pip install git+https://github.com/huggingface/transformers.git@8ee50537fe7613b87881cd043a85971c85e99519

llliuxiao · 2025-02-23T02:51:25Z

Temporary fix: pip install git+https://github.com/huggingface/transformers.git@8ee50537fe7613b87881cd043a85971c85e99519

It works!

lky-violet mentioned this issue Feb 17, 2025

Environment setting issue #102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Training qwen2.5-VL-7B-Instruct) AssertionError: Input and cos/sin must have the same dtype, got torch.float16 and torch.bfloat16 #105

(Training qwen2.5-VL-7B-Instruct) AssertionError: Input and cos/sin must have the same dtype, got torch.float16 and torch.bfloat16 #105

six-finger commented Feb 17, 2025

lky-violet commented Feb 17, 2025

six-finger commented Feb 17, 2025

weizhepei commented Feb 18, 2025

robinjoe93 commented Feb 19, 2025

lky-violet commented Feb 19, 2025

robinjoe93 commented Feb 20, 2025

Syazvinski commented Feb 23, 2025

llliuxiao commented Feb 23, 2025

(Training qwen2.5-VL-7B-Instruct) AssertionError: Input and cos/sin must have the same dtype, got torch.float16 and torch.bfloat16 #105

(Training qwen2.5-VL-7B-Instruct) AssertionError: Input and cos/sin must have the same dtype, got torch.float16 and torch.bfloat16 #105

Comments

six-finger commented Feb 17, 2025

lky-violet commented Feb 17, 2025

six-finger commented Feb 17, 2025

weizhepei commented Feb 18, 2025

robinjoe93 commented Feb 19, 2025

lky-violet commented Feb 19, 2025

robinjoe93 commented Feb 20, 2025

Syazvinski commented Feb 23, 2025

llliuxiao commented Feb 23, 2025