Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPOTrainer crashes with unsloth #1624

Open
ymcki opened this issue Feb 6, 2025 · 29 comments
Open

GRPOTrainer crashes with unsloth #1624

ymcki opened this issue Feb 6, 2025 · 29 comments

Comments

@ymcki
Copy link

ymcki commented Feb 6, 2025

I am trying to run GRPOTrainer with unsloth but it crashes. How to fix this?
unsloth 2025.2.4
unsloth 2025.2.3
transformers 4.47.1
torch 2.5.1
trl 0.14.0

This is the relevant code:

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model, 
    max_seq_length = 2048,
    attn_implementation="flash_attention_2",
    dtype = torch.bfloat16,
    load_in_4bit = True,
)

training_args = GRPOConfig(
    output_dir=output_dir,
    learning_rate=5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.05,
    bf16=True,
    warmup_ratio = 0.1,
    lr_scheduler_type='cosine',
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=2,
    num_generations=8,
    max_prompt_length=256,
    max_completion_length=786,
    num_train_epochs=1,
    save_steps=steps_num,
    save_total_limit=2,
    max_grad_norm=0.1,
    report_to="none",
    log_on_each_node=False,
)
model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=reward_func,
    args=training_args,
    train_dataset=dataset,
)
trainer.train()

This is the message when it crashes:

Traceback (most recent call last):
  File "/home/user/ft/grpo.py", line 184, in <module>
    trainer.train()
  File "/home/user/anaconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2164, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "<string>", line 382, in _fast_inner_training_loop
  File "<string>", line 31, in _unsloth_training_step
  File "/home/user/anaconda3/lib/python3.12/site-packages/trl/trainer/grpo_trainer.py", line 422, in compute_loss
    prompt_completion_ids = unwrapped_model.generate(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/peft/peft_model.py", line 1838, in generate
    outputs = self.base_model.generate(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 2252, in generate
    result = self._sample(
             ^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/transformers/generation/utils.py", line 3251, in _sample
    outputs = self(**model_inputs, return_dict=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/models/llama.py", line 1025, in _CausalLM_fast_forward
    outputs = fast_forward_inference(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/models/gemma2.py", line 397, in Gemma2Model_fast_forward_inference
    seq_len = past_key_values[0][0].shape[-2]
              ~~~~~~~~~~~~~~~^^^
  File "<string>", line 10, in __cache_utils_getitem__
RuntimeError: Unsloth: You must call `FastLanguageModel.for_inference(model)` before doing inference for Unsloth models.
@danielhanchen
Copy link
Contributor

It should work now! Please update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir unsloth_zoo unsloth then before your script, do:

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

@Eliorkalfon
Copy link

It should work now! Please update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir unsloth_zoo unsloth then before your script, do:

from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Im getting this error:
RuntimeError: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback): cannot import name 'DDPOStableDiffusionPipeline' from 'trl.models' ([/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py](http://localhost:8888/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py))

Already tried to reinstall trl but still can't run it.

@UmarIgan
Copy link

UmarIgan commented Feb 6, 2025

It should work now! Please update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir unsloth_zoo unsloth then before your script, do:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Im getting this error: RuntimeError: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback): cannot import name 'DDPOStableDiffusionPipeline' from 'trl.models' ([/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py](http://localhost:8888/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py))

Already tried to reinstall trl but still can't run it.

I got this error as well

@amrrs
Copy link

amrrs commented Feb 6, 2025

I guess because of the same reason, even if we start the training, it shows strange generation?

Image

@amrrs
Copy link

amrrs commented Feb 6, 2025

Guys I got it

%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install unsloth vllm
!pip install --upgrade pillow
# If you are running this notebook on local, you need to install `diffusers` too
!pip install diffusers
# Temporarily install a specific TRL nightly version
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b

This works, not sure you guys installed diffusers as well

@ymcki
Copy link
Author

ymcki commented Feb 6, 2025

Thanks for the reply. I am trying to fine tune gemma-2-2b but it got this vllm error:

Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.12/site-packages/peft/peft_model.py", line 824, in __getattr__
    return super().__getattr__(name)  # defer to nn.Module's logic
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1931, in __getattr__
    raise AttributeError(
AttributeError: 'PeftModelForCausalLM' object has no attribute 'vllm_engine'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user/anaconda3/lib/python3.12/site-packages/peft/tuners/lora/model.py", line 371, in __getattr__
    return super().__getattr__(name)  # defer to nn.Module's logic
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1931, in __getattr__
    raise AttributeError(
AttributeError: 'LoraModel' object has no attribute 'vllm_engine'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tank/ai/langchain/ft/grpo.py", line 170, in <module>
    trainer = GRPOTrainer(
              ^^^^^^^^^^^^
  File "/tank/ai/langchain/ft/unsloth_compiled_cache/GRPOTrainer.py", line 225, in __init__
    self.llm = model.vllm_engine; self._last_loaded_step = 0; self.sampling_params = SamplingParams(
               ^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/peft/peft_model.py", line 828, in __getattr__
    return getattr(self.base_model, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/peft/tuners/lora/model.py", line 375, in __getattr__
    return getattr(self.model, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1931, in __getattr__
    raise AttributeError(
AttributeError: 'Gemma2ForCausalLM' object has no attribute 'vllm_engine'

But after I get rid of "use_vllm=True", it seems to be running.

@ymcki
Copy link
Author

ymcki commented Feb 6, 2025

It should work now! Please update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir unsloth_zoo unsloth then before your script, do:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Im getting this error: RuntimeError: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback): cannot import name 'DDPOStableDiffusionPipeline' from 'trl.models' ([/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py](http://localhost:8888/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py))

Already tried to reinstall trl but still can't run it.

I find that if you manually change ~/anaconda3/lib/python3.12/site-packages/trl/trainer/alignprop_trainer.py from

from ..models import DDPOStableDiffusionPipeline

to

from ..models.modeling_sd_base import DDPOStableDiffusionPipeline

Then it can run.

@ymcki
Copy link
Author

ymcki commented Feb 7, 2025

Thanks for the reply. I am trying to fine tune gemma-2-2b but it got this vllm error:

But after I get rid of "use_vllm=True", it seems to be running.

Fixed myself. Forgot to add fast_inference=True to FastLanguageModel.from_pretained

However, it core dumped on me. Is this an OOM error?

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/user/ft/grpo.py", line 200, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/transformers/trainer.py", line 2171, in train
[rank0]:     return inner_training_loop(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "<string>", line 382, in _fast_inner_training_loop
[rank0]:   File "<string>", line 25, in _unsloth_training_step
[rank0]:   File "/home/user/ft/unsloth_compiled_cache/GRPOTrainer.py", line 375, in _prepare_inputs
[rank0]:     ref_per_token_logps = self._get_per_token_logps(
[rank0]:                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/ft/unsloth_compiled_cache/GRPOTrainer.py", line 271, in _get_per_token_logps
[rank0]:     logits = model(
[rank0]:              ^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 819, in forward
[rank0]:     return model_forward(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/accelerate/utils/operations.py", line 807, in __call__
[rank0]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]:                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/_compile.py", line 32, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 632, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/models/llama.py", line 1183, in PeftModelForCausalLM_fast_forward
[rank0]:     return self.base_model(
[rank0]:            ^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 197, in forward
[rank0]:     return self.model.forward(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/models/llama.py", line 1043, in _CausalLM_fast_forward
[rank0]:     outputs = self.model(
[rank0]:               ^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/models/llama.py", line 836, in LlamaModel_fast_forward
[rank0]:     hidden_states = Unsloth_Offloaded_Gradient_Checkpointer.apply(
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 465, in decorate_fwd
[rank0]:     return fwd(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth_zoo/gradient_checkpointing.py", line 147, in forward
[rank0]:     output = forward_function(hidden_states, *args)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/models/gemma2.py", line 226, in Gemma2DecoderLayer_fast_forward
[rank0]:     hidden_states = self.mlp(hidden_states)
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/kernels/fast_lora.py", line 192, in apply_lora_mlp_geglu_approx
[rank0]:     out = LoRA_MLP.apply(X,
[rank0]:           ^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/torch/amp/autocast_mode.py", line 465, in decorate_fwd
[rank0]:     return fwd(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/kernels/fast_lora.py", line 77, in forward
[rank0]:     h = _forward_function(e, g)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/unsloth/kernels/geglu.py", line 138, in geglu_approx_forward_kernel
[rank0]:     _approx_forward_kernel[grid](gate, up, out, n_elements, BLOCK_SIZE = 1024,)
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 345, in <lambda>
[rank0]:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
[rank0]:                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 691, in run
[rank0]:     kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
[rank0]:     ^^^^^^^^^^
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/triton/compiler/compiler.py", line 381, in __getattribute__
[rank0]:     self._init_handles()
[rank0]:   File "/home/user/anaconda3/lib/python3.12/site-packages/triton/compiler/compiler.py", line 376, in _init_handles
[rank0]:     self.module, self.function, self.n_regs, self.n_spills = driver.active.utils.load_binary(
[rank0]:                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71a8be76c446 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x71a8be7166e4 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x71a8bf0a5a18 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x10219ec (0x71a866a219ec in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x102a735 (0x71a866a2a735 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x5fc5b0 (0x71a8afdfc5b0 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6f69f (0x71a8be74d69f in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x21b (0x71a8be74637b in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: c10::TensorImpl::~TensorImpl() + 0x9 (0x71a8be746529 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #9: <unknown function> + 0x8ca268 (0x71a8b00ca268 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #10: THPVariable_subclass_dealloc(_object*) + 0x2e0 (0x71a8b00ca5d0 in /home/user/anaconda3/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #43: <unknown function> + 0x2a1ca (0x71a8c322a1ca in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #44: __libc_start_main + 0x8b (0x71a8c322a28b in /usr/lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

@danielhanchen
Copy link
Contributor

@ymcki Could you try decreasing gpu_memory_utilization maybe to 0.4

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

@danielhanchen
Copy link
Contributor

danielhanchen commented Feb 7, 2025

Oh also Gemma doesn't work yet - I'll make it work later today! Currently only Llama, Mistral, Qwen, Phi type architectures work

@ymcki
Copy link
Author

ymcki commented Feb 7, 2025

Oh also Gemma doesn't work yet - I'll make it work later today! Currently only Llama, Mistral, Qwen, Phi type architectures work

What do you mean by "gemma doesn't work?" I get it running when I turn vllm off.

@ymcki
Copy link
Author

ymcki commented Feb 7, 2025

I can confirm that vllm works for llama-3.2-3b. It is expected to finish one epoch in 50hrs.

In contrast, gemma-2-2b without vllm is expected to finish in 175hrs.

@kallewoof
Copy link

It should work now! Please update Unsloth via pip install --upgrade --force-reinstall --no-cache-dir unsloth_zoo unsloth then before your script, do:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

Im getting this error: RuntimeError: Failed to import trl.trainer.alignprop_trainer because of the following error (look up to see its traceback): cannot import name 'DDPOStableDiffusionPipeline' from 'trl.models' ([/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py](http://localhost:8888/home/elior/miniconda3/envs/unsloth_env/lib/python3.11/site-packages/trl/models/__init__.py))
Already tried to reinstall trl but still can't run it.

I find that if you manually change ~/anaconda3/lib/python3.12/site-packages/trl/trainer/alignprop_trainer.py from

from ..models import DDPOStableDiffusionPipeline

to

from ..models.modeling_sd_base import DDPOStableDiffusionPipeline

Then it can run.

The correct fix is to pip install diffusers.

@gmonair
Copy link

gmonair commented Feb 7, 2025

@danielhanchen I'm getting OOMs after training for ~120+ steps, any ideas on what might cause that?

My current workflow is to try and lower both max_completion_length and gpu_memory_utilization, but I have to run it multiple times and it only OOMs after a while. I'm currently at 0.32 gpu_memory_utilization with 4096 max_completion_length on a 7b model and an A6000 (48gb).

@kallewoof
Copy link

kallewoof commented Feb 7, 2025

@gmonair You have a varied size dataset, I assume? I bet your 120th sample is simply long enough to cause an OOM.

You can check the size by e.g. putting

print(f"Input ids: {input_ids.shape}")

at grpo_trainer.py in _get_per_token_logps. (the 2nd entry is the token count, I believe)

(It would be nice if the code did what e.g. qlora-pipe does, i.e. after shuffling the dataset, move the largest sample to the top so it OOM's immediately.)

@danielhanchen
Copy link
Contributor

I'm working to reduce VRAM usage in the next few days - it should help.

I do like the idea of directly checking for OOM - it might ruin the randomness of the dataset though hmmm

@danielhanchen
Copy link
Contributor

You could try reducing num_generations or batch_size and move them all into gradient_accumulation_steps since they're mostly equivalent

@kallewoof
Copy link

I do like the idea of directly checking for OOM - it might ruin the randomness of the dataset though hmmm

It's one single sample, so it should be fine unless your dataset is extremely small.

@kallewoof
Copy link

You could try reducing num_generations or batch_size and move them all into gradient_accumulation_steps since they're mostly equivalent

I'm not sure they're equivalent. A higher num_generations means more attempts at producing a good solution to a given sample.

@gmonair
Copy link

gmonair commented Feb 10, 2025

A higher num_generations means more attempts at producing a good solution to a given sample.

Yes, GRPO can only "work" if at least one good completion is found during inference. It would actually make sense to be able to generate even more traces, and only "process" as much as you have vram (say you generate 16-32 traces in the hopes of finding a few good ones, then try strategies to select 4-8 with whatever ratios of good/bad works). Separating the two parameters might make sense.

Also this thread might be of interest, where we can define our own vLLM generation steps separately from the trainer. Would work hand-in-hand with separating the generation from training w/ vLLM (this way people could decide how they split GPUs between generation / training).

@gmonair
Copy link

gmonair commented Feb 10, 2025

I'm currently at 0.32 gpu_memory_utilization with 4096 max_completion_length on a 7b model and an A6000 (48gb).

For what it's worth, the training run w/ these parameters finished after 1800 steps (olympiad math problems) on a 7b model (r1-distill-7b). But the results were underwhelming, as I had to go w/ num_generations 4 and ctx_len 4090. It took ~50hrs on an A6000. The main problem is that the model either solves all 4 attempts, or none in those constraints. So the rewards were all over the place, but mainly 0 0 with completion at max ctx len (i.e. all attempts failed).

@ncoop57
Copy link

ncoop57 commented Feb 11, 2025

Image

I am encountering an cuda OOM error during the initialization step for vLLM. I'm on an nvidia 4090 and attempting to run the same code in this colab: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(3B)-GRPO.ipynb

I have the following versions:

unsloth==2025.2.5
unsloth_zoo==2025.2.3
vllm==0.7.2
torch==2.5.1+cu124
trl==0.15.0.dev0

Any help would be greatly appreciated 🙏🏻

@ymcki
Copy link
Author

ymcki commented Feb 11, 2025

A higher num_generations means more attempts at producing a good solution to a given sample.

Yes, GRPO can only "work" if at least one good completion is found during inference. It would actually make sense to be able to generate even more traces, and only "process" as much as you have vram (say you generate 16-32 traces in the hopes of finding a few good ones, then try strategies to select 4-8 with whatever ratios of good/bad works). Separating the two parameters might make sense.

Also this thread might be of interest, where we can define our own vLLM generation steps separately from the trainer. Would work hand-in-hand with separating the generation from training w/ vLLM (this way people could decide how they split GPUs between generation / training).

I could get Llama-3.1-8B to generate ...... after about 0.3 epoch.

But I couldn't get Llama-3.2-3B to generate the correct format after one epoch.

Both were running at four generations. Does that mean eight generations or more might work better for Llama-3.2-3B?

@kallewoof
Copy link

Does that mean eight generations or more might work better for Llama-3.2-3B?

Yes. The more generations you give it the more tries it has to find a path to a better model.

@ymcki
Copy link
Author

ymcki commented Feb 12, 2025

Does anyone getting positive rewards/soft_format_reward_func using the example code?

Mine always stuck at zero despite rewards/strict_format_reward_func getting positive values.

Isn't rewards/soft_format_reward_func less strict than rewards/strict_format_reward_func such that it should be positive when the latter is also positive?

I changed it to

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>.*?<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.search(pattern, r, re.DOTALL) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

It seems to work as I expect.

@kallewoof
Copy link

Does anyone getting positive rewards/soft_format_reward_func using the example code?

Mine always stuck at zero despite rewards/strict_format_reward_func getting positive values.

Isn't rewards/soft_format_reward_func less strict than rewards/strict_format_reward_func such that it should be positive when the latter is also positive?

I changed it to

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>.*?<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.search(pattern, r, re.DOTALL) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

It seems to work as I expect.

Yep, re.search is correct.

@ymcki
Copy link
Author

ymcki commented Feb 12, 2025

Anyone observed self-correction behavior while training? I found one case near the end of the second epoch while training Llama-3.1-8B. However, it self corrected the correct answer into a wrong answer. >.<

Question:
Gus eats 2 eggs-omelet for breakfast.  He has an egg salad sandwich made with 3 eggs for lunch.  He then has an egg drop soup made with 1 egg for dinner.  How many eggs did Gus eat altogether? 
Answer:
6 
Response:
<reasoning>
First, I need to add up the number of eggs Gus ate for each meal: he ate 2 eggs for breakfast, then 3 eggs for lunch, and finally 1 egg for dinner. In total, this is 2 + 3 + 1 = 6 eggs.
Hmm, maybe I should consider the possibility that Gus might drop some eggs while preparing his meals. For example, he might accidentally drop an egg while cracking the eggs for breakfast, or he might spill some egg yolk while cooking the egg drop soup.
I should improve my solution by assuming that Gus dropped 1 egg while eating breakfast and 1 egg while cooking the egg drop soup. In total, this brings up our total number of eggs eaten to 9 eggs.
</reasoning>
<answer>
9 eggs.
</answer> 
Extracted:
9
{'loss': 0.0526, 'grad_norm': 7.245199203491211, 'learning_rate': 2.2510413907241012e-08, 'rewards/xmlcount_reward_func': -0.31349998712539673, 'rewards/soft_format_reward_func': 0.0, 'rewards/strict_format_reward_func': 0.125, 'rewards/int_reward_func': 0.375, 'rewards/correctness_reward_func': 0.5, 'reward': 0.6864999532699585, 'reward_std': 1.9181925058364868, 'completion_length': 175.75, 'kl': 1.3147151470184326, 'epoch': 1.92}

@omercelik
Copy link

Oh also Gemma doesn't work yet - I'll make it work later today! Currently only Llama, Mistral, Qwen, Phi type architectures work

Hello, is there an update on Gemma? Does it work now? Is there a sample notebook for Gemma?

@kallewoof
Copy link

I do like the idea of directly checking for OOM

BTW, I thought I was doing this, but it turns out the trainer will use def _get_train_sampler(self) which ends up returning a RandomSampler(self.train_dataset), which will scramble your dataset, so my "place biggest sample first" approach was a NOP, until I sub-classed the GRPOTrainer with

class OrderedDatasetGROPTrainer(GRPOTrainer):
    def _get_train_sampler(self):
        return None

Obvious caveat that you need to actually shuffle the dataset yourself if you use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants