-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GRPOTrainer crashes with unsloth #1624
Comments
It should work now! Please update Unsloth via from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel) |
Im getting this error: Already tried to reinstall trl but still can't run it. |
I got this error as well |
Guys I got it
This works, not sure you guys installed |
Thanks for the reply. I am trying to fine tune gemma-2-2b but it got this vllm error:
But after I get rid of "use_vllm=True", it seems to be running. |
I find that if you manually change ~/anaconda3/lib/python3.12/site-packages/trl/trainer/alignprop_trainer.py from from ..models import DDPOStableDiffusionPipeline to from ..models.modeling_sd_base import DDPOStableDiffusionPipeline Then it can run. |
Fixed myself. Forgot to add fast_inference=True to FastLanguageModel.from_pretained However, it core dumped on me. Is this an OOM error?
|
@ymcki Could you try decreasing model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
max_seq_length = max_seq_length,
load_in_4bit = True, # False for LoRA 16bit
fast_inference = True, # Enable vLLM fast inference
max_lora_rank = lora_rank,
gpu_memory_utilization = 0.6, # Reduce if out of memory
) |
Oh also Gemma doesn't work yet - I'll make it work later today! Currently only Llama, Mistral, Qwen, Phi type architectures work |
What do you mean by "gemma doesn't work?" I get it running when I turn vllm off. |
I can confirm that vllm works for llama-3.2-3b. It is expected to finish one epoch in 50hrs. In contrast, gemma-2-2b without vllm is expected to finish in 175hrs. |
The correct fix is to |
@danielhanchen I'm getting OOMs after training for ~120+ steps, any ideas on what might cause that? My current workflow is to try and lower both |
@gmonair You have a varied size dataset, I assume? I bet your 120th sample is simply long enough to cause an OOM. You can check the size by e.g. putting print(f"Input ids: {input_ids.shape}") at grpo_trainer.py in (It would be nice if the code did what e.g. qlora-pipe does, i.e. after shuffling the dataset, move the largest sample to the top so it OOM's immediately.) |
I'm working to reduce VRAM usage in the next few days - it should help. I do like the idea of directly checking for OOM - it might ruin the randomness of the dataset though hmmm |
You could try reducing |
It's one single sample, so it should be fine unless your dataset is extremely small. |
I'm not sure they're equivalent. A higher |
Yes, GRPO can only "work" if at least one good completion is found during inference. It would actually make sense to be able to generate even more traces, and only "process" as much as you have vram (say you generate 16-32 traces in the hopes of finding a few good ones, then try strategies to select 4-8 with whatever ratios of good/bad works). Separating the two parameters might make sense. Also this thread might be of interest, where we can define our own vLLM generation steps separately from the trainer. Would work hand-in-hand with separating the generation from training w/ vLLM (this way people could decide how they split GPUs between generation / training). |
For what it's worth, the training run w/ these parameters finished after 1800 steps (olympiad math problems) on a 7b model (r1-distill-7b). But the results were underwhelming, as I had to go w/ |
![]() I am encountering an cuda OOM error during the initialization step for vLLM. I'm on an nvidia 4090 and attempting to run the same code in this colab: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen2.5_(3B)-GRPO.ipynb I have the following versions:
Any help would be greatly appreciated 🙏🏻 |
I could get Llama-3.1-8B to generate ...... after about 0.3 epoch. But I couldn't get Llama-3.2-3B to generate the correct format after one epoch. Both were running at four generations. Does that mean eight generations or more might work better for Llama-3.2-3B? |
Yes. The more generations you give it the more tries it has to find a path to a better model. |
Does anyone getting positive rewards/soft_format_reward_func using the example code? Mine always stuck at zero despite rewards/strict_format_reward_func getting positive values. Isn't rewards/soft_format_reward_func less strict than rewards/strict_format_reward_func such that it should be positive when the latter is also positive? I changed it to
It seems to work as I expect. |
Yep, |
Anyone observed self-correction behavior while training? I found one case near the end of the second epoch while training Llama-3.1-8B. However, it self corrected the correct answer into a wrong answer. >.<
|
Hello, is there an update on Gemma? Does it work now? Is there a sample notebook for Gemma? |
BTW, I thought I was doing this, but it turns out the trainer will use class OrderedDatasetGROPTrainer(GRPOTrainer):
def _get_train_sampler(self):
return None Obvious caveat that you need to actually shuffle the dataset yourself if you use it. |
I am trying to run GRPOTrainer with unsloth but it crashes. How to fix this?
unsloth 2025.2.4
unsloth 2025.2.3
transformers 4.47.1
torch 2.5.1
trl 0.14.0
This is the relevant code:
This is the message when it crashes:
The text was updated successfully, but these errors were encountered: