-
Notifications
You must be signed in to change notification settings - Fork 38
updated kaggle-Gemma3_(4B) notebook #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It works fine in my case without the specification. Can you give the error maybe? |
Like I mentioned, it won't run since the tokenizing is happening twice.Without specification:Error later on:---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-19-ada16cbe0bb7> in <cell line: 2>()
1 from trl import SFTTrainer, SFTConfig
----> 2 trainer = SFTTrainer(
3 model = model,
4 tokenizer = tokenizer,
5 train_dataset = dataset,
/usr/local/lib/python3.10/dist-packages/unsloth/trainer.py in new_init(self, *args, **kwargs)
201 kwargs["args"] = config
202 pass
--> 203 original_init(self, *args, **kwargs)
204 pass
205 return new_init
/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func, **kwargs)
1008 fix_zero_training_loss(model, tokenizer, train_dataset)
1009
-> 1010 super().__init__(
1011 model = model,
1012 args = args,
/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py in wrapped_func(*args, **kwargs)
170 warnings.warn(message, FutureWarning, stacklevel=2)
171
--> 172 return func(*args, **kwargs)
173
174 return wrapped_func
/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func)
458 preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)
459 if preprocess_dataset:
--> 460 train_dataset = self._prepare_dataset(
461 train_dataset, processing_class, args, args.packing, formatting_func, "train"
462 )
/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in _prepare_dataset(self, dataset, processing_class, args, packing, formatting_func, dataset_name)
704
705 if bos_token is not None:
--> 706 if test_text.startswith(bos_token) or bos_token in chat_template:
707 add_special_tokens = False
708 print("Unsloth: We found double BOS tokens - we shall remove one automatically.")
AttributeError: 'int' object has no attribute 'startswith' With specification: |
@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting. It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook? |
Just double checked this. That's not the issue |
Set tokenize=False in tokenizer.apply_chat_template
It won't run otherwise since the tokenizing is happening twice. I faced this minor issue when running today.