updated kaggle-Gemma3_(4B) notebook #18

naimur-29 · 2025-03-22T09:56:12Z

Set tokenize=False in tokenizer.apply_chat_template
It won't run otherwise since the tokenizing is happening twice. I faced this minor issue when running today.

def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"], tokenize = False) # here
    return { "text" : texts }
pass

tokenizer = get_chat_template(
    tokenizer,
    tokenize = False, # and here (inference and two other following cells)
    chat_template = "gemma-3",
)

…t_template

Erland366 · 2025-03-27T08:36:54Z

It works fine in my case without the specification. Can you give the error maybe?

naimur-29 · 2025-03-27T14:19:32Z

It works fine in my case without the specification. Can you give the error maybe?

Like I mentioned, it won't run since the tokenizing is happening twice.

Without specification:

Error later on:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-ada16cbe0bb7> in <cell line: 2>()
      1 from trl import SFTTrainer, SFTConfig
----> 2 trainer = SFTTrainer(
      3     model = model,
      4     tokenizer = tokenizer,
      5     train_dataset = dataset,

/usr/local/lib/python3.10/dist-packages/unsloth/trainer.py in new_init(self, *args, **kwargs)
    201             kwargs["args"] = config
    202         pass
--> 203         original_init(self, *args, **kwargs)
    204     pass
    205     return new_init

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func, **kwargs)
   1008         fix_zero_training_loss(model, tokenizer, train_dataset)
   1009 
-> 1010         super().__init__(
   1011             model = model,
   1012             args = args,

/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py in wrapped_func(*args, **kwargs)
    170                 warnings.warn(message, FutureWarning, stacklevel=2)
    171 
--> 172             return func(*args, **kwargs)
    173 
    174         return wrapped_func

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func)
    458         preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)
    459         if preprocess_dataset:
--> 460             train_dataset = self._prepare_dataset(
    461                 train_dataset, processing_class, args, args.packing, formatting_func, "train"
    462             )

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in _prepare_dataset(self, dataset, processing_class, args, packing, formatting_func, dataset_name)
    704 
    705             if bos_token is not None:
--> 706                 if test_text.startswith(bos_token) or bos_token in chat_template:
    707                     add_special_tokens = False
    708                     print("Unsloth: We found double BOS tokens - we shall remove one automatically.")

AttributeError: 'int' object has no attribute 'startswith'

With specification:

rolandtannous · 2025-04-20T10:56:48Z

@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.

It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?

rolandtannous · 2025-04-21T08:42:50Z

@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.

It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?

Just double checked this. That's not the issue
in the custom apply_chat_template method in unsloth, tokenize = False is the default value so no need to set it explicitely.

rolandtannous · 2025-05-26T03:02:11Z

@shimmyshimmer this PR can be closed as is now superseded

updated kaggle-Gemma3_(4B); Set tokenize=False in tokenizer.apply_cha…

ce257d0

…t_template

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

updated kaggle-Gemma3_(4B) notebook #18

updated kaggle-Gemma3_(4B) notebook #18

Uh oh!

naimur-29 commented Mar 22, 2025 •

edited

Loading

Uh oh!

Erland366 commented Mar 27, 2025

Uh oh!

naimur-29 commented Mar 27, 2025 •

edited

Loading

Uh oh!

rolandtannous commented Apr 20, 2025 •

edited

Loading

Uh oh!

rolandtannous commented Apr 21, 2025

Uh oh!

rolandtannous commented May 26, 2025

Uh oh!

Uh oh!

updated kaggle-Gemma3_(4B) notebook #18

Are you sure you want to change the base?

updated kaggle-Gemma3_(4B) notebook #18

Uh oh!

Conversation

naimur-29 commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Erland366 commented Mar 27, 2025

Uh oh!

naimur-29 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Like I mentioned, it won't run since the tokenizing is happening twice.

Without specification:

Error later on:

With specification:

Uh oh!

rolandtannous commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rolandtannous commented Apr 21, 2025

Uh oh!

rolandtannous commented May 26, 2025

Uh oh!

Uh oh!

naimur-29 commented Mar 22, 2025 •

edited

Loading

naimur-29 commented Mar 27, 2025 •

edited

Loading

rolandtannous commented Apr 20, 2025 •

edited

Loading