Skip to content

updated kaggle-Gemma3_(4B) notebook #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

naimur-29
Copy link

@naimur-29 naimur-29 commented Mar 22, 2025

Set tokenize=False in tokenizer.apply_chat_template
It won't run otherwise since the tokenizing is happening twice. I faced this minor issue when running today.

def apply_chat_template(examples):
    texts = tokenizer.apply_chat_template(examples["conversations"], tokenize = False) # here
    return { "text" : texts }
pass
tokenizer = get_chat_template(
    tokenizer,
    tokenize = False, # and here (inference and two other following cells)
    chat_template = "gemma-3",
)

@Erland366
Copy link
Collaborator

It works fine in my case without the specification. Can you give the error maybe?

@naimur-29
Copy link
Author

naimur-29 commented Mar 27, 2025

It works fine in my case without the specification. Can you give the error maybe?

Like I mentioned, it won't run since the tokenizing is happening twice.

Without specification:

image

Error later on:

image

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-ada16cbe0bb7> in <cell line: 2>()
      1 from trl import SFTTrainer, SFTConfig
----> 2 trainer = SFTTrainer(
      3     model = model,
      4     tokenizer = tokenizer,
      5     train_dataset = dataset,

/usr/local/lib/python3.10/dist-packages/unsloth/trainer.py in new_init(self, *args, **kwargs)
    201             kwargs["args"] = config
    202         pass
--> 203         original_init(self, *args, **kwargs)
    204     pass
    205     return new_init

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func, **kwargs)
   1008         fix_zero_training_loss(model, tokenizer, train_dataset)
   1009 
-> 1010         super().__init__(
   1011             model = model,
   1012             args = args,

/usr/local/lib/python3.10/dist-packages/transformers/utils/deprecation.py in wrapped_func(*args, **kwargs)
    170                 warnings.warn(message, FutureWarning, stacklevel=2)
    171 
--> 172             return func(*args, **kwargs)
    173 
    174         return wrapped_func

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in __init__(self, model, args, data_collator, train_dataset, eval_dataset, processing_class, compute_loss_func, compute_metrics, callbacks, optimizers, optimizer_cls_and_kwargs, preprocess_logits_for_metrics, peft_config, formatting_func)
    458         preprocess_dataset = args.dataset_kwargs is None or not args.dataset_kwargs.get("skip_prepare_dataset", False)
    459         if preprocess_dataset:
--> 460             train_dataset = self._prepare_dataset(
    461                 train_dataset, processing_class, args, args.packing, formatting_func, "train"
    462             )

/kaggle/working/unsloth_compiled_cache/UnslothSFTTrainer.py in _prepare_dataset(self, dataset, processing_class, args, packing, formatting_func, dataset_name)
    704 
    705             if bos_token is not None:
--> 706                 if test_text.startswith(bos_token) or bos_token in chat_template:
    707                     add_special_tokens = False
    708                     print("Unsloth: We found double BOS tokens - we shall remove one automatically.")

AttributeError: 'int' object has no attribute 'startswith'

With specification:

image

@rolandtannous
Copy link
Contributor

rolandtannous commented Apr 20, 2025

@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.

It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?

@rolandtannous
Copy link
Contributor

@Erland366 @naimur-29 he is right in that tokenize=False should be used here. That's the recommended practice to avoid duplications . Check here and here. Might explain some of the errors people are getting.

It also depends on what you're passing to the Trainer. @naimur-29 do you have a readable copy of the ipynb notebook?

Just double checked this. That's not the issue
in the custom apply_chat_template method in unsloth, tokenize = False is the default value so no need to set it explicitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants