Skip to content

Conversation

@shubhamjain0594
Copy link

What does this PR do?

It seems that during addition of assistant_mask_only mechanism, the signature was modified to remove attention_mask from required labels. This seems a behaviour that I believe is wrong as all models generally need attention_mask as parameter to work with. So adding this back. It is especially useful when training with preprocessed tokenized dataset.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Copy link
Member

@qgallouedec qgallouedec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution.
We don't actually need "attention_mask" in the signature, because:

  • we don't get it from tokenization, so there is no "attention_mask" column anyway;
  • the collator doesn't use the attention mask, but builds it from the input_ids.

So, unless I'm mistaken, this PR can be closed.

@shubhamjain0594
Copy link
Author

shubhamjain0594 commented Nov 5, 2025

We use tokenizer with encode_plus instead of just encode which does return the attention mask.

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
tokenizer.encode_plus("Hello, world!")

To give you some context on why we need it: we have overridden evaluation loop in SFTTrainer so that we can evaluate by generating the complete sequence instead of just predicting the next token. To do we pass a processed dataset for this, a custom evaluation data collator that simply adds padding to the input text and uses attention_mask for generation during evaluation. But the RemoveUnusedColumns collator removes attention_mask from the data which leads to error in loss calculation.

@shubhamjain0594
Copy link
Author

shubhamjain0594 commented Nov 6, 2025

Also another place this can be used: To test robustness of the models to small perturbations. In the dataset during preprocessing, we modify attention_mask to add zeros at random places to see how does model respond.

@qgallouedec
Copy link
Member

Ok, thanks for the clarification, so this is a requirement that originates from the customization. in your case I think the easiest is to do

from trl import SFTTrainer as _SFTTrainer

class SFTTrainer(_SFTTrainer):
    def _set_signature_columns_if_needed(self):
        if self._signature_columns is None:
            self._signature_columns = ["input_ids", "labels", "attention_mask", "seq_lengths", "completion_mask", "assistant_masks"]

@shubhamjain0594
Copy link
Author

Thanks @qgallouedec for the tip. This is the workaround I have right now.

Though it does feel that this is not compatible with following usage of SFT: someone passes a preprocessed dataset, custom data_collator and setting skip_prepare_dataset: True. As in that case you would expect everything should still work.

@qgallouedec
Copy link
Member

would it work if you just pass remove_unused_columns=False in the config?

@shubhamjain0594
Copy link
Author

@qgallouedec, I think this might work. Will test it. And close PR if it works. Thank you :)

@qgallouedec
Copy link
Member

I'll close this PR, feel free to re-open if needed :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants