-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Add attention_mask to signature_columns #4459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
qgallouedec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution.
We don't actually need "attention_mask" in the signature, because:
- we don't get it from tokenization, so there is no "attention_mask" column anyway;
- the collator doesn't use the attention mask, but builds it from the input_ids.
So, unless I'm mistaken, this PR can be closed.
|
We use tokenizer with To give you some context on why we need it: we have overridden evaluation loop in SFTTrainer so that we can evaluate by generating the complete sequence instead of just predicting the next token. To do we pass a processed dataset for this, a custom evaluation data collator that simply adds padding to the input text and uses |
|
Also another place this can be used: To test robustness of the models to small perturbations. In the dataset during preprocessing, we modify |
|
Ok, thanks for the clarification, so this is a requirement that originates from the customization. in your case I think the easiest is to do from trl import SFTTrainer as _SFTTrainer
class SFTTrainer(_SFTTrainer):
def _set_signature_columns_if_needed(self):
if self._signature_columns is None:
self._signature_columns = ["input_ids", "labels", "attention_mask", "seq_lengths", "completion_mask", "assistant_masks"] |
|
Thanks @qgallouedec for the tip. This is the workaround I have right now. Though it does feel that this is not compatible with following usage of SFT: someone passes a preprocessed dataset, custom data_collator and setting |
|
would it work if you just pass |
|
@qgallouedec, I think this might work. Will test it. And close PR if it works. Thank you :) |
|
I'll close this PR, feel free to re-open if needed :) |
What does this PR do?
It seems that during addition of assistant_mask_only mechanism, the signature was modified to remove
attention_maskfrom required labels. This seems a behaviour that I believe is wrong as all models generally needattention_maskas parameter to work with. So adding this back. It is especially useful when training with preprocessed tokenized dataset.Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.