You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that the greedy_until function in TransformersModel uses excessive padding. In my case, I have a test set where my largest input has 27k tokens but most of the inputs are under 8k tokens. The current implementation uses max_context_continuation_size_allowed as the max_length in the tokenizer, which corresponds to the number of tokens for the largest samples in the entire dataset plus the maximum number of output tokens. This unnecessarily increases the evaluation time.
Solution/Feature
Instead of using max_context_continuation_size_allowed when tokenizing the batch contexts, it would be better to use something like this (untested):
largest_sample_in_batch=len(batch[0].tokenized_context)
max_generation_size=batch[0].generation_sizeifbatch[0].generation_sizeelseself.max_length-largest_sample_in_batchmax_length=min(largest_sample_in_batch+max_generation_size, self.max_length)
tokenized=self.tokenizer(
...
max_length=max_length# Only this needs to change
...
).to(self.device)
The calculations are essentially the same as the ones being done already in the code, only that we don't look at the first sample in the entire dataset but the first sample in the batch for determining the max_length.
If you think this makes sense, I could open a pull request.
The text was updated successfully, but these errors were encountered:
Issue encountered
I noticed that the
greedy_until
function inTransformersModel
uses excessive padding. In my case, I have a test set where my largest input has 27k tokens but most of the inputs are under 8k tokens. The current implementation usesmax_context_continuation_size_allowed
as themax_length
in the tokenizer, which corresponds to the number of tokens for the largest samples in the entire dataset plus the maximum number of output tokens. This unnecessarily increases the evaluation time.Solution/Feature
Instead of using
max_context_continuation_size_allowed
when tokenizing the batch contexts, it would be better to use something like this (untested):The calculations are essentially the same as the ones being done already in the code, only that we don't look at the first sample in the entire dataset but the first sample in the batch for determining the
max_length
.If you think this makes sense, I could open a pull request.
The text was updated successfully, but these errors were encountered: