You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have some pre-tokenized text that wasn't tokenized using bpe, so I'm passing word to the SentenceAligner constructor. The texts are rather long and can be split into sentences, but this input data brought up the question of the expected maximum length of tokens. I'd like to add some safeguards for future arbitrary inputs to ensure the model isn't returning incorrect results and to catch the length issues further up in my data pipeline.
IndexError: index 510 is out of bounds for axis 0 with size 510
When running with bpe, however, it succeeds (but perhaps the output is invalid). I added a check for the combined length, but even for lower values, it fails with a similar error as above. Is there perhaps some lower length limit due to some preprocessing by simalign? And does the bpe token method somehow get around this?
If the limit should be enforced, perhaps simalign should have this as a precondition.
The text was updated successfully, but these errors were encountered:
I have some pre-tokenized text that wasn't tokenized using bpe, so I'm passing
word
to theSentenceAligner
constructor. The texts are rather long and can be split into sentences, but this input data brought up the question of the expected maximum length of tokens. I'd like to add some safeguards for future arbitrary inputs to ensure the model isn't returning incorrect results and to catch the length issues further up in my data pipeline.I see in the bert model documentatoin that the maximum combined length of the two sets of tokens is 512: https://huggingface.co/google-bert/bert-base-multilingual-cased#preprocessing.
But when running with
word
token type, I seeWhen running with bpe, however, it succeeds (but perhaps the output is invalid). I added a check for the combined length, but even for lower values, it fails with a similar error as above. Is there perhaps some lower length limit due to some preprocessing by simalign? And does the bpe token method somehow get around this?
If the limit should be enforced, perhaps simalign should have this as a precondition.
The text was updated successfully, but these errors were encountered: