Description
I have some pre-tokenized text that wasn't tokenized using bpe, so I'm passing word
to the SentenceAligner
constructor. The texts are rather long and can be split into sentences, but this input data brought up the question of the expected maximum length of tokens. I'd like to add some safeguards for future arbitrary inputs to ensure the model isn't returning incorrect results and to catch the length issues further up in my data pipeline.
I see in the bert model documentatoin that the maximum combined length of the two sets of tokens is 512: https://huggingface.co/google-bert/bert-base-multilingual-cased#preprocessing.
But when running with word
token type, I see
IndexError: index 510 is out of bounds for axis 0 with size 510
When running with bpe, however, it succeeds (but perhaps the output is invalid). I added a check for the combined length, but even for lower values, it fails with a similar error as above. Is there perhaps some lower length limit due to some preprocessing by simalign? And does the bpe token method somehow get around this?
If the limit should be enforced, perhaps simalign should have this as a precondition.