Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maximum length of inputs #46

Open
joprice opened this issue Jan 15, 2025 · 0 comments
Open

maximum length of inputs #46

joprice opened this issue Jan 15, 2025 · 0 comments

Comments

@joprice
Copy link

joprice commented Jan 15, 2025

I have some pre-tokenized text that wasn't tokenized using bpe, so I'm passing word to the SentenceAligner constructor. The texts are rather long and can be split into sentences, but this input data brought up the question of the expected maximum length of tokens. I'd like to add some safeguards for future arbitrary inputs to ensure the model isn't returning incorrect results and to catch the length issues further up in my data pipeline.

I see in the bert model documentatoin that the maximum combined length of the two sets of tokens is 512: https://huggingface.co/google-bert/bert-base-multilingual-cased#preprocessing.

But when running with word token type, I see

IndexError: index 510 is out of bounds for axis 0 with size 510

When running with bpe, however, it succeeds (but perhaps the output is invalid). I added a check for the combined length, but even for lower values, it fails with a similar error as above. Is there perhaps some lower length limit due to some preprocessing by simalign? And does the bpe token method somehow get around this?

If the limit should be enforced, perhaps simalign should have this as a precondition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant