maximum length of inputs

I have some pre-tokenized text that wasn't tokenized using bpe, so I'm passing `word` to the `SentenceAligner` constructor. The texts are rather long and can be split into sentences, but this input data brought up the question of the expected maximum length of tokens. I'd like to add some safeguards for future arbitrary inputs to ensure the model isn't returning incorrect results and to catch the length issues further up in my data pipeline.

I see in the bert model documentatoin that the maximum combined length of the two sets of tokens is 512: https://huggingface.co/google-bert/bert-base-multilingual-cased#preprocessing.

But when running with `word` token type, I see

```
IndexError: index 510 is out of bounds for axis 0 with size 510
```

When running with bpe, however, it succeeds (but perhaps the output is invalid). I added a check for the combined length, but even for lower values, it fails with a similar error as above. Is there perhaps some lower length limit due to some preprocessing by simalign? And does the bpe token method somehow get around this? 

If the limit should be enforced, perhaps simalign should have this as a precondition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

maximum length of inputs #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

maximum length of inputs #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions