Suggestion for doing core for longer sequences? #33

ehsan-soe · 2020-01-09T23:08:49Z

Hi,

First, thank you for providing this valuable resource.
According to Table 4 of the Bert paper, for long sequences with length 1152+ the performance declined.
I wonder if I want to do the coref for my dataset in which average sequence length is 1500+, do you suggest using 'spanbert' on my data as it is. Or it is better to segment the data into pieces of length 512?
Of course both has it's drawbacks in negatively effecting the performance of pertained model but which approach do you suggest?

mandarjoshi90 · 2020-01-10T05:49:42Z

Thanks for your interest, Ehsan. I'm not sure I understand the choices you're thinking about. The pre-trained SpanBERT model can only encode documents up to 512 tokens in a single instance. We handle longer documents by splitting them into non-overlapping chunks (we could not get overlapping chunks to work better), and encoding each independently using BERT. So span pairs in different chunks are only connected via the MLPs. I'm not sure what alternative you're referring to.

nikarjunagi mentioned this issue Feb 23, 2022

Sentence index when splitting long sentences into non-overlapping chunks #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion for doing core for longer sequences? #33

Suggestion for doing core for longer sequences? #33

ehsan-soe commented Jan 9, 2020 •

edited

Loading

mandarjoshi90 commented Jan 10, 2020

Suggestion for doing core for longer sequences? #33

Suggestion for doing core for longer sequences? #33

Comments

ehsan-soe commented Jan 9, 2020 • edited Loading

mandarjoshi90 commented Jan 10, 2020

ehsan-soe commented Jan 9, 2020 •

edited

Loading