Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for doing core for longer sequences? #33

Open
ehsan-soe opened this issue Jan 9, 2020 · 1 comment
Open

Suggestion for doing core for longer sequences? #33

ehsan-soe opened this issue Jan 9, 2020 · 1 comment

Comments

@ehsan-soe
Copy link

ehsan-soe commented Jan 9, 2020

Hi,

First, thank you for providing this valuable resource.
According to Table 4 of the Bert paper, for long sequences with length 1152+ the performance declined.
I wonder if I want to do the coref for my dataset in which average sequence length is 1500+, do you suggest using 'spanbert' on my data as it is. Or it is better to segment the data into pieces of length 512?
Of course both has it's drawbacks in negatively effecting the performance of pertained model but which approach do you suggest?

@mandarjoshi90
Copy link
Owner

Thanks for your interest, Ehsan. I'm not sure I understand the choices you're thinking about. The pre-trained SpanBERT model can only encode documents up to 512 tokens in a single instance. We handle longer documents by splitting them into non-overlapping chunks (we could not get overlapping chunks to work better), and encoding each independently using BERT. So span pairs in different chunks are only connected via the MLPs. I'm not sure what alternative you're referring to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants