Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How much data is common between the two OSCAR versions? #19

Open
ibraheem-moosa opened this issue Feb 17, 2022 · 2 comments
Open

How much data is common between the two OSCAR versions? #19

ibraheem-moosa opened this issue Feb 17, 2022 · 2 comments

Comments

@ibraheem-moosa
Copy link

How much data is shared between the two versions? Do they overlap in time? Is the new version a superset of the earlier version?

Thanks in advance!

@Uinelj
Copy link
Member

Uinelj commented Feb 17, 2022

Hello and thanks for your question :)

The short answer is: We don't know.

We have conducted basic word occurrence counts in papers (especially in 21.09 vs. the upcoming 22.01 corpus) showing that the corpus possibly retains information about events, but we haven't checked the number of duplicate document between versions.

You may find some element of response by looking into the overlaps between CommonCrawl dumps, if there is some work on that.

@pjox Should we look into these type of stats?

@pjox
Copy link
Member

pjox commented Feb 17, 2022

I think we can do this for future versions, however I think it will be extremely difficult to do for the the 2019 and 21.09 as we didn't have any document integrity for these. With the new schema (https://arxiv.org/abs/2201.06642) and the metadata this will be way easier and it is indeed something that we can explore in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants