How much data is common between the two OSCAR versions? #19

ibraheem-moosa · 2022-02-17T10:56:20Z

How much data is shared between the two versions? Do they overlap in time? Is the new version a superset of the earlier version?

Thanks in advance!

Uinelj · 2022-02-17T12:12:37Z

Hello and thanks for your question :)

The short answer is: We don't know.

We have conducted basic word occurrence counts in papers (especially in 21.09 vs. the upcoming 22.01 corpus) showing that the corpus possibly retains information about events, but we haven't checked the number of duplicate document between versions.

You may find some element of response by looking into the overlaps between CommonCrawl dumps, if there is some work on that.

@pjox Should we look into these type of stats?

pjox · 2022-02-17T13:17:16Z

I think we can do this for future versions, however I think it will be extremely difficult to do for the the 2019 and 21.09 as we didn't have any document integrity for these. With the new schema (https://arxiv.org/abs/2201.06642) and the metadata this will be way easier and it is indeed something that we can explore in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much data is common between the two OSCAR versions? #19

How much data is common between the two OSCAR versions? #19

ibraheem-moosa commented Feb 17, 2022

Uinelj commented Feb 17, 2022

pjox commented Feb 17, 2022

How much data is common between the two OSCAR versions? #19

How much data is common between the two OSCAR versions? #19

Comments

ibraheem-moosa commented Feb 17, 2022

Uinelj commented Feb 17, 2022

pjox commented Feb 17, 2022