You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have conducted basic word occurrence counts in papers (especially in 21.09 vs. the upcoming 22.01 corpus) showing that the corpus possibly retains information about events, but we haven't checked the number of duplicate document between versions.
You may find some element of response by looking into the overlaps between CommonCrawl dumps, if there is some work on that.
I think we can do this for future versions, however I think it will be extremely difficult to do for the the 2019 and 21.09 as we didn't have any document integrity for these. With the new schema (https://arxiv.org/abs/2201.06642) and the metadata this will be way easier and it is indeed something that we can explore in the future.
How much data is shared between the two versions? Do they overlap in time? Is the new version a superset of the earlier version?
Thanks in advance!
The text was updated successfully, but these errors were encountered: