-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSCAR-2109 huggingface datasets are misaligned and truncated #18
Comments
Hi @Uinelj, I am still dealing with alignment issues between metadata and text> I run: |
I'll look into it today. |
Running the whole process again (with removing I'm looking for ways of using a local loading script in order to compare matches and find eventual mismatches. |
Thanks a lot @Uinelj, I might have had a problem with caching then. I will try again! I'll update you on my results. |
If you're using the updated version with the sorted parts, you still need to process long enough that you get to at least the second part or you won't see most of the remaining misalignments. For smaller languages (whatever language as
Then load the dataset from the local path instead of the name. To test/patch, just apply the patch above to the It's possible that you can also provide the patched script locally somehow and still load the data from the remote repo, but I couldn't quickly figure out how to do it. |
@adrianeboyd, thanks! The problem is that I cannot use |
The first patch (that is: sorting text and metadata files so that they are in sync) has already been merged into the current The second one, suggested by @adrianeboyd is not yet merged because I've been trying to:
Once I have a rapid way of debugging the issue and check the fix proposed by @adrianeboyd , I'll happily merge the code. |
@Uinelj, so sorry to bug you about this again but do you have an approximate timeline when this will be done? I unfortunately have a deadline on my project and might need to think about alternative if there is no chance of having a correctly aligned version of OSCAR available via the HF API. |
I'll allocate more time on this matter in the following week. |
Hello @norakassner , @albertvillanova has pushed a fix in huggingface/datasets#3910. When the PR is merged, the issue should be fixed! 🎉 |
I haven't tested the newlines fix (I don't think it will help because the dataset script is calling diff --git a/OSCAR-2109.py b/OSCAR-2109.py
index c2c94595..53e66dc3 100644
--- a/OSCAR-2109.py
+++ b/OSCAR-2109.py
@@ -397,8 +397,8 @@ class Oscar2109(datasets.GeneratorBasedBuilder):
def _generate_examples(self, metadata_and_text_files):
"""This function returns the examples in the raw (text) form by iterating on all the files."""
id_ = 0
- offset = 0
for meta_path, text_path in metadata_and_text_files:
+ offset = 0
logger.info("generating examples from = %s", text_path)
with gzip.open(open(text_path, "rb"), "rt", encoding="utf-8") as text_f:
with gzip.open(open(meta_path, "rb"), "rt", encoding="utf-8") as meta_f: I will try to test the newlines fix in the next day or two. |
Hi, In relation with the bugs:
CC: @Uinelj @pjox @norakassner |
Normally, the issues should be fixed now:
|
@albertvillanova Thank you! I removed the cache and downloaded the dataset again. Somehow, I still have the exact same misalignment as reported earlier: #18 (comment). Maybe I missed to remove some important cache? I check whether it is using the newest OSCAR-2109.py version. This is the case. It also accessed the most recent download. I am not sure what is going on. Any idea? |
Copied from: huggingface/datasets#3704
As mentioned in the comments, potentially related to: #15
The only way that I got a simple
wc -w
on the raw texts from git-lfs in the repo at https://huggingface.co/datasets/oscar-corpus/OSCAR-2109 to exactly matchwc -w
on all the texts exported from the loaded dataset was to fix all three issues mentioned below, plus not stripping all trailing whitespace. Just pairing the text/meta filenames was not sufficient.Describe the bug
The
oscar-corpus/OSCAR-2109
data appears to be misaligned and truncated by the dataset builder for subsets that contain more than one part and for cases where the texts contain non-unix newlines.Steps to reproduce the bug
A few examples, although I'm not sure how deterministic the particular (mis)alignment is in various configurations:
For
deduplicated_fi
, all exported raw texts from the dataset are 17GB rather than 20GB as reported in the data splits overview table. The token count withwc -w
for the raw texts is 2,067,556,874 rather than the expected 2,357,264,196 from the data splits table.For
deduplicated_no
all exported raw texts contain 624,040,887 rather than the expected 776,354,517 tokens.For
deduplicated_mk
it is 122,236,936 rather than 134,544,934 tokens.I'm not expecting the
wc -w
counts to line up exactly with the data splits table, but for comparison thewc -w
count fordeduplicated_mk
on the raw texts is 134,545,424.Issues
Expected results
All texts from the OSCAR release are extracted according to the metadata and aligned with the correct metadata.
Fixes
Not necessarily the exact fixes/checks you may want to use (I didn't test all languages or do any cross-platform testing, I'm not sure all the details are compatible with streaming), however to highlight the issues:
I've tested this with a number of smaller deduplicated languages with 1-20 parts and the resulting datasets looked correct in terms of word count and size when compared to the data splits table and raw texts, and the text/metadata alignments were correct in all my spot checks. However, there are many many languages I didn't test and I'm not sure that there aren't any texts containing blank lines in the corpus, for instance. For the cases I tested, the assertions related to blank lines and EOF made it easier to verify that the text and metadata were aligned as intended, since there would be little chance of spurious alignments of variable-length texts across so much data.
The text was updated successfully, but these errors were encountered: