Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset seems broken in community-oscar? #25

Open
wenlai-lavine opened this issue Dec 8, 2024 · 0 comments
Open

dataset seems broken in community-oscar? #25

wenlai-lavine opened this issue Dec 8, 2024 · 0 comments
Assignees

Comments

@wenlai-lavine
Copy link

wenlai-lavine commented Dec 8, 2024

Hi,

I am trying to load the dataset and simply count the number of examples from the latest version in oscar-corpus/community-oscar, but it seems the dataset is broken?

  • Simple testcase
from datasets import load_dataset
ds = load_dataset('oscar-corpus/community-oscar', data_files='data/2024-22/af_meta/*.jsonl.zst', split='train', streaming=True)
count = 0
for _ in ds:
    count += 1
print(count)
  • Another test case: I also try to decompress the zst file to get the jsonl file, then load again, but still get the same error
from datasets import load_dataset
ds = load_dataset('json', data_files='DATA_PATH/af_meta.jsonl', split='train', streaming=True)
count = 0
for _ in ds:
    count += 1
print(count)
  • Error:
/arrow/cpp/src/arrow/array/data.cc:185:  Check failed: (off) <= (length) Slice offset (1062) greater than array length (1059)
Aborted (core dumped)
  • Other testcase

    • I also try to use other datasets, such as en_meta_part_1.jsonl.zst and all other datasets. I got the same error.
    • I also try to use other version, without any problem when loading OSCAR-2301 using the same code
    from datasets import load_dataset
    ds = load_dataset('oscar-corpus/OSCAR-2301', language="af", split='train', streaming=True)
    count = 0
    for _ in ds:
       count += 1
    print(count)
    
    • There's no problem if I directly load the decompressed json file using json.load()
  • python package

    • python==3.10
    • datasets==3.1.0 [latest version]
    • pyarrow==18.1.0 [latest version]

@Uinelj Anyone can help to solve this? Thanks!

@pjox pjox self-assigned this Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants