-
Notifications
You must be signed in to change notification settings - Fork 3k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
This code fails:
from datasets import Dataset
ds = Dataset.from_json(path_to_file)
ds.data.validate()raises
ArrowInvalid: Column 2: In chunk 1: Invalid: Struct child array #3 invalid: Invalid: Length spanned by list offsets (4064) larger than values array (length 4063)This causes many issues for @TevenLeScao:
-
mapfails because it fails to rewrite invalid arrow arrays~/Desktop/hf/datasets/src/datasets/arrow_writer.py in write_examples_on_file(self) 438 if all(isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) for row in self.current_examples): 439 arrays = [row[0][col] for row in self.current_examples] --> 440 batch_examples[col] = array_concat(arrays) 441 else: 442 batch_examples[col] = [ ~/Desktop/hf/datasets/src/datasets/table.py in array_concat(arrays) 1885 1886 if not _is_extension_type(array_type): -> 1887 return pa.concat_arrays(arrays) 1888 1889 def _offsets_concat(offsets): ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays() ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status() ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status() ArrowIndexError: array slice would exceed array length
-
to_dict()segfaults⚠️ /Users/runner/work/crossbow/crossbow/arrow/cpp/src/arrow/array/data.cc:99: Check failed: (off) <= (length) Slice offset greater than array length
To reproduce: unzip the archive and run the above code using sanity_oscar_en.jsonl
sanity_oscar_en.jsonl.zip
PS: reading using pandas and converting to Arrow works though (note that the dataset lives in RAM in this case):
ds = Dataset.from_pandas(pd.read_json(path_to_file, lines=True))
ds.data.validate()TevenLeScao
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working