Skip to content

Invalid Arrow data from JSONL #5531

@lhoestq

Description

@lhoestq

This code fails:

from datasets import Dataset

ds = Dataset.from_json(path_to_file)
ds.data.validate()

raises

ArrowInvalid: Column 2: In chunk 1: Invalid: Struct child array #3 invalid: Invalid: Length spanned by list offsets (4064) larger than values array (length 4063)

This causes many issues for @TevenLeScao:

  • map fails because it fails to rewrite invalid arrow arrays

    ~/Desktop/hf/datasets/src/datasets/arrow_writer.py in write_examples_on_file(self)
        438             if all(isinstance(row[0][col], (pa.Array, pa.ChunkedArray)) for row in self.current_examples):
        439                 arrays = [row[0][col] for row in self.current_examples]
    --> 440                 batch_examples[col] = array_concat(arrays)
        441             else:
        442                 batch_examples[col] = [
    
    ~/Desktop/hf/datasets/src/datasets/table.py in array_concat(arrays)
    1885 
    1886     if not _is_extension_type(array_type):
    -> 1887         return pa.concat_arrays(arrays)
    1888 
    1889     def _offsets_concat(offsets):
    
    ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.concat_arrays()
    
    ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
    
    ~/.virtualenvs/hf-datasets/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
    
    ArrowIndexError: array slice would exceed array length
  • to_dict() segfaults ⚠️

    /Users/runner/work/crossbow/crossbow/arrow/cpp/src/arrow/array/data.cc:99:  Check failed: (off) <= (length) Slice offset greater 
    than array length

To reproduce: unzip the archive and run the above code using sanity_oscar_en.jsonl
sanity_oscar_en.jsonl.zip

PS: reading using pandas and converting to Arrow works though (note that the dataset lives in RAM in this case):

ds = Dataset.from_pandas(pd.read_json(path_to_file, lines=True))
ds.data.validate()

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions