Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] ParquetFileFragment.to_batches() hangs forever #45214

Open
lhoestq opened this issue Jan 9, 2025 · 0 comments
Open

[Python] ParquetFileFragment.to_batches() hangs forever #45214

lhoestq opened this issue Jan 9, 2025 · 0 comments

Comments

@lhoestq
Copy link

lhoestq commented Jan 9, 2025

In the datasets library we are using ParquetFileFragment.to_batches() to stream batches of data while applying filters file-per-file. We create fragments from file-like objects (because files can be local or remote).

However @AlexKoff88 reported that for some datasets like phiyodr/InpaintCOCO it causes the code to hang at huggingface/datasets#7357.

I managed to make a reproducible example:

wget https://huggingface.co/datasets/phiyodr/InpaintCOCO/resolve/c56e31947190173d2d6373c4833b0a9889ff6eee/data/test-00000-of-00003.parquet

file info:

  • size: 300MB
  • 5 row groups of <=100 rows each
  • see all the parquet metadata here
  • contains nested and binary types for images (not sure if relevant)
import pyarrow.dataset as ds

file = "test-00000-of-00003.parquet"
with open(file, "rb") as f:
    parquet_fragment = ds.ParquetFileFormat().make_fragment(f)
    for record_batch in parquet_fragment.to_batches():
        print(len(record_batch))  # 100
        break  # hangs forever

Environment:

  • python 3.12.2
  • pyarrow 18.1.0
  • macbook pro m2

The code hangs very often, but in rare random cases it is able to terminate.

The issue appears when running the python script, but doesn't appear in google colab or in ipython.

The issue also appears in eltorio/ROCOv2-radiology which happens to also contain binary types. The issue doesn't seem to appear in datasets like AI-MO/NuminaMath-CoT which don't contain binary types.

In the original issue in datasets this message was also reported:

Fatal Python error: PyGILState_Release: thread state 0x7fa1f409ade0 must be current when releasing
Python runtime state: finalizing (tstate=0x0000000000ad2958)

Thread 0x00007fa33d157740 (most recent call first):
  <no Python frame>

Component(s)

Parquet, Python

@lhoestq lhoestq changed the title [Python] ParquetFragment.to_batches() hangs forever [Python] ParquetFileFragment.to_batches() hangs forever Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant