Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential deadlocks in TPC-H benchmark suite #1658

Open
hendrikmakait opened this issue Feb 3, 2025 · 3 comments
Open

Potential deadlocks in TPC-H benchmark suite #1658

hendrikmakait opened this issue Feb 3, 2025 · 3 comments
Labels
stability work related to stability

Comments

@hendrikmakait
Copy link
Member

Looking at CI failures over the last two months, I noticed several instances of tests timing out. The timeouts appear to be caused by deadlocks: Grafana always shows a single task in processing on the scheduler, but no corresponding task on the workers. These are the affectd instances:

Based on Grafana activity, I suspect that this might be related to P2P but that's just a guess at this point.

@hendrikmakait hendrikmakait added the stability work related to stability label Feb 3, 2025
@hendrikmakait
Copy link
Member Author

@hendrikmakait
Copy link
Member Author

The Grafana dashboard misled me since it filters out executing tasks. When this fails, there is indeed a task executing on the worker. On my reproducing cluster, it's this one:

Key: ('readparquetpyarrowfs-fused-operation-a1013f0a679c620b3944185a7bcb851c', 109)

Callstack:

Key: ('readparquetpyarrowfs-fused-operation-a1013f0a679c620b3944185a7bcb851c', 109)
File "/opt/coiled/env/lib/python3.12/threading.py", line 1032, in _bootstrap self._bootstrap_inner()

File "/opt/coiled/env/lib/python3.12/threading.py", line 1075, in _bootstrap_inner self.run()

File "/opt/coiled/env/lib/python3.12/threading.py", line 1012, in run self._target(*self._args, **self._kwargs)

File "/opt/coiled/env/lib/python3.12/site-packages/distributed/threadpoolexecutor.py", line 58, in _worker task.run()

File "/opt/coiled/env/lib/python3.12/site-packages/distributed/_concurrent_futures_thread.py", line 65, in run result = self.fn(*self.args, **self.kwargs)

File "/opt/coiled/env/lib/python3.12/site-packages/distributed/utils.py", line 1508, in <lambda> executor, lambda: context.run(func, *args, **kwargs)

File "/opt/coiled/env/lib/python3.12/site-packages/distributed/worker.py", line 2946, in _run_task msg = _run_task_simple(task, data, time_delay)

File "/opt/coiled/env/lib/python3.12/site-packages/distributed/worker.py", line 2982, in _run_task_simple result = task(data)

File "/opt/coiled/env/lib/python3.12/site-packages/dask/_task_spec.py", line 651, in __call__ return self.func(*new_argspec)

File "/opt/coiled/env/lib/python3.12/site-packages/dask_expr/_expr.py", line 3799, in _execute_internal_graph res = execute_graph(internal_tasks, cache=cache, keys=[outkey])

File "/opt/coiled/env/lib/python3.12/site-packages/dask/_task_spec.py", line 786, in execute_graph cache[key] = node(cache)

File "/opt/coiled/env/lib/python3.12/site-packages/dask/_task_spec.py", line 650, in __call__ return self.func(*new_argspec, **kwargs)

File "/opt/coiled/env/lib/python3.12/site-packages/dask_expr/io/io.py", line 185, in _load_multiple_files table = pa.concat_tables(tables, promote_options="permissive")

File "/opt/coiled/env/lib/python3.12/site-packages/dask_expr/io/io.py", line 177, in <genexpr> ReadParquetPyarrowFS._fragment_to_table(

File "/opt/coiled/env/lib/python3.12/site-packages/dask_expr/io/parquet.py", line 1195, in _fragment_to_table return fragment.to_table(

@hendrikmakait
Copy link
Member Author

Possibly related: apache/arrow#40019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stability work related to stability
Projects
None yet
Development

No branches or pull requests

1 participant