-
Notifications
You must be signed in to change notification settings - Fork 205
Investigate TPC-H q4 hanging when not enough memory is allocated #1523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The query blocked because we don't have enough number of blocking threads configured for the tokio runtime. In merge phase, each spill file will be wrapped by a stream backed by a blocking thread (see read_spill_as_stream), so we'll spawn at least 183 blocking threads when there are 183 spill files to merge spilled data. The default number of blocking thread is 10, this make the query hang indefinitely. Tuning |
Thanks for debugging this @Kontinuation. Related to this, we currently create a new tokio runtime per plan. I do wonder if we should just have a global tokio runtime for the executor where we could allocate a higher number of threads that could be shared. Do you have an opinion on that? |
I filed an issue in DataFusion: apache/datafusion#15323 |
I prefer reusing the global tokio runtime for running all comet physical plans within the same process. The current tokio runtime per plan approach spawns needlessly large number of threads. We can also have a larger default of max blocking threads and these blocking threads can be better utilized by concurrently running queries. Having a global tokio runtime may prevent us from re-configuring the number of worker threads and blocking threads in an active Spark context using |
Here is an old PR that switched to using a global tokio runtime. I close the PR because I could not find a good justification for it at the time. Perhaps we should try this again and see if it helps with this issue. |
I filed #1590 for switching to a global tokio runtime |
Describe the bug
During benchmarking, I find that TPC-H q4 "hangs" indefinitely in the sort-merge join when there is not much memory allocated. I would expect the operator to be slow and spill but it seems to be in some kind of deadlock situation instead, with the stats never changing except for the "total time for joining".
Steps to reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: