-
Notifications
You must be signed in to change notification settings - Fork 1.5k
General framework to decorrelate the subqueries #5492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
According to the discussions in this issue, i think we can list the following items to support a subqueries decorrelation framework:
|
|
I think a crucial starting point is to transform all correlated exprs into |
In case others haven't heard, @irenjj is working on additional subquery support as part of a Google Summer of Code Project (where @jayzhan211 and I are helping mentor). Perhaps @suibianwanwank and @duongcongtoai would be interested in being involved too |
cool, actually in this PR: #16016 i'm trying to introduce a unified structure to do decorrelation for the following usecases:
The parts that are not implemented are left with unimplemented! errors. I'm trying to make the PR mergable, so everyone can continue to contribute in parallel, but my current goal is to implement enough to satisfy existing slt tests |
From what I can see, this ticket (a general framework to decorrelate subqueries) is going to be the core of any subsequent improvements for subqueries, so I suggest we pool our efforts together to focus on this first. I suggest, we all first review the PR from @duongcongtoai |
I recommend we also do some research on existing systems. Can someone provide links to existing implementations in other systems?
|
For DuckDB: basic logic is in |
Thanks to @irenjj remind, I found this based on your input: https://github.com/duckdb/duckdb/blob/main/src/planner/binder/query_node/plan_subquery.cpp. If I understand correctly, this should be DuckDB's logic for converting subQuery to DependentJoin. |
In this draft, dependent join detection is implemented inside this function. But i wonder if creating an explicit In duckdb PR he mentioned:
I can relates this to the idea in the 2nd paper
which corresponds to this part of the draft PR, and is still within the detail implementation of a specific optimizor |
There are one thing we surely know that should be implemented: detect which nodes in the LogicalPlan AST is a dependent join node. However, we don't need to create a new LogicalPlan type like => We can keep this concept internally to the decorelation framework |
In DuckDB, the final logical plan generated is |
Thank you everyone for your opinions. Looks like my implementation is trying to wrap everything inside a single optimizor, which is hard to follow and reduces space for collaborative work, and since we can use duckdb's implementation detail, the next steps can be the followings:
|
In this case, we may need at least 2 new optimizor passes 🤔 |
Thanks @duongcongtoai ! This is a very good idea! I also think we can start with simple unnest, we may need to introduce some DuckDB structures: new logical plan/expr( |
From what i understand, duckdb does not directly differentiate between simple unnest vs general unnest right? they achieve the simplified query plan thanks to the optimizor of |
Yep, It should be that general unnest includes the simple unnest case.
Sorry, I couldn't find the |
Yes Nice, then let's follow this approach of duckdb, to me it also makes the code cleaner |
I don't fully follow what this is proposing Are you proposing to add new |
Update: DelimScan is a physical optimization to retrieve unique column values across multiple tables efficiently (i.e some correlated subqueries require retrieving values from I may be able to work on some of this story by this week: detect all dependent join nodes (actually my previous PR will not go to waste and can reuse alot of stuff for this) After this implementation we will have a Logical Plan similar to this of duckdb
|
@duongcongtoai do you think we should have a feature branch for this? I think that as you work on detecting correlated queries(dependent join) you will end up creating a logical plan that actually can't be executed because we currently do not support Dependent Join (and seems like we dont plan to, as correlated can now be completed decorrelated (if I read the paper right.)). This will necessitate you to either put decorrelation logic in the same PR or implement Dummy Dependent Join which I think will leave some existing queries broken. You will either end up with a long PR or broken queries making it difficult to review/merge. Alternatively you can try putting this code behind feature gates, though I am not sure how that would work out or if that would be easier than having feature branch. Let me know If I am wrong, or if you have other plans. |
@logan-keede I expect to implement an optimizor that does 2 things (splitted into 2 PRs)
In the first PR i don't expect to integrate this optimizor to the main flow yet, the behavior is tested through in-code test instead of sqllogictests => This PR can be merged to main without breaking existing behavior In the second PR is where all integration happens, and we can integrate this optimizor into the mainbranch and test them with sqllogictest. But as you mention a lot of sqllogictests will be broken if we completely deprecate the followings
So i think we devide the 2nd phase into 2 more subphases: 2nd phase
Should any queries that cannot be decorrelated by the previous 2 optimizor, the new optimizor will come into play In this phase alot of work can be done in parallel to support different complex usecases, because more and more complex subqueries are supported ( 3rd phase |
totally forgot that was possible.😅 Also this is a very nice write up! Puts a lot of things into perspective. |
This sounds like a nice plan. Perhaps in parallel different people could work on migrating logic from the existing rules into the new optimizer rule and removing the old one |
Hi @duongcongtoai , I created a tracking issue and broke down the task of building dependent joins for better collaboration: #16173. There might be some points I didn't consider, and if you have already implemented them, feel free to let me know so I can update the task list. I also implemented a draft simple rule to handle correlated scalar subqueries in WHERE clauses: #16174. |
PLease let me know when you have PRs ready for review or other items that I can try and help with |
This PR is ready for review, |
Hi everyone, i have 2 pending PRs which support this story, please help me take a look
After merging them we can start implement the steps for decorrelation |
There is also one thing i want to highlight, is that in DuckDB, a SubqueryExpr may result into 2 output expr after decorrelation, this is because they want to support this query |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In the current DataFusion, it has very limited support for correlated subqueries. It can only decorrelate the (NOT) IN/Exists predicate subqueries to Semi/Anti Joins. Even in the simplest IN/Exists cases, if the correlated expressions are not in the Filter/Join conditions, the current decorrelate rules will not support them.
In the paper "Unnesting Arbitrary Queries" by T. Neumann; A. Kemper
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). It raise a mechanism to unnest arbitrary queries. This was already implemented by the Hyper DB:
For example:
select * from orders
where 1 in (select 1 from part left join (select l_partkey from lineitem where o_orderkey = 2) lineitem on p_partkey = lineitem.l_partkey)
https://hyper-db.de/interface.html#
Both SparkSQL and PostgreSQL do not support decorrelate such kind of queries.
Describe the solution you'd like
Describe alternatives you've considered
Additional context
The text was updated successfully, but these errors were encountered: