Skip to content

Push down topk through join#21621

Open
SubhamSinghal wants to merge 4 commits intoapache:mainfrom
SubhamSinghal:push-down-topk-through-join
Open

Push down topk through join#21621
SubhamSinghal wants to merge 4 commits intoapache:mainfrom
SubhamSinghal:push-down-topk-through-join

Conversation

@SubhamSinghal
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

#11900

Rationale for this change

When a query has ORDER BY <cols> LIMIT N on top of an outer join and all sort columns come from the preserved side,
DataFusion currently runs the full join first, then sorts and limits. We can push a copy of the Sort(fetch=N) to the preserved input, reducing the number of rows entering the join.

Before:

Sort: t1.b ASC, fetch=3
   Left Join: t1.a = t2.a
     Scan: t1     ← scans ALL rows
     Scan: t2

After:

  Sort: t1.b ASC, fetch=3
    Left Join: t1.a = t2.a
      Sort: t1.b ASC, fetch=3  ← pushed down
        Scan: t1               ← only top-3 rows enter join
      Scan: t2

What changes are included in this PR?

A new logical optimizer rule PushDownTopKThroughJoin that:

  1. Matches Sort with fetch = Some(N) (TopK)
  2. Looks through an optional Projection to find a Join
  3. Checks join type is LEFT or RIGHT with no non-equijoin filter
  4. Verifies all sort expression columns come from the preserved side
  5. Inserts a copy of the Sort(fetch=N) on the preserved child
  6. Keeps the top-level sort for correctness

Are these changes tested?

Yes through UT

Are there any user-facing changes?

No API changes.

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 14, 2026
} else {
&join.right
};
if matches!(preserved_child.as_ref(), LogicalPlan::Sort(_)) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition looks a bit broad.
If the child has no fetch limit or a larger fetch limit than the current one then pushing down the current Sort with its fetch limit would be beneficial, no ?
The optimization should be skipped only if the Sort expr is different or its fetch limit is non-zero but smaller than the current one.

DROP TABLE t1;

statement ok
DROP TABLE t2; No newline at end of file
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DROP TABLE t2;
DROP TABLE t2;

nit: add an empty line at the end to make it Unix-friendly.

02)--Left Join: t1.a = t2.x
03)----TableScan: t1 projection=[a, b]
04)----TableScan: t2 projection=[x]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for:

  1. LEFT/RIGHT SEMI/ANTI JOINs
  2. LEFT and/or RIGHT JOIN with ORDER BY t1.b, t2.y LIMIT 3 (multiple sort columns from both sides)
  3. LIMIT N OFFSET M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants