[SPARK-55848][SQL][3.5] Fix incorrect dedup results with SPJ partial clustering by naveenp2708 · Pull Request #54852 · apache/spark

naveenp2708 · 2026-03-17T06:16:40Z

What changes were proposed in this pull request?

Backport fix for SPARK-55848 to branch-3.5. Same fix as merged in branch-4.1 via #54751 and branch-4.0 via #54851.

The fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and restructures satisfies0() to check ClusteredDistribution first, returning false when partially clustered. EnsureRequirements then inserts the necessary Exchange.

Why are the changes needed?

SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number).

Does this PR introduce any user-facing change?

Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results.

How was this patch tested?

Three regression tests added. All 53 tests pass locally.

Was this patch authored or co-authored using generative AI tooling?

No.

naveenp2708 · 2026-03-17T06:17:01Z

@peter-toth Backport to branch-3.5 as requested. Same fix as #54751 (4.1) and #54851 (4.0)

peter-toth · 2026-03-17T10:30:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala

-      required match {
-        case c @ ClusteredDistribution(requiredClustering, requireAllClusterKeys, _) =>
-          if (requireAllClusterKeys) {
-            // Checks whether this partitioning is partitioned on exactly same clustering keys of


Same as #54851 (comment).

peter-toth

LGTM, just minor request.

…clustering When SPJ partial clustering splits a partition across multiple tasks, post-join dedup operators (dropDuplicates, Window row_number) produce incorrect results because KeyGroupedPartitioning.satisfies0() incorrectly reports satisfaction of ClusteredDistribution. This fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and restructures satisfies0() to check ClusteredDistribution first, returning false when partially clustered. EnsureRequirements then inserts the necessary Exchange. Plain SPJ joins without dedup are unaffected. Closes apache#54378

naveenp2708 · 2026-03-17T23:07:48Z

@peter-toth Restoring the deleted comments

peter-toth reviewed Mar 17, 2026

View reviewed changes

peter-toth approved these changes Mar 17, 2026

View reviewed changes

naveenp2708 force-pushed the spark-55848-fix-branch-3.5 branch from 766fcbb to 1b3f1ac Compare March 17, 2026 23:05

szehon-ho approved these changes Mar 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55848][SQL][3.5] Fix incorrect dedup results with SPJ partial clustering#54852

[SPARK-55848][SQL][3.5] Fix incorrect dedup results with SPJ partial clustering#54852
naveenp2708 wants to merge 1 commit intoapache:branch-3.5from
naveenp2708:spark-55848-fix-branch-3.5

naveenp2708 commented Mar 17, 2026

Uh oh!

naveenp2708 commented Mar 17, 2026

Uh oh!

peter-toth Mar 17, 2026

Uh oh!

peter-toth left a comment

Uh oh!

naveenp2708 commented Mar 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

naveenp2708 commented Mar 17, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

naveenp2708 commented Mar 17, 2026

Uh oh!

peter-toth Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

naveenp2708 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

naveenp2708 commented Mar 17, 2026 •

edited

Loading