[SPARK-55848][SQL][3.5] Fix incorrect dedup results with SPJ partial clustering#54852
Open
naveenp2708 wants to merge 1 commit intoapache:branch-3.5from
Open
[SPARK-55848][SQL][3.5] Fix incorrect dedup results with SPJ partial clustering#54852naveenp2708 wants to merge 1 commit intoapache:branch-3.5from
naveenp2708 wants to merge 1 commit intoapache:branch-3.5from
Conversation
Author
|
@peter-toth Backport to branch-3.5 as requested. Same fix as #54751 (4.1) and #54851 (4.0) |
peter-toth
reviewed
Mar 17, 2026
| required match { | ||
| case c @ ClusteredDistribution(requiredClustering, requireAllClusterKeys, _) => | ||
| if (requireAllClusterKeys) { | ||
| // Checks whether this partitioning is partitioned on exactly same clustering keys of |
peter-toth
approved these changes
Mar 17, 2026
Contributor
peter-toth
left a comment
There was a problem hiding this comment.
LGTM, just minor request.
…clustering When SPJ partial clustering splits a partition across multiple tasks, post-join dedup operators (dropDuplicates, Window row_number) produce incorrect results because KeyGroupedPartitioning.satisfies0() incorrectly reports satisfaction of ClusteredDistribution. This fix adds an isPartiallyClustered flag to KeyGroupedPartitioning and restructures satisfies0() to check ClusteredDistribution first, returning false when partially clustered. EnsureRequirements then inserts the necessary Exchange. Plain SPJ joins without dedup are unaffected. Closes apache#54378
766fcbb to
1b3f1ac
Compare
Author
|
@peter-toth Restoring the deleted comments |
szehon-ho
approved these changes
Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Backport fix for SPARK-55848 to branch-3.5. Same fix as merged in branch-4.1 via #54751 and branch-4.0 via #54851.
The fix adds an
isPartiallyClusteredflag toKeyGroupedPartitioningand restructuressatisfies0()to checkClusteredDistributionfirst, returningfalsewhen partially clustered.EnsureRequirementsthen inserts the necessary Exchange.Why are the changes needed?
SPJ with partial clustering produces incorrect results for post-join dedup operations (dropDuplicates, Window row_number).
Does this PR introduce any user-facing change?
Yes. Queries using SPJ with partial clustering followed by dedup operations will now return correct results.
How was this patch tested?
Three regression tests added. All 53 tests pass locally.
Was this patch authored or co-authored using generative AI tooling?
No.