Skip to content

feat: Replace current hash join implementation with Grace hash join [experimental]#3632

Draft
andygrove wants to merge 9 commits intoapache:mainfrom
andygrove:ghj
Draft

feat: Replace current hash join implementation with Grace hash join [experimental]#3632
andygrove wants to merge 9 commits intoapache:mainfrom
andygrove:ghj

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Mar 5, 2026

Which issue does this PR close?

Research towards #2545

Follows on from earlier research in #3564

Rationale for this change

Comet's current experimental hash join operator is not suitable for use in production because it has no spilling support and will OOM if the build side is too large. For example, it was not possible to run the TPC-DS benchmarks with hash joins enabled prior to this PR.

This PR replaces it with an experimental Grace hash join operator which does have spilling.

Benchmark Results

Benchmark SMJ Hash Join Grace Hash Join
TPC-H 278s 208s 250s
TPC-DS 704s OOM 664s

What changes are included in this PR?

  • Replace SortMergeJoinExec with ShuffledHashJoinExec via RewriteJoin rule (removes input sorts), executed natively as GraceHashJoinExec
  • Hash-partition both build and probe sides into N buckets (default 16) using prefix-sum algorithm for cache-friendly O(rows) partitioning
  • Fast path: when build side fits in memory and no spilling occurred, skip partitioning overhead — single HashJoinExec with streaming probe
  • Slow path: merge adjacent partitions to ~32 MB groups, join sequentially with per-partition HashJoinExec
  • Spill to disk via Arrow IPC with 1 MB buffered writes; streaming reads via SpillReaderExec with batch coalescing (~8192 rows)
  • Recursive repartitioning (up to depth 3 / 4096 effective partitions) when individual partitions exceed hash table memory
  • Cooperative memory management: single spillable MemoryReservation during partitioning; aggressive "spill all" strategy on probe-side memory pressure to avoid thrashing between concurrent instances
  • Supports all join types: Inner, Left, Right, Full, LeftSemi, LeftAnti, LeftMark, RightSemi, RightAnti, RightMark
  • Configurable SMJ replacement guard (maxBuildSize) to keep sort-merge join when both sides are large
  • Fast path threshold divided by executor cores to bound per-task memory

How are these changes tested?

GHJ supports all join types with either build side, so remove the
canBuildShuffledHashJoinLeft/Right checks and the LeftAnti/LeftSemi
BuildRight guard (apache#457, apache#2667).
GHJ supports all join types with either build side, but the intermediate
ShuffledHashJoinExec node is validated by Spark before CometExecRule
replaces it. Use the optimal build side when allowed, otherwise fall
back to whichever side Spark permits.
andygrove and others added 4 commits March 5, 2026 12:02
- HashJoin test now matches CometGraceHashJoinExec and checks GHJ
  metrics (no build_mem_used, adds spill_count)
- BroadcastHashJoin test removes build_mem_used assertion since
  the native side does not report this metric
- Remove dead CometHashJoinExec case class (createExec already
  produces CometGraceHashJoinExec)
Skip build-side hash partitioning when the fast path threshold is set.
Instead of always computing hashes and splitting every build batch into
N partitions (only to collect them back together for the fast path),
buffer the build side directly. When the build fits in memory and is
under the threshold, feed it straight to HashJoinExec with zero
partitioning overhead.

Falls back to the partitioned slow path on memory pressure or when the
build exceeds the threshold.

Also fix CometConf fastPathThreshold type from intConf to longConf to
support values > 2 GB without integer overflow, and remove a duplicate
config line in the benchmark TOML.

~4% improvement on both TPC-H and TPC-DS benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant