Skip to content

[SPARK-55147][SS] Scope timestamp range for time-interval join retrieval in V4 state format#54879

Open
nicholaschew11 wants to merge 1 commit intoapache:masterfrom
nicholaschew11:SPARK-55147-range-scan-v4
Open

[SPARK-55147][SS] Scope timestamp range for time-interval join retrieval in V4 state format#54879
nicholaschew11 wants to merge 1 commit intoapache:masterfrom
nicholaschew11:SPARK-55147-range-scan-v4

Conversation

@nicholaschew11
Copy link
Contributor

@nicholaschew11 nicholaschew11 commented Mar 18, 2026

What changes were proposed in this pull request?

This PR improves the retrieval operation in the V4 stream-stream join state manager to scope the timestamp range for time-interval joins. Instead of scanning all timestamps for a given key during prefix scan, V4 now extracts constant interval offsets from the join condition and computes a (minTs, maxTs) range per input row, enabling the prefix scan to skip entries before minTs and terminate early past maxTs.

  • Add scanRangeOffsets and computeTimestampRange to OneSideHashJoiner, using StreamingJoinHelper.getStateValueWatermark(eventWatermark=0) to extract interval bounds from the join condition
  • Add timestampRange parameter to getJoinedRows in the state manager trait, V4 implementation, and V1-V3 base class (ignored by V1-V3)
  • Add getValuesInRange to KeyWithTsToValuesStore that filters by range and stops early past the upper bound
  • getValues now delegates to getValuesInRange(Long.MinValue, Long.MaxValue)

Why are the changes needed?

For time-interval joins, the V4 state format stores values indexed by (key, timestamp). Without range scoping, retrieving matches requires scanning all timestamps for a key via prefix scan, even though the join condition constrains matching to a specific time window. With this change, the scan is bounded to only the relevant timestamp range, reducing I/O proportionally to the ratio of the interval width to the total timestamp span in state.

Does this PR introduce any user-facing change?

No. V4 state format is experimental and gated behind spark.sql.streaming.join.stateFormatV4.enabled.

How was this patch tested?

New unit tests in SymmetricHashJoinStateManagerEventTimeInValueSuite:

  • getJoinedRows with timestampRange: boundary conditions, exact matches, empty ranges, full range
  • timestampRange with multiple values per timestamp: multiple values at the same timestamp

Existing V4 join suites (Inner, Outer, FullOuter, LeftSemi) all pass.

Was this patch authored or co-authored using generative AI tooling?

Yes. (Claude Opus 4.6)

@nicholaschew11 nicholaschew11 force-pushed the SPARK-55147-range-scan-v4 branch from 7935015 to bde2f50 Compare March 18, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant