deps: [DO NOT MERGE] Start testing what will be DataFusion 54.0#3916
Draft
mbutrovich wants to merge 19 commits intoapache:mainfrom
Draft
deps: [DO NOT MERGE] Start testing what will be DataFusion 54.0#3916mbutrovich wants to merge 19 commits intoapache:mainfrom
mbutrovich wants to merge 19 commits intoapache:mainfrom
Conversation
Contributor
Author
|
Added a test for the Spark SQL failure for FULL join with filter on NULL. Testing the fix now, bumped the dependency for DataFusion to test apache/datafusion#21660 |
github-merge-queue bot
pushed a commit
to apache/datafusion
that referenced
this pull request
Apr 16, 2026
…er evaluates to NULL (#21660) ## Which issue does this PR close? - Closes #. ## Rationale for this change While enabling `SortMergeJoinExec` with filters in [DataFusion Comet](https://github.com/apache/datafusion-comet), I hit a correctness bug in DataFusion's `SortMergeJoinExec` for full outer joins with nullable filter columns. The bug was originally surfaced via the [SPARK-43113](https://issues.apache.org/jira/browse/SPARK-43113) reproducer. When a join filter expression evaluates to `NULL` (e.g., `l.b < (r.b + 1)` where `r.b` is `NULL`), the full outer join treats the row pair as **matched** instead of **unmatched**. Per SQL semantics, `NULL` in a boolean filter context is `false` (not satisfied), so the rows should be emitted as separate unmatched rows. The bug has been present since filtered full outer join support was added to SMJ in #12764 / #13369. It was never caught because: 1. The join fuzz tests generate filter column data with `Int32Array::from_iter_values()`, which never produces `NULL` values. 2. No existing unit test or sqllogictest exercised a full outer join filter that evaluates to `NULL`. ## What changes are included in this PR? **Root cause:** The full outer join code path had a special case that preserved raw `NULL` values from the filter expression result (`pre_mask`) instead of converting them to `false` via `prep_null_mask_filter` like left/right outer joins do. This caused two problems: 1. **Streamed (left) side:** In `get_corrected_filter_mask()`, `NULL` entries in `filter_mask` are treated as "pass through" (for pre-null-joined rows from `append_nulls()`). But when the filter expression itself returns `NULL`, those entries also appear as `NULL` in the mask — and get incorrectly treated as matched. This produced wrong join output (matched rows instead of unmatched). 2. **Buffered (right) side:** `BooleanArray::value()` was called on `NULL` positions in `pre_mask` to update `FilterState`. At NULL positions, the values buffer contains a deterministic but semantically meaningless result (computed from the default zero-storage of NULL inputs). For some rows this value happens to be `true`, which incorrectly marks unmatched buffered rows as `SomePassed` and silently drops them from the output. **Fix:** Remove the full outer join special case in `materializing_stream.rs`. All outer join types now uniformly use the null-corrected `mask` (where `NULL` → `false` via `prep_null_mask_filter`) for both deferred filtering metadata and `FilterState` tracking. Semi/anti/mark joins are unaffected — they use `BitwiseSortMergeJoinStream` which already converts NULLs to `false`. **Tests:** - New unit test `join_full_null_filter_result` reproducing the SPARK-43113 scenario with a nullable right-side column. - Modified `make_staggered_batches_i32` in `join_fuzz.rs` to inject ~10% `NULL` values into the filter column (`x`), so the fuzz tests exercise `NULL` filter handling across all join types. ## Are these changes tested? Yes. - New unit test (`join_full_null_filter_result`) directly reproduces the bug. - Existing 57 SMJ unit tests all pass. - All 41 join fuzz tests pass with the new nullable filter column data, including `test_full_join_1k_filtered` which compares `HashJoinExec` vs `SortMergeJoinExec` and would have caught this bug if the fuzz data had included `NULL`s. - Will run 100 iterations of the fuzz tests overnight to shake out any remaining nondeterministic issues. - Testing in Comet CI (all Spark SQL tests) apache/datafusion-comet#3916 ## Are there any user-facing changes? Full outer sort-merge joins with filters involving nullable columns now produce correct results. Previously, rows where the filter evaluated to `NULL` were incorrectly included as matched; they are now correctly emitted as unmatched (null-joined) rows.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Starting to test what will be DataFusion 54.0 to see if that helps find issues in DataFusion when the RC comes around. Lately we haven't gotten all the work done in time in Comet to test an RC, and find issues after the .0 release and have to wait for at least .1 to bump.