[SPARK-56033][SQL] Support whole-stage codegen for `ArrayTransform` by LuciferYang · Pull Request #54864 · apache/spark

LuciferYang · 2026-03-17T12:32:28Z

What changes were proposed in this pull request?

This PR adds code generation support to ArrayTransform (the SQL transform() function) and its underlying HOF (Higher-Order Function) infrastructure (NamedLambdaVariable, LambdaFunction), removing CodegenFallback from these three expressions so that queries using transform() can participate in whole-stage code generation.

Background

All 11 higher-order functions in Spark SQL currently extend CodegenFallback. While CodegenFallback still executes correctly, it has a key limitation: WholeStageCodegenExec.supportCodegen returns false when any CodegenFallback expression is found, causing the entire operator pipeline — not just the HOF — to fall back from whole-stage codegen. This means even surrounding non-HOF expressions in the same stage lose the benefits of codegen.

Design

Lambda variable binding mechanism — A new lambdaVariableMap: Map[ExprId, ExprCode] in CodegenContext with a withLambdaVariableBindings save/restore helper (following the established currentVars/INPUT_ROW pattern). The enclosing HOF registers lambda parameter bindings before generating the lambda body; NamedLambdaVariable.doGenCode looks up its binding to emit zero-overhead variable references.

Mutable state fields — Lambda variable values use ctx.addMutableState() (class fields) instead of local variables, because Expression.reduceCodeSize() may extract lambda body code into separate private methods where local loop variables would be out of scope.

AtomicReference dual-write — When the lambda body contains CodegenFallback sub-expressions (e.g., ArrayFilter which hasn't been given codegen yet), the generated loop also writes to the AtomicReference on NamedLambdaVariable, so that eval() calls from fallback sub-expressions read the correct value. A static check (function.exists(_.isInstanceOf[CodegenFallback])) skips these writes when the lambda body is fully codegen'd, avoiding unnecessary boxing overhead.

Graceful fallback — NamedLambdaVariable.doGenCode falls back to eval() via references[] when no binding is registered (e.g., in GenerateMutableProjection paths), with a logWarning for diagnostic purposes.

Why are the changes needed?

Queries using transform() currently disable whole-stage codegen for the entire stage, reducing codegen coverage for the operator pipeline. This PR re-enables whole-stage codegen for stages containing transform().

More importantly, this establishes the reusable HOF codegen infrastructure (lambdaVariableMap, withLambdaVariableBindings, and the AtomicReference dual-write pattern) that other higher-order functions (ArrayFilter, ArrayExists, ArrayAggregate, MapFilter, etc.) can adopt incrementally to further expand whole-stage codegen coverage.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests: HigherOrderFunctionsSuite, WholeStageCodegenSuite, DataFrameFunctionsSuite all pass.
New unit test: LambdaFunction.doGenCode throws SparkException when bindings are missing.
New integration test: 9 WholeStageCodegenSuite scenarios covering:
- Basic transform(array(1,2,3), x -> x+1)
- Nested transform(transform(arr, x -> x+1), y -> y*2)
- Transform with index variable (x, i) -> x + i
- Nullable elements array(1, null, 3)
- Empty array
- Nested CodegenFallback HOF (filter inside transform)
- Null array argument
- Non-primitive types: struct and string
Benchmark: HigherOrderFunctionBenchmark added to measure transform performance across element types (int, string, struct, nullable), nested transforms, and mixed codegen/fallback scenarios.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

dongjoon-hyun · 2026-03-17T15:00:53Z

...e/src/test/scala/org/apache/spark/sql/execution/benchmark/HigherOrderFunctionBenchmark.scala

+ *        "benchmarks/HigherOrderFunctionBenchmark-results.txt".
+ * }}}
+ */
+object HigherOrderFunctionBenchmark extends SqlBasedBenchmark {


Shall we spin-off and merge HigherOrderFunctionBenchmark first, @LuciferYang ?

OK, I'll spin off HigherOrderFunctionBenchmark into a separate pr

But here's the thing: if you submit the HigherOrderFunctionBenchmark first, the control group is actually also with codegen off. So if the current pr cannot be merged, then the previously merged HigherOrderFunctionBenchmark may not serve much purpose. What's your opinion on this? @dongjoon-hyun

Kimahriman · 2026-03-17T20:18:42Z

#34558

LuciferYang · 2026-03-18T02:41:11Z

#34558

Sorry, I didn't notice your PR. We can work on advancing your pr now.

Kimahriman · 2026-03-18T02:46:16Z

#34558

Sorry, I didn't notice your PR. We can work on advancing your pr now.

Thanks! It be great to finally get that in. It's fairly similar to your approach, and we've been using it internally for several years now

LuciferYang added 2 commits March 17, 2026 20:29

init

3c630f6

add benchmark and result

d0afaf0

dongjoon-hyun reviewed Mar 17, 2026

View reviewed changes

LuciferYang marked this pull request as draft March 18, 2026 02:53

LuciferYang mentioned this pull request Mar 18, 2026

[SPARK-37019][SQL] Add codegen support to array higher-order functions #34558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56033][SQL] Support whole-stage codegen for `ArrayTransform`#54864

[SPARK-56033][SQL] Support whole-stage codegen for `ArrayTransform`#54864
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-56033

LuciferYang commented Mar 17, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun Mar 17, 2026

Uh oh!

LuciferYang Mar 17, 2026

Uh oh!

LuciferYang Mar 18, 2026

Uh oh!

Kimahriman commented Mar 17, 2026

Uh oh!

LuciferYang commented Mar 18, 2026

Uh oh!

Kimahriman commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LuciferYang commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Background

Design

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Mar 17, 2026

Uh oh!

LuciferYang commented Mar 18, 2026

Uh oh!

Kimahriman commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuciferYang commented Mar 17, 2026 •

edited

Loading