Skip to content

[SPARK-56033][SQL] Support whole-stage codegen for ArrayTransform#54864

Draft
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-56033
Draft

[SPARK-56033][SQL] Support whole-stage codegen for ArrayTransform#54864
LuciferYang wants to merge 2 commits intoapache:masterfrom
LuciferYang:SPARK-56033

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Mar 17, 2026

What changes were proposed in this pull request?

This PR adds code generation support to ArrayTransform (the SQL transform() function) and its underlying HOF (Higher-Order Function) infrastructure (NamedLambdaVariable, LambdaFunction), removing CodegenFallback from these three expressions so that queries using transform() can participate in whole-stage code generation.

Background

All 11 higher-order functions in Spark SQL currently extend CodegenFallback. While CodegenFallback still executes correctly, it has a key limitation: WholeStageCodegenExec.supportCodegen returns false when any CodegenFallback expression is found, causing the entire operator pipeline — not just the HOF — to fall back from whole-stage codegen. This means even surrounding non-HOF expressions in the same stage lose the benefits of codegen.

Design

Lambda variable binding mechanism — A new lambdaVariableMap: Map[ExprId, ExprCode] in CodegenContext with a withLambdaVariableBindings save/restore helper (following the established currentVars/INPUT_ROW pattern). The enclosing HOF registers lambda parameter bindings before generating the lambda body; NamedLambdaVariable.doGenCode looks up its binding to emit zero-overhead variable references.

Mutable state fields — Lambda variable values use ctx.addMutableState() (class fields) instead of local variables, because Expression.reduceCodeSize() may extract lambda body code into separate private methods where local loop variables would be out of scope.

AtomicReference dual-write — When the lambda body contains CodegenFallback sub-expressions (e.g., ArrayFilter which hasn't been given codegen yet), the generated loop also writes to the AtomicReference on NamedLambdaVariable, so that eval() calls from fallback sub-expressions read the correct value. A static check (function.exists(_.isInstanceOf[CodegenFallback])) skips these writes when the lambda body is fully codegen'd, avoiding unnecessary boxing overhead.

Graceful fallbackNamedLambdaVariable.doGenCode falls back to eval() via references[] when no binding is registered (e.g., in GenerateMutableProjection paths), with a logWarning for diagnostic purposes.

Why are the changes needed?

Queries using transform() currently disable whole-stage codegen for the entire stage, reducing codegen coverage for the operator pipeline. This PR re-enables whole-stage codegen for stages containing transform().

More importantly, this establishes the reusable HOF codegen infrastructure (lambdaVariableMap, withLambdaVariableBindings, and the AtomicReference dual-write pattern) that other higher-order functions (ArrayFilter, ArrayExists, ArrayAggregate, MapFilter, etc.) can adopt incrementally to further expand whole-stage codegen coverage.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Existing tests: HigherOrderFunctionsSuite, WholeStageCodegenSuite, DataFrameFunctionsSuite all pass.
  • New unit test: LambdaFunction.doGenCode throws SparkException when bindings are missing.
  • New integration test: 9 WholeStageCodegenSuite scenarios covering:
    • Basic transform(array(1,2,3), x -> x+1)
    • Nested transform(transform(arr, x -> x+1), y -> y*2)
    • Transform with index variable (x, i) -> x + i
    • Nullable elements array(1, null, 3)
    • Empty array
    • Nested CodegenFallback HOF (filter inside transform)
    • Null array argument
    • Non-primitive types: struct and string
  • Benchmark: HigherOrderFunctionBenchmark added to measure transform performance across element types (int, string, struct, nullable), nested transforms, and mixed codegen/fallback scenarios.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.6

* "benchmarks/HigherOrderFunctionBenchmark-results.txt".
* }}}
*/
object HigherOrderFunctionBenchmark extends SqlBasedBenchmark {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we spin-off and merge HigherOrderFunctionBenchmark first, @LuciferYang ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll spin off HigherOrderFunctionBenchmark into a separate pr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here's the thing: if you submit the HigherOrderFunctionBenchmark first, the control group is actually also with codegen off. So if the current pr cannot be merged, then the previously merged HigherOrderFunctionBenchmark may not serve much purpose. What's your opinion on this? @dongjoon-hyun

@Kimahriman
Copy link
Contributor

#34558

@LuciferYang
Copy link
Contributor Author

#34558

Sorry, I didn't notice your PR. We can work on advancing your pr now.

@Kimahriman
Copy link
Contributor

#34558

Sorry, I didn't notice your PR. We can work on advancing your pr now.

Thanks! It be great to finally get that in. It's fairly similar to your approach, and we've been using it internally for several years now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants