[SPARK-56033][SQL] Support whole-stage codegen for ArrayTransform#54864
[SPARK-56033][SQL] Support whole-stage codegen for ArrayTransform#54864LuciferYang wants to merge 2 commits intoapache:masterfrom
ArrayTransform#54864Conversation
| * "benchmarks/HigherOrderFunctionBenchmark-results.txt". | ||
| * }}} | ||
| */ | ||
| object HigherOrderFunctionBenchmark extends SqlBasedBenchmark { |
There was a problem hiding this comment.
Shall we spin-off and merge HigherOrderFunctionBenchmark first, @LuciferYang ?
There was a problem hiding this comment.
OK, I'll spin off HigherOrderFunctionBenchmark into a separate pr
There was a problem hiding this comment.
But here's the thing: if you submit the HigherOrderFunctionBenchmark first, the control group is actually also with codegen off. So if the current pr cannot be merged, then the previously merged HigherOrderFunctionBenchmark may not serve much purpose. What's your opinion on this? @dongjoon-hyun
|
Sorry, I didn't notice your PR. We can work on advancing your pr now. |
Thanks! It be great to finally get that in. It's fairly similar to your approach, and we've been using it internally for several years now |
What changes were proposed in this pull request?
This PR adds code generation support to
ArrayTransform(the SQLtransform()function) and its underlying HOF (Higher-Order Function) infrastructure (NamedLambdaVariable,LambdaFunction), removingCodegenFallbackfrom these three expressions so that queries usingtransform()can participate in whole-stage code generation.Background
All 11 higher-order functions in Spark SQL currently extend
CodegenFallback. WhileCodegenFallbackstill executes correctly, it has a key limitation:WholeStageCodegenExec.supportCodegenreturnsfalsewhen anyCodegenFallbackexpression is found, causing the entire operator pipeline — not just the HOF — to fall back from whole-stage codegen. This means even surrounding non-HOF expressions in the same stage lose the benefits of codegen.Design
Lambda variable binding mechanism — A new
lambdaVariableMap: Map[ExprId, ExprCode]inCodegenContextwith awithLambdaVariableBindingssave/restore helper (following the establishedcurrentVars/INPUT_ROWpattern). The enclosing HOF registers lambda parameter bindings before generating the lambda body;NamedLambdaVariable.doGenCodelooks up its binding to emit zero-overhead variable references.Mutable state fields — Lambda variable values use
ctx.addMutableState()(class fields) instead of local variables, becauseExpression.reduceCodeSize()may extract lambda body code into separate private methods where local loop variables would be out of scope.AtomicReference dual-write — When the lambda body contains
CodegenFallbacksub-expressions (e.g.,ArrayFilterwhich hasn't been given codegen yet), the generated loop also writes to theAtomicReferenceonNamedLambdaVariable, so thateval()calls from fallback sub-expressions read the correct value. A static check (function.exists(_.isInstanceOf[CodegenFallback])) skips these writes when the lambda body is fully codegen'd, avoiding unnecessary boxing overhead.Graceful fallback —
NamedLambdaVariable.doGenCodefalls back toeval()viareferences[]when no binding is registered (e.g., inGenerateMutableProjectionpaths), with alogWarningfor diagnostic purposes.Why are the changes needed?
Queries using
transform()currently disable whole-stage codegen for the entire stage, reducing codegen coverage for the operator pipeline. This PR re-enables whole-stage codegen for stages containingtransform().More importantly, this establishes the reusable HOF codegen infrastructure (
lambdaVariableMap,withLambdaVariableBindings, and theAtomicReferencedual-write pattern) that other higher-order functions (ArrayFilter,ArrayExists,ArrayAggregate,MapFilter, etc.) can adopt incrementally to further expand whole-stage codegen coverage.Does this PR introduce any user-facing change?
No.
How was this patch tested?
HigherOrderFunctionsSuite,WholeStageCodegenSuite,DataFrameFunctionsSuiteall pass.LambdaFunction.doGenCodethrowsSparkExceptionwhen bindings are missing.WholeStageCodegenSuitescenarios covering:transform(array(1,2,3), x -> x+1)transform(transform(arr, x -> x+1), y -> y*2)(x, i) -> x + iarray(1, null, 3)CodegenFallbackHOF (filterinsidetransform)HigherOrderFunctionBenchmarkadded to measure transform performance across element types (int, string, struct, nullable), nested transforms, and mixed codegen/fallback scenarios.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Sonnet 4.6