Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU#29080
Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU#29080Copilot wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Thanks, this looks like the right shape overall. The fp16 CPU kernel computes in fp32, uses ORT's temp allocator, and the typed registration matches the existing QuickGelu schema/fusion path. I left one test-coverage suggestion for the new chunked execution path.
@copilot, please update the test (see comments inline). aslo check whether there is some related graph optimization that need to enable the fusion for CPU.
|
|
||
| TEST_F(ActivationOpTest, QuickGelu_fp16) { | ||
| std::vector<float> input_values{-1.0f, 0.0f, 1.0f, 2.5f, -2.5f, 5.0f, -5.0f, 0.3f}; | ||
| std::vector<int64_t> dims{static_cast<int64_t>(input_values.size())}; |
There was a problem hiding this comment.
Could we make this test cross the 4096-element chunk boundary used by the new QuickGelu<MLFloat16>::Compute() specialization? With only eight values, the test covers the math branches but would not catch a bug in task partitioning or the final partial chunk. A 4097-element case, or a small multiple plus remainder, would exercise the multi-task/tail path as well.
There was a problem hiding this comment.
Done in e7fb2db. The fp16 test now uses 8205 = 2 * 4096 + 13 elements (seeded with the original corner values plus a varied ramp), so it crosses the 4096-element chunk boundary and exercises the multi-task path plus a final partial/tail chunk.
|
|
||
| #define REGISTER_QUICKGELU_KERNEL(data_type) \ | ||
| ONNX_OPERATOR_TYPED_KERNEL_EX( \ | ||
| QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \ |
There was a problem hiding this comment.
| QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \ | |
| QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \ |
There was a problem hiding this comment.
Applied in e7fb2db — the backslashes in REGISTER_QUICKGELU_KERNEL are now aligned.
Description
On the CPU EP,
QuickGeluwas registered forfloatonly, so on fp16 graphs the Swish/SiLU activation (x * Sigmoid(alpha*x)) ran as unfusedSigmoid+Mul(cast-wrapped) — a significant regression on ARMv8.2-A (Cortex-A76). TheQuickGeluFusionpass is already dtype-aware; the missing piece was an fp16 kernel to target.contrib_ops/cpu/activations.cc— AddedQuickGelu<MLFloat16>::Compute. It converts fp16→fp32 (MlasConvertHalfToFloatBufferInParallel), runs the existing fused float path (MlasComputeSiluforalpha==1, else scaledMlasComputeLogistic+MlasEltwiseMul), and converts back fp32→fp16. This keeps the activation fused into one kernel while computing in fp32, giving a graceful fallback that is correct on CPUs without native fp16 (x86 unaffected). Registration switched to a typed macro forfloat+MLFloat16.contrib_ops/cpu/cpu_contrib_kernels.cc— Updated the class-name declaration andBuildKernelCreateInfoentry from non-typed to typedfloat+MLFloat16.test/contrib_ops/activation_op_test.cc— AddedQuickGelu_fp16coveringalpha = 1.702,1.0(SiLU),-1.702on the CPU EP.The MSDomain
QuickGeluschema already permitstensor(float16), so no schema change was needed.Motivation and Context
fp16 inference on ARM64 edge/mobile is attractive for model size/memory, but the unfused fp16 Swish dominated latency (~118 ms of
Sigmoid+Mulvs ~28 ms fusedQuickGeluin the reported BirdNET v2.4 profile on RPi5), erasing the fp16 benefit. The only prior workaround was keeping the activation in fp32 during conversion. CUDA EP already registersQuickGeluforMLFloat16; this brings CPU to parity.