Skip to content

Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU#29080

Draft
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-float16-swish-silu-fusion
Draft

Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU#29080
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-float16-swish-silu-fusion

Conversation

Copilot AI commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

On the CPU EP, QuickGelu was registered for float only, so on fp16 graphs the Swish/SiLU activation (x * Sigmoid(alpha*x)) ran as unfused Sigmoid + Mul (cast-wrapped) — a significant regression on ARMv8.2-A (Cortex-A76). The QuickGeluFusion pass is already dtype-aware; the missing piece was an fp16 kernel to target.

  • contrib_ops/cpu/activations.cc — Added QuickGelu<MLFloat16>::Compute. It converts fp16→fp32 (MlasConvertHalfToFloatBufferInParallel), runs the existing fused float path (MlasComputeSilu for alpha==1, else scaled MlasComputeLogistic + MlasEltwiseMul), and converts back fp32→fp16. This keeps the activation fused into one kernel while computing in fp32, giving a graceful fallback that is correct on CPUs without native fp16 (x86 unaffected). Registration switched to a typed macro for float + MLFloat16.
  • contrib_ops/cpu/cpu_contrib_kernels.cc — Updated the class-name declaration and BuildKernelCreateInfo entry from non-typed to typed float + MLFloat16.
  • test/contrib_ops/activation_op_test.cc — Added QuickGelu_fp16 covering alpha = 1.702, 1.0 (SiLU), -1.702 on the CPU EP.

The MSDomain QuickGelu schema already permits tensor(float16), so no schema change was needed.

Motivation and Context

fp16 inference on ARM64 edge/mobile is attractive for model size/memory, but the unfused fp16 Swish dominated latency (~118 ms of Sigmoid+Mul vs ~28 ms fused QuickGelu in the reported BirdNET v2.4 profile on RPi5), erasing the fp16 benefit. The only prior workaround was keeping the activation in fp32 during conversion. CUDA EP already registers QuickGelu for MLFloat16; this brings CPU to parity.

Copilot AI changed the title [WIP] Fix float16 Swish/SiLU fusion to QuickGelu on CPU Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU Jun 16, 2026
Copilot AI requested a review from tianleiwu June 16, 2026 18:24

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like the right shape overall. The fp16 CPU kernel computes in fp32, uses ORT's temp allocator, and the typed registration matches the existing QuickGelu schema/fusion path. I left one test-coverage suggestion for the new chunked execution path.

@copilot, please update the test (see comments inline). aslo check whether there is some related graph optimization that need to enable the fusion for CPU.


TEST_F(ActivationOpTest, QuickGelu_fp16) {
std::vector<float> input_values{-1.0f, 0.0f, 1.0f, 2.5f, -2.5f, 5.0f, -5.0f, 0.3f};
std::vector<int64_t> dims{static_cast<int64_t>(input_values.size())};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this test cross the 4096-element chunk boundary used by the new QuickGelu<MLFloat16>::Compute() specialization? With only eight values, the test covers the math branches but would not catch a bug in task partitioning or the final partial chunk. A 4097-element case, or a small multiple plus remainder, would exercise the multi-task/tail path as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in e7fb2db. The fp16 test now uses 8205 = 2 * 4096 + 13 elements (seeded with the original corner values plus a varied ramp), so it crosses the 4096-element chunk boundary and exercises the multi-task path plus a final partial/tail chunk.

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.


#define REGISTER_QUICKGELU_KERNEL(data_type) \
ONNX_OPERATOR_TYPED_KERNEL_EX( \
QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \
QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied in e7fb2db — the backslashes in REGISTER_QUICKGELU_KERNEL are now aligned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants