Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU by Copilot · Pull Request #29080 · microsoft/onnxruntime

Copilot · 2026-06-16T17:32:07Z

Description

On the CPU EP, QuickGelu was registered for float only, so on fp16 graphs the Swish/SiLU activation (x * Sigmoid(alpha*x)) ran as unfused Sigmoid + Mul (cast-wrapped) — a significant regression on ARMv8.2-A (Cortex-A76). The QuickGeluFusion pass is already dtype-aware; the missing piece was an fp16 kernel to target.

contrib_ops/cpu/activations.cc — Added QuickGelu<MLFloat16>::Compute. It converts fp16→fp32 (MlasConvertHalfToFloatBufferInParallel), runs the existing fused float path (MlasComputeSilu for alpha==1, else scaled MlasComputeLogistic + MlasEltwiseMul), and converts back fp32→fp16. This keeps the activation fused into one kernel while computing in fp32, giving a graceful fallback that is correct on CPUs without native fp16 (x86 unaffected). Registration switched to a typed macro for float + MLFloat16.
contrib_ops/cpu/cpu_contrib_kernels.cc — Updated the class-name declaration and BuildKernelCreateInfo entry from non-typed to typed float + MLFloat16.
test/contrib_ops/activation_op_test.cc — Added QuickGelu_fp16 covering alpha = 1.702, 1.0 (SiLU), -1.702 on the CPU EP.

The MSDomain QuickGelu schema already permits tensor(float16), so no schema change was needed.

Motivation and Context

fp16 inference on ARM64 edge/mobile is attractive for model size/memory, but the unfused fp16 Swish dominated latency (~118 ms of Sigmoid+Mul vs ~28 ms fused QuickGelu in the reported BirdNET v2.4 profile on RPi5), erasing the fp16 benefit. The only prior workaround was keeping the activation in fp32 during conversion. CUDA EP already registers QuickGelu for MLFloat16; this brings CPU to parity.

tianleiwu

Thanks, this looks like the right shape overall. The fp16 CPU kernel computes in fp32, uses ORT's temp allocator, and the typed registration matches the existing QuickGelu schema/fusion path. I left one test-coverage suggestion for the new chunked execution path.

@copilot, please update the test (see comments inline). aslo check whether there is some related graph optimization that need to enable the fusion for CPU.

tianleiwu · 2026-06-16T21:34:51Z


+TEST_F(ActivationOpTest, QuickGelu_fp16) {
+  std::vector<float> input_values{-1.0f, 0.0f, 1.0f, 2.5f, -2.5f, 5.0f, -5.0f, 0.3f};
+  std::vector<int64_t> dims{static_cast<int64_t>(input_values.size())};


Could we make this test cross the 4096-element chunk boundary used by the new QuickGelu<MLFloat16>::Compute() specialization? With only eight values, the test covers the math branches but would not catch a bug in task partitioning or the final partial chunk. A 4097-element case, or a small multiple plus remainder, would exercise the multi-task/tail path as well.

Done in e7fb2db. The fp16 test now uses 8205 = 2 * 4096 + 13 elements (seeded with the original corner values plus a varied ramp), so it crosses the 4096-element chunk boundary and exercises the multi-task path plus a final partial/tail chunk.

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2026-06-17T00:42:50Z

+
+#define REGISTER_QUICKGELU_KERNEL(data_type)                                            \
+  ONNX_OPERATOR_TYPED_KERNEL_EX(                                                        \
+      QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider,                       \


Suggested change

QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \

QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \

Applied in e7fb2db — the backslashes in REGISTER_QUICKGELU_KERNEL are now aligned.

…n fp16 test

Initial plan

01223ed

Copilot AI assigned Copilot and tianleiwu Jun 16, 2026

Copilot started work on behalf of tianleiwu June 16, 2026 17:32 View session

Register MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU

1cf6ed8

Copilot AI changed the title ~~[WIP] Fix float16 Swish/SiLU fusion to QuickGelu on CPU~~ Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU Jun 16, 2026

Copilot finished work on behalf of tianleiwu June 16, 2026 18:24

Copilot AI requested a review from tianleiwu June 16, 2026 18:24

tianleiwu reviewed Jun 16, 2026

View reviewed changes

github-actions Bot reviewed Jun 17, 2026

View reviewed changes

Address review: align kernel macro and exercise 4096-chunk boundary i…

e7fb2db

…n fp16 test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU#29080

Add MLFloat16 QuickGelu CPU kernel for fused fp16 Swish/SiLU#29080
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-float16-swish-silu-fusion

Copilot AI commented Jun 16, 2026 •

edited

Loading

Uh oh!

tianleiwu left a comment •

edited

Loading

Uh oh!

tianleiwu Jun 16, 2026

Uh oh!

tianleiwu Jun 20, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 17, 2026

Uh oh!

tianleiwu Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \
	QuickGelu, kMSDomain, 1, data_type, kCpuExecutionProvider, \

Conversation

Copilot AI commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

tianleiwu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

tianleiwu Jun 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Jun 16, 2026 •

edited

Loading

tianleiwu left a comment •

edited

Loading