[NPU] Add softmax implementation by lowdy1 · Pull Request #1087 · linkedin/Liger-Kernel

lowdy1 · 2026-02-09T03:17:21Z

Summary

This PR adds a Softmax implementation for NPU. It includes a single-block forward kernel for smaller column sizes, as well as a multi-block kernel for large column sizes to avoid NPU UB overflow.

Testing Done

Test done with python -m pytest test/transformers/test_softmax.py
Hardware Type: Atlas 800I A2(32G)

run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

noemotiovon

Some of my own thoughts — hope they’re helpful.

noemotiovon · 2026-02-09T03:50:22Z

src/liger_kernel/ops/backends/_ascend/ops/softmax.py

+        softmax_forward_kernel[(num_programs,)](y2d, x2d, x2d.stride(0), y2d.stride(0), n_rows, n_cols, BLOCK_SIZE)
+        multi_block_launch = False
+    else:
+        softmax_multi_block_forward_kernel[(n_rows,)](


In Triton kernel implementations on NPU, it’s best for the first dimension of the grid to be num_cores.

noemotiovon · 2026-02-09T03:52:48Z

src/liger_kernel/ops/backends/_ascend/ops/softmax.py

+            BLOCK_SIZE=BLOCK_SIZE,
+        )
+    else:
+        softmax_multi_block_backward_kernel[(n_rows,)](


noemotiovon · 2026-02-09T08:24:09Z

src/liger_kernel/ops/backends/_ascend/ops/softmax.py

+    num_programs = min(num_cores, n_rows)
+
+    if n_cols <= BLOCK_SIZE:
+        softmax_forward_kernel[(num_programs,)](y2d, x2d, x2d.stride(0), y2d.stride(0), n_rows, n_cols, BLOCK_SIZE)


I don’t think we need two separate kernels. Also, the current approach of launching the kernel row by row is not optimal. It’s clear that the UB (unified buffer) isn’t being fully utilized.

I believe we should use a single unified kernel, with two nested loops inside: the outer loop iterates over rows, and the inner loop iterates over columns. If one kernel can process multiple rows, it can make better use of the UB.

However, this might introduce some additional overhead, so we still need to benchmark and see whether it actually improves performance.

Agreed, multi-block kernel plus grid stride loop over rows by num_prorams should work.

Thanks for your suggestions.

NPU softmax implementation

44dce9f

noemotiovon reviewed Feb 9, 2026

View reviewed changes

add grid loop for multi-block kernel

6cc0c05

lowdy1 force-pushed the softmax_npu branch from a3e0b2c to 6cc0c05 Compare February 13, 2026 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU] Add softmax implementation#1087

[NPU] Add softmax implementation#1087
lowdy1 wants to merge 2 commits intolinkedin:mainfrom
lowdy1:softmax_npu

lowdy1 commented Feb 9, 2026

Uh oh!

noemotiovon left a comment

Uh oh!

noemotiovon Feb 9, 2026

Uh oh!

noemotiovon Feb 9, 2026

Uh oh!

noemotiovon Feb 9, 2026

Uh oh!

Tcc0403 Feb 9, 2026

Uh oh!

lowdy1 Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lowdy1 commented Feb 9, 2026

Summary

Testing Done

Uh oh!

noemotiovon left a comment

Choose a reason for hiding this comment

Uh oh!

noemotiovon Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

noemotiovon Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Tcc0403 Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

lowdy1 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants