Skip to content

[NPU] Add softmax implementation#1087

Open
lowdy1 wants to merge 2 commits intolinkedin:mainfrom
lowdy1:softmax_npu
Open

[NPU] Add softmax implementation#1087
lowdy1 wants to merge 2 commits intolinkedin:mainfrom
lowdy1:softmax_npu

Conversation

@lowdy1
Copy link
Contributor

@lowdy1 lowdy1 commented Feb 9, 2026

Summary

This PR adds a Softmax implementation for NPU. It includes a single-block forward kernel for smaller column sizes, as well as a multi-block kernel for large column sizes to avoid NPU UB overflow.

Testing Done

Test done with python -m pytest test/transformers/test_softmax.py
Hardware Type: Atlas 800I A2(32G)

  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

Copy link
Contributor

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of my own thoughts — hope they’re helpful.

softmax_forward_kernel[(num_programs,)](y2d, x2d, x2d.stride(0), y2d.stride(0), n_rows, n_cols, BLOCK_SIZE)
multi_block_launch = False
else:
softmax_multi_block_forward_kernel[(n_rows,)](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Triton kernel implementations on NPU, it’s best for the first dimension of the grid to be num_cores.

BLOCK_SIZE=BLOCK_SIZE,
)
else:
softmax_multi_block_backward_kernel[(n_rows,)](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

num_programs = min(num_cores, n_rows)

if n_cols <= BLOCK_SIZE:
softmax_forward_kernel[(num_programs,)](y2d, x2d, x2d.stride(0), y2d.stride(0), n_rows, n_cols, BLOCK_SIZE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think we need two separate kernels. Also, the current approach of launching the kernel row by row is not optimal. It’s clear that the UB (unified buffer) isn’t being fully utilized.

I believe we should use a single unified kernel, with two nested loops inside: the outer loop iterates over rows, and the inner loop iterates over columns. If one kernel can process multiple rows, it can make better use of the UB.

However, this might introduce some additional overhead, so we still need to benchmark and see whether it actually improves performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, multi-block kernel plus grid stride loop over rows by num_prorams should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants