Conversation
noemotiovon
left a comment
There was a problem hiding this comment.
Some of my own thoughts — hope they’re helpful.
| softmax_forward_kernel[(num_programs,)](y2d, x2d, x2d.stride(0), y2d.stride(0), n_rows, n_cols, BLOCK_SIZE) | ||
| multi_block_launch = False | ||
| else: | ||
| softmax_multi_block_forward_kernel[(n_rows,)]( |
There was a problem hiding this comment.
In Triton kernel implementations on NPU, it’s best for the first dimension of the grid to be num_cores.
| BLOCK_SIZE=BLOCK_SIZE, | ||
| ) | ||
| else: | ||
| softmax_multi_block_backward_kernel[(n_rows,)]( |
| num_programs = min(num_cores, n_rows) | ||
|
|
||
| if n_cols <= BLOCK_SIZE: | ||
| softmax_forward_kernel[(num_programs,)](y2d, x2d, x2d.stride(0), y2d.stride(0), n_rows, n_cols, BLOCK_SIZE) |
There was a problem hiding this comment.
I don’t think we need two separate kernels. Also, the current approach of launching the kernel row by row is not optimal. It’s clear that the UB (unified buffer) isn’t being fully utilized.
I believe we should use a single unified kernel, with two nested loops inside: the outer loop iterates over rows, and the inner loop iterates over columns. If one kernel can process multiple rows, it can make better use of the UB.
However, this might introduce some additional overhead, so we still need to benchmark and see whether it actually improves performance.
There was a problem hiding this comment.
Agreed, multi-block kernel plus grid stride loop over rows by num_prorams should work.
There was a problem hiding this comment.
Thanks for your suggestions.
Summary
This PR adds a Softmax implementation for NPU. It includes a single-block forward kernel for smaller column sizes, as well as a multi-block kernel for large column sizes to avoid NPU UB overflow.
Testing Done
Test done with
python -m pytest test/transformers/test_softmax.pyHardware Type: Atlas 800I A2(32G)
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence