[WIP]support fused neighborhood attention for npu#1034
[WIP]support fused neighborhood attention for npu#1034Hailey-Zh wants to merge 3 commits intolinkedin:mainfrom
Conversation
|
Thank you! Could you also attach the benchmark results and keep comments in english? |
This is currently a draft and there are still a few outstanding issues to resolve. I will make sure to include the benchmark results and switch all comments to English in the final official version. |
|
Since we’re currently focused on the Ascend CI, and this kernel is still unworkable, I was wondering if you have bandwidth to keep working on it. If you’d like, maybe we could also help move it forward. |
we'll keep working on it |
Summary
This PR introduces support for Fused Neighborhood Attention (FNA) optimized specifically for NPU architectures. The implementation focuses on memory efficiency and hardware affinity to prevent performance bottlenecks. Key modifications include:
Grid Dimension Refactoring: Adjusted the attention grid to a 2D structure. This change optimizes thread block mapping and prevents User Buffer (UB) overflow, ensuring the workload fits within the NPU's local memory constraints.
NPU-Affinity Softmax: Refactored the Softmax tiling and grid dimensions to align with NPU compute unit sizes, maximizing throughput and reducing synchronization overhead.
Details
Testing Done
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence