Skip to content

Conversation

@GuoningHuang
Copy link

@GuoningHuang GuoningHuang commented Sep 24, 2025

  1. Implement FlashAttention kernel in MLIR using affine loops + vector ops.
  2. Replace the existing Attention → TOSA lowering path with Attention → FlashAttention vectorized kernel.
  3. Tagrget at single op, performance were compared against the original TOSA lowering,achieves ~3× speedup on CPU.
  4. achieve ~1.17x speed up in total decode phrase, corresponds to the performance tested in issue [TASK] Locate Performance Bottlenecks in DeepSeek R1 Decode Subgraph #604.
    image
    baseline:
image apply for decode phase: image

apply for both prefill and decode phase:
image

@zhanghb97
Copy link
Member

  • Please format your code and install pre-commit tool.
  • No obvious performance improvement observed
  • Please use vector.load/store, vector.transfer_read/write as sometimes they cannot guarantee the generation of optimal hardware code
  • Can the splat inside the loop be moved outside? You can verify whether it affects performance

Before:
image

After:
image

@zhanghb97 zhanghb97 added enhancement New feature or request format issue labels Nov 14, 2025
@GuoningHuang
Copy link
Author

GuoningHuang commented Nov 14, 2025

  • Please format your code and install pre-commit tool.
  • No obvious performance improvement observed
  • Please use vector.load/store, vector.transfer_read/write as sometimes they cannot guarantee the generation of optimal hardware code
  • Can the splat inside the loop be moved outside? You can verify whether it affects performance

Before: image

After: image

Hi, thanks for the feedback.
I've addressed the issues:

  • The code has been reformatted and the pre-commit tool is now installed and working properly.
  • I have replaced the vector.transfer_read/write operations with vector.load/store as suggested, you can see in next-flash-attention-vec.mlir.
  • As for the splat inside the loop, I have moved some splat outside the loop and it brings measurable performance improvements on single operator as below :
    image \
  • After these optimizations, I still gain the same result as before, a performance improvement observed in the decode phase compare with no opt in my machine.
    Before:
image

After:image

  • I guess the performance improvement may vary across different machines, since memory bandwidth and per-core compute capability differ between systems, which can lead to different performance bottlenecks. below is my machine:
image

@GuoningHuang
Copy link
Author

GuoningHuang commented Nov 15, 2025

@zhanghb97
I implemented a new FlashAttention version featuring tiled KV processing and vectorization.
The implementation is available in next-flash-attention-vec-tiled.mlir.
Based on theoretical analysis and my previous experiments, I think this version maybe close to the performance ceiling on CPU.
It achieves better cache utilization then previous one. It also delivers small but measurable performance improvements, as shown below:
Before:
image

After:image

@GuoningHuang
Copy link
Author

GuoningHuang commented Nov 16, 2025

This is a longer prompt test result:
Before:image
After,KV tiled size = 32:屏幕截图 2025-11-16 155404
After,KV tiled size = 64:
image

@zhanghb97
Copy link
Member

zhanghb97 commented Nov 17, 2025

@GuoningHuang It looks pretty good. Please resolve the conflicts, then I'll pull it down to test the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request format issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants