[Frontend] Lower attention op to FlashAttention vectorized kernel. #553

GuoningHuang · 2025-09-24T07:11:37Z

Implement FlashAttention kernel in MLIR using affine loops + vector ops.
Replace the existing Attention → TOSA lowering path with Attention → FlashAttention vectorized kernel.
Tagrget at single op, performance were compared against the original TOSA lowering，achieves ~3× speedup on CPU.
achieve ~1.17x speed up in total decode phrase, corresponds to the performance tested in issue [TASK] Locate Performance Bottlenecks in DeepSeek R1 Decode Subgraph #604.

baseline:

apply for decode phase:

apply for both prefill and decode phase：

…+ vector ops.

zhanghb97 · 2025-11-13T15:51:43Z

Please format your code and install pre-commit tool.
No obvious performance improvement observed
Please use vector.load/store, vector.transfer_read/write as sometimes they cannot guarantee the generation of optimal hardware code
Can the splat inside the loop be moved outside? You can verify whether it affects performance

Before:

After:

GuoningHuang · 2025-11-14T07:24:31Z

Please format your code and install pre-commit tool.

No obvious performance improvement observed

Please use vector.load/store, vector.transfer_read/write as sometimes they cannot guarantee the generation of optimal hardware code

Can the splat inside the loop be moved outside? You can verify whether it affects performance

Before:

After:

Hi, thanks for the feedback.
I've addressed the issues:

The code has been reformatted and the pre-commit tool is now installed and working properly.

I have replaced the vector.transfer_read/write operations with vector.load/store as suggested, you can see in next-flash-attention-vec.mlir.

As for the splat inside the loop, I have moved some splat outside the loop and it brings measurable performance improvements on single operator as below :
\

After these optimizations, I still gain the same result as before, a performance improvement observed in the decode phase compare with no opt in my machine.
Before:

After:

I guess the performance improvement may vary across different machines, since memory bandwidth and per-core compute capability differ between systems, which can lead to different performance bottlenecks. below is my machine:

GuoningHuang · 2025-11-15T13:09:02Z

@zhanghb97
I implemented a new FlashAttention version featuring tiled KV processing and vectorization.
The implementation is available in next-flash-attention-vec-tiled.mlir.
Based on theoretical analysis and my previous experiments, I think this version maybe close to the performance ceiling on CPU.
It achieves better cache utilization then previous one. It also delivers small but measurable performance improvements, as shown below:
Before:

After:

GuoningHuang · 2025-11-16T08:07:10Z

This is a longer prompt test result：
Before：
After，KV tiled size = 32：
After，KV tiled size = 64：

zhanghb97 · 2025-11-17T07:39:26Z

@GuoningHuang It looks pretty good. Please resolve the conflicts, then I'll pull it down to test the performance.

GuoningHuang mentioned this pull request Oct 27, 2025

Add operator fusion pass into E2E pipeline but cause perfomance decline #595

Closed

GuoningHuang force-pushed the flash-atten branch 3 times, most recently from 7984633 to cba4060 Compare November 13, 2025 07:44

[Examples]Implement FlashAttention kernel in MLIR using affine loops …

0c58d5d

…+ vector ops.

GuoningHuang force-pushed the flash-atten branch from cba4060 to d20bd35 Compare November 13, 2025 08:04

GuoningHuang added 2 commits November 13, 2025 16:06

[Frontend]Lower attention op to FlashAttention vectorized kernel.

f68d64f

[Frontend] implement affine level flashattention

f70079f

GuoningHuang force-pushed the flash-atten branch from d20bd35 to f399cb2 Compare November 13, 2025 08:07

[Examples] appy flash attention op for decode phrase

94425c4

GuoningHuang force-pushed the flash-atten branch from 9b5de4c to 94425c4 Compare November 13, 2025 08:16

zhanghb97 added enhancement New feature or request format issue labels Nov 14, 2025

[Frontend] Optimize Flash Attention lowering and example

2c9afcc

GuoningHuang force-pushed the flash-atten branch from bf0d407 to 2c9afcc Compare November 14, 2025 07:31

[Frontend] update flash attention with Tiled and Vectorized version

5138603

GuoningHuang force-pushed the flash-atten branch from 1341e89 to 5138603 Compare November 15, 2025 14:03

Merge branch 'main' into flash-atten

650ffd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Frontend] Lower attention op to FlashAttention vectorized kernel. #553

[Frontend] Lower attention op to FlashAttention vectorized kernel. #553

Uh oh!

GuoningHuang commented Sep 24, 2025 •

edited

Loading

Uh oh!

zhanghb97 commented Nov 13, 2025

Uh oh!

GuoningHuang commented Nov 14, 2025 •

edited

Loading

Uh oh!

GuoningHuang commented Nov 15, 2025 •

edited

Loading

Uh oh!

GuoningHuang commented Nov 16, 2025 •

edited

Loading

Uh oh!

zhanghb97 commented Nov 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Frontend] Lower attention op to FlashAttention vectorized kernel. #553

Are you sure you want to change the base?

[Frontend] Lower attention op to FlashAttention vectorized kernel. #553

Uh oh!

Conversation

GuoningHuang commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhanghb97 commented Nov 13, 2025

Uh oh!

GuoningHuang commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuoningHuang commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GuoningHuang commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhanghb97 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GuoningHuang commented Sep 24, 2025 •

edited

Loading

GuoningHuang commented Nov 14, 2025 •

edited

Loading

GuoningHuang commented Nov 15, 2025 •

edited

Loading

GuoningHuang commented Nov 16, 2025 •

edited

Loading

zhanghb97 commented Nov 17, 2025 •

edited

Loading