perf: Use 2WG pipeline design for MLA implementation on Hopper #952

yzh119 · 2025-03-17T07:55:10Z

This PR implements #892 .

Per benchmark, 2WG pipeline (FlashMLA's implementation) is faster than our current 3WG pipeline design on Hopper. While it remains investigation where the gap comes from, we should implements the 2WG (and 4WG in the future) pipeline in FlashInfer to make sure our implementation not getting worse performance than flashmla.

Performance

Before this PR:

Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1547.23 GB/s
FLOPs: 167.29 TFLOPs
Config: batch_size=64, seq_len=1024, num_heads=128
Memory bandwidth: 1483.82 GB/s
FLOPs: 290.23 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 2238.72 GB/s
FLOPs: 242.06 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=128
Memory bandwidth: 1612.66 GB/s
FLOPs: 315.43 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2821.32 GB/s
FLOPs: 305.05 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=128
Memory bandwidth: 1767.63 GB/s
FLOPs: 345.74 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 1960.50 GB/s
FLOPs: 223.79 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=128
Memory bandwidth: 1533.88 GB/s
FLOPs: 331.70 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 2546.83 GB/s
FLOPs: 290.72 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=128
Memory bandwidth: 1629.73 GB/s
FLOPs: 352.43 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 2820.22 GB/s
FLOPs: 321.93 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=128
Memory bandwidth: 1657.89 GB/s
FLOPs: 358.52 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=64
Memory bandwidth: 2682.98 GB/s
FLOPs: 319.63 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=128
Memory bandwidth: 1600.79 GB/s
FLOPs: 375.94 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=64
Memory bandwidth: 2803.48 GB/s
FLOPs: 333.98 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=128
Memory bandwidth: 1584.79 GB/s
FLOPs: 372.18 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=64
Memory bandwidth: 2768.36 GB/s
FLOPs: 329.80 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=128
Memory bandwidth: 1565.82 GB/s
FLOPs: 367.73 TFLOPs

After this PR:

Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1520.70 GB/s
FLOPs: 164.42 TFLOPs
Config: batch_size=64, seq_len=1024, num_heads=128
Memory bandwidth: 1807.33 GB/s
FLOPs: 353.51 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 2327.25 GB/s
FLOPs: 251.63 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=128
Memory bandwidth: 2024.00 GB/s
FLOPs: 395.88 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2897.75 GB/s
FLOPs: 313.32 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=128
Memory bandwidth: 2256.69 GB/s
FLOPs: 441.40 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 1963.77 GB/s
FLOPs: 224.17 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=128
Memory bandwidth: 2011.81 GB/s
FLOPs: 435.05 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 2638.88 GB/s
FLOPs: 301.23 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=128
Memory bandwidth: 2168.25 GB/s
FLOPs: 468.88 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 3008.55 GB/s
FLOPs: 343.43 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=128
Memory bandwidth: 2175.46 GB/s
FLOPs: 470.44 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=64
Memory bandwidth: 2724.09 GB/s
FLOPs: 324.52 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=128
Memory bandwidth: 2153.42 GB/s
FLOPs: 505.72 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=64
Memory bandwidth: 3015.30 GB/s
FLOPs: 359.22 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=128
Memory bandwidth: 2120.13 GB/s
FLOPs: 497.91 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=64
Memory bandwidth: 3100.86 GB/s
FLOPs: 369.41 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=128
Memory bandwidth: 2096.94 GB/s
FLOPs: 492.46 TFLOPs

upd wip upd upd

This reverts commit cda5311.

yzh119 force-pushed the flashmla branch from d4a3884 to 9085da9 Compare March 26, 2025 08:08

yzh119 changed the title ~~[WIP] Use 2WG pipeline design for MLA implementation on Hopper~~ perf: Use 2WG pipeline design for MLA implementation on Hopper Mar 26, 2025

yzh119 marked this pull request as ready for review March 26, 2025 08:59

yzh119 mentioned this pull request Mar 26, 2025

[Tracking Issue] MLA performance tracking #897

Open

10 tasks

yzh119 added 11 commits March 27, 2025 00:51

upd

c91d3a2

upd wip upd upd

fix

e574546

Revert "fix"

f9fa2fe

This reverts commit cda5311.

wip

e1f54de

wip

0884fee

wip

3e5be53

wip

74078bc

wip

34d2eb9

upd

642e824

bugfix

49b2a8a

upd

a9c9e24

yzh119 force-pushed the flashmla branch from 4b8bf74 to a9c9e24 Compare March 27, 2025 00:51

yzh119 added 5 commits March 27, 2025 00:52

upd

a84e1ea

upd

e4f9f81

bugfix

c6d84ef

improve benchmark

f68cdb3

upd

d52a593

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

yzh119 commented Mar 17, 2025 •

edited

Loading

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

Are you sure you want to change the base?

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

Conversation

yzh119 commented Mar 17, 2025 • edited Loading

Performance

yzh119 commented Mar 17, 2025 •

edited

Loading