Skip to content

Commit 87e5d01

Browse files
committed
CuTe and CUTLASS
1 parent 14a85ec commit 87e5d01

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

_posts/2024-12-12-flashinfer-v02-release.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ After four months of development, we are thrilled to announce the release of **F
2525

2626
FlashInfer's standout feature is its highly flexible block-sparse FlashAttention implementation, supporting **any block size configuration**. Our PageAttention operators are implemented as **block-sparse attention kernels**, where `page_size` specifies the block's column count. At its finest granularity, FlashInfer supports **vector-sparsity**[^1] (`page_size=1`), allowing for precise memory management (used in [sglang](https://github.com/sgl-project/sglang)) and efficient KV-Cache token pruning.
2727

28-
By leveraging [CuTE](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md)'s `CustomStride` and `ComposedLayout` abstractions, we have extended vector-sparsity to FlashAttention-3. Inspired by [Cutlass's gather/scatter convolution](https://github.com/NVIDIA/cutlass/tree/e1cd8c7866dd6de02b66a89879795e7d7301aacc/examples/59_ampere_gather_scatter_conv), this was achieved through a simple modification to the producer's memory loading module.
28+
By leveraging [CuTe](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md)'s `CustomStride` and `ComposedLayout` abstractions, we have extended vector-sparsity to FlashAttention-3. Inspired by [CUTLASS's gather/scatter convolution](https://github.com/NVIDIA/cutlass/tree/e1cd8c7866dd6de02b66a89879795e7d7301aacc/examples/59_ampere_gather_scatter_conv), this was achieved through a simple modification to the producer's memory loading module.
2929

3030
### Performance Benchmark
3131
We compared two attention implementations: PageAttention with `page_size=1` [^2] (use vector-sparse attention implementation) and variable-length dense attention [^3], benchmarking them under identical problem sizes across both FA-2 (v0.1.*) and FA-3 (v0.2) backends. Benchmarks used `head_dim=128`, `causal=True`, varying batch sizes `(B)` and sequence lengths `(L)` with Gaussian-initialized input Q/K/V tensors.

0 commit comments

Comments
 (0)