CuTe and CUTLASS

yzh119 · yzh119 · commit 87e5d01a1cc7 · 2024-12-19T00:02:12.000-08:00
diff --git a/_posts/2024-12-12-flashinfer-v02-release.md b/_posts/2024-12-12-flashinfer-v02-release.md
@@ -25,7 +25,7 @@ After four months of development, we are thrilled to announce the release of **F
 
 FlashInfer's standout feature is its highly flexible block-sparse FlashAttention implementation, supporting **any block size configuration**. Our PageAttention operators are implemented as **block-sparse attention kernels**, where `page_size` specifies the block's column count. At its finest granularity, FlashInfer supports **vector-sparsity**[^1] (`page_size=1`), allowing for precise memory management (used in [sglang](https://github.com/sgl-project/sglang)) and efficient KV-Cache token pruning.
 
-By leveraging [CuTE](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md)'s `CustomStride` and `ComposedLayout` abstractions, we have extended vector-sparsity to FlashAttention-3. Inspired by [Cutlass's gather/scatter convolution](https://github.com/NVIDIA/cutlass/tree/e1cd8c7866dd6de02b66a89879795e7d7301aacc/examples/59_ampere_gather_scatter_conv), this was achieved through a simple modification to the producer's memory loading module.
+By leveraging [CuTe](https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/00_quickstart.md)'s `CustomStride` and `ComposedLayout` abstractions, we have extended vector-sparsity to FlashAttention-3. Inspired by [CUTLASS's gather/scatter convolution](https://github.com/NVIDIA/cutlass/tree/e1cd8c7866dd6de02b66a89879795e7d7301aacc/examples/59_ampere_gather_scatter_conv), this was achieved through a simple modification to the producer's memory loading module.
 
 ### Performance Benchmark
 We compared two attention implementations: PageAttention with `page_size=1` [^2] (use vector-sparse attention implementation) and variable-length dense attention [^3], benchmarking them under identical problem sizes across both FA-2 (v0.1.*) and FA-3 (v0.2) backends. Benchmarks used `head_dim=128`, `causal=True`, varying batch sizes `(B)` and sequence lengths `(L)` with Gaussian-initialized input Q/K/V tensors.