[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

happierpig · 2025-03-21T21:28:32Z

Description

This PR is a follow-up to #858, which integrates the PoDAttention (arXiv link) API in a user-transparent manner. Users can now invoke PoDAttention via the same API as BatchPrefillWithPagedKVCache, without explicitly specifying whether requests are prefill or decode (example code).

Key Changes

Support for Non-Continuous Q/O and KV Tensor Layout
Previously, tensor offsets were computed using indptr, assuming continuous layouts. PoDAttention requires supporting mixed prefill/decode subsets within requests, necessitating a non-continuous layout.
- Added q_lenptr and kv_lenptr to accommodate this functionality (code link).
Horizontal Fusion-Style Implementation
For improved efficiency, subsets of requests are aware of each other, enabling optimal selection of kernel hyperparameters and persistent kernel execution.
- Current resource partitioning strategy solely depends on total KV-cache load size (scheduler code).
- Note: This strategy is customizable based on specific workloads.

Limitations and Future Work

CUDA Graph is currently not supported. Only FA2 is supported at this stage.
The workload classifier (qo_len > threshold) is preliminary and requires improvement (classifier implementation).
Performance tuning is ongoing, and correctness has only been validated on a limited set of unit tests (unit tests).
cc @AKKamath @yzh119

…or PoD.

yzh119 · 2025-03-21T22:22:45Z

Some of the unittests failed, for example (test_block_sparse_attention[False-256-16-16-128-64-16-4])

RuntimeError: Error in function 'PrefillSplitQOKVIndptr' at /workspace/flashinfer/data/include/flashinfer/attention/scheduler.cuh:515: kv_len_ptr_h[0]: 0 should be positive

happierpig added 11 commits March 19, 2025 00:52

[Minor] change qo_indptr into qo_start_idx + qo_len_ptr before refact…

1266dbe

…or PoD.

[Minor] add qo_len_ptr into SplitQK schedule

9c92244

[Minor] add kv_len_ptr in paged_kv_t

3fe3355

[Major] refactor without test

843b7d1

[Major] add pod jit

00216e8

[fix] fix typo

782493e

fix

bbfae99

remove GPU kernel-scheduler

21d867b

add unit test

6925f79

clean code.

74861a6

upd

9f51efe

happierpig requested a review from yzh119 March 21, 2025 21:28

fix

2589997

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

happierpig commented Mar 21, 2025

yzh119 commented Mar 21, 2025

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

Are you sure you want to change the base?

[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967

Conversation

happierpig commented Mar 21, 2025

Description

Key Changes

Limitations and Future Work

yzh119 commented Mar 21, 2025