[Refactor] Uniform PoDAttention API with Horizontal Fusion SMs Schedule #967
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR is a follow-up to #858, which integrates the PoDAttention (arXiv link) API in a user-transparent manner. Users can now invoke PoDAttention via the same API as
BatchPrefillWithPagedKVCache
, without explicitly specifying whether requests are prefill or decode (example code).Key Changes
Support for Non-Continuous Q/O and KV Tensor Layout
Previously, tensor offsets were computed using
indptr
, assuming continuous layouts. PoDAttention requires supporting mixed prefill/decode subsets within requests, necessitating a non-continuous layout.q_lenptr
andkv_lenptr
to accommodate this functionality (code link).Horizontal Fusion-Style Implementation
For improved efficiency, subsets of requests are aware of each other, enabling optimal selection of kernel hyperparameters and persistent kernel execution.
Limitations and Future Work
qo_len > threshold
) is preliminary and requires improvement (classifier implementation).cc @AKKamath @yzh119