Skip to content

Conversation

fangfangssj
Copy link

为FastDeploy集成 SageAttn v2++的RFC

Copy link

paddle-bot bot commented Sep 18, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

@luotao1 luotao1 self-assigned this Sep 19, 2025
@luotao1
Copy link
Collaborator

luotao1 commented Sep 19, 2025

@chang-wenbin

- qk_int8_sv_f8_accum_f16_fuse_v_scale_attn_inst_buf # FP16累积版本
- qk_int8_sv_f8_accum_f32_fuse_v_scale_fuse_v_mean_attn # 融合V均值
#### sm90
- qk_int8_sv_f8_accum_f32_fuse_v_scale_attn

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议先接入SM90架构算子跑通流程,后续同步进行验证以及其他架构接入。

triton算子实现sm86架构
#### sm86
- attn_qk_int8_block_varlen
- attn_qk_int8_per_block_causal_varlen

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM服务下想要有性能收益必须要支持varlen,cudakernel可以参考paddlenlp PR中的算子修改进行验证。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants