Skip to content

Commit 2c16f57

Browse files
author
Guanbao Yu
committed
move out some contents
1 parent 0dba6b3 commit 2c16f57

File tree

1 file changed

+0
-11
lines changed

1 file changed

+0
-11
lines changed

Qwen/Qwen3-235B-A22B-ROCm.md

Lines changed: 0 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -89,17 +89,6 @@ lm_eval \
8989
--batch_size 100
9090
```
9191

92-
## More to update
93-
94-
### Optimizations on the way
95-
1. https://github.com/vllm-project/vllm/pull/28500 enables **q_norm + k_norm + rope fusion** on ROCm platforms, which was initially implemented for cuda in https://github.com/vllm-project/vllm/pull/27165.
96-
2. https://github.com/vllm-project/vllm/pull/25693 added new fusion passes to enable **rms_norm + fp8_block_quant** and **silu + fp8_block_quant**, which depends on the triton fused kernel in https://github.com/ROCm/aiter/tree/dev/perf_fused_rms_fp8_group_quant. Need to check if this triton kernel merged into AITER main.
97-
3. **Sequence parallel** code ready in https://github.com/ROCm/vllm/pull/790. But seems poor performance due to the pynccl comm op.
98-
4. **All-reduce + rms_norm** fusion WIP in https://github.com/ROCm/vllm/pull/803.
99-
5. FP8 block GEMM is not efficient enough, i.e., up to 1.2p ~ 1.4p flops even after tuning.
100-
101-
### Other parallelism
102-
1. Try other parallel strategies for best performance across different scenarios.
10392

10493

10594

0 commit comments

Comments
 (0)