[FEATURE] Upstream VIT FA RDNA3 ROCM #27776

JartX · 2025-10-30T00:23:40Z

I'm trying to respect the latest logic implemented with @Lucaskabela's wrapper, also trying to correct the execution. Before the wrapper, I already had it advanced with minimalist logic implemented in rocm.py and layer.py. Right now I can get it to work with FLASH_ATTN without torch.SDPA, I ask for your help:

@DarkLight1337 @tjtanaa @lgeiger @Lucaskabela

I am worried about the following part in qwen2_5_vl.py

context_layer = vit_flash_attn_wrapper( 
q, 
k, 
v, 
cu_seqlens, 
max_seqlen, 
batch_size, 
self.attn_backend == _Backend.ROCM_AITER_FA, 
self.use_upstream_fa, 
)

I would say that there is a lack of coherence in the parameters:

def vit_flash_attn_wrapper( 
q: torch.Tensor, 
k: torch.Tensor, 
v: torch.Tensor, 
cu_seqlens: torch.Tensor, 
max_seqlen: torch.Tensor, 
batch_size: int, 
is_rocm_aiter: bool, 
use_upstream_fa: bool,
) -> torch.Tensor: 
return torch.ops.vllm.flash_attn_maxseqlen_wrapper( 
q, k, v, cu_seqlens, max_seqlen, batch_size, is_rocm_aiter, use_upstream_fa 
)

is_rocm_aiter: bool == ROCM_AITER_FA
self.attn_backend == _Backend.ROCM_AITER_FA,

I have passed the selection that I am sure is poorly made to flash_attn_maxseqlen_wrapper, here @tjtanaa will surely hit me xD

Please take a look, and if I have to give access to everyone in my repository I will give it :)

Thank you so much!

Signed-off-by: JartX <[email protected]>

gemini-code-assist

Code Review

This pull request introduces support for Vision Transformer Flash Attention on RDNA3 with ROCM. The changes include updating the Dockerfile to conditionally build Flash Attention, and modifying the attention layers to correctly select the ViT attention backend for ROCm platforms. The refactoring to centralize backend selection logic is a good improvement. However, I've found a critical issue where a function is used as a boolean value instead of being called, which will lead to incorrect behavior at runtime. Please see the specific comment for details.

vllm/attention/ops/vit_attn_wrappers.py

Signed-off-by: JartX <[email protected]>

JartX · 2025-10-30T00:28:23Z

/gemini summary

gemini-code-assist · 2025-10-30T00:28:49Z

Summary of Changes

This pull request integrates comprehensive support for upstream Flash Attention on AMD RDNA3 (gfx1100) GPUs within the ROCm platform. It refactors the attention backend selection mechanism to be more robust and dynamic, moving away from a passed boolean parameter to explicit checks of the current platform, environment variables, and GPU architecture. This change centralizes the logic for choosing between different Flash Attention implementations, including AITemplate's version for gfx9 and the newly integrated upstream Flash Attention for RDNA3, ensuring optimal performance based on the hardware and configuration.

Highlights

RDNA3 Flash Attention Support: Introduced conditional compilation and runtime detection for upstream Flash Attention on AMD RDNA3 (gfx1100) GPUs, enabling enhanced performance for these architectures within the ROCm ecosystem.
Refactored Attention Backend Selection: Streamlined the logic for selecting the appropriate Flash Attention backend by removing the is_rocm_aiter boolean parameter from wrapper functions. The decision-making is now centralized and based on explicit checks of the current platform, environment variables, and GPU architecture, addressing the author's concern about parameter coherence.
Dynamic Upstream FA Detection: The qwen2_5_vl.py model now dynamically checks for the availability of the flash_attn library at runtime to determine whether to use the upstream Flash Attention implementation.
Dockerfile Enhancements: The ROCm Dockerfile now includes git as a dependency and conditionally compiles the ROCm/flash-attention library if specific environment variables (FLASH_ATTENTION_TRITON_AMD_ENABLE and GPU_ARCHS) are set, facilitating easier setup for RDNA3 Flash Attention.

Changelog

docker/Dockerfile.rocm
- Added git to the apt-get install command.
- Implemented conditional build logic for ROCm/flash-attention based on FLASH_ATTENTION_TRITON_AMD_ENABLE and GPU_ARCHS environment variables.
vllm/attention/layer.py
- Imported on_gfx1x for RDNA3 architecture detection.
- Modified maybe_get_vit_flash_attn_backend to include a new condition for enabling _Backend.FLASH_ATTN on gfx1100 when FLASH_ATTENTION_TRITON_AMD_ENABLE is 'TRUE' and flash_attn is available.
- Removed the on_gfx9() check from the general _Backend.FLASH_ATTN condition, broadening its applicability.
vllm/attention/ops/vit_attn_wrappers.py
- Imported envs, current_platform, on_gfx9, and on_gfx1x.
- Removed the is_rocm_aiter parameter from flash_attn_maxseqlen_wrapper, flash_attn_maxseqlen_wrapper_fake, and vit_flash_attn_wrapper.
- Updated the logic within flash_attn_maxseqlen_wrapper to determine AITemplate usage based on current_platform.is_rocm(), on_gfx9(), and envs.VLLM_ROCM_USE_AITER/envs.VLLM_ROCM_USE_AITER_MHA.
vllm/model_executor/models/qwen2_5_vl.py
- Ensured max_seqlen is a torch.Tensor before passing it to the wrapper.
- Dynamically set self.use_upstream_fa based on flash_attn library availability.
- Updated the call to vit_flash_attn_wrapper to remove the is_rocm_aiter argument.
vllm/platforms/rocm.py
- Removed an entry for AMD_Instinct_MI308X from ROCM_ARCH_TO_NAME.
- Added import os to get_vit_attn_backend.
- Implemented on_gfx1x function for RDNA3 detection.
- Updated get_vit_attn_backend to prioritize upstream Flash Attention for gfx1100 based on environment variables.
- Replaced os.environ["VLLM_USE_TRITON_AWQ"] = "1" with envs.VLLM_USE_TRITON_AWQ = True for consistency.

Activity

The author, JartX, initiated the pull request and requested a summary of the changes.
A bot, gemini-code-assist[bot], provided a critical review comment on vllm/attention/layer.py, highlighting a potential bug in the original code where on_gfx9 was used as a boolean instead of being called as a function on_gfx9(). This issue is implicitly addressed by the PR's refactoring of attention backend selection logic.

…d-args Signed-off-by: JartX <[email protected]>

JartX · 2025-10-30T08:52:49Z

@zhewenl can inference with your last changes on flash_attn:

vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:11 [api_server.py:1869] vLLM API server version 0.1.dev10861+g624438436
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:11 [utils.py:253] non-default args: {'model_tag': '/models/Qwen3-VL-32B-Instruct-AWQ-4bit', 'port': 80, 'chat_template': '/chat-template-tools.jinja', 'model': '/models/Qwen3-VL-32B-Instruct-AWQ-4bit', 'dtype': 'float16', 'max_model_len': 65536, 'served_model_name': ['INCQ'], 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.98, 'limit_mm_per_prompt': {'image': 6, 'video': 0}, 'mm_encoder_tp_mode': 'data', 'max_num_seqs': 32, 'enable_log_requests': True}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:16 [model.py:668] Resolved architecture: Qwen3VLForConditionalGeneration
vllm1-1  | (APIServer pid=1) WARNING 10-30 08:42:16 [model.py:1999] Casting torch.bfloat16 to torch.float16.
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:16 [model.py:1773] Using max model len 65536
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:16 [scheduler.py:211] Chunked prefill is enabled with max_num_batched_tokens=2048.
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:42:21 [core.py:93] Initializing a V1 LLM engine (v0.1.dev10861+g624438436) with config: model='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', speculative_config=None, tokenizer='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=QWEN3, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 64, 'local_cache_dir': None}
vllm1-1  | [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | INFO 10-30 08:42:26 [pynccl.py:111] vLLM is using nccl==2.26.6
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [gpu_model_runner.py:2861] Starting to load model cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit...
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
Loading safetensors checkpoint shards: 100% 5/5 [00:02<00:00,  1.82it/s]
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:39 [default_loader.py:314] Loading weights took 2.76 seconds
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:40 [gpu_model_runner.py:2926] Model loading took 6.6191 GiB and 4.165281 seconds
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:26 [backends.py:618] Using cache directory: /root/.cache/vllm/torch_compile_cache/c4e7ce712f/rank_0_0/backbone for vLLM's torch.compile
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:26 [backends.py:634] Dynamo bytecode transform time: 7.05 s
vllm1-1  | (Worker_TP0 pid=39) [rank0]:W1030 08:43:28.235000 39 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP2 pid=41) [rank2]:W1030 08:43:28.634000 41 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP3 pid=42) [rank3]:W1030 08:43:28.642000 42 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP1 pid=40) [rank1]:W1030 08:43:28.644000 40 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP0 pid=39) [rank0]:W1030 08:43:30.481000 39 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP3 pid=42) [rank3]:W1030 08:43:30.848000 42 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP1 pid=40) [rank1]:W1030 08:43:30.853000 40 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP2 pid=41) [rank2]:W1030 08:43:30.870000 41 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:31 [backends.py:248] Cache the graph for dynamic shape for later use
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:40 [shm_broadcast.py:495] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation).
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:42 [backends.py:279] Compiling a graph for dynamic shape takes 15.09 s
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:49 [monitor.py:34] torch.compile takes 22.14 s in total
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:50 [gpu_worker.py:319] Available KV cache memory: 11.99 GiB
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1229] GPU KV cache size: 190,816 tokens
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100% 11/11 [00:02<00:00,  4.71it/s]
Capturing CUDA graphs (decode, FULL): 100% 7/7 [00:01<00:00,  4.15it/s]
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:55 [gpu_model_runner.py:3854] Graph capturing finished in 5 secs, took 0.40 GiB
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:55 [core.py:237] init engine (profile, create kv cache, warmup model) took 74.70 seconds
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:00 [api_server.py:1647] Supported tasks: ['generate']
vllm1-1  | (APIServer pid=1) WARNING 10-30 08:44:03 [model.py:1603] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:03 [serving_responses.py:167] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:03 [serving_chat.py:130] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [serving_completion.py:68] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [api_server.py:1938] Starting vLLM API server 0 on http://0.0.0.0:80
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:38] Available routes are:
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /docs, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /health, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /load, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /ping, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /ping, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /tokenize, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /detokenize, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/models, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /version, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/responses, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/completions, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/embeddings, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /pooling, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /classify, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /score, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/score, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /rerank, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/rerank, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v2/rerank, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /invocations, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /metrics, Methods: GET
vllm1-1  | (APIServer pid=1) INFO:     Started server process [1]
vllm1-1  | (APIServer pid=1) INFO:     Waiting for application startup.
vllm1-1  | (APIServer pid=1) INFO:     Application startup complete.

Can you help me in that points of the wrapper and the flash_attn?:

context_layer = vit_flash_attn_wrapper( 
q, 
k, 
v, 
cu_seqlens, 
max_seqlen, 
batch_size, 
self.attn_backend == _Backend.ROCM_AITER_FA, 
self.use_upstream_fa, 
)

def vit_flash_attn_wrapper( 
q: torch.Tensor, 
k: torch.Tensor, 
v: torch.Tensor, 
cu_seqlens: torch.Tensor, 
max_seqlen: torch.Tensor, 
batch_size: int, 
is_rocm_aiter: bool, 
use_upstream_fa: bool,
) -> torch.Tensor: 
return torch.ops.vllm.flash_attn_maxseqlen_wrapper( 
q, k, v, cu_seqlens, max_seqlen, batch_size, is_rocm_aiter, use_upstream_fa 
)

JartX · 2025-10-30T11:47:24Z

Hi @yewentao256, would you be so kind as to take a look at this as well? Perhaps instead of using environment variables you know of another way to add it implicitly.

Signed-off-by: JartX <[email protected]>

tjtanaa · 2025-10-30T15:42:11Z

@JartX can you evaluate the benchmark which one is faster, triton flash attention API or the Torch.SDPA? Let's try to avoid introducing more and more code path by offering the best one as default. And I saw that you introduced a new environment variables. We would like to cut down on that.

tjtanaa · 2025-10-30T15:45:39Z

Moreover, since there are efforts in fixing AMD CI, the GPU used on AMD CI is able to test all the code path, torch.sdpa, CK flash attention varlen, AITER flash attention varlen, and even this new triton flash attention varlen (if it is worth introducing). I will be fixing all of them together.

Lucaskabela

I think the change to custom op signature is fine, but there is some changes in the models file we shouldn't need

Lucaskabela · 2025-10-30T15:42:57Z

vllm/attention/ops/vit_attn_wrappers.py

-    is_rocm_aiter: bool,
    use_upstream_fa: bool,
 ) -> torch.Tensor:
-    if is_rocm_aiter:


So I am not sure what the ask is on this PR, but from my end we should feel free to change these shims however we need.

These custom ops are purely to preserve traceability of the ViT component, and the signatures are designed like this because we went from selecting attention on attrs of the model (self) to needing an independent function without self parameter.

I do want to voice a design consideration on this backend selection logic as a whole though - to me, it would seem better if we could use just pass attn_fnlambda's directly as opposed to some backend enum then doing the function selection later. I wonder what is preventing us from doing this in the code today? (traceability, etc)

cc @ywang96 who may have more context on this

it would seem better if we could use just pass attn_fnlambda's directly as opposed to some backend enum then doing the function selection later. I wonder what is preventing us from doing this in the code today?

From the user perspective it's cleaner to just pass in an enum (e.g, --mm-encoder-attn-backend TORCH_SDPA) and it's better for us to control this over passing an entire free-form attn implementation, but I agree that enum -> attn_fn can be done at the level of init time of XXXVisionTransformer and we pass the resolved attn_fn as an input downstream to XXXVisionAttention. Does that align with what's on your mind?

@Lucaskabela @ywang96

I have an RFC #27821 that is proposing the same idea of enum -> attn_fn is happening in the XXXVisionTransformer .

However, in this RFC taking into account that many of the VL models share the same logic as qwen2_5_vl.py, it can be abstract out further that the Overriding logic should be handled by platform as only platform knows that backend it can support.

So the maybe_get_vit_flash_attn_backend will solely responsible for enum -> attn_fn mapping rather than including overriding logic. (maybe_get_vit_flash_attn_backend will be renamed to a new name matching its role).

Lucaskabela · 2025-10-30T15:46:43Z

vllm/model_executor/models/qwen2_5_vl.py

            q, k = torch.chunk(qk_rotated, 2, dim=0)

        if self.is_flash_attn_backend:
+            from importlib.util import find_spec


This entire logic seems unnecessary here because:

max_seqlen is already a tensor (type hints FTW)

use_upstream_fa is set in Qwen2_5_VisionTransformer/upstream in init. We should just modify the logic there if needed as opposed to here

JartX · 2025-10-30T21:17:11Z

@tjtanaa
The benchmarks

FA CACHE 32B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 44.0127 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 592.918 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1227.88 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.5604 |
+-----------------------------------+-----------+
| Average latency (s) | 29.6106 |
+-----------------------------------+-----------+
| Average time to first token (s) | 1.1861 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.128 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.1227 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 231.57 |
+-----------------------------------+-----------+
2025-10-30 16:14:29 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.6951 | 0.0776 | 0.1067 | 20.924 | 216 | 140 | 6.7049 | 14.6395 |
| 25% | 0.6977 | 0.1144 | 0.1242 | 25.4125 | 230 | 176 | 6.9486 | 15.0216 |
| 50% | 1.4801 | 0.1371 | 0.1308 | 30.9778 | 243 | 228 | 7.373 | 15.9937 |
| 66% | 1.5328 | 0.14 | 0.1356 | 32.7173 | 261 | 249 | 7.6107 | 16.7284 |
| 75% | 1.534 | 0.1403 | 0.1371 | 33.7172 | 269 | 264 | 7.8192 | 17.2755 |
| 80% | 1.5349 | 0.1404 | 0.1383 | 34.6548 | 272 | 282 | 8.1374 | 17.6022 |
| 90% | 1.5361 | 0.1409 | 0.1403 | 37.0137 | 279 | 336 | 9.0777 | 19.6744 |
| 95% | 1.5372 | 0.1414 | 0.145 | 38.2948 | 299 | 379 | 9.8969 | 20.5134 |
| 98% | 1.5375 | 0.1417 | 0.1472 | 39.0541 | 332 | 415 | 10.6263 | 21.9542 |
| 99% | 1.5379 | 0.142 | 0.1492 | 44.0115 | 334 | 676 | 15.3596 | 28.4289 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 16:14:29 - evalscope - INFO: Save the summary to: outputs/20251030_161318/QWEN3VL

TORCH.SDPA CACHE 32B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 44.856 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 589.434 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1215.04 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.5227 |
+-----------------------------------+-----------+
| Average latency (s) | 30.0095 |
+-----------------------------------+-----------+
| Average time to first token (s) | 1.6538 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.1268 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.1214 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 233.65 |
+-----------------------------------+-----------+
2025-10-30 16:25:19 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 1.7192 | 0.0667 | 0.1041 | 20.2229 | 216 | 134 | 6.6262 | 14.3746 |
| 25% | 1.7707 | 0.1134 | 0.1224 | 25.5349 | 230 | 176 | 6.8792 | 14.9965 |
| 50% | 1.7744 | 0.1363 | 0.1296 | 31.4246 | 243 | 230 | 7.3191 | 15.8199 |
| 66% | 1.777 | 0.14 | 0.1334 | 33.2203 | 261 | 251 | 7.5552 | 16.408 |
| 75% | 1.7784 | 0.1402 | 0.1366 | 34.5455 | 269 | 272 | 7.8737 | 16.8927 |
| 80% | 1.7794 | 0.1403 | 0.1375 | 35.2264 | 272 | 284 | 8.0621 | 17.6604 |
| 90% | 1.7806 | 0.1406 | 0.1401 | 37.9803 | 279 | 350 | 9.2153 | 19.3472 |
| 95% | 1.781 | 0.1412 | 0.1403 | 38.7801 | 299 | 377 | 9.7131 | 20.796 |
| 98% | 1.7826 | 0.1416 | 0.1497 | 39.6375 | 332 | 415 | 10.4699 | 21.9482 |
| 99% | 1.7833 | 0.1418 | 0.1504 | 44.8548 | 334 | 691 | 15.4053 | 27.9196 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 16:25:19 - evalscope - INFO: Save the summary to: outputs/20251030_162405/QWEN3VL

TORCH.SDPA WCACHE 32B
Benchmarking summary:
+-----------------------------------+----------+
| Key | Value |
+===================================+==========+
| Time taken for tests (s) | 72.023 |
+-----------------------------------+----------+
| Number of concurrency | 100 |
+-----------------------------------+----------+
| Total requests | 100 |
+-----------------------------------+----------+
| Succeed requests | 100 |
+-----------------------------------+----------+
| Failed requests | 0 |
+-----------------------------------+----------+
| Output token throughput (tok/s) | 346.015 |
+-----------------------------------+----------+
| Total token throughput (tok/s) | 709.578 |
+-----------------------------------+----------+
| Request throughput (req/s) | 1.466 |
+-----------------------------------+----------+
| Average latency (s) | 58.5657 |
+-----------------------------------+----------+
| Average time to first token (s) | 16.5453 |
+-----------------------------------+----------+
| Average time per output token (s) | 0.1878 |
+-----------------------------------+----------+
| Average inter-token latency (s) | 0.178 |
+-----------------------------------+----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+----------+
| Average output tokens per request | 236.02 |
+-----------------------------------+----------+
2025-10-30 16:35:13 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 5.5968 | 0.0783 | 0.1332 | 50.1825 | 216 | 148 | 2.9492 | 7.395 |
| 25% | 10.4208 | 0.1194 | 0.1483 | 54.1575 | 230 | 182 | 3.3454 | 7.6963 |
| 50% | 17.7737 | 0.1394 | 0.1837 | 59.8414 | 243 | 233 | 3.8836 | 8.1659 |
| 66% | 22.5718 | 0.1404 | 0.1993 | 61.4438 | 261 | 250 | 4.0619 | 8.4115 |
| 75% | 24.8496 | 0.1406 | 0.2096 | 62.7496 | 269 | 271 | 4.3188 | 8.5015 |
| 80% | 24.898 | 0.1407 | 0.2275 | 63.2192 | 272 | 279 | 4.4166 | 8.6882 |
| 90% | 29.6564 | 0.1412 | 0.2433 | 64.8741 | 279 | 319 | 4.9172 | 9.0637 |
| 95% | 29.6575 | 0.1418 | 0.2941 | 66.929 | 299 | 387 | 5.7716 | 9.5356 |
| 98% | 30.4465 | 2.3613 | 0.326 | 68.209 | 332 | 447 | 6.5534 | 10.8864 |
| 99% | 30.4946 | 2.4065 | 0.358 | 72.0218 | 334 | 649 | 9.0112 | 12.3574 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 16:35:13 - evalscope - INFO: Save the summary to: outputs/20251030_163324/QWEN3VL
root@ainode1:/app#

FA WCACHE 32B
Benchmarking summary:
+-----------------------------------+----------+
| Key | Value |
+===================================+==========+
| Time taken for tests (s) | 82.0359 |
+-----------------------------------+----------+
| Number of concurrency | 100 |
+-----------------------------------+----------+
| Total requests | 100 |
+-----------------------------------+----------+
| Succeed requests | 100 |
+-----------------------------------+----------+
| Failed requests | 0 |
+-----------------------------------+----------+
| Output token throughput (tok/s) | 309.161 |
+-----------------------------------+----------+
| Total token throughput (tok/s) | 634.552 |
+-----------------------------------+----------+
| Request throughput (req/s) | 1.3121 |
+-----------------------------------+----------+
| Average latency (s) | 67.1365 |
+-----------------------------------+----------+
| Average time to first token (s) | 22.5948 |
+-----------------------------------+----------+
| Average time per output token (s) | 0.2024 |
+-----------------------------------+----------+
| Average inter-token latency (s) | 0.189 |
+-----------------------------------+----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+----------+
| Average output tokens per request | 235.62 |
+-----------------------------------+----------+
2025-10-30 16:41:55 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 8.7995 | 0.0782 | 0.1382 | 56.7753 | 216 | 134 | 2.3767 | 6.3967 |
| 25% | 11.2158 | 0.1143 | 0.1584 | 63.1105 | 230 | 184 | 2.88 | 6.7068 |
| 50% | 25.4009 | 0.1355 | 0.1908 | 68.9092 | 243 | 236 | 3.4319 | 7.1465 |
| 66% | 30.0787 | 0.1396 | 0.2172 | 70.5585 | 261 | 253 | 3.5993 | 7.3268 |
| 75% | 32.4449 | 0.14 | 0.2343 | 71.6447 | 269 | 275 | 3.8245 | 7.4334 |
| 80% | 32.446 | 0.1402 | 0.2408 | 72.2703 | 272 | 282 | 3.9076 | 7.5732 |
| 90% | 37.1238 | 0.1406 | 0.268 | 74.8065 | 279 | 347 | 4.6359 | 7.9018 |
| 95% | 37.1247 | 0.1412 | 0.3766 | 75.8954 | 299 | 385 | 5.0728 | 8.5029 |
| 98% | 39.4619 | 2.3378 | 0.4174 | 76.2096 | 332 | 398 | 5.2224 | 8.9433 |
| 99% | 39.4619 | 2.3726 | 0.4193 | 82.0341 | 334 | 705 | 8.594 | 11.5318 |
+-------------+----------+---------+----------+-------------+--------------+---------------+--------------

TORCH.SDPA WCACHE MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 54.6005 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 646.204 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1100.72 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.8328 |
+-----------------------------------+-----------+
| Average latency (s) | 43.0512 |
+-----------------------------------+-----------+
| Average time to first token (s) | 10.3647 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.096 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0927 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 352.58 |
+-----------------------------------+-----------+
2025-10-30 18:23:48 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 4.9009 | 0.0617 | 0.0722 | 31.1491 | 216 | 201 | 6.4528 | 12.8676 |
| 25% | 6.1604 | 0.0654 | 0.083 | 40.5178 | 230 | 298 | 7.4399 | 13.3768 |
| 50% | 11.1589 | 0.0686 | 0.0941 | 44.7023 | 243 | 356 | 7.9449 | 13.9569 |
| 66% | 13.5212 | 0.0695 | 0.1 | 47.1132 | 261 | 397 | 8.4099 | 14.2436 |
| 75% | 14.7944 | 0.07 | 0.1049 | 48.9736 | 269 | 425 | 8.6723 | 14.5184 |
| 80% | 14.7957 | 0.0705 | 0.1083 | 49.2853 | 272 | 432 | 8.7653 | 14.7427 |
| 90% | 17.2209 | 0.0716 | 0.1184 | 51.7792 | 279 | 489 | 9.4536 | 15.3315 |
| 95% | 17.2224 | 0.075 | 0.1247 | 54.2727 | 299 | 579 | 10.6684 | 15.4736 |
| 98% | 17.7139 | 0.374 | 0.1822 | 54.465 | 332 | 593 | 10.8877 | 15.7445 |
| 99% | 17.717 | 1.2324 | 0.2145 | 54.5586 | 334 | 595 | 10.9057 | 16.3571 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:23:48 - evalscope - INFO: Save the summary to: outputs/20251030_182112/QWEN3VL
root@ainode1:/app#

TORCH.SDPA CACHE MOE 30B

+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 53.1952 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1065.81 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1800.92 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.9643 |
+-----------------------------------+-----------+
| Average latency (s) | 23.2187 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.9222 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0648 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.062 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 359.55 |
+-----------------------------------+-----------+
2025-10-30 18:30:25 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.4521 | 0.0547 | 0.0627 | 13.4872 | 216 | 185 | 13.7167 | 23.2075 |
| 25% | 0.975 | 0.0608 | 0.0644 | 18.8408 | 230 | 269 | 14.2927 | 23.9291 |
| 50% | 1.069 | 0.0652 | 0.0651 | 24.4533 | 243 | 361 | 14.7429 | 25.3894 |
| 66% | 1.0782 | 0.0675 | 0.0666 | 26.6542 | 261 | 396 | 14.8773 | 26.5442 |
| 75% | 1.0813 | 0.0682 | 0.0669 | 28.1046 | 269 | 422 | 15.0153 | 27.3046 |
| 80% | 1.0828 | 0.0685 | 0.0673 | 28.5103 | 272 | 429 | 15.1061 | 28.5251 |
| 90% | 1.0857 | 0.0693 | 0.0675 | 31.1938 | 279 | 489 | 15.6762 | 36.0011 |
| 95% | 1.087 | 0.0698 | 0.0685 | 32.5589 | 299 | 536 | 16.4625 | 42.8358 |
| 98% | 1.0899 | 0.0703 | 0.0694 | 33.7322 | 332 | 605 | 17.9354 | 48.2691 |
| 99% | 1.0942 | 0.0714 | 0.0731 | 53.1798 | 334 | 2048 | 38.5109 | 58.3012 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:30:25 - evalscope - INFO: Save the summary to: outputs/20251030_182906/QWEN3VL

TORCH.SDPA WCACHE EP MMDATA MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 63.534 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 844.695 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1423.39 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.3335 |
+-----------------------------------+-----------+
| Average latency (s) | 31.4613 |
+-----------------------------------+-----------+
| Average time to first token (s) | 4.5218 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0795 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0744 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 361.98 |
+-----------------------------------+-----------+
2025-10-30 18:38:31 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 1.8812 | 0.0589 | 0.07 | 20.6832 | 216 | 187 | 8.9648 | 17.7821 |
| 25% | 3.0235 | 0.0641 | 0.0723 | 26.9901 | 230 | 270 | 10.0778 | 18.0466 |
| 50% | 4.7063 | 0.07 | 0.0773 | 32.8515 | 243 | 366 | 11.0915 | 18.8519 |
| 66% | 5.852 | 0.0709 | 0.0809 | 34.7591 | 261 | 396 | 11.4274 | 19.4505 |
| 75% | 6.4132 | 0.0716 | 0.0837 | 36.7817 | 269 | 426 | 11.6507 | 20.0717 |
| 80% | 6.416 | 0.0721 | 0.0865 | 37.6976 | 272 | 441 | 11.7754 | 20.6697 |
| 90% | 7.7442 | 0.0779 | 0.092 | 40.0254 | 279 | 492 | 12.3524 | 21.7433 |
| 95% | 7.8041 | 0.0809 | 0.0958 | 41.032 | 299 | 531 | 12.9411 | 23.9661 |
| 98% | 7.9866 | 0.0872 | 0.1244 | 42.8509 | 332 | 640 | 14.9355 | 31.5263 |
| 99% | 7.9867 | 0.5745 | 0.1444 | 63.4683 | 334 | 2048 | 32.2681 | 36.144 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:38:31 - evalscope - INFO: Save the summary to: outputs/20251030_183545/QWEN3VL

TORCH.SDPA CACHE EP MMDATA MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 57.4744 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1036.16 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1731.83 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.8052 |
+-----------------------------------+-----------+
| Average latency (s) | 25.2854 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.7732 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.069 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0664 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 369.37 |
+-----------------------------------+-----------+
2025-10-30 18:40:30 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.5994 | 0.0585 | 0.0663 | 16.5848 | 216 | 219 | 13.283 | 21.7653 |
| 25% | 0.6018 | 0.0647 | 0.0687 | 20.3705 | 230 | 276 | 13.549 | 22.258 |
| 50% | 0.6065 | 0.0692 | 0.0698 | 26.0267 | 243 | 363 | 13.9795 | 23.8249 |
| 66% | 0.9156 | 0.0706 | 0.0703 | 28.4996 | 261 | 400 | 14.0588 | 24.9101 |
| 75% | 0.9809 | 0.0711 | 0.071 | 30.1839 | 269 | 428 | 14.1787 | 26.0163 |
| 80% | 0.9822 | 0.0713 | 0.0712 | 31.518 | 272 | 452 | 14.341 | 26.4757 |
| 90% | 0.9836 | 0.0728 | 0.072 | 33.5213 | 279 | 497 | 14.8264 | 31.3297 |
| 95% | 0.9848 | 0.0826 | 0.073 | 34.5789 | 299 | 527 | 15.2405 | 36.2241 |
| 98% | 0.9884 | 0.0844 | 0.0733 | 35.6343 | 332 | 581 | 16.3045 | 48.0304 |
| 99% | 0.9885 | 0.0849 | 0.0741 | 57.3962 | 334 | 2048 | 35.6818 | 73.2013 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

TORCH.SDPA WCACHE EP MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 73.2821 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 691.217 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1169.45 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.9285 |
+-----------------------------------+-----------+
| Average latency (s) | 40.1204 |
+-----------------------------------+-----------+
| Average time to first token (s) | 8.6339 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0944 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0878 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 358.43 |
+-----------------------------------+-----------+
2025-10-30 18:51:11 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 3.1656 | 0.0595 | 0.0753 | 30.4569 | 216 | 202 | 6.6323 | 13.8441 |
| 25% | 4.3582 | 0.0674 | 0.0809 | 36.6867 | 230 | 282 | 7.7017 | 14.2868 |
| 50% | 9.2658 | 0.0711 | 0.0903 | 40.3674 | 243 | 346 | 8.6001 | 14.7629 |
| 66% | 11.6628 | 0.0723 | 0.0978 | 43.7976 | 261 | 391 | 8.9156 | 15.2021 |
| 75% | 12.8479 | 0.081 | 0.1043 | 44.5376 | 269 | 405 | 9.1366 | 15.3761 |
| 80% | 12.8502 | 0.0822 | 0.1072 | 45.5359 | 272 | 420 | 9.2346 | 15.7422 |
| 90% | 15.2021 | 0.0851 | 0.1163 | 47.0097 | 279 | 458 | 9.7697 | 16.0486 |
| 95% | 15.2032 | 0.0862 | 0.1253 | 50.3847 | 299 | 533 | 10.5868 | 16.5803 |
| 98% | 15.6115 | 0.1168 | 0.1811 | 51.8543 | 332 | 616 | 11.8795 | 16.8974 |
| 99% | 15.6119 | 1.1942 | 0.1987 | 73.2774 | 334 | 2048 | 27.9486 | 30.9782 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:51:11 - evalscope - INFO: Save the summary to: outputs/20251030_184812/QWEN3VL

TORCH.SDPA CACHE EP MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 59.5439 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 955.403 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1612.54 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.6498 |
+-----------------------------------+-----------+
| Average latency (s) | 26.292 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.7713 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0734 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0708 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 360.55 |
+-----------------------------------+-----------+
2025-10-30 18:55:11 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.4228 | 0.0596 | 0.0698 | 13.0905 | 216 | 165 | 12.5655 | 20.4495 |
| 25% | 0.4249 | 0.0687 | 0.073 | 22.5016 | 230 | 284 | 12.7066 | 20.8741 |
| 50% | 0.9214 | 0.0713 | 0.0744 | 29.0387 | 243 | 379 | 13.0516 | 22.22 |
| 66% | 0.9245 | 0.0779 | 0.0751 | 30.9603 | 261 | 409 | 13.1903 | 23.5255 |
| 75% | 0.9255 | 0.0808 | 0.0758 | 31.7657 | 269 | 422 | 13.279 | 24.5137 |
| 80% | 0.9268 | 0.0814 | 0.0762 | 32.5434 | 272 | 436 | 13.3975 | 25.0712 |
| 90% | 0.9286 | 0.0835 | 0.0765 | 35.5313 | 279 | 498 | 14.0158 | 34.2758 |
| 95% | 0.9295 | 0.0852 | 0.0779 | 37.0333 | 299 | 544 | 14.6895 | 44.5117 |
| 98% | 0.9313 | 0.086 | 0.0782 | 37.6751 | 332 | 578 | 15.3417 | 57.3209 |
| 99% | 0.9313 | 0.0866 | 0.0783 | 59.5296 | 334 | 2048 | 34.403 | 157.1854 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:55:11 - evalscope - INFO: Save the summary to: outputs/20251030_185346/QWEN3VL

FA WCACHE MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 76.1157 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 665.335 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1103.29 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.766 |
+-----------------------------------+-----------+
| Average latency (s) | 47.4141 |
+-----------------------------------+-----------+
| Average time to first token (s) | 14.2678 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0954 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.088 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 376.74 |
+-----------------------------------+-----------+
2025-10-30 19:03:02 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 6.4604 | 0.0565 | 0.0691 | 35.5685 | 216 | 176 | 4.9482 | 11.8001 |
| 25% | 7.6708 | 0.0626 | 0.0805 | 44.3216 | 230 | 303 | 6.8364 | 12.2502 |
| 50% | 15.7807 | 0.0665 | 0.0913 | 48.8625 | 243 | 379 | 7.7565 | 12.7922 |
| 66% | 18.2435 | 0.0678 | 0.0998 | 50.8784 | 261 | 415 | 8.166 | 13.1782 |
| 75% | 19.4037 | 0.0683 | 0.1059 | 52.2424 | 269 | 438 | 8.3551 | 13.4079 |
| 80% | 19.4049 | 0.0686 | 0.1087 | 53.3581 | 272 | 458 | 8.5774 | 13.5085 |
| 90% | 23.4638 | 0.0693 | 0.1282 | 54.5664 | 279 | 491 | 8.9963 | 13.9303 |
| 95% | 23.4652 | 0.0704 | 0.1408 | 55.5348 | 299 | 526 | 9.4715 | 14.6384 |
| 98% | 24.1829 | 0.0836 | 0.2045 | 56.6207 | 332 | 604 | 10.6675 | 15.542 |
| 99% | 24.1831 | 1.2243 | 0.2184 | 76.05 | 334 | 2048 | 26.9297 | 30.1644 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 19:03:02 - evalscope - INFO: Save the summary to: outputs/20251030_190005/QWEN3VL

FA CACHE MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 54.3489 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1078.18 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1787.42 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.86 |
+-----------------------------------+-----------+
| Average latency (s) | 24.9804 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.8962 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0663 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0639 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 376.99 |
+-----------------------------------+-----------+
2025-10-30 19:13:19 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.8588 | 0.0531 | 0.0629 | 17.39 | 216 | 243 | 13.9782 | 22.4605 |
| 25% | 0.9095 | 0.0634 | 0.0665 | 21.9884 | 230 | 308 | 14.0554 | 23.0041 |
| 50% | 0.9142 | 0.0676 | 0.0673 | 25.8729 | 243 | 371 | 14.3393 | 24.1448 |
| 66% | 0.9171 | 0.0685 | 0.0678 | 27.3414 | 261 | 395 | 14.447 | 24.917 |
| 75% | 0.9191 | 0.0689 | 0.0682 | 28.7832 | 269 | 421 | 14.6266 | 25.6518 |
| 80% | 0.9201 | 0.0692 | 0.0683 | 30.2443 | 272 | 449 | 14.8458 | 26.3901 |
| 90% | 0.9231 | 0.0736 | 0.0685 | 32.9384 | 279 | 510 | 15.5011 | 34.2137 |
| 95% | 0.9254 | 0.0778 | 0.0686 | 33.5483 | 299 | 527 | 15.7204 | 44.9994 |
| 98% | 0.9266 | 0.0803 | 0.0688 | 34.9626 | 332 | 609 | 17.4186 | 48.8927 |
| 99% | 0.9269 | 0.0812 | 0.0732 | 54.2814 | 334 | 2048 | 37.7293 | 49.748 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 19:13:19 - evalscope - INFO: Save the summary to: outputs/20251030_191159/QWEN3VL

FA WCACHE EP MMDATA MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 77.855 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 660.657 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1094.2 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.7482 |
+-----------------------------------+-----------+
| Average latency (s) | 44.7356 |
+-----------------------------------+-----------+
| Average time to first token (s) | 13.6044 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0896 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0824 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 377.9 |
+-----------------------------------+-----------+
2025-10-30 20:19:57 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 6.9057 | 0.0595 | 0.0692 | 35.1505 | 216 | 213 | 6.1236 | 12.7879 |
| 25% | 9.0109 | 0.0651 | 0.0764 | 40.3813 | 230 | 296 | 7.2493 | 13.1699 |
| 50% | 15.3092 | 0.0704 | 0.0816 | 45.1786 | 243 | 367 | 8.1067 | 13.4851 |
| 66% | 16.4454 | 0.0713 | 0.0899 | 48.0438 | 261 | 406 | 8.479 | 13.9981 |
| 75% | 17.5254 | 0.0717 | 0.0973 | 49.9678 | 269 | 443 | 8.8593 | 14.224 |
| 80% | 17.5274 | 0.0719 | 0.1007 | 50.987 | 272 | 457 | 8.9904 | 14.2892 |
| 90% | 18.1543 | 0.0725 | 0.1182 | 52.931 | 279 | 505 | 9.5407 | 14.651 |
| 95% | 20.2161 | 0.0733 | 0.1315 | 55.7345 | 299 | 585 | 10.4962 | 15.5217 |
| 98% | 20.4034 | 0.0793 | 0.1825 | 57.2001 | 332 | 662 | 11.5734 | 16.0438 |
| 99% | 20.4035 | 0.568 | 0.2522 | 77.8514 | 334 | 2048 | 26.3065 | 29.1581 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 20:19:57 - evalscope - INFO: Save the summary to: outputs/20251030_201700/QWEN3VL

FA CACHE EP MMDATA MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 39.4272 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 963.856 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1617.63 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.6363 |
+-----------------------------------+-----------+
| Average latency (s) | 26.3642 |
+-----------------------------------+-----------+
| Average time to first token (s) | 1.1381 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.07 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.069 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 365.61 |
+-----------------------------------+-----------+
2025-10-30 20:26:22 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 1.0893 | 0.0605 | 0.0654 | 16.6581 | 216 | 213 | 12.8062 | 21.1437 |
| 25% | 1.1412 | 0.0648 | 0.0696 | 22.5231 | 230 | 297 | 13.1865 | 21.7527 |
| 50% | 1.1453 | 0.0702 | 0.0704 | 26.6384 | 243 | 362 | 13.5894 | 22.8363 |
| 66% | 1.1475 | 0.0711 | 0.0711 | 30.03 | 261 | 413 | 13.7529 | 24.3199 |
| 75% | 1.1494 | 0.0717 | 0.0717 | 31.3657 | 269 | 436 | 13.9005 | 24.6126 |
| 80% | 1.15 | 0.0722 | 0.0722 | 32.9363 | 272 | 464 | 14.0878 | 24.9927 |
| 90% | 1.2555 | 0.0819 | 0.0733 | 35.7575 | 279 | 530 | 14.8221 | 30.2387 |
| 95% | 1.2562 | 0.0842 | 0.0736 | 37.223 | 299 | 569 | 15.2862 | 34.0216 |
| 98% | 1.2584 | 0.0848 | 0.0737 | 37.931 | 332 | 597 | 15.7391 | 45.9805 |
| 99% | 1.3102 | 0.0862 | 0.0742 | 39.4253 | 334 | 696 | 17.6536 | 68.6258 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 20:26:22 - evalscope - INFO: Save the summary to: outputs/20251030_202518/QWEN3VL

FA WCACHE EP MOE 30B

Benchmarking summary:
+-----------------------------------+----------+
| Key | Value |
+===================================+==========+
| Time taken for tests (s) | 84.1858 |
+-----------------------------------+----------+
| Number of concurrency | 100 |
+-----------------------------------+----------+
| Total requests | 100 |
+-----------------------------------+----------+
| Succeed requests | 100 |
+-----------------------------------+----------+
| Failed requests | 0 |
+-----------------------------------+----------+
| Output token throughput (tok/s) | 569.827 |
+-----------------------------------+----------+
| Total token throughput (tok/s) | 958.643 |
+-----------------------------------+----------+
| Request throughput (req/s) | 1.5679 |
+-----------------------------------+----------+
| Average latency (s) | 50.9302 |
+-----------------------------------+----------+
| Average time to first token (s) | 15.1424 |
+-----------------------------------+----------+
| Average time per output token (s) | 0.11 |
+-----------------------------------+----------+
| Average inter-token latency (s) | 0.0985 |
+-----------------------------------+----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+----------+
| Average output tokens per request | 363.44 |
+-----------------------------------+----------+
2025-10-30 21:11:50 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 6.4485 | 0.0593 | 0.0763 | 40.0743 | 216 | 195 | 4.866 | 10.7384 |
| 25% | 7.812 | 0.0654 | 0.0857 | 45.6035 | 230 | 266 | 5.7993 | 11.1949 |
| 50% | 16.0646 | 0.0704 | 0.103 | 52.5351 | 243 | 367 | 6.9667 | 11.6967 |
| 66% | 20.1423 | 0.0718 | 0.1129 | 55.2501 | 261 | 404 | 7.3451 | 12.0482 |
| 75% | 21.4187 | 0.0805 | 0.1192 | 57.4467 | 269 | 442 | 7.6941 | 12.1421 |
| 80% | 21.4696 | 0.0816 | 0.1272 | 58.3506 | 272 | 457 | 7.8713 | 12.271 |
| 90% | 25.499 | 0.0852 | 0.1502 | 59.9887 | 279 | 500 | 8.3033 | 12.6402 |
| 95% | 25.5026 | 0.086 | 0.1682 | 60.9027 | 299 | 525 | 8.6203 | 13.253 |
| 98% | 26.2705 | 0.0974 | 0.2562 | 63.7788 | 332 | 684 | 10.7246 | 14.8639 |
| 99% | 26.3214 | 1.2897 | 0.2805 | 84.1223 | 334 | 2048 | 24.3455 | 27.2698 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 21:11:50 - evalscope - INFO: Save the summary to: outputs/20251030_210843/QWEN3VL

FA CACHE EP MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 60.5891 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 934.445 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1556.28 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.5075 |
+-----------------------------------+-----------+
| Average latency (s) | 26.8537 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.9026 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0724 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0696 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 372.66 |
+-----------------------------------+-----------+
2025-10-30 21:16:44 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.3744 | 0.0584 | 0.0688 | 15.3015 | 216 | 194 | 12.614 | 20.5148 |
| 25% | 0.9796 | 0.0663 | 0.0715 | 21.6323 | 230 | 273 | 12.7091 | 21.0648 |
| 50% | 1.0326 | 0.0707 | 0.0734 | 28.5254 | 243 | 375 | 13.1462 | 22.4579 |
| 66% | 1.0355 | 0.0731 | 0.0744 | 31.0512 | 261 | 414 | 13.3328 | 23.4039 |
| 75% | 1.0366 | 0.0799 | 0.075 | 33.0748 | 269 | 448 | 13.5321 | 23.8997 |
| 80% | 1.0376 | 0.0811 | 0.0755 | 33.6966 | 272 | 459 | 13.6176 | 24.7821 |
| 90% | 1.0398 | 0.0851 | 0.0759 | 36.6518 | 279 | 522 | 14.2421 | 32.0522 |
| 95% | 1.0405 | 0.0859 | 0.0775 | 38.213 | 299 | 567 | 14.8379 | 43.8427 |
| 98% | 1.0412 | 0.0864 | 0.0781 | 39.8789 | 332 | 653 | 16.3746 | 47.2565 |
| 99% | 1.044 | 0.0866 | 0.0781 | 60.5256 | 334 | 2048 | 33.8369 | 63.1981 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 21:16:44 - evalscope - INFO: Save the summary to: outputs/20251030_211519/QWEN3VL

JartX · 2025-10-30T21:21:13Z

@tjtanaa

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA CACHE MOE 30B	1078.18	24.9804	0.8962	1787.42	2.86
TORCH.SDPA CACHE MOE 30B	1065.81	23.2187	0.9222	1800.92	2.9643
TORCH.SDPA CACHE EP MMDATA MOE 30B	1036.16	25.2854	0.7732	1731.83	2.8052
FA CACHE EP MMDATA MOE 30B	963.856	26.3642	1.1381	1617.63	2.6363
TORCH.SDPA CACHE EP MOE 30B	955.403	26.292	0.7713	1612.54	2.6498
FA CACHE EP MOE 30B	934.445	26.8537	0.9026	1556.28	2.5075
TORCH.SDPA WCACHE EP MMDATA MOE 30B	844.695	31.4613	4.5218	1423.39	2.3335
TORCH.SDPA WCACHE EP MOE 30B	691.217	40.1204	8.6339	1169.45	1.9285
FA WCACHE MOE 30B	665.335	47.4141	14.2678	1103.29	1.766
FA WCACHE EP MMDATA MOE 30B	660.657	44.7356	13.6044	1094.2	1.7482
TORCH.SDPA WCACHE MOE 30B	646.204	43.0512	10.3647	1100.72	1.8328
FA CACHE 32B	592.918	29.6106	1.1861	1227.88	2.5604
TORCH.SDPA CACHE 32B	589.434	30.0095	1.6538	1215.04	2.5227
FA WCACHE EP MOE 30B	569.827	50.9302	15.1424	958.643	1.5679
TORCH.SDPA WCACHE 32B	346.015	58.5657	16.5453	709.578	1.466
FA WCACHE 32B	309.161	67.1365	22.5948	634.552	1.3121

tjtanaa · 2025-10-30T21:34:57Z

@JartX From your data, I think it shows that in generally the TORCH_SDPA is still the better option.

tjtanaa · 2025-10-30T21:36:26Z

CACHE MOE 30B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA CACHE MOE 30B	1078.18	24.9804	0.8962	1787.42	2.86
TORCH.SDPA CACHE MOE 30B	1065.81	23.2187	0.9222	1800.92	2.9643

TORCH.SDPA (better total throughput and request throughput)

CACHE EP MMDATA MOE 30B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA CACHE EP MMDATA MOE 30B	963.856	26.3642	1.1381	1617.63	2.6363
TORCH.SDPA CACHE EP MMDATA MOE 30B	1036.16	25.2854	0.7732	1731.83	2.8052

TORCH.SDPA (better across all metrics)

CACHE EP MOE 30B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA CACHE EP MOE 30B	934.445	26.8537	0.9026	1556.28	2.5075
TORCH.SDPA CACHE EP MOE 30B	955.403	26.292	0.7713	1612.54	2.6498

TORCH.SDPA (better across all metrics)

WCACHE EP MMDATA MOE 30B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA WCACHE EP MMDATA MOE 30B	660.657	44.7356	13.6044	1094.2	1.7482
TORCH.SDPA WCACHE EP MMDATA MOE 30B	844.695	31.4613	4.5218	1423.39	2.3335

TORCH.SDPA (significantly better across all metrics)

WCACHE EP MOE 30B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA WCACHE EP MOE 30B	569.827	50.9302	15.1424	958.643	1.5679
TORCH.SDPA WCACHE EP MOE 30B	691.217	40.1204	8.6339	1169.45	1.9285

TORCH.SDPA (significantly better across all metrics)

WCACHE MOE 30B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA WCACHE MOE 30B	665.335	47.4141	14.2678	1103.29	1.766
TORCH.SDPA WCACHE MOE 30B	646.204	43.0512	10.3647	1100.72	1.8328

Mixed - FA has better output throughput, TORCH.SDPA has better latency and TTFT

CACHE 32B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA CACHE 32B	592.918	29.6106	1.1861	1227.88	2.5604
TORCH.SDPA CACHE 32B	589.434	30.0095	1.6538	1215.04	2.5227

FA (slightly better output throughput and TTFT)

WCACHE 32B

Configuration	Output Throughput (tok/s)	Avg Latency (s)	Avg TTFT (s)	Total Throughput (tok/s)	Request Throughput (req/s)
FA WCACHE 32B	309.161	67.1365	22.5948	634.552	1.3121
TORCH.SDPA WCACHE 32B	346.015	58.5657	16.5453	709.578	1.466

TORCH.SDPA (better across all metrics)

JartX · 2025-10-30T21:44:13Z

@tjtanaa It surprised me too, but I don't understand why xD. Could you try to explain it to me? Please? Even if it's just because of the time I've spent with the benchmarks hahaha

JartX · 2025-10-31T13:15:56Z

@tjtanaa

Accuracy Test:

FA

Tasks	Version	Filter	Metric		Value		Stderr
chartqa	0	none	anywhere_accuracy	↑	0.795	±	0.0286
		none	exact_match	↑	0.570	±	0.0351
		none	relaxed_accuracy	↑	0.795	±	0.0286

TORCH.SDPA

Tasks	Version	Filter	Metric		Value		Stderr
chartqa	0	none	anywhere_accuracy	↑	0.775	±	0.0296
		none	exact_match	↑	0.565	±	0.0351
		none	relaxed_accuracy	↑	0.770	±	0.0298

tjtanaa · 2025-11-01T18:41:21Z

@JartX I think the accuracy changes is not sufficient to say there is a huge degradation, maybe we need to evaluate on more datasets.

Accuracy Changes (FA → TORCH.SDPA)

Metric	FA Value	TORCH.SDPA Value	Change	Percentage Change
anywhere_accuracy	0.795	0.775	-0.020	-2.52%
exact_match	0.570	0.565	-0.005	-0.88%
relaxed_accuracy	0.795	0.770	-0.025	-3.14%

Regarding to the speed comparison between triton FA and torch.sdpa.

In my opinion,

The triton FA is developed for Instinct GPUs, so the performance benefit of the kernel are not necessarily transferrable to the Radeon GPUs.
The triton FA are old implementations of FA. torch.sdpa implementation is tied to pytorch optimization. AMD has been bringing optimization to operators in pytorch so that could be a reason why torch.spda is catching up with the triton FA.

tjtanaa · 2025-11-02T02:30:55Z

@JartX I would like to suggest you an alternative. Since on Radeon, AITER is also installed, can you explore the use of AITER's triton flash_attn_varlen_func implementation instead (https://github.com/ROCm/aiter/blob/294b779c6cc9790dbfd1815ddcf0026c62008456/aiter/ops/triton/mha.py#L955)?

See if it is faster?

JartX · 2025-11-02T22:23:19Z

@tjtanaa thanks for the idea — I tried it back in the day, but inference isn’t possible due to lack of hardware support. For example:

(worker_TP2 pid=5145) /usr/local/lib/python3.12/dist-packages/aiter/jit/build/ck/include/ck_tile/core/arch/amd_buffer_addressing_builtins_hip.hpp:1168:22: error: invalid operand for instruction
(worker_TP2 pid=5145)  1168 |         asm volatile("buffer_load_dword %1, %2, 0 offen offset:%3 lds"
(worker_TP2 pid=5145)       |                      ^
(worker_TP2 pid=5145) <inline asm>:1:25: note: instantiated into assembly here
(worker_TP2 pid=5145)     1 |         buffer_load_dword v42, s[24:27], 0 offen offset:0 lds

I’ve also seen your PR: #27919 this one could even be removed PR #27776, or just kept in the Dockerfile along with a wiki update referencing ROCm and RDNA3.

And sorry for my ignorance, but with your PR — is it possible to force Flash Attention upstream?

If possible, it would be really easy to offer both types of care in case atrial fibrillation progresses.

Thank you so much for your time.

tjtanaa · 2025-11-02T22:43:26Z

@JartX I am referring to the triton implementation from Aiter repo. Is invoking the triton implementation triggering asm error?

JartX · 2025-11-03T08:29:16Z

@tjtanaa Hi, it also fails to start up; it seems to only be supported for X Arch:
https://github.com/ROCm/aiter/blob/294b779c6cc9790dbfd1815ddcf0026c62008456/aiter/ops/triton/utils/_triton/arch_info.py

JartX · 2025-11-03T08:29:55Z

Go to add al spoof the gpu:

"gfx1100": "MI300X"

JartX · 2025-11-03T08:50:03Z

@tjtanaa
Okay, I was able to run it, but the results barely differ from upstream fa; TORCH.SDPA is still better, so I'm closing the PR until maybe this changes :)

JartX added 3 commits October 29, 2025 23:37

fa_upstream_detection for rdna3 rocm

22de4f2

Signed-off-by: JartX <[email protected]>

working_on_fa_rdna3

25028eb

Signed-off-by: JartX <[email protected]>

remove is_rocm_aiter

58f0be7

Signed-off-by: JartX <[email protected]>

JartX requested review from LucasWilkinson, gshtras and sighingnow as code owners October 30, 2025 00:23

JartX marked this pull request as draft October 30, 2025 00:24

mergify bot added ci/build qwen Related to Qwen models rocm Related to AMD ROCm labels Oct 30, 2025

gemini-code-assist bot reviewed Oct 30, 2025

View reviewed changes

vllm/attention/ops/vit_attn_wrappers.py Outdated Show resolved Hide resolved

missing () on on_gfx9()

90d3b7c

Signed-off-by: JartX <[email protected]>

default FLASH_ATTENTION_TRITON_AMD_ENABLE GPU_ARCHS if passed on buil…

cf36822

…d-args Signed-off-by: JartX <[email protected]>

JartX mentioned this pull request Oct 30, 2025

[BugFix][VL] Fix FA selection on Qwen2.5-VL #27790

Merged

Merge branch 'main' into feature/upstream_fa_rdna3_rocm

6244384

JartX force-pushed the feature/upstream_fa_rdna3_rocm branch from 1b770fd to 6244384 Compare October 30, 2025 12:48

JartX added 2 commits October 30, 2025 13:53

readd code and 7900XTX device id

2af8555

Signed-off-by: JartX <[email protected]>

readd code and 7900XTX device id

c03438b

Signed-off-by: JartX <[email protected]>

Lucaskabela suggested changes Oct 30, 2025

View reviewed changes

Merge branch 'main' into feature/upstream_fa_rdna3_rocm

cd442f6

Merge branch 'main' into feature/upstream_fa_rdna3_rocm

11a441e

JartX closed this Nov 3, 2025

Uh oh!

[FEATURE] Upstream VIT FA RDNA3 ROCM #27776

[FEATURE] Upstream VIT FA RDNA3 ROCM #27776

Uh oh!

Conversation

JartX commented Oct 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

JartX commented Oct 30, 2025

Uh oh!

gemini-code-assist bot commented Oct 30, 2025

Summary of Changes

Highlights

Uh oh!

JartX commented Oct 30, 2025

Uh oh!

JartX commented Oct 30, 2025

Uh oh!

tjtanaa commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Oct 30, 2025

Uh oh!

Lucaskabela left a comment

Choose a reason for hiding this comment

Uh oh!

Lucaskabela Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Lucaskabela Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

JartX commented Oct 30, 2025

Uh oh!

JartX commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Oct 30, 2025

CACHE MOE 30B

CACHE EP MMDATA MOE 30B

CACHE EP MOE 30B

WCACHE EP MMDATA MOE 30B

WCACHE EP MOE 30B

WCACHE MOE 30B

CACHE 32B

WCACHE 32B

Uh oh!

JartX commented Oct 30, 2025

Uh oh!

JartX commented Oct 31, 2025

Uh oh!

tjtanaa commented Nov 1, 2025

Accuracy Changes (FA → TORCH.SDPA)

Uh oh!

tjtanaa commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JartX commented Nov 2, 2025

Uh oh!

tjtanaa commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JartX commented Nov 3, 2025

Uh oh!

JartX commented Nov 3, 2025

JartX commented Oct 30, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Oct 30, 2025 •

edited

Loading

ywang96 Oct 30, 2025 •

edited

Loading

JartX commented Oct 30, 2025 •

edited

Loading

tjtanaa commented Oct 30, 2025 •

edited

Loading

tjtanaa commented Nov 2, 2025 •

edited

Loading

tjtanaa commented Nov 2, 2025 •

edited

Loading