Skip to content

Conversation

@JartX
Copy link
Contributor

@JartX JartX commented Oct 30, 2025

I'm trying to respect the latest logic implemented with @Lucaskabela's wrapper, also trying to correct the execution. Before the wrapper, I already had it advanced with minimalist logic implemented in rocm.py and layer.py. Right now I can get it to work with FLASH_ATTN without torch.SDPA, I ask for your help:

@DarkLight1337 @tjtanaa @lgeiger @Lucaskabela

I am worried about the following part in qwen2_5_vl.py

context_layer = vit_flash_attn_wrapper( 
q, 
k, 
v, 
cu_seqlens, 
max_seqlen, 
batch_size, 
self.attn_backend == _Backend.ROCM_AITER_FA, 
self.use_upstream_fa, 
)

I would say that there is a lack of coherence in the parameters:

def vit_flash_attn_wrapper( 
q: torch.Tensor, 
k: torch.Tensor, 
v: torch.Tensor, 
cu_seqlens: torch.Tensor, 
max_seqlen: torch.Tensor, 
batch_size: int, 
is_rocm_aiter: bool, 
use_upstream_fa: bool,
) -> torch.Tensor: 
return torch.ops.vllm.flash_attn_maxseqlen_wrapper( 
q, k, v, cu_seqlens, max_seqlen, batch_size, is_rocm_aiter, use_upstream_fa 
)

is_rocm_aiter: bool == ROCM_AITER_FA
self.attn_backend == _Backend.ROCM_AITER_FA,

I have passed the selection that I am sure is poorly made to flash_attn_maxseqlen_wrapper, here @tjtanaa will surely hit me xD

Please take a look, and if I have to give access to everyone in my repository I will give it :)

Thank you so much!

@JartX JartX marked this pull request as draft October 30, 2025 00:24
@mergify mergify bot added ci/build qwen Related to Qwen models rocm Related to AMD ROCm labels Oct 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Vision Transformer Flash Attention on RDNA3 with ROCM. The changes include updating the Dockerfile to conditionally build Flash Attention, and modifying the attention layers to correctly select the ViT attention backend for ROCm platforms. The refactoring to centralize backend selection logic is a good improvement. However, I've found a critical issue where a function is used as a boolean value instead of being called, which will lead to incorrect behavior at runtime. Please see the specific comment for details.

@JartX
Copy link
Contributor Author

JartX commented Oct 30, 2025

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request integrates comprehensive support for upstream Flash Attention on AMD RDNA3 (gfx1100) GPUs within the ROCm platform. It refactors the attention backend selection mechanism to be more robust and dynamic, moving away from a passed boolean parameter to explicit checks of the current platform, environment variables, and GPU architecture. This change centralizes the logic for choosing between different Flash Attention implementations, including AITemplate's version for gfx9 and the newly integrated upstream Flash Attention for RDNA3, ensuring optimal performance based on the hardware and configuration.

Highlights

  • RDNA3 Flash Attention Support: Introduced conditional compilation and runtime detection for upstream Flash Attention on AMD RDNA3 (gfx1100) GPUs, enabling enhanced performance for these architectures within the ROCm ecosystem.
  • Refactored Attention Backend Selection: Streamlined the logic for selecting the appropriate Flash Attention backend by removing the is_rocm_aiter boolean parameter from wrapper functions. The decision-making is now centralized and based on explicit checks of the current platform, environment variables, and GPU architecture, addressing the author's concern about parameter coherence.
  • Dynamic Upstream FA Detection: The qwen2_5_vl.py model now dynamically checks for the availability of the flash_attn library at runtime to determine whether to use the upstream Flash Attention implementation.
  • Dockerfile Enhancements: The ROCm Dockerfile now includes git as a dependency and conditionally compiles the ROCm/flash-attention library if specific environment variables (FLASH_ATTENTION_TRITON_AMD_ENABLE and GPU_ARCHS) are set, facilitating easier setup for RDNA3 Flash Attention.
Changelog
  • docker/Dockerfile.rocm
    • Added git to the apt-get install command.
    • Implemented conditional build logic for ROCm/flash-attention based on FLASH_ATTENTION_TRITON_AMD_ENABLE and GPU_ARCHS environment variables.
  • vllm/attention/layer.py
    • Imported on_gfx1x for RDNA3 architecture detection.
    • Modified maybe_get_vit_flash_attn_backend to include a new condition for enabling _Backend.FLASH_ATTN on gfx1100 when FLASH_ATTENTION_TRITON_AMD_ENABLE is 'TRUE' and flash_attn is available.
    • Removed the on_gfx9() check from the general _Backend.FLASH_ATTN condition, broadening its applicability.
  • vllm/attention/ops/vit_attn_wrappers.py
    • Imported envs, current_platform, on_gfx9, and on_gfx1x.
    • Removed the is_rocm_aiter parameter from flash_attn_maxseqlen_wrapper, flash_attn_maxseqlen_wrapper_fake, and vit_flash_attn_wrapper.
    • Updated the logic within flash_attn_maxseqlen_wrapper to determine AITemplate usage based on current_platform.is_rocm(), on_gfx9(), and envs.VLLM_ROCM_USE_AITER/envs.VLLM_ROCM_USE_AITER_MHA.
  • vllm/model_executor/models/qwen2_5_vl.py
    • Ensured max_seqlen is a torch.Tensor before passing it to the wrapper.
    • Dynamically set self.use_upstream_fa based on flash_attn library availability.
    • Updated the call to vit_flash_attn_wrapper to remove the is_rocm_aiter argument.
  • vllm/platforms/rocm.py
    • Removed an entry for AMD_Instinct_MI308X from ROCM_ARCH_TO_NAME.
    • Added import os to get_vit_attn_backend.
    • Implemented on_gfx1x function for RDNA3 detection.
    • Updated get_vit_attn_backend to prioritize upstream Flash Attention for gfx1100 based on environment variables.
    • Replaced os.environ["VLLM_USE_TRITON_AWQ"] = "1" with envs.VLLM_USE_TRITON_AWQ = True for consistency.
Activity
  • The author, JartX, initiated the pull request and requested a summary of the changes.
  • A bot, gemini-code-assist[bot], provided a critical review comment on vllm/attention/layer.py, highlighting a potential bug in the original code where on_gfx9 was used as a boolean instead of being called as a function on_gfx9(). This issue is implicitly addressed by the PR's refactoring of attention backend selection logic.

@JartX
Copy link
Contributor Author

JartX commented Oct 30, 2025

@zhewenl can inference with your last changes on flash_attn:

vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:11 [api_server.py:1869] vLLM API server version 0.1.dev10861+g624438436
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:11 [utils.py:253] non-default args: {'model_tag': '/models/Qwen3-VL-32B-Instruct-AWQ-4bit', 'port': 80, 'chat_template': '/chat-template-tools.jinja', 'model': '/models/Qwen3-VL-32B-Instruct-AWQ-4bit', 'dtype': 'float16', 'max_model_len': 65536, 'served_model_name': ['INCQ'], 'tensor_parallel_size': 4, 'gpu_memory_utilization': 0.98, 'limit_mm_per_prompt': {'image': 6, 'video': 0}, 'mm_encoder_tp_mode': 'data', 'max_num_seqs': 32, 'enable_log_requests': True}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:16 [model.py:668] Resolved architecture: Qwen3VLForConditionalGeneration
vllm1-1  | (APIServer pid=1) WARNING 10-30 08:42:16 [model.py:1999] Casting torch.bfloat16 to torch.float16.
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:16 [model.py:1773] Using max model len 65536
vllm1-1  | (APIServer pid=1) INFO 10-30 08:42:16 [scheduler.py:211] Chunked prefill is enabled with max_num_batched_tokens=2048.
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:42:21 [core.py:93] Initializing a V1 LLM engine (v0.1.dev10861+g624438436) with config: model='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', speculative_config=None, tokenizer='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=QWEN3, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': 3, 'debug_dump_path': None, 'cache_dir': '', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention', 'vllm::sparse_attn_indexer'], 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'use_cudagraph': True, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'full_cuda_graph': True, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 64, 'local_cache_dir': None}
vllm1-1  | [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | INFO 10-30 08:42:26 [pynccl.py:111] vLLM is using nccl==2.26.6
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
vllm1-1  | [Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | [Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
vllm1-1  | INFO 10-30 08:42:31 [parallel_state.py:1325] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [gpu_model_runner.py:2861] Starting to load model cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit...
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [rocm.py:225] Using ViT FlashAttention (upstream) on V1 engine (gfx1x / RDNA3).
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:35 [rocm.py:318] Using Rocm Attention backend on V1 engine.
Loading safetensors checkpoint shards: 100% 5/5 [00:02<00:00,  1.82it/s]
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:39 [default_loader.py:314] Loading weights took 2.76 seconds
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:40 [gpu_model_runner.py:2926] Model loading took 6.6191 GiB and 4.165281 seconds
vllm1-1  | (Worker_TP2 pid=41) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP3 pid=42) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP1 pid=40) INFO 10-30 08:42:41 [gpu_model_runner.py:3698] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:26 [backends.py:618] Using cache directory: /root/.cache/vllm/torch_compile_cache/c4e7ce712f/rank_0_0/backbone for vLLM's torch.compile
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:26 [backends.py:634] Dynamo bytecode transform time: 7.05 s
vllm1-1  | (Worker_TP0 pid=39) [rank0]:W1030 08:43:28.235000 39 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP2 pid=41) [rank2]:W1030 08:43:28.634000 41 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP3 pid=42) [rank3]:W1030 08:43:28.642000 42 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP1 pid=40) [rank1]:W1030 08:43:28.644000 40 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP0 pid=39) [rank0]:W1030 08:43:30.481000 39 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP3 pid=42) [rank3]:W1030 08:43:30.848000 42 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP1 pid=40) [rank1]:W1030 08:43:30.853000 40 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP2 pid=41) [rank2]:W1030 08:43:30.870000 41 torch/_inductor/codegen/triton_combo_kernel.py:97] [0/0] ComboKernels: 1 long reduction nodes are separated
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:31 [backends.py:248] Cache the graph for dynamic shape for later use
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:40 [shm_broadcast.py:495] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation).
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:42 [backends.py:279] Compiling a graph for dynamic shape takes 15.09 s
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:49 [monitor.py:34] torch.compile takes 22.14 s in total
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:50 [gpu_worker.py:319] Available KV cache memory: 11.99 GiB
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1229] GPU KV cache size: 190,816 tokens
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:50 [kv_cache_utils.py:1234] Maximum concurrency for 65,536 tokens per request: 2.91x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100% 11/11 [00:02<00:00,  4.71it/s]
Capturing CUDA graphs (decode, FULL): 100% 7/7 [00:01<00:00,  4.15it/s]
vllm1-1  | (Worker_TP0 pid=39) INFO 10-30 08:43:55 [gpu_model_runner.py:3854] Graph capturing finished in 5 secs, took 0.40 GiB
vllm1-1  | (EngineCore_DP0 pid=29) INFO 10-30 08:43:55 [core.py:237] init engine (profile, create kv cache, warmup model) took 74.70 seconds
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:00 [api_server.py:1647] Supported tasks: ['generate']
vllm1-1  | (APIServer pid=1) WARNING 10-30 08:44:03 [model.py:1603] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:03 [serving_responses.py:167] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:03 [serving_chat.py:130] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [serving_completion.py:68] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [api_server.py:1938] Starting vLLM API server 0 on http://0.0.0.0:80
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:38] Available routes are:
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /docs, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /health, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /load, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /ping, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /ping, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /tokenize, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /detokenize, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/models, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /version, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/responses, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/completions, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/embeddings, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /pooling, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /classify, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /score, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/score, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /rerank, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v1/rerank, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /v2/rerank, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /invocations, Methods: POST
vllm1-1  | (APIServer pid=1) INFO 10-30 08:44:04 [launcher.py:46] Route: /metrics, Methods: GET
vllm1-1  | (APIServer pid=1) INFO:     Started server process [1]
vllm1-1  | (APIServer pid=1) INFO:     Waiting for application startup.
vllm1-1  | (APIServer pid=1) INFO:     Application startup complete.

Can you help me in that points of the wrapper and the flash_attn?:

context_layer = vit_flash_attn_wrapper( 
q, 
k, 
v, 
cu_seqlens, 
max_seqlen, 
batch_size, 
self.attn_backend == _Backend.ROCM_AITER_FA, 
self.use_upstream_fa, 
)
def vit_flash_attn_wrapper( 
q: torch.Tensor, 
k: torch.Tensor, 
v: torch.Tensor, 
cu_seqlens: torch.Tensor, 
max_seqlen: torch.Tensor, 
batch_size: int, 
is_rocm_aiter: bool, 
use_upstream_fa: bool,
) -> torch.Tensor: 
return torch.ops.vllm.flash_attn_maxseqlen_wrapper( 
q, k, v, cu_seqlens, max_seqlen, batch_size, is_rocm_aiter, use_upstream_fa 
)

@JartX
Copy link
Contributor Author

JartX commented Oct 30, 2025

Hi @yewentao256, would you be so kind as to take a look at this as well? Perhaps instead of using environment variables you know of another way to add it implicitly.

@JartX JartX force-pushed the feature/upstream_fa_rdna3_rocm branch from 1b770fd to 6244384 Compare October 30, 2025 12:48
@tjtanaa
Copy link
Collaborator

tjtanaa commented Oct 30, 2025

@JartX can you evaluate the benchmark which one is faster, triton flash attention API or the Torch.SDPA? Let's try to avoid introducing more and more code path by offering the best one as default. And I saw that you introduced a new environment variables. We would like to cut down on that.

@tjtanaa
Copy link
Collaborator

tjtanaa commented Oct 30, 2025

Moreover, since there are efforts in fixing AMD CI, the GPU used on AMD CI is able to test all the code path, torch.sdpa, CK flash attention varlen, AITER flash attention varlen, and even this new triton flash attention varlen (if it is worth introducing). I will be fixing all of them together.

Copy link
Contributor

@Lucaskabela Lucaskabela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change to custom op signature is fine, but there is some changes in the models file we shouldn't need

is_rocm_aiter: bool,
use_upstream_fa: bool,
) -> torch.Tensor:
if is_rocm_aiter:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I am not sure what the ask is on this PR, but from my end we should feel free to change these shims however we need.

These custom ops are purely to preserve traceability of the ViT component, and the signatures are designed like this because we went from selecting attention on attrs of the model (self) to needing an independent function without self parameter.

I do want to voice a design consideration on this backend selection logic as a whole though - to me, it would seem better if we could use just pass attn_fnlambda's directly as opposed to some backend enum then doing the function selection later. I wonder what is preventing us from doing this in the code today? (traceability, etc)

cc @ywang96 who may have more context on this

Copy link
Member

@ywang96 ywang96 Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would seem better if we could use just pass attn_fnlambda's directly as opposed to some backend enum then doing the function selection later. I wonder what is preventing us from doing this in the code today?

From the user perspective it's cleaner to just pass in an enum (e.g, --mm-encoder-attn-backend TORCH_SDPA) and it's better for us to control this over passing an entire free-form attn implementation, but I agree that enum -> attn_fn can be done at the level of init time of XXXVisionTransformer and we pass the resolved attn_fn as an input downstream to XXXVisionAttention. Does that align with what's on your mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lucaskabela @ywang96

I have an RFC #27821 that is proposing the same idea of enum -> attn_fn is happening in the XXXVisionTransformer .

However, in this RFC taking into account that many of the VL models share the same logic as qwen2_5_vl.py, it can be abstract out further that the Overriding logic should be handled by platform as only platform knows that backend it can support.

So the maybe_get_vit_flash_attn_backend will solely responsible for enum -> attn_fn mapping rather than including overriding logic. (maybe_get_vit_flash_attn_backend will be renamed to a new name matching its role).

q, k = torch.chunk(qk_rotated, 2, dim=0)

if self.is_flash_attn_backend:
from importlib.util import find_spec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire logic seems unnecessary here because:

  1. max_seqlen is already a tensor (type hints FTW)
  2. use_upstream_fa is set in Qwen2_5_VisionTransformer/upstream in init. We should just modify the logic there if needed as opposed to here

@JartX
Copy link
Contributor Author

JartX commented Oct 30, 2025

@tjtanaa
The benchmarks

FA CACHE 32B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 44.0127 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 592.918 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1227.88 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.5604 |
+-----------------------------------+-----------+
| Average latency (s) | 29.6106 |
+-----------------------------------+-----------+
| Average time to first token (s) | 1.1861 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.128 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.1227 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 231.57 |
+-----------------------------------+-----------+
2025-10-30 16:14:29 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.6951 | 0.0776 | 0.1067 | 20.924 | 216 | 140 | 6.7049 | 14.6395 |
| 25% | 0.6977 | 0.1144 | 0.1242 | 25.4125 | 230 | 176 | 6.9486 | 15.0216 |
| 50% | 1.4801 | 0.1371 | 0.1308 | 30.9778 | 243 | 228 | 7.373 | 15.9937 |
| 66% | 1.5328 | 0.14 | 0.1356 | 32.7173 | 261 | 249 | 7.6107 | 16.7284 |
| 75% | 1.534 | 0.1403 | 0.1371 | 33.7172 | 269 | 264 | 7.8192 | 17.2755 |
| 80% | 1.5349 | 0.1404 | 0.1383 | 34.6548 | 272 | 282 | 8.1374 | 17.6022 |
| 90% | 1.5361 | 0.1409 | 0.1403 | 37.0137 | 279 | 336 | 9.0777 | 19.6744 |
| 95% | 1.5372 | 0.1414 | 0.145 | 38.2948 | 299 | 379 | 9.8969 | 20.5134 |
| 98% | 1.5375 | 0.1417 | 0.1472 | 39.0541 | 332 | 415 | 10.6263 | 21.9542 |
| 99% | 1.5379 | 0.142 | 0.1492 | 44.0115 | 334 | 676 | 15.3596 | 28.4289 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 16:14:29 - evalscope - INFO: Save the summary to: outputs/20251030_161318/QWEN3VL

TORCH.SDPA CACHE 32B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 44.856 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 589.434 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1215.04 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.5227 |
+-----------------------------------+-----------+
| Average latency (s) | 30.0095 |
+-----------------------------------+-----------+
| Average time to first token (s) | 1.6538 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.1268 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.1214 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 233.65 |
+-----------------------------------+-----------+
2025-10-30 16:25:19 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 1.7192 | 0.0667 | 0.1041 | 20.2229 | 216 | 134 | 6.6262 | 14.3746 |
| 25% | 1.7707 | 0.1134 | 0.1224 | 25.5349 | 230 | 176 | 6.8792 | 14.9965 |
| 50% | 1.7744 | 0.1363 | 0.1296 | 31.4246 | 243 | 230 | 7.3191 | 15.8199 |
| 66% | 1.777 | 0.14 | 0.1334 | 33.2203 | 261 | 251 | 7.5552 | 16.408 |
| 75% | 1.7784 | 0.1402 | 0.1366 | 34.5455 | 269 | 272 | 7.8737 | 16.8927 |
| 80% | 1.7794 | 0.1403 | 0.1375 | 35.2264 | 272 | 284 | 8.0621 | 17.6604 |
| 90% | 1.7806 | 0.1406 | 0.1401 | 37.9803 | 279 | 350 | 9.2153 | 19.3472 |
| 95% | 1.781 | 0.1412 | 0.1403 | 38.7801 | 299 | 377 | 9.7131 | 20.796 |
| 98% | 1.7826 | 0.1416 | 0.1497 | 39.6375 | 332 | 415 | 10.4699 | 21.9482 |
| 99% | 1.7833 | 0.1418 | 0.1504 | 44.8548 | 334 | 691 | 15.4053 | 27.9196 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 16:25:19 - evalscope - INFO: Save the summary to: outputs/20251030_162405/QWEN3VL

TORCH.SDPA WCACHE 32B
Benchmarking summary:
+-----------------------------------+----------+
| Key | Value |
+===================================+==========+
| Time taken for tests (s) | 72.023 |
+-----------------------------------+----------+
| Number of concurrency | 100 |
+-----------------------------------+----------+
| Total requests | 100 |
+-----------------------------------+----------+
| Succeed requests | 100 |
+-----------------------------------+----------+
| Failed requests | 0 |
+-----------------------------------+----------+
| Output token throughput (tok/s) | 346.015 |
+-----------------------------------+----------+
| Total token throughput (tok/s) | 709.578 |
+-----------------------------------+----------+
| Request throughput (req/s) | 1.466 |
+-----------------------------------+----------+
| Average latency (s) | 58.5657 |
+-----------------------------------+----------+
| Average time to first token (s) | 16.5453 |
+-----------------------------------+----------+
| Average time per output token (s) | 0.1878 |
+-----------------------------------+----------+
| Average inter-token latency (s) | 0.178 |
+-----------------------------------+----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+----------+
| Average output tokens per request | 236.02 |
+-----------------------------------+----------+
2025-10-30 16:35:13 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 5.5968 | 0.0783 | 0.1332 | 50.1825 | 216 | 148 | 2.9492 | 7.395 |
| 25% | 10.4208 | 0.1194 | 0.1483 | 54.1575 | 230 | 182 | 3.3454 | 7.6963 |
| 50% | 17.7737 | 0.1394 | 0.1837 | 59.8414 | 243 | 233 | 3.8836 | 8.1659 |
| 66% | 22.5718 | 0.1404 | 0.1993 | 61.4438 | 261 | 250 | 4.0619 | 8.4115 |
| 75% | 24.8496 | 0.1406 | 0.2096 | 62.7496 | 269 | 271 | 4.3188 | 8.5015 |
| 80% | 24.898 | 0.1407 | 0.2275 | 63.2192 | 272 | 279 | 4.4166 | 8.6882 |
| 90% | 29.6564 | 0.1412 | 0.2433 | 64.8741 | 279 | 319 | 4.9172 | 9.0637 |
| 95% | 29.6575 | 0.1418 | 0.2941 | 66.929 | 299 | 387 | 5.7716 | 9.5356 |
| 98% | 30.4465 | 2.3613 | 0.326 | 68.209 | 332 | 447 | 6.5534 | 10.8864 |
| 99% | 30.4946 | 2.4065 | 0.358 | 72.0218 | 334 | 649 | 9.0112 | 12.3574 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 16:35:13 - evalscope - INFO: Save the summary to: outputs/20251030_163324/QWEN3VL
root@ainode1:/app#

FA WCACHE 32B
Benchmarking summary:
+-----------------------------------+----------+
| Key | Value |
+===================================+==========+
| Time taken for tests (s) | 82.0359 |
+-----------------------------------+----------+
| Number of concurrency | 100 |
+-----------------------------------+----------+
| Total requests | 100 |
+-----------------------------------+----------+
| Succeed requests | 100 |
+-----------------------------------+----------+
| Failed requests | 0 |
+-----------------------------------+----------+
| Output token throughput (tok/s) | 309.161 |
+-----------------------------------+----------+
| Total token throughput (tok/s) | 634.552 |
+-----------------------------------+----------+
| Request throughput (req/s) | 1.3121 |
+-----------------------------------+----------+
| Average latency (s) | 67.1365 |
+-----------------------------------+----------+
| Average time to first token (s) | 22.5948 |
+-----------------------------------+----------+
| Average time per output token (s) | 0.2024 |
+-----------------------------------+----------+
| Average inter-token latency (s) | 0.189 |
+-----------------------------------+----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+----------+
| Average output tokens per request | 235.62 |
+-----------------------------------+----------+
2025-10-30 16:41:55 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 8.7995 | 0.0782 | 0.1382 | 56.7753 | 216 | 134 | 2.3767 | 6.3967 |
| 25% | 11.2158 | 0.1143 | 0.1584 | 63.1105 | 230 | 184 | 2.88 | 6.7068 |
| 50% | 25.4009 | 0.1355 | 0.1908 | 68.9092 | 243 | 236 | 3.4319 | 7.1465 |
| 66% | 30.0787 | 0.1396 | 0.2172 | 70.5585 | 261 | 253 | 3.5993 | 7.3268 |
| 75% | 32.4449 | 0.14 | 0.2343 | 71.6447 | 269 | 275 | 3.8245 | 7.4334 |
| 80% | 32.446 | 0.1402 | 0.2408 | 72.2703 | 272 | 282 | 3.9076 | 7.5732 |
| 90% | 37.1238 | 0.1406 | 0.268 | 74.8065 | 279 | 347 | 4.6359 | 7.9018 |
| 95% | 37.1247 | 0.1412 | 0.3766 | 75.8954 | 299 | 385 | 5.0728 | 8.5029 |
| 98% | 39.4619 | 2.3378 | 0.4174 | 76.2096 | 332 | 398 | 5.2224 | 8.9433 |
| 99% | 39.4619 | 2.3726 | 0.4193 | 82.0341 | 334 | 705 | 8.594 | 11.5318 |
+-------------+----------+---------+----------+-------------+--------------+---------------+--------------

TORCH.SDPA WCACHE MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 54.6005 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 646.204 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1100.72 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.8328 |
+-----------------------------------+-----------+
| Average latency (s) | 43.0512 |
+-----------------------------------+-----------+
| Average time to first token (s) | 10.3647 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.096 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0927 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 352.58 |
+-----------------------------------+-----------+
2025-10-30 18:23:48 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 4.9009 | 0.0617 | 0.0722 | 31.1491 | 216 | 201 | 6.4528 | 12.8676 |
| 25% | 6.1604 | 0.0654 | 0.083 | 40.5178 | 230 | 298 | 7.4399 | 13.3768 |
| 50% | 11.1589 | 0.0686 | 0.0941 | 44.7023 | 243 | 356 | 7.9449 | 13.9569 |
| 66% | 13.5212 | 0.0695 | 0.1 | 47.1132 | 261 | 397 | 8.4099 | 14.2436 |
| 75% | 14.7944 | 0.07 | 0.1049 | 48.9736 | 269 | 425 | 8.6723 | 14.5184 |
| 80% | 14.7957 | 0.0705 | 0.1083 | 49.2853 | 272 | 432 | 8.7653 | 14.7427 |
| 90% | 17.2209 | 0.0716 | 0.1184 | 51.7792 | 279 | 489 | 9.4536 | 15.3315 |
| 95% | 17.2224 | 0.075 | 0.1247 | 54.2727 | 299 | 579 | 10.6684 | 15.4736 |
| 98% | 17.7139 | 0.374 | 0.1822 | 54.465 | 332 | 593 | 10.8877 | 15.7445 |
| 99% | 17.717 | 1.2324 | 0.2145 | 54.5586 | 334 | 595 | 10.9057 | 16.3571 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:23:48 - evalscope - INFO: Save the summary to: outputs/20251030_182112/QWEN3VL
root@ainode1:/app#

TORCH.SDPA CACHE MOE 30B

+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 53.1952 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1065.81 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1800.92 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.9643 |
+-----------------------------------+-----------+
| Average latency (s) | 23.2187 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.9222 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0648 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.062 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 359.55 |
+-----------------------------------+-----------+
2025-10-30 18:30:25 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.4521 | 0.0547 | 0.0627 | 13.4872 | 216 | 185 | 13.7167 | 23.2075 |
| 25% | 0.975 | 0.0608 | 0.0644 | 18.8408 | 230 | 269 | 14.2927 | 23.9291 |
| 50% | 1.069 | 0.0652 | 0.0651 | 24.4533 | 243 | 361 | 14.7429 | 25.3894 |
| 66% | 1.0782 | 0.0675 | 0.0666 | 26.6542 | 261 | 396 | 14.8773 | 26.5442 |
| 75% | 1.0813 | 0.0682 | 0.0669 | 28.1046 | 269 | 422 | 15.0153 | 27.3046 |
| 80% | 1.0828 | 0.0685 | 0.0673 | 28.5103 | 272 | 429 | 15.1061 | 28.5251 |
| 90% | 1.0857 | 0.0693 | 0.0675 | 31.1938 | 279 | 489 | 15.6762 | 36.0011 |
| 95% | 1.087 | 0.0698 | 0.0685 | 32.5589 | 299 | 536 | 16.4625 | 42.8358 |
| 98% | 1.0899 | 0.0703 | 0.0694 | 33.7322 | 332 | 605 | 17.9354 | 48.2691 |
| 99% | 1.0942 | 0.0714 | 0.0731 | 53.1798 | 334 | 2048 | 38.5109 | 58.3012 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:30:25 - evalscope - INFO: Save the summary to: outputs/20251030_182906/QWEN3VL

TORCH.SDPA WCACHE EP MMDATA MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 63.534 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 844.695 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1423.39 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.3335 |
+-----------------------------------+-----------+
| Average latency (s) | 31.4613 |
+-----------------------------------+-----------+
| Average time to first token (s) | 4.5218 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0795 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0744 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 361.98 |
+-----------------------------------+-----------+
2025-10-30 18:38:31 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 1.8812 | 0.0589 | 0.07 | 20.6832 | 216 | 187 | 8.9648 | 17.7821 |
| 25% | 3.0235 | 0.0641 | 0.0723 | 26.9901 | 230 | 270 | 10.0778 | 18.0466 |
| 50% | 4.7063 | 0.07 | 0.0773 | 32.8515 | 243 | 366 | 11.0915 | 18.8519 |
| 66% | 5.852 | 0.0709 | 0.0809 | 34.7591 | 261 | 396 | 11.4274 | 19.4505 |
| 75% | 6.4132 | 0.0716 | 0.0837 | 36.7817 | 269 | 426 | 11.6507 | 20.0717 |
| 80% | 6.416 | 0.0721 | 0.0865 | 37.6976 | 272 | 441 | 11.7754 | 20.6697 |
| 90% | 7.7442 | 0.0779 | 0.092 | 40.0254 | 279 | 492 | 12.3524 | 21.7433 |
| 95% | 7.8041 | 0.0809 | 0.0958 | 41.032 | 299 | 531 | 12.9411 | 23.9661 |
| 98% | 7.9866 | 0.0872 | 0.1244 | 42.8509 | 332 | 640 | 14.9355 | 31.5263 |
| 99% | 7.9867 | 0.5745 | 0.1444 | 63.4683 | 334 | 2048 | 32.2681 | 36.144 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:38:31 - evalscope - INFO: Save the summary to: outputs/20251030_183545/QWEN3VL

TORCH.SDPA CACHE EP MMDATA MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 57.4744 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1036.16 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1731.83 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.8052 |
+-----------------------------------+-----------+
| Average latency (s) | 25.2854 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.7732 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.069 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0664 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 369.37 |
+-----------------------------------+-----------+
2025-10-30 18:40:30 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.5994 | 0.0585 | 0.0663 | 16.5848 | 216 | 219 | 13.283 | 21.7653 |
| 25% | 0.6018 | 0.0647 | 0.0687 | 20.3705 | 230 | 276 | 13.549 | 22.258 |
| 50% | 0.6065 | 0.0692 | 0.0698 | 26.0267 | 243 | 363 | 13.9795 | 23.8249 |
| 66% | 0.9156 | 0.0706 | 0.0703 | 28.4996 | 261 | 400 | 14.0588 | 24.9101 |
| 75% | 0.9809 | 0.0711 | 0.071 | 30.1839 | 269 | 428 | 14.1787 | 26.0163 |
| 80% | 0.9822 | 0.0713 | 0.0712 | 31.518 | 272 | 452 | 14.341 | 26.4757 |
| 90% | 0.9836 | 0.0728 | 0.072 | 33.5213 | 279 | 497 | 14.8264 | 31.3297 |
| 95% | 0.9848 | 0.0826 | 0.073 | 34.5789 | 299 | 527 | 15.2405 | 36.2241 |
| 98% | 0.9884 | 0.0844 | 0.0733 | 35.6343 | 332 | 581 | 16.3045 | 48.0304 |
| 99% | 0.9885 | 0.0849 | 0.0741 | 57.3962 | 334 | 2048 | 35.6818 | 73.2013 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+

TORCH.SDPA WCACHE EP MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 73.2821 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 691.217 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1169.45 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.9285 |
+-----------------------------------+-----------+
| Average latency (s) | 40.1204 |
+-----------------------------------+-----------+
| Average time to first token (s) | 8.6339 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0944 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0878 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 358.43 |
+-----------------------------------+-----------+
2025-10-30 18:51:11 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 3.1656 | 0.0595 | 0.0753 | 30.4569 | 216 | 202 | 6.6323 | 13.8441 |
| 25% | 4.3582 | 0.0674 | 0.0809 | 36.6867 | 230 | 282 | 7.7017 | 14.2868 |
| 50% | 9.2658 | 0.0711 | 0.0903 | 40.3674 | 243 | 346 | 8.6001 | 14.7629 |
| 66% | 11.6628 | 0.0723 | 0.0978 | 43.7976 | 261 | 391 | 8.9156 | 15.2021 |
| 75% | 12.8479 | 0.081 | 0.1043 | 44.5376 | 269 | 405 | 9.1366 | 15.3761 |
| 80% | 12.8502 | 0.0822 | 0.1072 | 45.5359 | 272 | 420 | 9.2346 | 15.7422 |
| 90% | 15.2021 | 0.0851 | 0.1163 | 47.0097 | 279 | 458 | 9.7697 | 16.0486 |
| 95% | 15.2032 | 0.0862 | 0.1253 | 50.3847 | 299 | 533 | 10.5868 | 16.5803 |
| 98% | 15.6115 | 0.1168 | 0.1811 | 51.8543 | 332 | 616 | 11.8795 | 16.8974 |
| 99% | 15.6119 | 1.1942 | 0.1987 | 73.2774 | 334 | 2048 | 27.9486 | 30.9782 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:51:11 - evalscope - INFO: Save the summary to: outputs/20251030_184812/QWEN3VL

TORCH.SDPA CACHE EP MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 59.5439 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 955.403 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1612.54 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.6498 |
+-----------------------------------+-----------+
| Average latency (s) | 26.292 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.7713 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0734 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0708 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 360.55 |
+-----------------------------------+-----------+
2025-10-30 18:55:11 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.4228 | 0.0596 | 0.0698 | 13.0905 | 216 | 165 | 12.5655 | 20.4495 |
| 25% | 0.4249 | 0.0687 | 0.073 | 22.5016 | 230 | 284 | 12.7066 | 20.8741 |
| 50% | 0.9214 | 0.0713 | 0.0744 | 29.0387 | 243 | 379 | 13.0516 | 22.22 |
| 66% | 0.9245 | 0.0779 | 0.0751 | 30.9603 | 261 | 409 | 13.1903 | 23.5255 |
| 75% | 0.9255 | 0.0808 | 0.0758 | 31.7657 | 269 | 422 | 13.279 | 24.5137 |
| 80% | 0.9268 | 0.0814 | 0.0762 | 32.5434 | 272 | 436 | 13.3975 | 25.0712 |
| 90% | 0.9286 | 0.0835 | 0.0765 | 35.5313 | 279 | 498 | 14.0158 | 34.2758 |
| 95% | 0.9295 | 0.0852 | 0.0779 | 37.0333 | 299 | 544 | 14.6895 | 44.5117 |
| 98% | 0.9313 | 0.086 | 0.0782 | 37.6751 | 332 | 578 | 15.3417 | 57.3209 |
| 99% | 0.9313 | 0.0866 | 0.0783 | 59.5296 | 334 | 2048 | 34.403 | 157.1854 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 18:55:11 - evalscope - INFO: Save the summary to: outputs/20251030_185346/QWEN3VL

FA WCACHE MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 76.1157 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 665.335 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1103.29 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.766 |
+-----------------------------------+-----------+
| Average latency (s) | 47.4141 |
+-----------------------------------+-----------+
| Average time to first token (s) | 14.2678 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0954 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.088 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 376.74 |
+-----------------------------------+-----------+
2025-10-30 19:03:02 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 6.4604 | 0.0565 | 0.0691 | 35.5685 | 216 | 176 | 4.9482 | 11.8001 |
| 25% | 7.6708 | 0.0626 | 0.0805 | 44.3216 | 230 | 303 | 6.8364 | 12.2502 |
| 50% | 15.7807 | 0.0665 | 0.0913 | 48.8625 | 243 | 379 | 7.7565 | 12.7922 |
| 66% | 18.2435 | 0.0678 | 0.0998 | 50.8784 | 261 | 415 | 8.166 | 13.1782 |
| 75% | 19.4037 | 0.0683 | 0.1059 | 52.2424 | 269 | 438 | 8.3551 | 13.4079 |
| 80% | 19.4049 | 0.0686 | 0.1087 | 53.3581 | 272 | 458 | 8.5774 | 13.5085 |
| 90% | 23.4638 | 0.0693 | 0.1282 | 54.5664 | 279 | 491 | 8.9963 | 13.9303 |
| 95% | 23.4652 | 0.0704 | 0.1408 | 55.5348 | 299 | 526 | 9.4715 | 14.6384 |
| 98% | 24.1829 | 0.0836 | 0.2045 | 56.6207 | 332 | 604 | 10.6675 | 15.542 |
| 99% | 24.1831 | 1.2243 | 0.2184 | 76.05 | 334 | 2048 | 26.9297 | 30.1644 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 19:03:02 - evalscope - INFO: Save the summary to: outputs/20251030_190005/QWEN3VL

FA CACHE MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 54.3489 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 1078.18 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1787.42 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.86 |
+-----------------------------------+-----------+
| Average latency (s) | 24.9804 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.8962 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0663 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0639 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 376.99 |
+-----------------------------------+-----------+
2025-10-30 19:13:19 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.8588 | 0.0531 | 0.0629 | 17.39 | 216 | 243 | 13.9782 | 22.4605 |
| 25% | 0.9095 | 0.0634 | 0.0665 | 21.9884 | 230 | 308 | 14.0554 | 23.0041 |
| 50% | 0.9142 | 0.0676 | 0.0673 | 25.8729 | 243 | 371 | 14.3393 | 24.1448 |
| 66% | 0.9171 | 0.0685 | 0.0678 | 27.3414 | 261 | 395 | 14.447 | 24.917 |
| 75% | 0.9191 | 0.0689 | 0.0682 | 28.7832 | 269 | 421 | 14.6266 | 25.6518 |
| 80% | 0.9201 | 0.0692 | 0.0683 | 30.2443 | 272 | 449 | 14.8458 | 26.3901 |
| 90% | 0.9231 | 0.0736 | 0.0685 | 32.9384 | 279 | 510 | 15.5011 | 34.2137 |
| 95% | 0.9254 | 0.0778 | 0.0686 | 33.5483 | 299 | 527 | 15.7204 | 44.9994 |
| 98% | 0.9266 | 0.0803 | 0.0688 | 34.9626 | 332 | 609 | 17.4186 | 48.8927 |
| 99% | 0.9269 | 0.0812 | 0.0732 | 54.2814 | 334 | 2048 | 37.7293 | 49.748 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 19:13:19 - evalscope - INFO: Save the summary to: outputs/20251030_191159/QWEN3VL

FA WCACHE EP MMDATA MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 77.855 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 660.657 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1094.2 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 1.7482 |
+-----------------------------------+-----------+
| Average latency (s) | 44.7356 |
+-----------------------------------+-----------+
| Average time to first token (s) | 13.6044 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0896 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0824 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 377.9 |
+-----------------------------------+-----------+
2025-10-30 20:19:57 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 6.9057 | 0.0595 | 0.0692 | 35.1505 | 216 | 213 | 6.1236 | 12.7879 |
| 25% | 9.0109 | 0.0651 | 0.0764 | 40.3813 | 230 | 296 | 7.2493 | 13.1699 |
| 50% | 15.3092 | 0.0704 | 0.0816 | 45.1786 | 243 | 367 | 8.1067 | 13.4851 |
| 66% | 16.4454 | 0.0713 | 0.0899 | 48.0438 | 261 | 406 | 8.479 | 13.9981 |
| 75% | 17.5254 | 0.0717 | 0.0973 | 49.9678 | 269 | 443 | 8.8593 | 14.224 |
| 80% | 17.5274 | 0.0719 | 0.1007 | 50.987 | 272 | 457 | 8.9904 | 14.2892 |
| 90% | 18.1543 | 0.0725 | 0.1182 | 52.931 | 279 | 505 | 9.5407 | 14.651 |
| 95% | 20.2161 | 0.0733 | 0.1315 | 55.7345 | 299 | 585 | 10.4962 | 15.5217 |
| 98% | 20.4034 | 0.0793 | 0.1825 | 57.2001 | 332 | 662 | 11.5734 | 16.0438 |
| 99% | 20.4035 | 0.568 | 0.2522 | 77.8514 | 334 | 2048 | 26.3065 | 29.1581 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 20:19:57 - evalscope - INFO: Save the summary to: outputs/20251030_201700/QWEN3VL

FA CACHE EP MMDATA MOE 30B

Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 39.4272 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 963.856 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1617.63 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.6363 |
+-----------------------------------+-----------+
| Average latency (s) | 26.3642 |
+-----------------------------------+-----------+
| Average time to first token (s) | 1.1381 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.07 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.069 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 365.61 |
+-----------------------------------+-----------+
2025-10-30 20:26:22 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 1.0893 | 0.0605 | 0.0654 | 16.6581 | 216 | 213 | 12.8062 | 21.1437 |
| 25% | 1.1412 | 0.0648 | 0.0696 | 22.5231 | 230 | 297 | 13.1865 | 21.7527 |
| 50% | 1.1453 | 0.0702 | 0.0704 | 26.6384 | 243 | 362 | 13.5894 | 22.8363 |
| 66% | 1.1475 | 0.0711 | 0.0711 | 30.03 | 261 | 413 | 13.7529 | 24.3199 |
| 75% | 1.1494 | 0.0717 | 0.0717 | 31.3657 | 269 | 436 | 13.9005 | 24.6126 |
| 80% | 1.15 | 0.0722 | 0.0722 | 32.9363 | 272 | 464 | 14.0878 | 24.9927 |
| 90% | 1.2555 | 0.0819 | 0.0733 | 35.7575 | 279 | 530 | 14.8221 | 30.2387 |
| 95% | 1.2562 | 0.0842 | 0.0736 | 37.223 | 299 | 569 | 15.2862 | 34.0216 |
| 98% | 1.2584 | 0.0848 | 0.0737 | 37.931 | 332 | 597 | 15.7391 | 45.9805 |
| 99% | 1.3102 | 0.0862 | 0.0742 | 39.4253 | 334 | 696 | 17.6536 | 68.6258 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 20:26:22 - evalscope - INFO: Save the summary to: outputs/20251030_202518/QWEN3VL

FA WCACHE EP MOE 30B

Benchmarking summary:
+-----------------------------------+----------+
| Key | Value |
+===================================+==========+
| Time taken for tests (s) | 84.1858 |
+-----------------------------------+----------+
| Number of concurrency | 100 |
+-----------------------------------+----------+
| Total requests | 100 |
+-----------------------------------+----------+
| Succeed requests | 100 |
+-----------------------------------+----------+
| Failed requests | 0 |
+-----------------------------------+----------+
| Output token throughput (tok/s) | 569.827 |
+-----------------------------------+----------+
| Total token throughput (tok/s) | 958.643 |
+-----------------------------------+----------+
| Request throughput (req/s) | 1.5679 |
+-----------------------------------+----------+
| Average latency (s) | 50.9302 |
+-----------------------------------+----------+
| Average time to first token (s) | 15.1424 |
+-----------------------------------+----------+
| Average time per output token (s) | 0.11 |
+-----------------------------------+----------+
| Average inter-token latency (s) | 0.0985 |
+-----------------------------------+----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+----------+
| Average output tokens per request | 363.44 |
+-----------------------------------+----------+
2025-10-30 21:11:50 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 6.4485 | 0.0593 | 0.0763 | 40.0743 | 216 | 195 | 4.866 | 10.7384 |
| 25% | 7.812 | 0.0654 | 0.0857 | 45.6035 | 230 | 266 | 5.7993 | 11.1949 |
| 50% | 16.0646 | 0.0704 | 0.103 | 52.5351 | 243 | 367 | 6.9667 | 11.6967 |
| 66% | 20.1423 | 0.0718 | 0.1129 | 55.2501 | 261 | 404 | 7.3451 | 12.0482 |
| 75% | 21.4187 | 0.0805 | 0.1192 | 57.4467 | 269 | 442 | 7.6941 | 12.1421 |
| 80% | 21.4696 | 0.0816 | 0.1272 | 58.3506 | 272 | 457 | 7.8713 | 12.271 |
| 90% | 25.499 | 0.0852 | 0.1502 | 59.9887 | 279 | 500 | 8.3033 | 12.6402 |
| 95% | 25.5026 | 0.086 | 0.1682 | 60.9027 | 299 | 525 | 8.6203 | 13.253 |
| 98% | 26.2705 | 0.0974 | 0.2562 | 63.7788 | 332 | 684 | 10.7246 | 14.8639 |
| 99% | 26.3214 | 1.2897 | 0.2805 | 84.1223 | 334 | 2048 | 24.3455 | 27.2698 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 21:11:50 - evalscope - INFO: Save the summary to: outputs/20251030_210843/QWEN3VL

FA CACHE EP MOE 30B
Benchmarking summary:
+-----------------------------------+-----------+
| Key | Value |
+===================================+===========+
| Time taken for tests (s) | 60.5891 |
+-----------------------------------+-----------+
| Number of concurrency | 100 |
+-----------------------------------+-----------+
| Total requests | 100 |
+-----------------------------------+-----------+
| Succeed requests | 100 |
+-----------------------------------+-----------+
| Failed requests | 0 |
+-----------------------------------+-----------+
| Output token throughput (tok/s) | 934.445 |
+-----------------------------------+-----------+
| Total token throughput (tok/s) | 1556.28 |
+-----------------------------------+-----------+
| Request throughput (req/s) | 2.5075 |
+-----------------------------------+-----------+
| Average latency (s) | 26.8537 |
+-----------------------------------+-----------+
| Average time to first token (s) | 0.9026 |
+-----------------------------------+-----------+
| Average time per output token (s) | 0.0724 |
+-----------------------------------+-----------+
| Average inter-token latency (s) | 0.0696 |
+-----------------------------------+-----------+
| Average input tokens per request | 247.99 |
+-----------------------------------+-----------+
| Average output tokens per request | 372.66 |
+-----------------------------------+-----------+
2025-10-30 21:16:44 - evalscope - INFO:
Percentile results:
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| Percentiles | TTFT (s) | ITL (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Output (tok/s) | Total (tok/s) |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
| 10% | 0.3744 | 0.0584 | 0.0688 | 15.3015 | 216 | 194 | 12.614 | 20.5148 |
| 25% | 0.9796 | 0.0663 | 0.0715 | 21.6323 | 230 | 273 | 12.7091 | 21.0648 |
| 50% | 1.0326 | 0.0707 | 0.0734 | 28.5254 | 243 | 375 | 13.1462 | 22.4579 |
| 66% | 1.0355 | 0.0731 | 0.0744 | 31.0512 | 261 | 414 | 13.3328 | 23.4039 |
| 75% | 1.0366 | 0.0799 | 0.075 | 33.0748 | 269 | 448 | 13.5321 | 23.8997 |
| 80% | 1.0376 | 0.0811 | 0.0755 | 33.6966 | 272 | 459 | 13.6176 | 24.7821 |
| 90% | 1.0398 | 0.0851 | 0.0759 | 36.6518 | 279 | 522 | 14.2421 | 32.0522 |
| 95% | 1.0405 | 0.0859 | 0.0775 | 38.213 | 299 | 567 | 14.8379 | 43.8427 |
| 98% | 1.0412 | 0.0864 | 0.0781 | 39.8789 | 332 | 653 | 16.3746 | 47.2565 |
| 99% | 1.044 | 0.0866 | 0.0781 | 60.5256 | 334 | 2048 | 33.8369 | 63.1981 |
+-------------+----------+---------+----------+-------------+--------------+---------------+----------------+---------------+
2025-10-30 21:16:44 - evalscope - INFO: Save the summary to: outputs/20251030_211519/QWEN3VL

@JartX
Copy link
Contributor Author

JartX commented Oct 30, 2025

@tjtanaa

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA CACHE MOE 30B 1078.18 24.9804 0.8962 1787.42 2.86
TORCH.SDPA CACHE MOE 30B 1065.81 23.2187 0.9222 1800.92 2.9643
TORCH.SDPA CACHE EP MMDATA MOE 30B 1036.16 25.2854 0.7732 1731.83 2.8052
FA CACHE EP MMDATA MOE 30B 963.856 26.3642 1.1381 1617.63 2.6363
TORCH.SDPA CACHE EP MOE 30B 955.403 26.292 0.7713 1612.54 2.6498
FA CACHE EP MOE 30B 934.445 26.8537 0.9026 1556.28 2.5075
TORCH.SDPA WCACHE EP MMDATA MOE 30B 844.695 31.4613 4.5218 1423.39 2.3335
TORCH.SDPA WCACHE EP MOE 30B 691.217 40.1204 8.6339 1169.45 1.9285
FA WCACHE MOE 30B 665.335 47.4141 14.2678 1103.29 1.766
FA WCACHE EP MMDATA MOE 30B 660.657 44.7356 13.6044 1094.2 1.7482
TORCH.SDPA WCACHE MOE 30B 646.204 43.0512 10.3647 1100.72 1.8328
FA CACHE 32B 592.918 29.6106 1.1861 1227.88 2.5604
TORCH.SDPA CACHE 32B 589.434 30.0095 1.6538 1215.04 2.5227
FA WCACHE EP MOE 30B 569.827 50.9302 15.1424 958.643 1.5679
TORCH.SDPA WCACHE 32B 346.015 58.5657 16.5453 709.578 1.466
FA WCACHE 32B 309.161 67.1365 22.5948 634.552 1.3121

@tjtanaa
Copy link
Collaborator

tjtanaa commented Oct 30, 2025

@JartX From your data, I think it shows that in generally the TORCH_SDPA is still the better option.

@tjtanaa
Copy link
Collaborator

tjtanaa commented Oct 30, 2025

CACHE MOE 30B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA CACHE MOE 30B 1078.18 24.9804 0.8962 1787.42 2.86
TORCH.SDPA CACHE MOE 30B 1065.81 23.2187 0.9222 1800.92 2.9643

TORCH.SDPA (better total throughput and request throughput)

CACHE EP MMDATA MOE 30B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA CACHE EP MMDATA MOE 30B 963.856 26.3642 1.1381 1617.63 2.6363
TORCH.SDPA CACHE EP MMDATA MOE 30B 1036.16 25.2854 0.7732 1731.83 2.8052

TORCH.SDPA (better across all metrics)

CACHE EP MOE 30B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA CACHE EP MOE 30B 934.445 26.8537 0.9026 1556.28 2.5075
TORCH.SDPA CACHE EP MOE 30B 955.403 26.292 0.7713 1612.54 2.6498

TORCH.SDPA (better across all metrics)

WCACHE EP MMDATA MOE 30B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA WCACHE EP MMDATA MOE 30B 660.657 44.7356 13.6044 1094.2 1.7482
TORCH.SDPA WCACHE EP MMDATA MOE 30B 844.695 31.4613 4.5218 1423.39 2.3335

TORCH.SDPA (significantly better across all metrics)

WCACHE EP MOE 30B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA WCACHE EP MOE 30B 569.827 50.9302 15.1424 958.643 1.5679
TORCH.SDPA WCACHE EP MOE 30B 691.217 40.1204 8.6339 1169.45 1.9285

TORCH.SDPA (significantly better across all metrics)

WCACHE MOE 30B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA WCACHE MOE 30B 665.335 47.4141 14.2678 1103.29 1.766
TORCH.SDPA WCACHE MOE 30B 646.204 43.0512 10.3647 1100.72 1.8328

Mixed - FA has better output throughput, TORCH.SDPA has better latency and TTFT

CACHE 32B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA CACHE 32B 592.918 29.6106 1.1861 1227.88 2.5604
TORCH.SDPA CACHE 32B 589.434 30.0095 1.6538 1215.04 2.5227

FA (slightly better output throughput and TTFT)

WCACHE 32B

Configuration Output Throughput (tok/s) Avg Latency (s) Avg TTFT (s) Total Throughput (tok/s) Request Throughput (req/s)
FA WCACHE 32B 309.161 67.1365 22.5948 634.552 1.3121
TORCH.SDPA WCACHE 32B 346.015 58.5657 16.5453 709.578 1.466

TORCH.SDPA (better across all metrics)

@JartX
Copy link
Contributor Author

JartX commented Oct 30, 2025

@tjtanaa It surprised me too, but I don't understand why xD. Could you try to explain it to me? Please? Even if it's just because of the time I've spent with the benchmarks hahaha

@JartX
Copy link
Contributor Author

JartX commented Oct 31, 2025

@tjtanaa

Accuracy Test:

FA

Tasks Version Filter n-shot Metric Value Stderr
chartqa 0 none 0 anywhere_accuracy 0.795 ± 0.0286
none 0 exact_match 0.570 ± 0.0351
none 0 relaxed_accuracy 0.795 ± 0.0286

TORCH.SDPA

Tasks Version Filter n-shot Metric Value Stderr
chartqa 0 none 0 anywhere_accuracy 0.775 ± 0.0296
none 0 exact_match 0.565 ± 0.0351
none 0 relaxed_accuracy 0.770 ± 0.0298

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 1, 2025

@JartX I think the accuracy changes is not sufficient to say there is a huge degradation, maybe we need to evaluate on more datasets.

Accuracy Changes (FA → TORCH.SDPA)

Metric FA Value TORCH.SDPA Value Change Percentage Change
anywhere_accuracy 0.795 0.775 -0.020 -2.52%
exact_match 0.570 0.565 -0.005 -0.88%
relaxed_accuracy 0.795 0.770 -0.025 -3.14%

Regarding to the speed comparison between triton FA and torch.sdpa.

In my opinion,

  1. The triton FA is developed for Instinct GPUs, so the performance benefit of the kernel are not necessarily transferrable to the Radeon GPUs.
  2. The triton FA are old implementations of FA. torch.sdpa implementation is tied to pytorch optimization. AMD has been bringing optimization to operators in pytorch so that could be a reason why torch.spda is catching up with the triton FA.

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 2, 2025

@JartX I would like to suggest you an alternative. Since on Radeon, AITER is also installed, can you explore the use of AITER's triton flash_attn_varlen_func implementation instead (https://github.com/ROCm/aiter/blob/294b779c6cc9790dbfd1815ddcf0026c62008456/aiter/ops/triton/mha.py#L955)?

See if it is faster?

@JartX
Copy link
Contributor Author

JartX commented Nov 2, 2025

@tjtanaa thanks for the idea — I tried it back in the day, but inference isn’t possible due to lack of hardware support. For example:

(worker_TP2 pid=5145) /usr/local/lib/python3.12/dist-packages/aiter/jit/build/ck/include/ck_tile/core/arch/amd_buffer_addressing_builtins_hip.hpp:1168:22: error: invalid operand for instruction
(worker_TP2 pid=5145)  1168 |         asm volatile("buffer_load_dword %1, %2, 0 offen offset:%3 lds"
(worker_TP2 pid=5145)       |                      ^
(worker_TP2 pid=5145) <inline asm>:1:25: note: instantiated into assembly here
(worker_TP2 pid=5145)     1 |         buffer_load_dword v42, s[24:27], 0 offen offset:0 lds

I’ve also seen your PR: #27919 this one could even be removed PR #27776, or just kept in the Dockerfile along with a wiki update referencing ROCm and RDNA3.

And sorry for my ignorance, but with your PR — is it possible to force Flash Attention upstream?

If possible, it would be really easy to offer both types of care in case atrial fibrillation progresses.

Thank you so much for your time.

@tjtanaa
Copy link
Collaborator

tjtanaa commented Nov 2, 2025

@JartX I am referring to the triton implementation from Aiter repo. Is invoking the triton implementation triggering asm error?

@JartX
Copy link
Contributor Author

JartX commented Nov 3, 2025

@tjtanaa Hi, it also fails to start up; it seems to only be supported for X Arch:
https://github.com/ROCm/aiter/blob/294b779c6cc9790dbfd1815ddcf0026c62008456/aiter/ops/triton/utils/_triton/arch_info.py

@JartX
Copy link
Contributor Author

JartX commented Nov 3, 2025

Go to add al spoof the gpu:

"gfx1100": "MI300X"

@JartX
Copy link
Contributor Author

JartX commented Nov 3, 2025

@tjtanaa
Okay, I was able to run it, but the results barely differ from upstream fa; TORCH.SDPA is still better, so I'm closing the PR until maybe this changes :)

@JartX JartX closed this Nov 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build qwen Related to Qwen models rocm Related to AMD ROCm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants