feat: Add TensorRT Edge-LLM AttentionPlugin backend support by chohk88 · Pull Request #4108 · pytorch/TensorRT

chohk88 · 2026-03-03T13:54:42Z

Add plugin backend as an alternative to the default SDPA lowering for LLM inference, providing ~1.7x-3.3x speedup over SDPA and ~8x-11x over PyTorch eager execution.

Supported Models: Llama 3.x (3.1/3.2), Qwen 2.5, Qwen 3

Changes:

examples/dynamo/attention_plugin_example.py: Standalone plugin demo with correctness validation against PyTorch SDPA
examples/dynamo/end_to_end_llm_generation_example.py: End-to-end LLM generation example with plugin integration and benchmarks
tools/llm/plugin_utils.py: Model-agnostic plugin utilities including op registration (tensorrt_edge_llm::xqa_attn), TensorRT converter, PluginAttention module, LLMPluginWrapper, compilation and generation
tools/llm/run_llm.py: Add --backend plugin/sdpa selection with plugin workflow integration
tools/llm/README.md: Plugin backend documentation with build guide, usage examples, and performance summary

Plugin library built from TensorRT-Edge-LLM 0.4.0: https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

New feature (non-breaking change which adds functionality)

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

Add plugin backend as an alternative to the default SDPA lowering for LLM inference, providing ~1.5x-1.8x speedup over SDPA and ~8x-11x over PyTorch eager execution. Supported Models: Llama 3.x (3.1/3.2), Qwen 2.5, Qwen 3 Changes: - examples/dynamo/attention_plugin_example.py: Standalone plugin demo with correctness validation against PyTorch SDPA - examples/dynamo/end_to_end_llm_generation_example.py: End-to-end LLM generation example with plugin integration and benchmarks - tools/llm/plugin_utils.py: Model-agnostic plugin utilities including op registration (tensorrt_edge_llm::xqa_attn), TensorRT converter, PluginAttention module, LLMPluginWrapper, compilation and generation - tools/llm/run_llm.py: Add --backend plugin/sdpa selection with plugin workflow integration - tools/llm/README.md: Plugin backend documentation with build guide, usage examples, and performance summary Plugin library built from TensorRT-Edge-LLM 0.4.0: https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime

narendasan · 2026-03-03T20:23:09Z

tools/llm/README.md

Make sure to add the MAX_JOBS=8 otherwise you might take peoples systems down

narendasan

Overall, I think its close. @zewenli98 should take a pass but we can merge near as is. but I want to think about next how we might create lowering passes that insert the placeholder ops programmatically. Evan is about to disable decomposition by default for sdpa so we can basically dynamically insert a pass that keys on those ops

narendasan · 2026-03-03T20:26:18Z

tools/llm/run_llm.py

+                    trt_timings.append(elapsed_ms / 1000.0)
+        else:
+            # SDPA backend (default)
+            if args.cache == "static_v1":


We have a few threads, backend and cache and with @zewenli98's PR in core Attention. can we merge these settings so its easy to understand when you will get TRT-Edge-LLM, when you get native IAttention and when you get Static KV Cache?

@chohk88 I implemented the converters for some attention variants in #4104. Can you take a look how to integrate?

narendasan · 2026-03-06T02:08:08Z

tools/llm/plugin_utils.py

+# -----------------------------------------------------------------------------
+
+
+@dynamo_tensorrt_converter(


Lets put all the converters for our edgellm ops in their own file

narendasan

Do we have lowering passes to insert the tensorrt edge llm ops in place of pytorch ops?

zewenli98 · 2026-03-06T03:28:36Z

tools/llm/plugin_utils.py

+        enabled_precisions={torch.float32},
+        use_explicit_typing=True,


when use_explicit_typing is true, enabled_precisions should be removed.

zewenli98 · 2026-03-06T03:30:24Z

tools/llm/plugin_utils.py

+        device=device,
+        disable_tf32=True,
+        min_block_size=1,
+        debug=debug,


debug is deprecated. Please use with torch_tensorrt.dynamo.Debugger(...)

zewenli98 · 2026-03-06T04:47:36Z

tools/llm/README.md

@@ -7,7 +7,9 @@ This directory provides utilities and scripts for compiling, optimizing, and ben
 - **Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc.
 - **VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2.
 - **Precision Modes:** Supports FP16, BF16, and FP32.
- **Quantization:** Supports FP8 and NVFP4 quantization formats for reduced memory usage and improved inference speed.


should we keep quant?

zewenli98 · 2026-03-06T04:52:21Z

tools/llm/run_llm.py

+                    trt_timings.append(elapsed_ms / 1000.0)
+        else:
+            # SDPA backend (default)
+            if args.cache == "static_v1":


@chohk88 I implemented the converters for some attention variants in #4104. Can you take a look how to integrate?

chohk88 requested review from narendasan and zewenli98 March 3, 2026 13:54

chohk88 self-assigned this Mar 3, 2026

meta-cla bot added the cla signed label Mar 3, 2026

narendasan reviewed Mar 3, 2026

View reviewed changes

narendasan reviewed Mar 6, 2026

View reviewed changes

zewenli98 reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add TensorRT Edge-LLM AttentionPlugin backend support#4108

feat: Add TensorRT Edge-LLM AttentionPlugin backend support#4108
chohk88 wants to merge 1 commit intomainfrom
attn-plugin-workflow

chohk88 commented Mar 3, 2026

Uh oh!

narendasan Mar 3, 2026

Uh oh!

narendasan left a comment

Uh oh!

narendasan Mar 3, 2026

Uh oh!

zewenli98 Mar 6, 2026

Uh oh!

narendasan Mar 6, 2026

Uh oh!

narendasan left a comment

Uh oh!

zewenli98 Mar 6, 2026

Uh oh!

zewenli98 Mar 6, 2026

Uh oh!

zewenli98 Mar 6, 2026

Uh oh!

zewenli98 Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# -----------------------------------------------------------------------------


		@dynamo_tensorrt_converter(

		enabled_precisions={torch.float32},
		use_explicit_typing=True,

Conversation

chohk88 commented Mar 3, 2026

Description

Type of change

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

narendasan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

narendasan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants