feat: Add TensorRT Edge-LLM AttentionPlugin backend support#4108
feat: Add TensorRT Edge-LLM AttentionPlugin backend support#4108
Conversation
Add plugin backend as an alternative to the default SDPA lowering for LLM inference, providing ~1.5x-1.8x speedup over SDPA and ~8x-11x over PyTorch eager execution. Supported Models: Llama 3.x (3.1/3.2), Qwen 2.5, Qwen 3 Changes: - examples/dynamo/attention_plugin_example.py: Standalone plugin demo with correctness validation against PyTorch SDPA - examples/dynamo/end_to_end_llm_generation_example.py: End-to-end LLM generation example with plugin integration and benchmarks - tools/llm/plugin_utils.py: Model-agnostic plugin utilities including op registration (tensorrt_edge_llm::xqa_attn), TensorRT converter, PluginAttention module, LLMPluginWrapper, compilation and generation - tools/llm/run_llm.py: Add --backend plugin/sdpa selection with plugin workflow integration - tools/llm/README.md: Plugin backend documentation with build guide, usage examples, and performance summary Plugin library built from TensorRT-Edge-LLM 0.4.0: https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime
| - **Source build (slow)**: `pip install flash-attn --no-build-isolation -v` (fallback if pre-built wheels fail) |
There was a problem hiding this comment.
Make sure to add the MAX_JOBS=8 otherwise you might take peoples systems down
narendasan
left a comment
There was a problem hiding this comment.
Overall, I think its close. @zewenli98 should take a pass but we can merge near as is. but I want to think about next how we might create lowering passes that insert the placeholder ops programmatically. Evan is about to disable decomposition by default for sdpa so we can basically dynamically insert a pass that keys on those ops
| trt_timings.append(elapsed_ms / 1000.0) | ||
| else: | ||
| # SDPA backend (default) | ||
| if args.cache == "static_v1": |
There was a problem hiding this comment.
We have a few threads, backend and cache and with @zewenli98's PR in core Attention. can we merge these settings so its easy to understand when you will get TRT-Edge-LLM, when you get native IAttention and when you get Static KV Cache?
| # ----------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| @dynamo_tensorrt_converter( |
There was a problem hiding this comment.
Lets put all the converters for our edgellm ops in their own file
narendasan
left a comment
There was a problem hiding this comment.
Do we have lowering passes to insert the tensorrt edge llm ops in place of pytorch ops?
| enabled_precisions={torch.float32}, | ||
| use_explicit_typing=True, |
There was a problem hiding this comment.
when use_explicit_typing is true, enabled_precisions should be removed.
| device=device, | ||
| disable_tf32=True, | ||
| min_block_size=1, | ||
| debug=debug, |
There was a problem hiding this comment.
debug is deprecated. Please use with torch_tensorrt.dynamo.Debugger(...)
| @@ -7,7 +7,9 @@ This directory provides utilities and scripts for compiling, optimizing, and ben | |||
| - **Model Support:** Works with popular LLMs such as Llama-3, Qwen2.5, etc. | |||
| - **VLM Support:** Supports Visual Language Models like Qwen2.5-VL and Eagle2. | |||
| - **Precision Modes:** Supports FP16, BF16, and FP32. | |||
| - **Quantization:** Supports FP8 and NVFP4 quantization formats for reduced memory usage and improved inference speed. | |||
| trt_timings.append(elapsed_ms / 1000.0) | ||
| else: | ||
| # SDPA backend (default) | ||
| if args.cache == "static_v1": |
Add plugin backend as an alternative to the default SDPA lowering for LLM inference, providing ~1.7x-3.3x speedup over SDPA and ~8x-11x over PyTorch eager execution.
Supported Models: Llama 3.x (3.1/3.2), Qwen 2.5, Qwen 3
Changes:
Plugin library built from TensorRT-Edge-LLM 0.4.0: https://github.com/chohk88/TensorRT-Edge-LLM/tree/feature/torch-tensorrt-python-runtime
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes # (issue)
Type of change
Please delete options that are not relevant and/or add your own.
Checklist: