-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
…` to be called with infinities Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
eac3b95
to
49157b1
Compare
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
…just 1 lora Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
I've got some performance numbers using the MLPerf Llama2-70B inference benchmark. I retokenised the dataset for Llama3.1.
|
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Benchmarking LoRA against baseline (no LoRA) throughputWe use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k. We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%. Llama3.1-8B1x TPU-v6eThe LoRA slowdown varies from -8.4% to -23.9%. ![]() 1x GPU-L4The LoRA slowdown varies from -17.3% to -32.8%. ![]() 1x GPU-H100The LoRA slowdown varies from -10.0% to -51.8%. ![]() Llama3.1-70B8x TPU-v6eThe LoRA slowdown varies from -20.7% to -46.3%. ![]() 8x GPU-L4The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%. ![]() 4x GPU-H100Unable to launch VMs due to persistent unavailability across multiple zones and regions. |
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Signed-off-by: Akshat Tripathi <[email protected]>
Summary
This PR optimises the Multi-LoRA implementation from #14238. This one should be merged in after it.
This includes several kernel optimisations:
And a few general ones:
expand
op a82f3feThings left/RFC
LogitsProcessorWithLoRA
introduces a long (~1.5 second) stall when it's enabled, but not much activity seems to happen on the CPU or TPU during this time. I've disabled this for now.LogitsProcessorWithLoRA
is always created even if there's no LoRA adapter that needs it, is there a reason for this?