[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

Akshat-Tripathi · 2025-03-27T23:43:17Z

Summary

This PR optimises the Multi-LoRA implementation from #14238. This one should be merged in after it.

This includes several kernel optimisations:

Block size tuning 2bb8868 d7338f8
Faster mask creation 2aacb34
Allowing for some blocks to be skipped 6ee0b57
Adding LoRA Laning eb804a0
Splitting the Pallas kernel into shrink/expand variants de6746a
Removing masking when only 1 LoRA adapter is used aad109b

And a few general ones:

Pre-transposing the LoRA adapters used in the expand op a82f3fe
Reducing recompilations 5638e7d

Things left/RFC

There are still a few recompilations at the start of a run that I need to track down
LogitsProcessorWithLoRA introduces a long (~1.5 second) stall when it's enabled, but not much activity seems to happen on the CPU or TPU during this time. I've disabled this for now.
It seems LogitsProcessorWithLoRA is always created even if there's no LoRA adapter that needs it, is there a reason for this?
I have microbenchmarks for the kernels, but I'm not sure what the right place to put them is.

Signed-off-by: Akshat Tripathi <[email protected]>

…` to be called with infinities Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

mergify · 2025-04-03T09:27:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Akshat-Tripathi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Akshat Tripathi <[email protected]>

…just 1 lora Signed-off-by: Akshat Tripathi <[email protected]>

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi · 2025-04-07T08:03:44Z

I've got some performance numbers using the MLPerf Llama2-70B inference benchmark. I retokenised the dataset for Llama3.1.

Model	Parameters	Without LoRA (tok/s)	With LoRA (tok/s)
Llama3.1	8B	1621.62	1426.01
Llama3.1	70B	432.964	326.709

Signed-off-by: Akshat Tripathi <[email protected]>

psyhtest · 2025-04-07T21:01:08Z

Benchmarking LoRA against baseline (no LoRA) throughput

We use NVIDIA's GenAI-Perf tool to force fixed-length inputs and outputs to produce "heatmap" plots as below. On TPU-v6e and H100 instances, we vary the inputs from 128 to 8k. On L4 instances, we vary the inputs from 128 to 2k.

We calculate the LoRA slowdown as ((LoRA throughput / baseline throughput) - 1) * 100%.

Llama3.1-8B

1x TPU-v6e

The LoRA slowdown varies from -8.4% to -23.9%.

1x GPU-L4

The LoRA slowdown varies from -17.3% to -32.8%.

1x GPU-H100

The LoRA slowdown varies from -10.0% to -51.8%.

Llama3.1-70B

8x TPU-v6e

The LoRA slowdown varies from -20.7% to -46.3%.

8x GPU-L4

The LoRA slowdown varies from -13.8% (second best: -25.1%) to -49.7%.

4x GPU-H100

Unable to launch VMs due to persistent unavailability across multiple zones and regions.

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi added 30 commits March 4, 2025 21:02

Added non-triton SGMV and BGMV ops (not kernels yet)

d993de9

Signed-off-by: Akshat Tripathi <[email protected]>

Made a copy of the layer tests for the TPU. TODO: DRY it out

4f816ed

Signed-off-by: Akshat Tripathi <[email protected]>

Removed extra print

5f0355b

Signed-off-by: Akshat Tripathi <[email protected]>

Made some minor shape-based fixes to the kernels

edd02c5

Signed-off-by: Akshat Tripathi <[email protected]>

Added basic lora execution code

aff94f9

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced einsums with matmuls+reshaping for better xla compilation

adfd194

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced inf/-inf with max/min since XLA doesn't allow `nan_to_num_()…

816a56c

…` to be called with infinities Signed-off-by: Akshat Tripathi <[email protected]>

Added lora config to _dummy_run()

c8a51c8

Signed-off-by: Akshat Tripathi <[email protected]>

Changed torch._dynamo config

51f929d

Signed-off-by: Akshat Tripathi <[email protected]>

Quick patch to allow non lora code to run

23d4a24

Signed-off-by: Akshat Tripathi <[email protected]>

Minor fixes

47397a7

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced einsums with matmuls to allow xla compilation

456eb37

Signed-off-by: Akshat Tripathi <[email protected]>

Removed xla ops for torch ops

eabc748

Signed-off-by: Akshat Tripathi <[email protected]>

Removed old debug log points

ac9753e

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed bgmv/sgmv shape error

aa8b0fd

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed lora batching crash in warmup

124215f

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed shape issue in add_lora_linear()

e148254

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed dynamic lora tensor shapes

494b35e

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed lora_input preparation for actual execution

1dbfcd9

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed wrong model bug

1bb2578

Signed-off-by: Akshat Tripathi <[email protected]>

Moved if statements outside of for loops in PunicaWrapperTPU

ddc4cbc

Signed-off-by: Akshat Tripathi <[email protected]>

Added early exits to PunicaWrapperTPU lora functions

48a6944

Signed-off-by: Akshat Tripathi <[email protected]>

Added torch ops for tpu (Static prefill sizes)

7802e84

Signed-off-by: Akshat Tripathi <[email protected]>

XLA bgmv operations are now imported from the default torch_ops

ab5396b

Signed-off-by: Akshat Tripathi <[email protected]>

Removed TODOs

fdf29d3

Signed-off-by: Akshat Tripathi <[email protected]>

Removed old code

c2b4139

Signed-off-by: Akshat Tripathi <[email protected]>

Linting

f31b7d1

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed import error

87ff73e

Signed-off-by: Akshat Tripathi <[email protected]>

lint

96c3dde

Signed-off-by: Akshat Tripathi <[email protected]>

Abstracted out infinity values

4e72ede

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi requested review from ywang96, comaniac and alexm-redhat as code owners April 2, 2025 04:33

Akshat-Tripathi added 2 commits April 2, 2025 17:04

Fixed laning integration bug

e1aaed6

Signed-off-by: Akshat Tripathi <[email protected]>

Lint

62500e1

Signed-off-by: Akshat Tripathi <[email protected]>

yaochengji assigned vanbasten23 and yaochengji Apr 2, 2025

mgoin self-requested a review April 3, 2025 09:26

mergify bot added the needs-rebase label Apr 3, 2025

Akshat-Tripathi added 2 commits April 4, 2025 13:01

Removed LoRA vocab padding for TPU

eb72ab6

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'multi_lora_tpu_v1' into tpu_bgmv_optimisation

49157b1

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi force-pushed the tpu_bgmv_optimisation branch from eac3b95 to 49157b1 Compare April 4, 2025 13:02

Akshat-Tripathi added 5 commits April 4, 2025 14:02

Fixed 0 padding issue with LoRA

54db22d

Signed-off-by: Akshat Tripathi <[email protected]>

Changed TPU lora_vocab_padding_size to 1

5232785

Signed-off-by: Akshat Tripathi <[email protected]>

Fixed bug in bgmv_expand kernel - outputs weren't being written with …

1b4c2f2

…just 1 lora Signed-off-by: Akshat Tripathi <[email protected]>

Changed TPU lora_vocab_padding_size to 1

c8f68d7

Signed-off-by: Akshat Tripathi <[email protected]>

Enabled lora bias

ed3b245

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi added 6 commits April 7, 2025 12:58

Merge branch 'multi_lora_tpu_v1' into tpu_bgmv_optimisation

4855791

Signed-off-by: Akshat Tripathi <[email protected]>

Replaced enable_laning flag with dim comparison

9d35414

Signed-off-by: Akshat Tripathi <[email protected]>

Enabled fully sharded loras

54c00c3

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'multi_lora_tpu_v1' into tpu_bgmv_optimisation

f3e48a6

Signed-off-by: Akshat Tripathi <[email protected]>

Removed test benchmarking file

a4b2e27

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'main' into multi_lora_tpu_v1

9f0fdbe

Signed-off-by: Akshat Tripathi <[email protected]>

Akshat-Tripathi added 3 commits April 8, 2025 16:20

Refactored add_shrink to return a tensor not a tuple

fbddd3c

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'main' into multi_lora_tpu_v1

2012bbd

Signed-off-by: Akshat Tripathi <[email protected]>

Merge branch 'multi_lora_tpu_v1' into tpu_bgmv_optimisation

d1c11c8

Signed-off-by: Akshat Tripathi <[email protected]>

mergify bot removed the needs-rebase label Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

Akshat-Tripathi commented Mar 27, 2025 •

edited by github-actions bot

Loading

mergify bot commented Apr 3, 2025

Akshat-Tripathi commented Apr 7, 2025

psyhtest commented Apr 7, 2025 •

edited

Loading

[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

Are you sure you want to change the base?

[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend #15655

Conversation

Akshat-Tripathi commented Mar 27, 2025 • edited by github-actions bot Loading

Summary

Things left/RFC

mergify bot commented Apr 3, 2025

Akshat-Tripathi commented Apr 7, 2025

psyhtest commented Apr 7, 2025 • edited Loading

Benchmarking LoRA against baseline (no LoRA) throughput

Llama3.1-8B

1x TPU-v6e

1x GPU-L4

1x GPU-H100

Llama3.1-70B

8x TPU-v6e

8x GPU-L4

4x GPU-H100

Akshat-Tripathi commented Mar 27, 2025 •

edited by github-actions bot

Loading

psyhtest commented Apr 7, 2025 •

edited

Loading