huggingface · Conzel · Oct 22, 2025 · Oct 22, 2025 · Nov 3, 2025 · Nov 3, 2025
diff --git a/examples/bdlora_finetuning/README.md b/examples/bdlora_finetuning/README.md
@@ -0,0 +1,15 @@
+# BD-LoRA Finetuning
+
+Block-Diagonal LoRA (BD-LoRA) is a LoRA variant in which some LoRA factors are constrained to be block-diagonal. 
+This allows faster serving by eliminating communication overheads when running inference on multiple GPU, at the same finetuning performance as vanilla LoRA. 
+
+To get an overview on how to use BD-LoRA, please view the Python notebook at `peft/examples/bdlora_finetuning/bdlora_peft_demo.ipynb`. 
+
+To benefit from inference speed-ups, you need an inference engine that is compatible with BD-LoRA. At the moment, there is an experimental PR at https://github.com/vllm-project/vllm/pull/28136 which allows you to use BD-LoRA in vLLM. If you find this work useful, consider leaving a comment there. 
+
+To install, you can clone the GitHub repository connected to the fork at https://github.com/Conzel/vllm/tree/bdlora-bk. Then, install vLLM following the usual instructions: https://docs.vllm.ai/en/stable/getting_started/installation/. We assume that you have a hardware setup with at least 2 available GPUs. 
+
+This example folder contains 3 scripts:
+- `bdlora_peft_demo.ipynb` Showcases how to instantiate a BD-LoRA model, train it, and save/reload the weights. 
+- `vllm_server.bash` Spins up a BD-LoRA compatible vLLM server. To use it, you need to run the notebook once to create adapters with the correct format. 
+- `chat.py` Can be used to query the vLLM server after it has finished booting up. Usage example: `python3 chat.py --target lora1`.
diff --git a/examples/bdlora_finetuning/bdlora-performance.png b/examples/bdlora_finetuning/bdlora-performance.png
diff --git a/examples/bdlora_finetuning/bdlora-sharding.png b/examples/bdlora_finetuning/bdlora-sharding.png