-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[FEAT] Integrate BD-LoRA into PEFT #2895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
Thanks for your work on integrating block-diagonal LoRA into PEFT. I haven't checked the details yet, but wanted to give a first, high level feedback. 1. Parallelism The main advantage of BD-LoRA compared to LoRA would be the better parallelism story. Did you verify that it works with the proposed implementation? I think having a small example would be really helpful. AFAICT, the included notebook doesn't really cover that. 2. Implementation approach You implemented BD-LoRA as a subclass of normal LoRA. The idea is probably to leverage the existing code as much as possible. However, it can also be problematic, as BD-LoRA may or may not always be a substitute for LoRA, so some checks like I wonder if you have considered implementing BD-LoRA as a LoRA variant instead. E.g. DoRA is implemented as a LoRA variant. The advantage would be that we have a defined interface for BD-LoRA. From a user's point of view, it could look like this: from peft import LoraConfig, BdLoraConfig
bd_lora_config = BdLoraConfig(
lora_a_is_blockdiagonal=["o_proj", "down_proj"],
lora_b_is_blockdiagonal=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj"],
nblocks=2,
)
lora_config = LoraConfig(
r=16,
target_modules=["q_proj", "v_proj", "k_proj", "up_proj", "gate_proj", "o_proj", "down_proj"],
use_bdlora=bd_lora_config,
)Maybe this idea won't quite work because LoRA variants are not the right abstraction, but I wanted to discuss this first before proceeding. |
|
Hey @BenjaminBossan, thank you very much for your thoughtful comments.
If all adapters can be kept in GPU memory: If no adapter can be kept in GPU memory: I agree that adding some information on how to integrate BD-LoRA vLLM would be a good idea. I would wait until the PR in vLLM is accepted (or at least ready to merge), and then add further information to the notebook.
|
|
@BenjaminBossan Is there something else that we need to move forward / that I can work on? |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for refactoring the PR to implement BD-LoRA as a LoRA variant. I think this is indeed the much nicer approach, as it integrates better and is more future proof.
This would involve adding the target_name parameter in the create_and_replace method in lora/model.py to record the current key, as BD-LoRA needs this parameter to determine whether a module is row- or column-sharded.
We can do that but I hope that we can find a way to avoid it. I started a discussion about the choice of LoRA A vs B, depending on the outcome there, we may be able to streamline the implementation.
I agree that adding some information on how to integrate BD-LoRA vLLM would be a good idea. I would wait until the PR in vLLM is accepted (or at least ready to merge), and then add further information to the notebook.
Sounds good. Apart from vLLM, could we also expect potential speedups from transformers TP with this technique (at least for those models that support it)? If yes, it would be great if we could include an example.
Is there something else that we need to move forward / that I can work on?
Sorry for the delay, we had our hands full with the last PEFT release and updates required for the upcoming transformers v5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing the results. Note that for the final PR, we need to remove this file, as we will run the experiment on our own hardware to keep results comparable.
Regarding the settings, do you have an idea why the end result, the test accuracy, is relatively low at 36.8%? For comparison vanilla LoRA has 48.2% test accuracy. And that is with 9M trainable parameters, compared to 49M trainable parameters used here. From the paper results, I would have expected a very close match in performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing the results. Note that for the final PR, we need to remove this file, as we will run the experiment on our own hardware to keep results comparable.
Okay, that makes a lot of sense.
Regarding the settings, do you have an idea why the end result, the test accuracy, is relatively low at 36.8%? For comparison vanilla LoRA has 48.2% test accuracy. And that is with 9M trainable parameters, compared to 49M trainable parameters used here. From the paper results, I would have expected a very close match in performance.
At the moment, BD-LoRA uses many more parameters as I set the LoRA adapters to every linear module. I just noticed that in the LoRA result we only apply the adapters to v and q projections. Applying them to all layers might already cause overfitting/require different learning rates, which explains BD-LoRAs comparatively worse performance.
Only adding LoRA adapters to v_proj and q_proj for BD-LoRA might not be the best test scenario for BD-LoRA, as then we only test performance for modules where the LoRA B adapter is block-diagonal. Do you have any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a vanilla LoRA experiments that targets the 'MLP' part of the model. You could do the same, which would allow a mix of A and B being BD. Let's try to match the rank to achieve a similar number of trainable parameters (9-10M). The test accuracy for this experiment is 0.526.
src/peft/tuners/lora/config.py
Outdated
| nblocks: Number of blocks in block-diagonal matrices | ||
| """ | ||
|
|
||
| lora_a_is_blockdiagonal: Optional[list[str]] = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name lora_a_is_blockdiagonal would suggest the argument to be a bool. How about something like target_modules_blockdiagonal_a or even target_modules_bd_a (same for b)?
As to the meaning of these arguments: How would I, as a user, make the correct choice for which of A or B should be block-diagonal? I think it would be great if this could be automated (at least, as an option), for instance based on the base weight. Do you think that would be possible? If not, then at the very least the doc needs to be improved so that the users know what they should set here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I have updated the attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the meaning: Yes, it would be great to automate it. However, I am not sure if this can be done in a robust manner.
As a general guideline: To use BD-LoRA, we want to set QKV and up projections as B-block-diagonal, and down,out projections as A-block-diagonal. The exact sharding scheme however might depend on the implementation in the downstream inference engine. As an example, vLLM re-implements each supported model architecture and explicitly encodes which layers are ColumnParallel (needs LoRA B block-diagonal) and which ones are RowParallel (needs LoRA A block-diagonal): Here is Llama as an example: https://github.com/vllm-project/vllm/blob/0af3d4f0df360decc2115f43f5e4bc732342e7e4/vllm/model_executor/models/llama.py
If you have some ideas on how to determine this automatically, I would be very glad.
src/peft/tuners/lora/variants.py
Outdated
| self.weight = nn.Parameter(torch.empty(out_features, in_features // nblocks)) | ||
| torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5)) | ||
| if bias: | ||
| raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's raise with a proper error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/peft/tuners/lora/variants.py
Outdated
|
|
||
| def forward(self, x: torch.Tensor) -> torch.Tensor: | ||
| first_dims = x.shape[:-1] | ||
| if not len(x.shape) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit clearer IMO:
| if not len(x.shape) == 2: | |
| if x.dim() != 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/peft/tuners/lora/variants.py
Outdated
|
|
||
| @staticmethod | ||
| def merge_safe(module: Linear, active_adapter: str, orig_weight: torch.Tensor) -> torch.Tensor: | ||
| return orig_weight + module.get_delta_weight(active_adapter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is it expected that the use of block diagonal LoRA does not affect how merging should be performed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was an oversight, thank you. I have added the appropriate method for getting the delta weight in the case of BD-LoRA adapters.
src/peft/utils/peft_types.py
Outdated
| - PREFIX_TUNING | ||
| - LORA | ||
| - ADALORA | ||
| - BDLORA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed, since we're going with LoRA variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
src/peft/utils/peft_types.py
Outdated
| PREFIX_TUNING = "PREFIX_TUNING" | ||
| LORA = "LORA" | ||
| ADALORA = "ADALORA" | ||
| BDLORA = "BDLORA" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed, since we're going with LoRA variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
tests/test_bdlora.py
Outdated
| return X | ||
|
|
||
|
|
||
| class TestBdLora: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for thinking about adding tests. From skimming those, I think most if not all can be covered by extending existing tests. To do so, check e.g. how the test matrix is expanded for DoRA:
peft/tests/test_custom_models.py
Lines 91 to 134 in 9d3e0e5
| ("Vanilla MLP 7 LoRA with DoRA", "MLP", LoraConfig, {"target_modules": ["lin0"], "use_dora": True}), | |
| ("Vanilla MLP 8 LoRA with DoRA", "MLP", LoraConfig, {"target_modules": ["lin0", "lin1"], "use_dora": True}), | |
| ( | |
| "Vanilla MLP 9 LoRA with DoRA", | |
| "MLP", | |
| LoraConfig, | |
| {"target_modules": "lin1", "use_dora": True, "lora_alpha": 32}, | |
| ), | |
| ("Embedding + transformers Conv1D 1 LoRA", "EmbConv1D", LoraConfig, {"target_modules": ["conv1d"]}), | |
| ("Embedding + transformers Conv1D 2 LoRA", "EmbConv1D", LoraConfig, {"target_modules": ["emb"]}), | |
| ("Embedding + transformers Conv1D 3 LoRA", "EmbConv1D", LoraConfig, {"target_modules": ["emb", "conv1d"]}), | |
| ( | |
| "Embedding + transformers Conv1D 1 DoRA", | |
| "EmbConv1D", | |
| LoraConfig, | |
| {"target_modules": ["conv1d"], "use_dora": True}, | |
| ), | |
| ("Embedding + transformers Conv1D 2 DoRA", "EmbConv1D", LoraConfig, {"target_modules": ["emb"], "use_dora": True}), | |
| ( | |
| "Embedding + transformers Conv1D 3 DoRA", | |
| "EmbConv1D", | |
| LoraConfig, | |
| {"target_modules": ["emb", "conv1d"], "use_dora": True}, | |
| ), | |
| ( | |
| "Embedding + transformers Conv1D 1 LoRA trainable_tokens", | |
| "EmbConv1D", | |
| LoraConfig, | |
| {"target_modules": ["conv1d"], "trainable_token_indices": {"emb": [0, 10]}}, | |
| ), | |
| ("Conv1d LoRA", "Conv1d", LoraConfig, {"target_modules": ["conv1d"]}), | |
| ("Conv1d LoRA with DoRA", "Conv1d", LoraConfig, {"target_modules": ["conv1d"], "use_dora": True}), | |
| ("Conv2d 1 LoRA", "Conv2d", LoraConfig, {"target_modules": ["conv2d"]}), | |
| ("Conv2d 2 LoRA", "Conv2d", LoraConfig, {"target_modules": ["conv2d", "lin0"]}), | |
| ("Conv2d 1 LoRA with DoRA", "Conv2d", LoraConfig, {"target_modules": ["conv2d"], "use_dora": True}), | |
| ("Conv2d 2 LoRA with DoRA", "Conv2d", LoraConfig, {"target_modules": ["conv2d", "lin0"], "use_dora": True}), | |
| ("Conv2d Groups LoRA", "Conv2dGroups", LoraConfig, {"target_modules": ["conv2d"]}), | |
| ("Conv2d Groups2 LoRA", "Conv2dGroups2", LoraConfig, {"target_modules": ["conv2d"]}), | |
| ("Conv2d Groups LoRA with DoRA", "Conv2dGroups", LoraConfig, {"target_modules": ["conv2d"], "use_dora": True}), | |
| ("Conv2d Groups2 LoRA with DoRA", "Conv2dGroups2", LoraConfig, {"target_modules": ["conv2d"], "use_dora": True}), | |
| ("Conv3d 1 LoRA", "Conv3d", LoraConfig, {"target_modules": ["conv3d"]}), | |
| ("Conv3d 2 LoRA", "Conv3d", LoraConfig, {"target_modules": ["conv3d", "lin0"]}), | |
| ("Conv3d 1 LoRA with DoRA", "Conv3d", LoraConfig, {"target_modules": ["conv3d"], "use_dora": True}), | |
| ("Conv3d 2 LoRA with DoRA", "Conv3d", LoraConfig, {"target_modules": ["conv3d", "lin0"], "use_dora": True}), |
If you add similar settings for BD-LoRA there (of course only for linear layers), it should automatically cover lots of use cases of BD-LoRA. Then we can check the test coverage for gaps and add more BD-LoRA specific tests if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, thank you! I have moved the tests to the test_custom_model file. The only test that could not be converted is the block-diagonal structure test, but this is maybe a bit too implementation dependent anyways, I have removed it.
Related to type casting & initialization.
|
@BenjaminBossan Thank you very much for your thorough review! We can do that but I hope that we can find a way to avoid it. I started a discussion about the choice of LoRA A vs B, depending on the outcome there, we may be able to streamline the implementation.
Transformers TP would also need a PR that adds BD-LoRA compatibility, similar to vLLM. I am not too familiar with Transformers TP, but if Multi-LoRA serving with Megatron-style sharding is implemented already, it should not be much work.
No worries, thanks for your time. I hope to have addressed all your concerns. By moving the tests to the LoRA variant test matrix, I have found some more minor issues (related to dtype and multi-adapter settings) which I have also fixed.
|
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, the PR is getting close to the finish line. I did another review, please check.
Regarding testing: There are a few checks being performed at initialization time, e.g. for BD target module overlap or for % nblocks. Could you please add tests for those to test_initialization.py? I think the existing tests there should give you a good idea how to implement the tests, if not LMK.
Transformers TP would also need a PR that adds BD-LoRA compatibility, similar to vLLM. I am not too familiar with Transformers TP, but if Multi-LoRA serving with Megatron-style sharding is implemented already, it should not be much work.
Okay, let's leave that aside then. It would be nice to have but is not required for this PR.
As a general guideline: To use BD-LoRA, we want to set QKV and up projections as B-block-diagonal, and down,out projections as A-block-diagonal. The exact sharding scheme however might depend on the implementation in the downstream inference engine. As an example, vLLM re-implements each supported model architecture and explicitly encodes which layers are ColumnParallel (needs LoRA B block-diagonal) and which ones are RowParallel (needs LoRA A block-diagonal): Here is Llama as an example: https://github.com/vllm-project/vllm/blob/0af3d4f0df360decc2115f43f5e4bc732342e7e4/vllm/model_executor/models/llama.py
If you have some ideas on how to determine this automatically, I would be very glad.
I see, thanks for explaining. In this case, I think it's hard to impossible to automate this step. Could you put this additional information into the docstring of BdLoraConfig too, e.g. by extending it here?
One thing we could do is to determine some defaults for popular models with support in vLLM and, if the user leaves target_modules_bd_a and target_modules_bd_b as None, check if the used model matches one of the defaults. This is, however, a bit of extra work, especially as these defaults can quickly get out of date. Up to you if you think it's worth the effort.
Is passing the target_name to the LoRA adapters okay or is there a more clean approach we can do?
I thought a bit about that and I think it's okay as is. Another idea that I had would be to provide two variants, one for BD-A and one for BD-B, and resolve which one to use based on the target name, but I think overall that would be equally complex, so we can leave the logic is as.
One more thing, could you try to limit the characters to 120 per line, there are some very long strings for instance in config.py. Calling make style should help with some of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a vanilla LoRA experiments that targets the 'MLP' part of the model. You could do the same, which would allow a mix of A and B being BD. Let's try to match the rank to achieve a similar number of trainable parameters (9-10M). The test accuracy for this experiment is 0.526.
src/peft/tuners/lora/config.py
Outdated
| # check for overlap | ||
| if self.target_modules_bd_a is None or self.target_modules_bd_b is None: | ||
| return # one is empty so no overlap possible | ||
| for module in self.target_modules_bd_a: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
overlap = set(self.target_modules_bd_a or []) & set(self.target_modules_bd_b or [])
if overlap:
raise ValueError(f"Found overlapping ...")
src/peft/tuners/lora/variants.py
Outdated
| torch.nn.init.zeros_(self.weight) | ||
| else: | ||
| torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5)) | ||
| if bias: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this check up, directly below the check for % nblocks. Also, let's raise a ValueError instead.
src/peft/tuners/lora/variants.py
Outdated
| @staticmethod | ||
| def init(module: Linear, adapter_name: str, **kwargs) -> None: | ||
| use_bdlora = kwargs.get("use_bdlora") | ||
| if not use_bdlora: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever happen? I don't think so, since the LoRA variant is only added if use_bdlora is given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that is true, I have converted it to an assert as a sanity-check (also to make the type-checker aware that this is not None)
Removing some unused checks and clarifying others
This raises an error if neither target_modules_bd_a or ..._b matches.
Updated experiments to include new results
|
Hey @BenjaminBossan, thanks again for your time.
Great, I added a match_strict argument that is set to True per default and raises an error if it is set and we encounter an overlap. I think this is a good solution.
This is great. I have added a similar experiment for BD-LoRA at rank 14 (to match the number of parameters) and have 0.51857 test accuracy on my device. Additionally, I have repeated the rank 44 experiment with the correct target modules (before, I had erroneously targeted all linear layers), this time achieving 0.479 accuracy. Additionally, I have added an example for vLLM-integration. I had to add 2 more script files for querying the server and starting it. At the moment, I have not added a real demo for the speed-up (just a demo for how to start the vLLM server with a BD-LoRA adapter and how to send requests to it). I think a speedup-demo would involve using larger models and a more sophisticated platform for sending the requests (I used a modified LLMPerf-fork for my experiments in the initial comments). If you think this is a priority, I can try to add this information however.
I added information into the docstring in a previous commit, just above the line you linked. Is that enough or should I add more?
I think this could be a good idea further down the road. The vLLM maintainers have requested to keep the pull request in a draft state at the moment, and want to include it into the codebase if the community demand is high enough, so maybe keeping it mainly documented for LLama is okay at the moment. I think I am only missing the remaining initialization tests now, I will add them today or tomorrow morning. Thanks for your support! EDIT: Done! |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, there is not much left. I did another review, please check my comments.
This is great. I have added a similar experiment for BD-LoRA at rank 14 (to match the number of parameters) and have 0.51857 test accuracy on my device. Additionally, I have repeated the rank 44 experiment with the correct target modules (before, I had erroneously targeted all linear layers), this time achieving 0.479 accuracy.
Let's only keep the one with rank 14, as it matches the param count and is giving the best results among those.
I added information into the docstring in a previous commit, just above the line you linked. Is that enough or should I add more?
Looks good, thanks.
The vLLM maintainers have requested to keep the pull request in a draft state at the moment, and want to include it into the codebase if the community demand is high enough, so maybe keeping it mainly documented for LLama is okay at the moment.
I see, too bad, but I'm sure they have good reasons. So essentially, to benefit from the speedup, users would have to checkout your branch? Let's document this, otherwise it's too hard to figure out.
There is also a merge conflict now, could you please merge with/rebase on the latest main and resolve it? When all is finished, don't forget to run make style.
src/peft/tuners/lora/config.py
Outdated
| default=True, | ||
| metadata={ | ||
| "help": "If set to true, requires each target_module to have either a block-diagonal LoRA-A or LoRA-B, " | ||
| "and raises an error otherwise. You can set this to False this to mix LoRA and BD-LoRA training, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "and raises an error otherwise. You can set this to False this to mix LoRA and BD-LoRA training, " | |
| "and raises an error otherwise. You can set this to False to mix LoRA and BD-LoRA training, " |
src/peft/tuners/lora/variants.py
Outdated
| def init(module: Linear, adapter_name: str, **kwargs) -> None: | ||
| use_bdlora = kwargs.get("use_bdlora") | ||
| target_name = kwargs.get("target_name", "") | ||
| assert use_bdlora is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, this can be fully removed. If you really want to keep it, please raise a proper error (in this case I'd say RuntimeError("Something went wrong with BD-LoRA initialization, please open an issue at https://github.com/huggingface/peft/issues")).
src/peft/tuners/lora/variants.py
Outdated
|
|
||
| # Handle case where use_bdlora is a dict (from saved config) instead of BdLoraConfig object | ||
| if isinstance(use_bdlora, dict): | ||
| from peft.tuners.lora.config import BdLoraConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the import be moved to the root? Also, it is a bit strange to go from BdLoraConfig to dict back to BdLoraConfig. Can we not keep it as a BdLoraConfig?
tests/test_custom_models.py
Outdated
| "modules_to_save": ["lin1"], | ||
| "use_bdlora": BdLoraConfig(target_modules_bd_a=["lin0"], nblocks=2), | ||
| }, | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get an example with match_strict=False and one vanilla LoRA layer? On the other hand, no need for the modules_to_save example, as I don't think there is any interaction there with BD-LoRA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be covered by the two tests above, right?
(
"BD-LoRA A only",
"MLP",
LoraConfig,
{
"target_modules": ["lin0", "lin1"],
"use_bdlora": BdLoraConfig(target_modules_bd_a=["lin0"], nblocks=2, match_strict=False),
},
),
(
"BD-LoRA B only",
"MLP",
LoraConfig,
{
"target_modules": ["lin0", "lin1"],
"use_bdlora": BdLoraConfig(target_modules_bd_b=["lin1"], nblocks=2, match_strict=False),
},
),
I removed the modules_to_save test :)
tests/test_custom_models.py
Outdated
| ########## | ||
| # BD-LoRA # | ||
| ########## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ########## | |
| # BD-LoRA # | |
| ########## | |
| ########### | |
| # BD-LoRA # | |
| ########### |
;-)
tests/test_initialization.py
Outdated
| with pytest.raises(ValueError, match="DoRA does not support megatron_core"): | ||
| LoraConfig(target_modules=["linear"], use_dora=True, megatron_config=megatron_config) | ||
|
|
||
| def test_bdlora_lora_a_block_diagonal(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test and the next one can be scrapped, this should be covered pretty much by the test_custom_models.py tests. Also, they are not really about initialization.
tests/test_initialization.py
Outdated
| with pytest.raises(ValueError, match="matches neither A nor B block-diagonal patterns"): | ||
| get_peft_model(model, config) | ||
|
|
||
| def test_bdlora_non_strict_matching_works(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be removed, see my suggestion above to add a non-strict BD-LoRA example to test_custom_models.py.
src/peft/tuners/lora/variants.py
Outdated
| raise ValueError( | ||
| f"self.in_features={self.in_features} or self.out_features={self.out_features} not divisble by {self.nblocks}" | ||
| ) | ||
| if bias: | ||
| raise ValueError("Bias is not implemented for BlockDiagonalLinear layer.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add init tests for these two errors as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test for the divisibility error, but I actually cannot get the bias error to raise. For example, I had a test like this:
def test_bdlora_bias_raises(self):
model = self.get_model()
bdlora_config = {
"target_modules_bd_a": [],
"target_modules_bd_b": ["linear"],
"nblocks": 2,
"match_strict": True,
}
config = LoraConfig(target_modules=["linear"], use_bdlora=bdlora_config, bias='lora_only')
with pytest.raises(ValueError, match="Bias is not implemented for BdLora."):
get_peft_model(model, config)
And a check at the BdLoraLinearVariant like this:
elif has_lora_b_blockdiagonal:
r = module.lora_B[adapter_name].in_features
print(module.lora_B[adapter_name].bias)
if module.lora_B[adapter_name].bias:
raise ValueError("Bias is not implemented for BdLora.")
base_layer = module.get_base_layer()
I thought that this should raise the ValueError for the bias. At the moment, I have just removed bias from the BlockDiagonalLinear (it is probably not the right place to catch the error anyways).
|
Hey Benjamin, thanks again for your thorough review. I hope I have caught everything this time. Please let me know especially if the example could use more work. Additionally, I ran into some unexpected behaviour with the bias, would be glad if you could check it out. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, especially the description of the example.
This should be covered by the two tests above, right?
Yes!
I actually cannot get the bias error to raise.
I think there is a misunderstanding here. LoraConfig(..., bias='lora_only') means that for this LoRA adapter, the bias of the targeted modules should be made trainable. I don't think that this has a bearing on BD-LoRA, does it?
There is a small merge conflict in the test file, could you please take care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, let's keep only the rank 14 experiment.
src/peft/tuners/lora/model.py
Outdated
| use_rslora=lora_config.use_rslora, | ||
| use_dora=lora_config.use_dora, | ||
| lora_bias=lora_config.lora_bias, | ||
| target_name=current_key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put the new arguments to the end.
Co-authored-by: Benjamin Bossan <[email protected]>
Ah perfect, then this does not interfere with BD-LoRA. I guess it is fine to just initialize BlockDiagonalLinear with bias=False and not perform any checks.
Done! |


Block-Diagonal LoRA (BD-LoRA) (NeurIPS 2025) is a LoRA-variant in which some LoRA factors are constrained to be block-diagonal. This allows faster serving (e.g. 10-40% with Llama 3.1-8B) by eliminating communication overheads when running inference on multiple GPUs. Despite the block-diagonal constraint, BD-LoRA is as performant as vanilla LoRA at similar parameter counts. For performance and speed examples, see the plots below.
BD-LoRA is designed to be used with tensor parallelism, which means sharding the weights of a model among multiple GPUs. I am currently also working on a pull-request to vLLM which will enable vLLM to use the adapters from this BD-LoRA implementation for GPU speed-ups. I will release the pull-request to vLLM in the next 1-2 days and post a link to it.
I have included in this PR:
src/peft/tuners/bdlora, following other LoRA variantsmethod_comparisonexamples/bdlora_finetuningtests/test_bdlora.py