[Prototype] Block interface: initialization, lr scale, peft #354
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Description
LinearWeightConfig
for standalone linear-like weights, ex. embeddings, lm output. Linear configs support variable defaults, so that each parent layer may define its own, ex. MLP layer 1 and 2 don't have the same default initialization, MoE router doesnt have a bias by default.apply_peft
, with sensible defaults set for each of them.Notes:
per_layer_lr_scale
. The effect is multiplicative (combine_lr_scales
)add_linear_biases
andinit_method_std
as shortcut to setting all linear separately. It would be really convenient but could be harder to manage.TODO:
key_value
, MLPgate_and_up
, MoE concatenated expert weights. We've so far had ad-hoc solutions for separating key and value for peft, and for separating the lr scale by expert, but I'd like something more generic.