-
Notifications
You must be signed in to change notification settings - Fork 35
Block interface: misc config improvements, modular tflops computation #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
quick question: how are we going to handle that different blocks in the modular stack require different preprocessors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Understood, makes sense. Could we have different kwargs namespaces for different blocks/mixers? |
That's a simple option. I'd like to explore alternatives though, maybe we could avoid the kwargs mess altogether. |
✨ Description
Address various obstacles to block-modular models.
Schedule
to sum contributions for all layers, micro-batches, etc. Report tflops asNaN
when unknown (ex. SSM) instead of an incorrect value.block_config
,block_index
andname
initialization arguments fromBlockLayer
. These were only needed for relatively minor details, which I managed to work around.debug_transformer
anddebug_transformer_memory
by the singlemodel_debug_level
inRun
(Forblock_config
argument removal)block_index
argument removal. Not sure why it was there in the first place, probably stability or legacy)init_method_std
in block config. Instead set all default initializations directly to hidden_size ** -0.5. Also remove the scaling by the number of layers in attn dense and mlp layer 2. (Forblock_config
argument removal)add_linear_biases
in block config by separate ones in mixer and mlp configs (Forblock_config
argument removal).max_window_layers
, will be handled trough block-modular configs (forblock_index
argument removal.)head_size
, mlpintermediate_size
and ssmd_inner
,dt_rank
,d_xb
no longer have a hidden_size-dependent default. Values now need to be set explicitly.base_model.named_modules
) as a replacement for the removedname
argument.Rename various config fields to reflect interface changes and improve consistency with our guidelines.
moe
,num
andexpert
prefixes and suffixes (redundant)TODO:
Test the new compute estimation method.
Review SSM config names
(from #360) LM config could use polishing
(from #359) Add back fine-grained bias enabling config (quen2 and dream disabled).
(from #359) Rework SSM conversion (disabled).
(from #358) Allow separate configuration for concatenated layers (ex. key_value, ssm in_proj)