Skip to content

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Sep 9, 2025

✨ Description

Address various obstacles to block-modular models.

  • Move compute estimation to individual layers to respect modularity. Use the Schedule to sum contributions for all layers, micro-batches, etc. Report tflops as NaN when unknown (ex. SSM) instead of an incorrect value.
  • Remove block_config, block_index and name initialization arguments from BlockLayer. These were only needed for relatively minor details, which I managed to work around.
  • Replace debug_transformer and debug_transformer_memory by the single model_debug_level in Run (For block_config argument removal)
  • Remove block-index scaling and unscaling in backup attention. (for block_index argument removal. Not sure why it was there in the first place, probably stability or legacy)
  • Remove init_method_std in block config. Instead set all default initializations directly to hidden_size ** -0.5. Also remove the scaling by the number of layers in attn dense and mlp layer 2. (For block_config argument removal)
  • Replace add_linear_biases in block config by separate ones in mixer and mlp configs (For block_config argument removal).
  • Remove max_window_layers, will be handled trough block-modular configs (for block_index argument removal.)
  • attn head_size, mlp intermediate_size and ssm d_inner, dt_rank, d_xb no longer have a hidden_size-dependent default. Values now need to be set explicitly.
  • Add separate configs for the embedding layer dropout and the final normalization layer, so they don't depend on an ambiguous block config.
  • Use torch-based names for modules (base_model.named_modules) as a replacement for the removed name argument.

Rename various config fields to reflect interface changes and improve consistency with our guidelines.

  • Block config:
    • hidden_dropout -> dropout (No more ambiguity)
  • MLP config
    • ffn_hidden_size -> intermediate_size (Consistency, matches HF name)
    • activation_type - > activation (unnecessary suffix)
    • mlp_recompute_level -> recompute_level (redundant prefix)
  • Moe config:
    • Remove all moe, num and expert prefixes and suffixes (redundant)
  • Attn config:
    • num_attention_heads -> heads (unnecessary prefixes)
    • kv_channels -> head_size (Consistency, closer to HF name)
    • attention_softmax_scale_power -> softmax_scale_power (redundant prefix)

TODO:
Test the new compute estimation method.
Review SSM config names
(from #360) LM config could use polishing
(from #359) Add back fine-grained bias enabling config (quen2 and dream disabled).
(from #359) Rework SSM conversion (disabled).
(from #358) Allow separate configuration for concatenated layers (ex. key_value, ssm in_proj)

@tscholak
Copy link
Collaborator

quick question: how are we going to handle that different blocks in the modular stack require different preprocessors?

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jlamypoirier jlamypoirier marked this pull request as ready for review September 17, 2025 21:42
@jlamypoirier
Copy link
Collaborator Author

quick question: how are we going to handle that different blocks in the modular stack require different preprocessors?

get_preprocessor is now modular, so the blocks/mixers/etc. themselves are responsible for adding the preprocessors they need. There are still issues wtih this, ex. if we end up with multiple rotary preprocessors writing different values to the same kwarg, but that's for future work.

@tscholak
Copy link
Collaborator

quick question: how are we going to handle that different blocks in the modular stack require different preprocessors?

get_preprocessor is now modular, so the blocks/mixers/etc. themselves are responsible for adding the preprocessors they need. There are still issues wtih this, ex. if we end up with multiple rotary preprocessors writing different values to the same kwarg, but that's for future work.

Understood, makes sense. Could we have different kwargs namespaces for different blocks/mixers?

Base automatically changed from block_interface_fine_grained to main September 18, 2025 21:13
@jlamypoirier
Copy link
Collaborator Author

jlamypoirier commented Sep 18, 2025

Understood, makes sense. Could we have different kwargs namespaces for different blocks/mixers?

That's a simple option. I'd like to explore alternatives though, maybe we could avoid the kwargs mess altogether.

@jlamypoirier jlamypoirier merged commit 8011b1f into main Sep 18, 2025
2 checks passed
@jlamypoirier jlamypoirier deleted the block_interface_tflops branch September 18, 2025 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants