Block interface: misc config improvements, modular tflops computation #361

jlamypoirier · 2025-09-09T01:07:09Z

✨ Description

Address various obstacles to block-modular models.

Move compute estimation to individual layers to respect modularity. Use the Schedule to sum contributions for all layers, micro-batches, etc. Report tflops as NaN when unknown (ex. SSM) instead of an incorrect value.
Remove block_config, block_index and name initialization arguments from BlockLayer. These were only needed for relatively minor details, which I managed to work around.
Replace debug_transformer and debug_transformer_memory by the single model_debug_level in Run (For block_config argument removal)
Remove block-index scaling and unscaling in backup attention. (for block_index argument removal. Not sure why it was there in the first place, probably stability or legacy)
Remove init_method_std in block config. Instead set all default initializations directly to hidden_size ** -0.5. Also remove the scaling by the number of layers in attn dense and mlp layer 2. (For block_config argument removal)
Replace add_linear_biases in block config by separate ones in mixer and mlp configs (For block_config argument removal).
Remove max_window_layers, will be handled trough block-modular configs (for block_index argument removal.)
attn head_size, mlp intermediate_size and ssm d_inner, dt_rank, d_xb no longer have a hidden_size-dependent default. Values now need to be set explicitly.
Add separate configs for the embedding layer dropout and the final normalization layer, so they don't depend on an ambiguous block config.
Use torch-based names for modules (base_model.named_modules) as a replacement for the removed name argument.

Rename various config fields to reflect interface changes and improve consistency with our guidelines.

Block config:
- hidden_dropout -> dropout (No more ambiguity)
MLP config
- ffn_hidden_size -> intermediate_size (Consistency, matches HF name)
- activation_type - > activation (unnecessary suffix)
- mlp_recompute_level -> recompute_level (redundant prefix)
Moe config:
- Remove all moe, num and expert prefixes and suffixes (redundant)
Attn config:
- num_attention_heads -> heads (unnecessary prefixes)
- kv_channels -> head_size (Consistency, closer to HF name)
- attention_softmax_scale_power -> softmax_scale_power (redundant prefix)

TODO:
Test the new compute estimation method.
Review SSM config names
(from #360) LM config could use polishing
(from #359) Add back fine-grained bias enabling config (quen2 and dream disabled).
(from #359) Rework SSM conversion (disabled).
(from #358) Allow separate configuration for concatenated layers (ex. key_value, ssm in_proj)

tscholak · 2025-09-17T17:54:38Z

quick question: how are we going to handle that different blocks in the modular stack require different preprocessors?

tscholak

LGTM!

…config

…fine_grained

jlamypoirier · 2025-09-18T19:30:00Z

quick question: how are we going to handle that different blocks in the modular stack require different preprocessors?

get_preprocessor is now modular, so the blocks/mixers/etc. themselves are responsible for adding the preprocessors they need. There are still issues wtih this, ex. if we end up with multiple rotary preprocessors writing different values to the same kwarg, but that's for future work.

tscholak · 2025-09-18T20:02:09Z

quick question: how are we going to handle that different blocks in the modular stack require different preprocessors?

get_preprocessor is now modular, so the blocks/mixers/etc. themselves are responsible for adding the preprocessors they need. There are still issues wtih this, ex. if we end up with multiple rotary preprocessors writing different values to the same kwarg, but that's for future work.

Understood, makes sense. Could we have different kwargs namespaces for different blocks/mixers?

jlamypoirier · 2025-09-18T21:16:48Z

Understood, makes sense. Could we have different kwargs namespaces for different blocks/mixers?

That's a simple option. I'd like to explore alternatives though, maybe we could avoid the kwargs mess altogether.

jlamypoirier added 30 commits July 21, 2025 17:17

TP mamba

82eed2b

TP mamba

4e310c7

fix

3cc4118

fix

9f7f75c

fixes

4054e04

fix

0014cc6

fixes

47ad548

fixes

6a074fa

Update external

d66651f

SSM debugging

50083ba

Merge branch 'main' into tp_mamba

5006328

Merge branch 'debug_mamba' into tp_mamba

13176bd

stuff

7b32699

Merge branch 'debug_mamba' into tp_mamba

73f591f

stuff

1feccc8

misc

e528b50

misc

b49c42f

Merge branch 'debug_mamba' into tp_mamba

bb4dcd9

misc

c1b7f44

misc

31f5d41

Merge branch 'debug_mamba' into tp_mamba

051bb07

misc

0a9ff25

Parallel discrete mamba 2

e7d9636

Mamba 2, misc

c14b764

doc

b605bd2

fix

5eea938

Merge branch 'debug_mamba' into tp_mamba

0a3e2a7

fixes

2e6d082

misc

b6c8613

Merge remote-tracking branch 'origin/main' into debug_mamba

f0c04cf

jlamypoirier added 11 commits August 27, 2025 16:39

stuff

9741ba0

fixes

be69677

Simplify bias options

82a70aa

stuff

680980a

Dynamic mlp and block layer creation

3ef7860

stuff

ecad96b

fix

3fd092c

stuff

1a3497c

stuff

b6e7fce

stuff

4dfe2a4

misc

4185741

tscholak approved these changes Sep 17, 2025

View reviewed changes

jlamypoirier marked this pull request as ready for review September 17, 2025 21:42

jlamypoirier added 9 commits September 17, 2025 17:50

Merge branch 'main' into concatenated_dim

188587e

Merge branch 'concatenated_dim' into tp_mamba

e111509

Merge branch 'tp_mamba' into block_interface

95e0231

Merge remote-tracking branch 'origin/main' into block_interface

e076c7a

Merge branch 'block_interface' into block_interface_weight

2315ac4

Merge remote-tracking branch 'origin/main' into block_interface_weight

79356f7

Merge branch 'block_interface_weight' into block_interface_mixer_mlp_…

e4198a6

…config

Merge branch 'block_interface_mixer_mlp_config' into block_interface_…

7abf263

…fine_grained

Merge branch 'block_interface_fine_grained' into block_interface_tflops

bfc9f84

Base automatically changed from block_interface_fine_grained to main September 18, 2025 21:13

Merge remote-tracking branch 'origin/main' into block_interface_tflops

4db4ccd

jlamypoirier merged commit 8011b1f into main Sep 18, 2025
2 checks passed

jlamypoirier deleted the block_interface_tflops branch September 18, 2025 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block interface: misc config improvements, modular tflops computation #361

Block interface: misc config improvements, modular tflops computation #361

Uh oh!

jlamypoirier commented Sep 9, 2025 •

edited

Loading

Uh oh!

tscholak commented Sep 17, 2025

Uh oh!

tscholak left a comment

Uh oh!

jlamypoirier commented Sep 18, 2025

Uh oh!

tscholak commented Sep 18, 2025

Uh oh!

jlamypoirier commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Block interface: misc config improvements, modular tflops computation #361

Block interface: misc config improvements, modular tflops computation #361

Uh oh!

Conversation

jlamypoirier commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

Uh oh!

tscholak commented Sep 17, 2025

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

jlamypoirier commented Sep 18, 2025

Uh oh!

tscholak commented Sep 18, 2025

Uh oh!

jlamypoirier commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jlamypoirier commented Sep 9, 2025 •

edited

Loading

jlamypoirier commented Sep 18, 2025 •

edited

Loading