[float8] add float8 rowwise MoE prototype #1245

danielvegamyhre · 2025-05-30T03:45:58Z

Summary

Adds --float8.moe_fqns_prototype="..." option to float8 training API
API accepts a comma-separated list of FQNs to apply MoE float8 training conversion to.
quanttize_ with the MoETrainingConfig will recursively swap nn.Parameter data tensors to a tensor subclass, which has an override for grouped_mm => dynamic quant + scaled grouped mm prototype. Context: see implementation of GroupedExperts here.

Testing

Tested via manual testing with torchao convert_moe_to_float8_training prototype (PR) and confirmed single GPU training works as expected.

Limitations

Only supports single GPU training so far.
Only performs grouped_mm override for routed experts (see condition here). For shared experts, I'll need to update the torchao prototype to support 3d A tensor (see torchtitan here).

danielvegamyhre · 2025-05-30T04:00:03Z

cc @tianyu-l @vkuzo this is not ready to land yet but I wanted to discuss the API proposed here and make sure we are aligned. Happy to rework it, this is my initial idea on how it should look.

tianyu-l · 2025-05-30T05:00:37Z

Thanks! The UI makes sense to me.

vkuzo · 2025-05-30T12:10:44Z

torchtitan/config_manager.py

@@ -465,6 +465,12 @@ class Float8:
    Not compatible with torch.compile.
    """

+    moe_fqns: list[str] | str = field(default_factory=list)


can we add "prototype" to the field name and add a link to the README in the docstring

tianyu-l · 2025-06-10T18:13:23Z

torchtitan/config_manager.py

@@ -465,6 +465,13 @@ class Float8:
    Not compatible with torch.compile.
    """

+    moe_fqns_prototype: list[str] | str = field(default_factory=list)


no need to add "prototype" to config name?

Suggested change

moe_fqns_prototype: list[str] | str = field(default_factory=list)

moe_fqns: list[str] | str = field(default_factory=list)

@vkuzo requested "prototype" be in the field name here. Unless I misunderstood the suggestion?

Alternatively we could omit "prototype" from the field name and just make sure the docstring/help text is very clear it is a prototype feature with limitations.

For context, I don't plan to land this until at least FSDP is supported (ideally TP as well).

I'm OK either way then. Also since this is an experiment folder, everything could be experimental.

tianyu-l · 2025-06-10T18:13:53Z

torchtitan/experiments/llama4/train_configs/debug_model.toml

@@ -69,3 +69,4 @@ selective_ac_option = '2'  # 'int' = ac every positive int layer or 'op', ac bas
 enable_fsdp_float8_all_gather = false
 precompute_float8_dynamic_scale_for_fsdp = false
 filter_fqns = ["output", "router.gate"]
+moe_fqns = []


let's put something in the list

Added "experts" as the default value (this is what I've been testing with).

tianyu-l

thanks, had two more comments

tianyu-l · 2025-06-11T00:29:36Z

torchtitan/experiments/llama4/train_configs/debug_model.toml

@@ -69,3 +69,4 @@ selective_ac_option = '2'  # 'int' = ac every positive int layer or 'op', ac bas
 enable_fsdp_float8_all_gather = false
 precompute_float8_dynamic_scale_for_fsdp = false
 filter_fqns = ["output", "router.gate"]
+moe_fqns = ["experts"]


do you want to capture the shared expert? If so may need to use "expert" instead of "experts"
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/model/moe.py#L204

If this is well-tested, let's put it into the other toml configs as well.

Not yet, this is intentional - the routed experts work with FSDP and TP, but shared expert only works with FSDP right now. Still debugging an issue related to shared expert + TP.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 30, 2025

danielvegamyhre marked this pull request as draft May 30, 2025 03:46

danielvegamyhre mentioned this pull request May 30, 2025

float8 moe training conversion API prototype pytorch/ao#2275

Merged

danielvegamyhre force-pushed the fp8moe branch from 96a71cd to a992026 Compare May 30, 2025 03:47

vkuzo reviewed May 30, 2025

View reviewed changes

danielvegamyhre added 3 commits June 5, 2025 13:35

add float8 moe prototype

1e55807

use quantize_ for moe

9325c13

use filter_fn in quantize_

75328ea

danielvegamyhre force-pushed the fp8moe branch from 261c66d to 75328ea Compare June 9, 2025 23:42

danielvegamyhre added 3 commits June 10, 2025 07:30

update prototype import path

de7cf4e

migrate api name

712828f

remove bf16 hack for single gpu testing

5df35ce

danielvegamyhre marked this pull request as ready for review June 10, 2025 14:52

danielvegamyhre requested review from tianyu-l, fegin and wwwjn as code owners June 10, 2025 14:52

lint

b2c60fa

danielvegamyhre force-pushed the fp8moe branch from 2066f29 to b2c60fa Compare June 10, 2025 15:00

tianyu-l reviewed Jun 10, 2025

View reviewed changes

add default moe_fqns

43afdef

tianyu-l approved these changes Jun 11, 2025

View reviewed changes

danielvegamyhre mentioned this pull request Jun 20, 2025

[float8 moe training] FSDP support pytorch/ao#2413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[float8] add float8 rowwise MoE prototype #1245

[float8] add float8 rowwise MoE prototype #1245

Uh oh!

danielvegamyhre commented May 30, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented May 30, 2025 •

edited

Loading

Uh oh!

tianyu-l commented May 30, 2025

Uh oh!

vkuzo May 30, 2025

Uh oh!

tianyu-l Jun 10, 2025

Uh oh!

danielvegamyhre Jun 10, 2025 •

edited

Loading

Uh oh!

tianyu-l Jun 10, 2025

Uh oh!

tianyu-l Jun 10, 2025

Uh oh!

danielvegamyhre Jun 10, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jun 11, 2025

Uh oh!

tianyu-l Jun 11, 2025

Uh oh!

danielvegamyhre Jun 23, 2025

Uh oh!

Uh oh!

	moe_fqns_prototype: list[str] \| str = field(default_factory=list)
	moe_fqns: list[str] \| str = field(default_factory=list)

[float8] add float8 rowwise MoE prototype #1245

Are you sure you want to change the base?

[float8] add float8 rowwise MoE prototype #1245

Uh oh!

Conversation

danielvegamyhre commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Limitations

Uh oh!

danielvegamyhre commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l commented May 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielvegamyhre commented May 30, 2025 •

edited

Loading

danielvegamyhre commented May 30, 2025 •

edited

Loading

danielvegamyhre Jun 10, 2025 •

edited

Loading