enable tensor parallelism for MXLinear #2434

vkuzo · 2025-06-24T14:08:24Z

Summary:

Enables TP for MXLinear. Specifically:

change the reshape logic from x.reshape(-1, block_size) to
x.reshape(*orig_shape[:-1], orig_shape[-1] // block_size, block_size)
modify the rest of the code to adhere to (1)
cast input tensor and max_abs to float32 before calculating the MX
scale, in order to get around another bug in DTensor + view + int16
target type

(1) is necessary because the old reshape logic would flatten dims, which
did not work if one of those flattened dims was sharded.

Note that TP does not yet work with the custom dim1 triton kernel, we'll need a separate PR to fix that by adding a sharding strategy to the kernel.

I verified that performance for FSDP + mxfp8 + compile is not affected by this stack, with torchtitan llama 3 8B on 8 B200 GPUs:

baseline (without this PR stack)

bf16 FSDP - tps 8.8k, peak_mem 35.0 GiB ([link](https://www.internalfb.com/phabricator/paste/view/P1850041288))
with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.steps 50 --parallelism.tensor_parallel_degree=1 --training.compile

bf16 FSDP + tp - tps 8.2k, peak_mem 29.6 GiB ([link](https://www.internalfb.com/phabricator/paste/view/P1850041882))
with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.steps 50 --parallelism.tensor_parallel_degree=2 --training.compile

mxfp8 FSDP - tps 10k, peak_mem 35.3 GiB ([link](https://www.internalfb.com/phabricator/paste/view/P1850040695))
with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.steps 50 --model.converters mx --mx.recipe_name "mxfp8" --parallelism.tensor_parallel_degree=1 --mx.use_fp8_dim1_cast_triton_kernel --training.compile

mxfp8 FSDP + TP - broken
with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.steps 50 --model.converters mx --mx.recipe_name "mxfp8" --parallelism.tensor_parallel_degree=2 --mx.use_fp8_dim1_cast_triton_kernel --training.compile

experiment (with this PR stack)

mxfp8 FSDP - tps 10k, peak_mem 35.3 GiB ([link](https://www.internalfb.com/phabricator/paste/view/P1850044437))
with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.steps 50 --model.converters mx --mx.recipe_name "mxfp8" --parallelism.tensor_parallel_degree=1 --mx.use_fp8_dim1_cast_triton_kernel --training.compile

mxpf8 + FSDP + TP + turn off dim triton kernel - tps 7.9k, peak_mem 29.7 GiB ([link](https://www.internalfb.com/phabricator/paste/view/P1850045992))
with-proxy CONFIG_FILE="torchtitan/models/llama3/train_configs/llama3_8b.toml " ./run_train.sh --model.print_after_conversion --training.steps 50 --model.converters mx --mx.recipe_name "mxfp8" --parallelism.tensor_parallel_degree=2 --mx.no-use_fp8_dim1_cast_triton_kernel --training.compile

Test Plan:

pytest test/prototype/mx_formats
./test/prototype/mx_formats/test_dtensor.sh

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-06-24T14:08:25Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-06-24T14:08:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2434

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 11 Pending

As of commit 1001602 with merge base 7d6bb6a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Enables TP for MXLinear. Specifically: 1. change the reshape logic from `x.reshape(-1, block_size)` to `x.reshape(*orig_shape[:-1], orig_shape[-1] // block_size, block_size)` 2. modify the rest of the code to adhere to (1) 3. cast input tensor and max_abs to float32 before calculating the MX scale, in order to get around another bug in DTensor + view + int16 target type (1) is necessary because the old reshape logic would flatten dims, which did not work if one of those flattened dims was sharded. Test Plan: ``` pytest test/prototype/mx_formats ./test/prototype/mx_formats/test_dtensor.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0bcd13e ghstack-comment-id: 3000664086 Pull Request resolved: #2434

[ghstack-poisoned]

Summary: Enables TP for MXLinear. Specifically: 1. change the reshape logic from `x.reshape(-1, block_size)` to `x.reshape(*orig_shape[:-1], orig_shape[-1] // block_size, block_size)` 2. modify the rest of the code to adhere to (1) 3. cast input tensor and max_abs to float32 before calculating the MX scale, in order to get around another bug in DTensor + view + int16 target type (1) is necessary because the old reshape logic would flatten dims, which did not work if one of those flattened dims was sharded. Test Plan: ``` pytest test/prototype/mx_formats ./test/prototype/mx_formats/test_dtensor.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b742646 ghstack-comment-id: 3000664086 Pull Request resolved: #2434

[ghstack-poisoned]

Summary: Enables TP for MXLinear. Specifically: 1. change the reshape logic from `x.reshape(-1, block_size)` to `x.reshape(*orig_shape[:-1], orig_shape[-1] // block_size, block_size)` 2. modify the rest of the code to adhere to (1) 3. cast input tensor and max_abs to float32 before calculating the MX scale, in order to get around another bug in DTensor + view + int16 target type (1) is necessary because the old reshape logic would flatten dims, which did not work if one of those flattened dims was sharded. Test Plan: ``` pytest test/prototype/mx_formats ./test/prototype/mx_formats/test_dtensor.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0cf8a1f ghstack-comment-id: 3000664086 Pull Request resolved: #2434

[ghstack-poisoned]

Summary: Enables TP for MXLinear. Specifically: 1. change the reshape logic from `x.reshape(-1, block_size)` to `x.reshape(*orig_shape[:-1], orig_shape[-1] // block_size, block_size)` 2. modify the rest of the code to adhere to (1) 3. cast input tensor and max_abs to float32 before calculating the MX scale, in order to get around another bug in DTensor + view + int16 target type (1) is necessary because the old reshape logic would flatten dims, which did not work if one of those flattened dims was sharded. Test Plan: ``` pytest test/prototype/mx_formats ./test/prototype/mx_formats/test_dtensor.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0cf8a1f ghstack-comment-id: 3000664086 Pull Request resolved: #2434

[ghstack-poisoned]

Summary: Enables TP for MXLinear. Specifically: 1. change the reshape logic from `x.reshape(-1, block_size)` to `x.reshape(*orig_shape[:-1], orig_shape[-1] // block_size, block_size)` 2. modify the rest of the code to adhere to (1) 3. cast input tensor and max_abs to float32 before calculating the MX scale, in order to get around another bug in DTensor + view + int16 target type (1) is necessary because the old reshape logic would flatten dims, which did not work if one of those flattened dims was sharded. Test Plan: ``` pytest test/prototype/mx_formats ./test/prototype/mx_formats/test_dtensor.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0cf8a1f ghstack-comment-id: 3000664086 Pull Request resolved: #2434

vkuzo · 2025-06-24T14:21:38Z

test/prototype/mx_formats/test_mx_linear.py

@@ -190,8 +190,8 @@ def test_linear_eager_emulated_vs_real_gemm(recipe_name, mkn):
 # TODO(future): enable compile support
 @pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")
 def test_activation_checkpointing():
-    input_shape = (2, 4)
-    grad_shape = (2, 8)
+    input_shape = (16, 4)


this was broken before, caught by enforcing that inner dim is divisible by block size

vkuzo · 2025-06-24T14:22:04Z

torchao/prototype/mx_formats/mx_tensor.py

+    # torchtitan but not in a unit test, so not enough info to file a good
+    # issue in pytorch/pytorch. For now, work around. In the future we should
+    # debug and fix this properly.
+    data_hp = data_hp.to(torch.float32)


performance testing showed that with compile on, having this in float32 does not regress performance

vkuzo · 2025-06-24T14:22:23Z

torchao/testing/training/dtensor_utils.py


    tp_out = tp_model(x_fp32_tp_input)
-    tp_out.sum().backward()
+    tp_out.backward(go_fp32_tp)


to make sure grad flowing into the last linear is contiguous

[ghstack-poisoned]

Summary: Enables TP for MXLinear. Specifically: 1. change the reshape logic from `x.reshape(-1, block_size)` to `x.reshape(*orig_shape[:-1], orig_shape[-1] // block_size, block_size)` 2. modify the rest of the code to adhere to (1) 3. cast input tensor and max_abs to float32 before calculating the MX scale, in order to get around another bug in DTensor + view + int16 target type (1) is necessary because the old reshape logic would flatten dims, which did not work if one of those flattened dims was sharded. Test Plan: ``` pytest test/prototype/mx_formats ./test/prototype/mx_formats/test_dtensor.sh ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 860833d ghstack-comment-id: 3000664086 Pull Request resolved: #2434

[ghstack-poisoned]

vkuzo added 8 commits June 20, 2025 07:10

Update

5c23c6b

[ghstack-poisoned]

Update

ad2ce62

[ghstack-poisoned]

Update

5eb2066

[ghstack-poisoned]

Update

6e3df57

[ghstack-poisoned]

Update

75e6fe7

[ghstack-poisoned]

Update

8bf42da

[ghstack-poisoned]

Update

c0080cd

[ghstack-poisoned]

Update

c6fc48b

[ghstack-poisoned]

vkuzo mentioned this pull request Jun 24, 2025

fix float8 training TP+SP integration tests #2414

Merged

vkuzo mentioned this pull request Jun 24, 2025

rename torchao.testing.float8 to torchao.testing.training #2415

Merged

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2025

This was referenced Jun 24, 2025

make dtensor shared test util more generic #2416

Merged

enable to_mxfp8 cast for DTensor #2420

Merged

Update

4cc1531

[ghstack-poisoned]

vkuzo added 4 commits June 24, 2025 07:17

Update

42083e2

[ghstack-poisoned]

Update

9d171ad

[ghstack-poisoned]

Update

09c1c58

[ghstack-poisoned]

Update

e511e7b

[ghstack-poisoned]

vkuzo added 3 commits June 24, 2025 07:17

Update

3562a5e

[ghstack-poisoned]

Update

7a0fd96

[ghstack-poisoned]

Update

2d1545f

[ghstack-poisoned]

vkuzo added 2 commits June 24, 2025 07:19

Update

20b7db2

[ghstack-poisoned]

Update

7788412

[ghstack-poisoned]

vkuzo added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Jun 24, 2025

vkuzo commented Jun 24, 2025

View reviewed changes

vkuzo requested review from drisspg and danielvegamyhre June 24, 2025 14:22

vkuzo mentioned this pull request Jun 24, 2025

TP + FSDP + MXFP8 fails during compile #2393

Open

vkuzo added 2 commits June 24, 2025 07:31

Update

aabeb61

[ghstack-poisoned]

Update

28f32b9

[ghstack-poisoned]

drisspg approved these changes Jun 24, 2025

View reviewed changes

Update

1001602

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/90/head to main June 24, 2025 19:18

vkuzo mentioned this pull request Jun 24, 2025

mxfp8 training: add TP sharding strategy for dim1 kernel #2436

Open

vkuzo merged commit 32599be into main Jun 24, 2025
47 of 53 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable tensor parallelism for MXLinear #2434

enable tensor parallelism for MXLinear #2434

Uh oh!

vkuzo commented Jun 24, 2025 •

edited

Loading

Uh oh!

vkuzo commented Jun 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 24, 2025 •

edited

Loading

Uh oh!

vkuzo Jun 24, 2025

Uh oh!

vkuzo Jun 24, 2025

Uh oh!

vkuzo Jun 24, 2025

Uh oh!

Uh oh!

Uh oh!

enable tensor parallelism for MXLinear #2434

enable tensor parallelism for MXLinear #2434

Uh oh!

Conversation

vkuzo commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2434

⏳ No Failures, 11 Pending

Uh oh!

vkuzo Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vkuzo commented Jun 24, 2025 •

edited

Loading

vkuzo commented Jun 24, 2025 •

edited

Loading

pytorch-bot bot commented Jun 24, 2025 •

edited

Loading