Using DTensor to handle local num_heads change while TP is applied #3465

wwwjn · 2025-07-16T02:37:07Z

Fixes #ISSUE_NUMBER. This PR is to make the TP tutorial up-to-date with DTensor changes.

Description

After DTensor enhancement, we are not able to use DTensor to handle the change of num_heads instead of manually handle the tensor shape while TP is applied.
Corresponding changes in pytorch/examples: pytorch/examples#1373

Checklist

The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
Only one issue is addressed in this pull request
Labels from the issue that this PR is fixing are added to this pull request
No unnecessary issues are included into this pull request.

pytorch-bot · 2025-07-16T02:37:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3465

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 6dd3297 with merge base 755434d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wwwjn · 2025-07-16T02:39:14Z

cc @tianyu-l

tianyu-l · 2025-07-16T03:36:13Z

intermediate_source/TP_tutorial.rst

@@ -141,7 +141,7 @@ q/k/v projection and row-wise sharding for the ``wo`` linear projection. So we c
 This is almost the ``layer_tp_plan`` we need to apply Tensor Parallelism to the ``TransformerBlock``. However, one thing we should be aware is that when sharding the linear layer column-wise, the output of the linear layers would become sharded on the last tensor dimension, and the row-wise sharding linear layer directly accepts an input that shards on the last dimension.
 If there are any more tensor operations (such as view operations) between the column-wise linear and the row-wise linear, we would need to adjust the relevant shape related ops to sharded shape.

-For the Llama model, in the attention layer there are couple of view operations that are shape related. In particular, column-wise parallel for ``wq``/ ``wk``/ ``wv`` linear layers, the activation tensor is sharded on the ``num_heads`` dimension, so we would need to adjust the ``num_heads`` to local ``num_heads``.
+For the Llama model, in the attention layer, there are several view operations related to shape. Specifically, for column-wise parallelism in the ``wq``/``wk``/``wv`` linear layers, the activation tensor is sharded on the ``num_heads`` dimension. To manage the difference between global and local ``num_heads``, we should set ``use_local_output=False`` to ensure the output is a DTensor. Unlike a regular tensor, a DTensor is aware of the parallelism plans and will automatically handle changes in the ``num_heads`` dimension.


I think we should be able to use DTensor i.e. set use_local_output=False everywhere.
Maybe it's OK to keep a mixed usage of use_local_output so people are aware of this flexibility, but we should mention it here.

wwwjn added 3 commits July 15, 2025 11:10

fsdp1 -> fsdp2

55493b2

change num_heads in tutorial

90c66f8

rewrite

630e1d2

meta-cla bot added the cla signed label Jul 16, 2025

wwwjn changed the title ~~Using DTensor to handel local num_heads change while TP is applied~~ Using DTensor to handle local num_heads change while TP is applied Jul 16, 2025

Merge branch 'main' into tp_tutorial_2

6dd3297

tianyu-l approved these changes Jul 16, 2025

View reviewed changes

svekars approved these changes Jul 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using DTensor to handle local num_heads change while TP is applied #3465

Using DTensor to handle local num_heads change while TP is applied #3465

Uh oh!

wwwjn commented Jul 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 16, 2025 •

edited

Loading

Uh oh!

wwwjn commented Jul 16, 2025

Uh oh!

tianyu-l Jul 16, 2025

Uh oh!

Uh oh!

Using DTensor to handle local num_heads change while TP is applied #3465

Are you sure you want to change the base?

Using DTensor to handle local num_heads change while TP is applied #3465

Uh oh!

Conversation

wwwjn commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

pytorch-bot bot commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3465

✅ No Failures

Uh oh!

wwwjn commented Jul 16, 2025

Uh oh!

tianyu-l Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wwwjn commented Jul 16, 2025 •

edited

Loading

pytorch-bot bot commented Jul 16, 2025 •

edited

Loading