Replies: 1 comment
-
|
We indeed shard attention modules along the torchtitan/torchtitan/models/llama4/infra/parallelize.py Lines 245 to 248 in 7929410 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I might be missing something, but when I look at the llama4 model and parallelism it seems to me that across TP ranks, the attention computation is replicated. This seems very strange to me. Attention is an expensive operation and it has a natural parallelization axis, the head dimension. Is there a good reason for doing things this way?
Beta Was this translation helpful? Give feedback.
All reactions