Replicated attention computation across TP ranks #1931

cgel · 2025-10-23T23:59:20Z

cgel
Oct 23, 2025

I might be missing something, but when I look at the llama4 model and parallelism it seems to me that across TP ranks, the attention computation is replicated. This seems very strange to me. Attention is an expensive operation and it has a natural parallelization axis, the head dimension. Is there a good reason for doing things this way?

tianyu-l · 2025-10-24T18:10:41Z

tianyu-l
Oct 24, 2025
Collaborator

We indeed shard attention modules along the num_heads dimension, so the attention computation will be sharded. -- each TP rank would only perform attention computation for its local heads.

torchtitan/torchtitan/models/llama4/infra/parallelize.py

Lines 245 to 248 in 7929410

    
           "attention.wq": colwise_parallel(), 
        
           "attention.wk": colwise_parallel(), 
        
           "attention.wv": colwise_parallel(), 
        
           "attention.wo": rowwise_parallel(output_layouts=Shard(1)),

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replicated attention computation across TP ranks #1931

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Replicated attention computation across TP ranks #1931

Uh oh!

cgel Oct 23, 2025

Replies: 1 comment

Uh oh!

tianyu-l Oct 24, 2025 Collaborator

cgel
Oct 23, 2025

tianyu-l
Oct 24, 2025
Collaborator