Context parallelism understanding

Hi 

We are recently testing the CP parallelism strategy, for a 2D configuration: FSDP+CP. 
From what we know, CP is to slice the sequence length, as attention kernel needs to compute the attention for the whole sequence, which means each GPU needs to gather all the sharded KV cache using some collective communication kernels. 

However, we didn't see any such kind of kernels, only found the All-Gather for parameters in pre-forward phase. 
![image](https://github.com/user-attachments/assets/23d89f06-ef01-4a58-b713-5864be99487e)

Is there anything that we misunderstood?  please add your comments for better understanding. 

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context parallelism understanding #723

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Context parallelism understanding #723

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions