Open
Description
Hi
We are recently testing the CP parallelism strategy, for a 2D configuration: FSDP+CP.
From what we know, CP is to slice the sequence length, as attention kernel needs to compute the attention for the whole sequence, which means each GPU needs to gather all the sharded KV cache using some collective communication kernels.
However, we didn't see any such kind of kernels, only found the All-Gather for parameters in pre-forward phase.
Is there anything that we misunderstood? please add your comments for better understanding.
Thanks.