Skip to content

[QEff Finetune]: Enable PP+DDP #394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

quic-mamta
Copy link
Contributor

@quic-mamta quic-mamta commented May 8, 2025

Added support for PP and DDP

Command for PP only : QAIC_VISIBLE_DEVICES=0,1,2,3 python -m QEfficient.cloud.finetune --device qaic --enable_pp --dist_backend qccl (number of pipeline stages will be equal to visible devices)

Command for DDP only : QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 -m QEfficient.cloud.finetune --device qaic --enable_ddp --dist_backend qccl

Command for PP+DDP : For 4 qaic devices(1 Ultra) with 2 pipeline stages
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 2 -m QEfficient.cloud.finetune --device qaic --enable_ddp --enable_pp --num_pp_stages 2 --dist_backend qccl

Signed-off-by: Mamta Singh <[email protected]>
@quic-mamta quic-mamta marked this pull request as draft May 8, 2025 07:55
@quic-mamta quic-mamta self-assigned this May 8, 2025
@quic-mamta quic-mamta changed the title Enable PP+DDP [QEff Finetune]: Enable PP+DDP May 8, 2025
@quic-mamta quic-mamta requested review from vbaddi and quic-swatia May 8, 2025 07:58
@quic-mamta quic-mamta force-pushed the pp_ddp branch 2 times, most recently from e8b1da7 to df36ae1 Compare May 8, 2025 08:34
@quic-mamta quic-mamta force-pushed the pp_ddp branch 8 times, most recently from 3ca1229 to 53ff3c4 Compare May 11, 2025 19:37
Copy link
Contributor

@quic-meetkuma quic-meetkuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work, Mamta! Please address the comments. Let us discuss offline if anything is confusing.

- This device map structure is verified for llama models only.
"""
device_map = {
"model.embed_tokens": rank * num_pp_stages,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some explanation why these particular layers are mapped to a particular device.
L64 to L67

"model.rotary_emb": rank * num_pp_stages + (num_pp_stages - 1),
}
n_layer_per_stage = math.ceil(num_layers / num_pp_stages)
for j in range(num_pp_stages):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some strong documentation for this double for loop. It is difficult to understand without taking a case. Better add some example and explain with it.

Signed-off-by: Mamta Singh <[email protected]>
n_layer_per_stage = math.ceil(num_layers / num_pp_stages)
for j in range(num_pp_stages):
for i in range(n_layer_per_stage * j, min(n_layer_per_stage * (j + 1), num_layers)):
device_map[f"model.layers.{i}"] = rank * num_pp_stages + j
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this not place 2 extra layers than n_layer_per_stage on the first device?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants