[QEff Finetune]: Enable PP+DDP #394

quic-mamta · 2025-05-08T07:55:09Z

Added support for PP and DDP

Command for PP only : QAIC_VISIBLE_DEVICES=0,1,2,3 python -m QEfficient.cloud.finetune --device qaic --enable_pp --dist_backend qccl (number of pipeline stages will be equal to visible devices)

Command for DDP only : QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 -m QEfficient.cloud.finetune --device qaic --enable_ddp --dist_backend qccl

Command for PP+DDP : For 4 qaic devices(1 Ultra) with 2 pipeline stages
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 2 -m QEfficient.cloud.finetune --device qaic --enable_ddp --enable_pp --num_pp_stages 2 --dist_backend qccl

Signed-off-by: Mamta Singh <[email protected]>

quic-meetkuma

Good work, Mamta! Please address the comments. Let us discuss offline if anything is confusing.

QEfficient/cloud/finetune.py

QEfficient/finetune/configs/training.py

QEfficient/cloud/finetune.py

quic-meetkuma · 2025-05-12T09:27:21Z

QEfficient/cloud/finetune.py

+        - This device map structure is verified for llama models only.
+    """
+    device_map = {
+        "model.embed_tokens": rank * num_pp_stages,


Please add some explanation why these particular layers are mapped to a particular device.
L64 to L67

quic-meetkuma · 2025-05-12T09:28:27Z

QEfficient/cloud/finetune.py

+        "model.rotary_emb": rank * num_pp_stages + (num_pp_stages - 1),
+    }
+    n_layer_per_stage = math.ceil(num_layers / num_pp_stages)
+    for j in range(num_pp_stages):


Please add some strong documentation for this double for loop. It is difficult to understand without taking a case. Better add some example and explain with it.

QEfficient/cloud/finetune.py

Signed-off-by: Mamta Singh <[email protected]>

QEfficient/cloud/finetune.py

quic-swatia · 2025-06-02T06:31:33Z

QEfficient/cloud/finetune.py

+    n_layer_per_stage = math.ceil(num_layers / num_pp_stages)
+    for j in range(num_pp_stages):
+        for i in range(n_layer_per_stage * j, min(n_layer_per_stage * (j + 1), num_layers)):
+            device_map[f"model.layers.{i}"] = rank * num_pp_stages + j


Will this not place 2 extra layers than n_layer_per_stage on the first device?

PP+DDP for 70B

c213173

Signed-off-by: Mamta Singh <[email protected]>

quic-mamta requested review from quic-rishinr, ochougul and quic-amitraj as code owners May 8, 2025 07:55

quic-mamta marked this pull request as draft May 8, 2025 07:55

quic-mamta self-assigned this May 8, 2025

quic-mamta changed the title ~~Enable PP+DDP~~ [QEff Finetune]: Enable PP+DDP May 8, 2025

quic-mamta force-pushed the pp_ddp branch from b1443c4 to 38b4a86 Compare May 8, 2025 07:57

quic-mamta requested review from vbaddi and quic-swatia May 8, 2025 07:58

quic-mamta force-pushed the pp_ddp branch 2 times, most recently from e8b1da7 to df36ae1 Compare May 8, 2025 08:34

Merge branch 'quic:main' into pp_ddp

406a869

quic-mamta force-pushed the pp_ddp branch 8 times, most recently from 3ca1229 to 53ff3c4 Compare May 11, 2025 19:37

quic-rishinr added the fine-tuning label May 12, 2025

quic-meetkuma suggested changes May 12, 2025

View reviewed changes

Merge branch 'main' into pp_ddp

9c3a460

Signed-off-by: Mamta Singh <[email protected]>

quic-mamta force-pushed the pp_ddp branch from 53ff3c4 to 9c3a460 Compare May 12, 2025 19:49

Update finetune.py

fff53ba

Signed-off-by: Mamta Singh <[email protected]>

quic-swatia reviewed Jun 2, 2025

View reviewed changes

Merge branch 'quic:main' into pp_ddp

d57e80e

quic-mamta force-pushed the pp_ddp branch from 02863c2 to d57e80e Compare June 2, 2025 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QEff Finetune]: Enable PP+DDP #394

[QEff Finetune]: Enable PP+DDP #394

quic-mamta commented May 8, 2025 •

edited

Loading

Uh oh!

quic-meetkuma left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-meetkuma May 12, 2025

Uh oh!

quic-meetkuma May 12, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-swatia Jun 2, 2025

Uh oh!

Uh oh!

[QEff Finetune]: Enable PP+DDP #394

Are you sure you want to change the base?

[QEff Finetune]: Enable PP+DDP #394

Conversation

quic-mamta commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-meetkuma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-meetkuma May 12, 2025

Choose a reason for hiding this comment

Uh oh!

quic-meetkuma May 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-swatia Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-mamta commented May 8, 2025 •

edited

Loading