-
Notifications
You must be signed in to change notification settings - Fork 44
[QEff Finetune]: Enable PP+DDP #394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Mamta Singh <[email protected]>
e8b1da7
to
df36ae1
Compare
3ca1229
to
53ff3c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work, Mamta! Please address the comments. Let us discuss offline if anything is confusing.
QEfficient/cloud/finetune.py
Outdated
- This device map structure is verified for llama models only. | ||
""" | ||
device_map = { | ||
"model.embed_tokens": rank * num_pp_stages, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some explanation why these particular layers are mapped to a particular device.
L64 to L67
QEfficient/cloud/finetune.py
Outdated
"model.rotary_emb": rank * num_pp_stages + (num_pp_stages - 1), | ||
} | ||
n_layer_per_stage = math.ceil(num_layers / num_pp_stages) | ||
for j in range(num_pp_stages): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some strong documentation for this double for loop. It is difficult to understand without taking a case. Better add some example and explain with it.
Signed-off-by: Mamta Singh <[email protected]>
Signed-off-by: Mamta Singh <[email protected]>
QEfficient/cloud/finetune.py
Outdated
n_layer_per_stage = math.ceil(num_layers / num_pp_stages) | ||
for j in range(num_pp_stages): | ||
for i in range(n_layer_per_stage * j, min(n_layer_per_stage * (j + 1), num_layers)): | ||
device_map[f"model.layers.{i}"] = rank * num_pp_stages + j |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this not place 2 extra layers than n_layer_per_stage on the first device?
Added support for PP and DDP
Command for PP only : QAIC_VISIBLE_DEVICES=0,1,2,3 python -m QEfficient.cloud.finetune --device qaic --enable_pp --dist_backend qccl (number of pipeline stages will be equal to visible devices)
Command for DDP only : QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node 4 -m QEfficient.cloud.finetune --device qaic --enable_ddp --dist_backend qccl
Command for PP+DDP : For 4 qaic devices(1 Ultra) with 2 pipeline stages
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc-per-node 2 -m QEfficient.cloud.finetune --device qaic --enable_ddp --enable_pp --num_pp_stages 2 --dist_backend qccl