Support TP + FSDPv2 / HSDP or just FSDPv2 / HSDP #3395

kmehant · 2025-02-13T18:46:43Z

What does this PR do?

This PR proposes having FSDP2 as a separate distributed type for accelerate residing along with the FSDPv1 implementation.
Furthermore, the PR also proposes use of prepare_nd_device_mesh util function to extend creation of device meshes for any combination of parallelisms. Currently it supports any combination of TP and FSDP/HSDP
This PR should potentially supersede [RFC] Support FSDP2 #3231

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@SunMarc @muellerzr
@kwen2501 from PyTorch

Signed-off-by: Mehant Kammakomati <[email protected]>

github-actions · 2025-03-16T15:06:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

kmehant · 2025-03-16T15:18:18Z

pulse

Signed-off-by: Mehant Kammakomati <[email protected]>

SunMarc · 2025-03-24T15:20:50Z

cc @S1ro1

S1ro1 · 2025-03-24T15:58:06Z

FSDP2 is close to done in #3394. Then I'll take a look at supporting HSDP and including TP. Though this PR probably also gets into the issue of increased memory usage for FSDP2 because of creating optimizer on a full (non-sharded) model, discussed here

@kmehant let me know if this works for your PR as expected, but it shouldn't.

model = ...
optimizer = ...(model.parameters(),...)

model, optimizer = accelerate.prepare(model, optimizer)

This should result in a higher memory usage as the optimizer holds the original model parameters.

kmehant added 6 commits February 13, 2025 19:15

feat: support fsdpv2 and fsdpv2 + tp

a6109e3

Signed-off-by: Mehant Kammakomati <[email protected]>

feat: support fsdpv2 and fsdpv2 + tp

3a3f035

Signed-off-by: Mehant Kammakomati <[email protected]>

feat: support fsdpv2 and fsdpv2 + tp

18d2484

Signed-off-by: Mehant Kammakomati <[email protected]>

feat: support fsdpv2 and fsdpv2 + tp

e3d18e9

Signed-off-by: Mehant Kammakomati <[email protected]>

feat: support fsdpv2 and fsdpv2 + tp

3109de6

Signed-off-by: Mehant Kammakomati <[email protected]>

feat: support fsdpv2 and fsdpv2 + tp

cb7696f

Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant mentioned this pull request Feb 13, 2025

Initial FSDP2 support #3394

Merged

kmehant mentioned this pull request Feb 27, 2025

[RFC] Support FSDP2 #3231

Open

5 tasks

feat: support fsdpv2 and fsdpv2 + tp

71271d1

Signed-off-by: Mehant Kammakomati <[email protected]>

kmehant force-pushed the tp-fsdp-2 branch from 38f8f8b to 71271d1 Compare March 20, 2025 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support TP + FSDPv2 / HSDP or just FSDPv2 / HSDP #3395

Support TP + FSDPv2 / HSDP or just FSDPv2 / HSDP #3395

kmehant commented Feb 13, 2025

github-actions bot commented Mar 16, 2025

kmehant commented Mar 16, 2025

SunMarc commented Mar 24, 2025 •

edited

Loading

S1ro1 commented Mar 24, 2025

Support TP + FSDPv2 / HSDP or just FSDPv2 / HSDP #3395

Are you sure you want to change the base?

Support TP + FSDPv2 / HSDP or just FSDPv2 / HSDP #3395

Conversation

kmehant commented Feb 13, 2025

What does this PR do?

Before submitting

Who can review?

github-actions bot commented Mar 16, 2025

kmehant commented Mar 16, 2025

SunMarc commented Mar 24, 2025 • edited Loading

S1ro1 commented Mar 24, 2025

SunMarc commented Mar 24, 2025 •

edited

Loading