Skip to content

sync layer norms #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 4, 2022
Merged

sync layer norms #272

merged 14 commits into from
Jul 4, 2022

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Mar 25, 2022

Extracting just the fix for now from #271 - since currently the test suite runs against deepspeed master which is missing the new bf16 code.

Hopefully we won't merge this, as I'm concerned with us imposing an unnecessary code for other users - if we do merge we will have to parametrize the if 1 branch.

edit: decided to merge this but to make it configurable so that we could run other PR branches and not having a headache of merging multiple branches all the time, which is too error prone.

cc: @thomasw21

@thomasw21 thomasw21 mentioned this pull request Mar 25, 2022
@thomasw21
Copy link
Member

thomasw21 commented Mar 25, 2022

Hopefully we won't merge this, as I'm concerned with us imposing an unnecessary code for other users - if we do merge we will have to parametrize the if 1 branch.

We should be able to store a force_tp_synchronization: bool in the class with a default to false in order not to make a breaking change. I think some layers should have that option.

@stas00 stas00 requested a review from a team May 4, 2022 16:22
@stas00 stas00 merged commit e1c479e into main Jul 4, 2022
@stas00 stas00 deleted the layer-norm-auto-sync branch July 4, 2022 23:51
younesbelkada pushed a commit to younesbelkada/Megatron-DeepSpeed that referenced this pull request Sep 28, 2022
* sync layer norms

* all_reduce is an in_place operation

* Make dataloader use another random generator (bigscience-workshop#276)

* do all_reduce op.AVG directly

* add eval dataloader deadlock workaround

* revert generator sync

* make auto-sync configurable; basic test; cleanup

* test with updated AMI image

* fix unrelated test

Co-authored-by: thomasw21 <[email protected]>
adammoody pushed a commit to adammoody/Megatron-DeepSpeed that referenced this pull request Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants