-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for DeepSpeed sequence parallelism (Ulysses) #35301
Conversation
Raise exception when sdpa
Make torch a requirement for deepspeed sp
Provide more clarity on the need for sequence parallelism.
a98972f
to
f9f3548
Compare
Hello all, just to provide a progress update, I am currently tuning and evaluating a few models to validate that this is all working correctly. One thing I have noticed is that the loss for sequence parallel seems to be slightly different, but in the same ballpark. I am still trying to figure out exactly why or if this matters. There seems to be a single token which is being clipped and I have yet to determine what that token is and how to preserve it. Also, though this works with TRL's SFT without issue the memory usage for DPO seems to be way too high. I am trying to investigate what the cause of this is. |
Hello all, I was busy with something else and just got back to work on this. I have tuned and evaluated several Llama 3.1 8B models on several different tasks (classification/summarization) and am currently comparing the results of all of the runs. So far I am definitely noticing some oddities between DeepSpeed ZeRO-3 and DeepSpeed Ulysses. The Currently from what I can tell, we aren't doing Either way, I will post charts and updates within a day or two. |
Hi, |
@Falko1: Thank you very much for this link. Yes, I am completely stuck trying to get other integrations to work. Maybe I will pivot to this instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! BTW I think you need a small rebase, we modified the attention format quite a lot!
Hope it won't be too inconvenient for you! 🤗 |
@ArthurZucker: Yes I have noticed :). I think at this point, since this PR isn't actually working, I will go ahead and close this. I will likely redo all of the work in the updated Transformers, perhaps without any dependency on DeepSpeed (like ProLong). So far I have been able to reproduce the same per vocab logits and loss with sequence parallelism turned as turned off (with slight differences) using both Tranformers' loss mechanics and DeepSpeed's provided sequence parallel loss (modified to work w/ ignore indices), but the model seems to drift after back-propagation... If I run into the same problem I guess I will create an issue and see if I can get help from the community. |
We are looking into RingAttention as well could be good! |
Yeah, with the attention refactor we think things should be easier now! |
What does this PR do?
Adds support for sequence parallel to Tranformers.
❗ The sister PR to
accelerate
(huggingface/accelerate#3299) must be merged first ❗The PR mostly pulls in and updates changes made in @zeyugao's very thorough PR here:
huggingface/accelerate#2877
And pulls in @samedjacob's PR here:
#32305
To support changes made by @samejacobs in DeepSpeed, to make sequence parallel integration with HF cleaner here:
deepspeedai/DeepSpeed#5774
And to respond to comments to the his PR to
transformers
here #32305.Note that my testing has revealed that all of @zeyugao's changes are indeed required for this integration to work.
I've added two decorators:
deepspeed_ulysses_attention
: Instead of modifying the global_flash_attention_forward
we wrap it instead as suggested by @ArthurZucker.support_deepspeed_ulysses
: An extension of the concept above to add support for sequence parallel to modules by injecting required variables likesp_group_size
.I am not married to any of these changes, and am completely open to suggestions.
The only major addition made in my PR is to find the right place to shard the inputs for sequence parallelism to work.
I added a method to
Trainer
called_finalize_inputs
which is meant to run right before the inputs are passed to the model. I toyed around with sharding the sequences in various parts of the data loading pipeline (see commits if you are curious), but ultimately determined that it had to be done in trainers, right before the inputs are passed to the model, otherwise libraries with custom trainers (i.e.trl
) would have to refactor the way their trainers prepare data.As things currently stand, the sister PR to
accelerate
(huggingface/accelerate#3299) must be merged first, so that the new flags added toHFDeepSpeedConfig
for sequence parallelism are available, but if desired we can likely make these PRs orthogonal by caching the same flags intransformers
for now.So far everything appears to work (forward/backward pass tested on 4xA10s and 8xA100s). I will be doing more testing soon (i.e. testing loss is same, tuning a model & evaluating, etc.).
Note that we should also update
loss_utils.py
'sfixed_cross_entropy
method to usevocab_sequence_parallel_cross_entropy
when sequence parallel is enabled, but this method currently does not allow you to propagate the argsignore_index
andreduction
. I will create a PR for this in DeepSpeed shortly.I have also integrated DeepSpeed's Ulysses loss (
vocab_sequence_parallel_cross_entropy
) intoloss_utils.py
and my tests are currently showing good loss curves.I am also beginning to think we should use the
DistributedAttention
that is inDeepSpeed
here instead of decorating the global flash attention function, similar to the way @zeyugao had it, as it seems to have a lot of functionality:https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/sequence/layer.py#L300
I will likely test this soon as well.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.