Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DeepSpeed sequence parallelism (Ulysses) #35301

Closed
wants to merge 66 commits into from

Conversation

ronald-d-rogers
Copy link

@ronald-d-rogers ronald-d-rogers commented Dec 16, 2024

What does this PR do?

Adds support for sequence parallel to Tranformers.

❗ The sister PR to accelerate (huggingface/accelerate#3299) must be merged first ❗

The PR mostly pulls in and updates changes made in @zeyugao's very thorough PR here:
huggingface/accelerate#2877

And pulls in @samedjacob's PR here:
#32305

To support changes made by @samejacobs in DeepSpeed, to make sequence parallel integration with HF cleaner here:
deepspeedai/DeepSpeed#5774

And to respond to comments to the his PR to transformers here #32305.

Note that my testing has revealed that all of @zeyugao's changes are indeed required for this integration to work.

I've added two decorators:

  • deepspeed_ulysses_attention: Instead of modifying the global _flash_attention_forward we wrap it instead as suggested by @ArthurZucker.
  • support_deepspeed_ulysses: An extension of the concept above to add support for sequence parallel to modules by injecting required variables like sp_group_size.

I am not married to any of these changes, and am completely open to suggestions.

The only major addition made in my PR is to find the right place to shard the inputs for sequence parallelism to work.

I added a method to Trainer called _finalize_inputs which is meant to run right before the inputs are passed to the model. I toyed around with sharding the sequences in various parts of the data loading pipeline (see commits if you are curious), but ultimately determined that it had to be done in trainers, right before the inputs are passed to the model, otherwise libraries with custom trainers (i.e. trl) would have to refactor the way their trainers prepare data.

As things currently stand, the sister PR to accelerate (huggingface/accelerate#3299) must be merged first, so that the new flags added to HFDeepSpeedConfig for sequence parallelism are available, but if desired we can likely make these PRs orthogonal by caching the same flags in transformers for now.

So far everything appears to work (forward/backward pass tested on 4xA10s and 8xA100s). I will be doing more testing soon (i.e. testing loss is same, tuning a model & evaluating, etc.).

Note that we should also update loss_utils.py's fixed_cross_entropy method to use
vocab_sequence_parallel_cross_entropy when sequence parallel is enabled, but this method currently does not allow you to propagate the args ignore_index and reduction. I will create a PR for this in DeepSpeed shortly.

I have also integrated DeepSpeed's Ulysses loss (vocab_sequence_parallel_cross_entropy) into loss_utils.py and my tests are currently showing good loss curves.

I am also beginning to think we should use the DistributedAttention that is in DeepSpeed here instead of decorating the global flash attention function, similar to the way @zeyugao had it, as it seems to have a lot of functionality:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/sequence/layer.py#L300

I will likely test this soon as well.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

zeyugao and others added 30 commits June 20, 2024 23:50
Raise exception when sdpa
Make torch a requirement for deepspeed sp
Provide more clarity on the need for sequence parallelism.
@ronald-d-rogers
Copy link
Author

ronald-d-rogers commented Jan 7, 2025

Hello all, just to provide a progress update, I am currently tuning and evaluating a few models to validate that this is all working correctly.

One thing I have noticed is that the loss for sequence parallel seems to be slightly different, but in the same ballpark. I am still trying to figure out exactly why or if this matters. There seems to be a single token which is being clipped and I have yet to determine what that token is and how to preserve it.

Also, though this works with TRL's SFT without issue the memory usage for DPO seems to be way too high. I am trying to investigate what the cause of this is.

@ronald-d-rogers
Copy link
Author

ronald-d-rogers commented Jan 14, 2025

Hello all, I was busy with something else and just got back to work on this.

I have tuned and evaluated several Llama 3.1 8B models on several different tasks (classification/summarization) and am currently comparing the results of all of the runs.

So far I am definitely noticing some oddities between DeepSpeed ZeRO-3 and DeepSpeed Ulysses.

The vocab_sequence_parallel_cross_entropy loss as it is currently implemented/integrated doesn't appear to work at all. Hugging Face's default loss aggregation across GPUs seems to work for classification (i.e. evaluation improves over base model) but I am currently running to issues with summarization. I am currently tuning a few more runs to see if the issues goes away (with more epochs), but if not I think I will have to start digging into how to loss is implemented.

Currently from what I can tell, we aren't doing softmax across the whole sequence in calculating cross-entropy loss, which I think could be causing issues, but am not sure.

Either way, I will post charts and updates within a day or two.

@Falko1
Copy link

Falko1 commented Jan 31, 2025

Hi,
thanks for the great work! I tried to get your half-finished PR to work but failed. Instead, I got this (https://github.com/princeton-nlp/ProLong) Sequence Parallel implementation to work, which integrates Ulysses-style SP into the HF trainer. Maybe you can get some inspiration there to finish the PR.
I'm glad to try it once it is merged!

@ronald-d-rogers
Copy link
Author

@Falko1: Thank you very much for this link. Yes, I am completely stuck trying to get other integrations to work. Maybe I will pivot to this instead.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! BTW I think you need a small rebase, we modified the attention format quite a lot!

@ArthurZucker
Copy link
Collaborator

Hope it won't be too inconvenient for you! 🤗

@ronald-d-rogers
Copy link
Author

ronald-d-rogers commented Feb 3, 2025

@ArthurZucker: Yes I have noticed :). I think at this point, since this PR isn't actually working, I will go ahead and close this. I will likely redo all of the work in the updated Transformers, perhaps without any dependency on DeepSpeed (like ProLong).

So far I have been able to reproduce the same per vocab logits and loss with sequence parallelism turned as turned off (with slight differences) using both Tranformers' loss mechanics and DeepSpeed's provided sequence parallel loss (modified to work w/ ignore indices), but the model seems to drift after back-propagation... If I run into the same problem I guess I will create an issue and see if I can get help from the community.

@ArthurZucker
Copy link
Collaborator

We are looking into RingAttention as well could be good!

@ronald-d-rogers
Copy link
Author

@ArthurZucker: Picotron?

@ArthurZucker
Copy link
Collaborator

Yeah, with the attention refactor we think things should be easier now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants