Add support for DeepSpeed sequence parallelism (Ulysses) #35301

ronald-d-rogers · 2024-12-16T20:51:20Z

What does this PR do?

Adds support for sequence parallel to Tranformers.

❗ The sister PR to accelerate (huggingface/accelerate#3299) must be merged first ❗

The PR mostly pulls in and updates changes made in @zeyugao's very thorough PR here:
huggingface/accelerate#2877

And pulls in @samedjacob's PR here:
#32305

To support changes made by @samejacobs in DeepSpeed, to make sequence parallel integration with HF cleaner here:
deepspeedai/DeepSpeed#5774

And to respond to comments to the his PR to transformers here #32305.

Note that my testing has revealed that all of @zeyugao's changes are indeed required for this integration to work.

I've added two decorators:

deepspeed_ulysses_attention: Instead of modifying the global _flash_attention_forward we wrap it instead as suggested by @ArthurZucker.
support_deepspeed_ulysses: An extension of the concept above to add support for sequence parallel to modules by injecting required variables like sp_group_size.

I am not married to any of these changes, and am completely open to suggestions.

The only major addition made in my PR is to find the right place to shard the inputs for sequence parallelism to work.

I added a method to Trainer called _finalize_inputs which is meant to run right before the inputs are passed to the model. I toyed around with sharding the sequences in various parts of the data loading pipeline (see commits if you are curious), but ultimately determined that it had to be done in trainers, right before the inputs are passed to the model, otherwise libraries with custom trainers (i.e. trl) would have to refactor the way their trainers prepare data.

As things currently stand, the sister PR to accelerate (huggingface/accelerate#3299) must be merged first, so that the new flags added to HFDeepSpeedConfig for sequence parallelism are available, but if desired we can likely make these PRs orthogonal by caching the same flags in transformers for now.

So far everything appears to work (forward/backward pass tested on 4xA10s and 8xA100s). I will be doing more testing soon (i.e. testing loss is same, tuning a model & evaluating, etc.).

Note that we should also update loss_utils.py's fixed_cross_entropy method to use
vocab_sequence_parallel_cross_entropy when sequence parallel is enabled, but this method currently does not allow you to propagate the args ignore_index and reduction. I will create a PR for this in DeepSpeed shortly.

I have also integrated DeepSpeed's Ulysses loss (vocab_sequence_parallel_cross_entropy) into loss_utils.py and my tests are currently showing good loss curves.

I am also beginning to think we should use the DistributedAttention that is in DeepSpeed here instead of decorating the global flash attention function, similar to the way @zeyugao had it, as it seems to have a lot of functionality:
https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/sequence/layer.py#L300

I will likely test this soon as well.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Raise exception when sdpa

…uence-parallel

…sformer

Make torch a requirement for deepspeed sp

Provide more clarity on the need for sequence parallelism.

…sformers into ds-ulysses

…com/zeyugao/transformers into ds-ulysses

ronald-d-rogers · 2025-01-07T04:21:30Z

Hello all, just to provide a progress update, I am currently tuning and evaluating a few models to validate that this is all working correctly.

One thing I have noticed is that the loss for sequence parallel seems to be slightly different, but in the same ballpark. I am still trying to figure out exactly why or if this matters. There seems to be a single token which is being clipped and I have yet to determine what that token is and how to preserve it.

Also, though this works with TRL's SFT without issue the memory usage for DPO seems to be way too high. I am trying to investigate what the cause of this is.

ronald-d-rogers · 2025-01-14T19:13:46Z

Hello all, I was busy with something else and just got back to work on this.

I have tuned and evaluated several Llama 3.1 8B models on several different tasks (classification/summarization) and am currently comparing the results of all of the runs.

So far I am definitely noticing some oddities between DeepSpeed ZeRO-3 and DeepSpeed Ulysses.

The vocab_sequence_parallel_cross_entropy loss as it is currently implemented/integrated doesn't appear to work at all. Hugging Face's default loss aggregation across GPUs seems to work for classification (i.e. evaluation improves over base model) but I am currently running to issues with summarization. I am currently tuning a few more runs to see if the issues goes away (with more epochs), but if not I think I will have to start digging into how to loss is implemented.

Currently from what I can tell, we aren't doing softmax across the whole sequence in calculating cross-entropy loss, which I think could be causing issues, but am not sure.

Either way, I will post charts and updates within a day or two.

…in loss calc

Falko1 · 2025-01-31T20:50:46Z

Hi,
thanks for the great work! I tried to get your half-finished PR to work but failed. Instead, I got this (https://github.com/princeton-nlp/ProLong) Sequence Parallel implementation to work, which integrates Ulysses-style SP into the HF trainer. Maybe you can get some inspiration there to finish the PR.
I'm glad to try it once it is merged!

ronald-d-rogers · 2025-02-02T00:21:30Z

@Falko1: Thank you very much for this link. Yes, I am completely stuck trying to get other integrations to work. Maybe I will pivot to this instead.

ArthurZucker

Hey! BTW I think you need a small rebase, we modified the attention format quite a lot!

ArthurZucker · 2025-02-03T13:55:17Z

Hope it won't be too inconvenient for you! 🤗

ronald-d-rogers · 2025-02-03T19:51:06Z

@ArthurZucker: Yes I have noticed :). I think at this point, since this PR isn't actually working, I will go ahead and close this. I will likely redo all of the work in the updated Transformers, perhaps without any dependency on DeepSpeed (like ProLong).

So far I have been able to reproduce the same per vocab logits and loss with sequence parallelism turned as turned off (with slight differences) using both Tranformers' loss mechanics and DeepSpeed's provided sequence parallel loss (modified to work w/ ignore indices), but the model seems to drift after back-propagation... If I run into the same problem I guess I will create an issue and see if I can get help from the community.

ArthurZucker · 2025-02-03T19:52:46Z

We are looking into RingAttention as well could be good!

ronald-d-rogers · 2025-02-03T19:56:18Z

@ArthurZucker: Picotron?

ArthurZucker · 2025-02-13T08:06:12Z

Yeah, with the attention refactor we think things should be easier now!

zeyugao and others added 30 commits June 20, 2024 23:50

Add original deepspeed

6d6fb4b

Support override the seed_worker in Trainer

b5f054c

Add some necessary check on sequence parallel argument

6f72b8a

Add DistributedAttention

e311c0b

Add starcoder2 as sequence parallel supported

f2a6cc9

Add llama, mistral

42fd905

Raise exception when sdpa

Move DistributedSampler initialization into trainer

1c1eed2

Fix llama query shape when _upad_input

d3b0ce0

Use all_to_all for flexiablity

8d565f8

Support sdpa for llama and mistral

2727691

Fix miss understood train_batch_size calcuation

fbb7e0b

Fix args.world_size calcuation in model parallel

cf29d6d

Merge remote-tracking branch 'origin/main' into support-deepspeed-seq…

01a4cdd

…uence-parallel

Run ruff check

57488b8

Run ruff format

278873c

DeepSpeed sequence parallelism (aka Ulysses) integration with HF tran…

d94a598

…sformer

Add deepspeed sp unit test

2cd494e

Add deepspeed sp unit test

ed1b2c7

Properly document args to DS SeqAllToAll

d660f11

Add DS seq parallelism doc

2805b7a

Formatting

c0cce19

isort fix

82ab867

Quality fix

0918a67

Update test_deepspeed.py

9ea2571

Make torch a requirement for deepspeed sp

Update deepspeed.md

8766b91

Provide more clarity on the need for sequence parallelism.

Respond to PR comments (wrap_deepspeed)

76389dd

Merge branch 'wrap-ds-uly' of https://github.com/ronald-d-rogers/tran…

f54f288

…sformers into ds-ulysses

Merge branch 'support-deepspeed-sequence-parallel' of https://github.…

2ca042f

…com/zeyugao/transformers into ds-ulysses

Merges support for deepspeed ulysses

ae143cd

fix forgot to remove some imports

a2a26be

Ronald Rogers added 4 commits January 2, 2025 14:47

back to just using transformer's loss to fix nan eval loss

0c6c561

fix accidentally removed import

a9d7c6c

fix syntax error

428e35d

adds deepspeed cross entropy for sequence parallel

f9f3548

ronald-d-rogers force-pushed the ds-ulysses branch from a98972f to f9f3548 Compare January 2, 2025 19:47

remove no longer needed nan means from evalution loop

0a4c7d2

Ronald Rogers added 10 commits January 24, 2025 02:00

fix skipped tokens and not ignoring ignore indices on mean reduction …

1f2490a

…in loss calc

copy over ds loss and ignore ignore indices on backprop

660fdac

forgot to ignore more ignore indices

6c56cb2

back to just using hf loss but with proper backprop

94572cb

fix ruff error

3a80c16

fix syntax error...

f58c44d

fix forgot to pass seq parallel group to torch dist

b3b981d

just leaving last token clipped for now...

6c561a9

fix final sequence shard going to varlen attn impl

e3db6bd

add fix to other supported models

2ff42f6

Ronald Rogers added 2 commits February 1, 2025 15:41

try deepspeed's loss again w/ varlen fix

18f2a09

fix ignore mask in customized ds-loss backprop

c817d18

ArthurZucker reviewed Feb 3, 2025

View reviewed changes

ronald-d-rogers closed this Feb 3, 2025

ArthurZucker mentioned this pull request Feb 13, 2025

Support context parallel training with ring-flash-attention #33467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for DeepSpeed sequence parallelism (Ulysses) #35301

Add support for DeepSpeed sequence parallelism (Ulysses) #35301

ronald-d-rogers commented Dec 16, 2024 •

edited

Loading

ronald-d-rogers commented Jan 7, 2025 •

edited

Loading

ronald-d-rogers commented Jan 14, 2025 •

edited

Loading

Falko1 commented Jan 31, 2025

ronald-d-rogers commented Feb 2, 2025

ArthurZucker left a comment

ArthurZucker commented Feb 3, 2025

ronald-d-rogers commented Feb 3, 2025 •

edited

Loading

ArthurZucker commented Feb 3, 2025

ronald-d-rogers commented Feb 3, 2025

ArthurZucker commented Feb 13, 2025

Add support for DeepSpeed sequence parallelism (Ulysses) #35301

Add support for DeepSpeed sequence parallelism (Ulysses) #35301

Conversation

ronald-d-rogers commented Dec 16, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

ronald-d-rogers commented Jan 7, 2025 • edited Loading

ronald-d-rogers commented Jan 14, 2025 • edited Loading

Falko1 commented Jan 31, 2025

ronald-d-rogers commented Feb 2, 2025

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Feb 3, 2025

ronald-d-rogers commented Feb 3, 2025 • edited Loading

ArthurZucker commented Feb 3, 2025

ronald-d-rogers commented Feb 3, 2025

ArthurZucker commented Feb 13, 2025

ronald-d-rogers commented Dec 16, 2024 •

edited

Loading

ronald-d-rogers commented Jan 7, 2025 •

edited

Loading

ronald-d-rogers commented Jan 14, 2025 •

edited

Loading

ronald-d-rogers commented Feb 3, 2025 •

edited

Loading