Support context parallel training with ring-flash-attention #33467

zhuzilin · 2024-09-13T11:12:18Z

Feature request

Hi, I'm the author of zhuzilin/ring-flash-attention.

I wonder if you are interested in integrating context parallel with zhuzilin/ring-flash-attention, so that user can train llm with long data more efficiently.

Motivation

As openai o1 released, it will probably be common for people to train model with really long cot data. And it will be nice if most model within the transformers library can support training with long context efficiently with certain type of context parallel, i.e. the context length scale linearly with the number of GPUs.

The 3 existing context parallel methods are the deepspeed ulysses, ring attention and the one proposed in the llama3 tech report. The deepspeed ulysses will be limited by the number of kv heads (the maximum context length can be num_head_kv * seq_length_per_gpu), which makes it a little unfriendly to GQA models. So it will be great if the transformers library could support the one or both of the other 2 context parallel methods.

And both ring attention and the llama3 strategy are supported with flash attention in zhuzilin/ring-flash-attention, whose correctness has been proved by jzhang38/EasyContext. The library basically has the same api as flash attention, and hides the communication required from its user to make it a easy substitution from any origin flash attention api callsite.

Therefore, I believe it will be easy to support the context parallel with zhuzilin/ring-flash-attention. For example, we could have different branch in modeling_flash_attention_utils._flash_attention_forward.

Your contribution

I'd love to help if you have interests :)

The text was updated successfully, but these errors were encountered:

LysandreJik · 2024-09-17T10:38:06Z

cc @ArthurZucker

ArthurZucker · 2024-09-27T13:46:39Z

Hey hey! I think similarly to the recent integration of liger kernels this would make sense for sure to have a soft dependency, and monkey patch the flash_attn_forward 🤗 happy to review a pr like this!

ZetangForward · 2024-10-03T06:17:08Z

Hi, that's very nice :) Has this feature been integrated into the Huggingface library now?

zhuzilin · 2024-10-08T04:09:15Z

Hmm... After a brief look through, it seems hard to give a clean PR to transformers repo for ring flash attn... There are main obstacles:

For really long data, different rank may need to share the same data, e.g. 4 gpu sharing the same batch with only 1 sample within. In that case, we need to customize the num_replicas and rank passed to samplers, which is currently managed in accelerate. And we could not pass config to accelerate to change the behavior at the moment.
The ring attention need to manage a new process group for the GPUs that share the same batch, so that they can communicate with each other during the attention calculation. There is no clear location where to put that. (we can set a global variable for the process group, but I'm not sure if that will broke some config branch of the Trainer....)
We cannot only change the accerlate repo, because we need a new way to manage the data in Trainer, something like DataCollatorWithFlattening.

I'm afraid we have to postpone this feature after some prerequisites are ready.

zhuzilin · 2024-10-09T15:07:52Z

Just found out that the deepspeed team is adding ulysses. We can wait until they land the feature, as ring attention and deepspeed ulysses will share many prerequisites~

casper-hansen · 2025-01-30T10:22:08Z

DeepSpeed added Ulysses. Any plans to integrate this (highly needed for reasoning models) @ArthurZucker @zhuzilin

https://www.deepspeed.ai/tutorials/ds-sequence/

zhuzilin · 2025-02-05T01:09:47Z

DeepSpeed added Ulysses.

@casper-hansen That's great! Could you point to the PR that integrate ulysses into huggingface transformers/accelerator?

casper-hansen · 2025-02-05T05:42:03Z

@zhuzilin it's not integrated into huggingface. it's only in deepspeed

ArthurZucker · 2025-02-13T08:43:54Z

Pr (#35301) was unfortunately closed! Anyone can work on this!

zhuzilin added the Feature request Request for a new feature label Sep 13, 2024

LysandreJik added the Flash Attention label Sep 17, 2024

ArthurZucker added the Good Difficult Issue label Feb 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support context parallel training with ring-flash-attention #33467

Support context parallel training with ring-flash-attention #33467

zhuzilin commented Sep 13, 2024

LysandreJik commented Sep 17, 2024

ArthurZucker commented Sep 27, 2024

ZetangForward commented Oct 3, 2024

zhuzilin commented Oct 8, 2024 •

edited

Loading

zhuzilin commented Oct 9, 2024

casper-hansen commented Jan 30, 2025

zhuzilin commented Feb 5, 2025

casper-hansen commented Feb 5, 2025

ArthurZucker commented Feb 13, 2025

Support context parallel training with ring-flash-attention #33467

Support context parallel training with ring-flash-attention #33467

Comments

zhuzilin commented Sep 13, 2024

Feature request

Motivation

Your contribution

LysandreJik commented Sep 17, 2024

ArthurZucker commented Sep 27, 2024

ZetangForward commented Oct 3, 2024

zhuzilin commented Oct 8, 2024 • edited Loading

zhuzilin commented Oct 9, 2024

casper-hansen commented Jan 30, 2025

zhuzilin commented Feb 5, 2025

casper-hansen commented Feb 5, 2025

ArthurZucker commented Feb 13, 2025

zhuzilin commented Oct 8, 2024 •

edited

Loading