Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support xccl distributed backend #3034

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

dvrogozh
Copy link
Contributor

Starting from torch>=2.7 XCCL distributed backend is available for XPU devices (requires torch built with USE_XCCL=1).

This commit is verified on Intel Data Center GPU Max with Bloom:

text-generation-launcher --sharded true --num-shard 2 \
  --model-id bigscience/bloom-560m

This commit does not impact IPEX which currently remains using custom distributed backend.

CC: @Narsil

Starting from `torch>=2.7` XCCL distributed backend is available
for XPU devices (requires torch built with `USE_XCCL=1`).

This commit is verified on Intel Data Center GPU Max with Bloom:
```
text-generation-launcher --sharded true --num-shard 2 \
  --model-id bigscience/bloom-560m
```

Signed-off-by: Dmitry Rogozhkin <[email protected]>
@Narsil
Copy link
Collaborator

Narsil commented Feb 19, 2025

What's the benefit over the Ipex backend ? If this allows suboptimal deployments compared to the IPEX image, I think we'd rather not merge this at all (and error out instead with instructions on how to get the better image).

Not having flash attention is kind of a no-go nowadays (we still maintain the old paths but only because they existed at some point, we're not adding any anymore).

@dvrogozh
Copy link
Contributor Author

Initially, IPEX is an external plugin for pytorch which brings in few things which essentially can be grouped in 2:

  1. Accelerator support in the scope of pytorch API: eager mode operators, profiling, distributed backend, etc.
  2. Additional features outside of pytorch API. These would be important 3rd party kernels such as attention kernels.

At the moment we are in the process of bringing in Intel GPUs support right into stock pytorch thru the dedicated device backend called xpu. Relationship with IPEX is the following: IPEX is using those features which become available natively in pytorch XPU backend. I.e. for each next release IPEX is getting rebased on top of some version of pytorch and features which are now available thru XPU backend are dropped from IPEX codebase. In a way it can be viewed as XPU support is getting upstreamed to pytorch (in the scope of "group 1" features exposed by IPEX as I noted above).

XPU distributed backend falls into the 1st group - that's one of the features which is getting upstreamed to pytorch. The plan is that IPEX "ccl" backend will be dropped going forward and IPEX will rely on "xccl" backend exposed directly by pytorch. That's the process which will take time. XCCL distributed backend will be first available in PT 2.7 and will require manual pytorch compilation with USE_XCCL=1, later - added to nightly builds and then replace IPEX's "ccl" in one of IPEX releases. We hope to make this change during this year with the rough estimate around PT 2.9 to make a s/ccl/xccl/ switch.

The change I propose in this PR is being done with above background in mind. It introduces "xccl" distributed support into TGI which can be tried out if someone will build TGI against stock pytorch (without IPEX). As you correctly notice such a build has limited value due to lacking flash attention support. Basically that's the reason why I don't propose to expose such configuration on a higher level at TGI (via docker and documentation covering such environment). At the same time such a build is interesting for development as it helps to identify issues earlier and builds a foundation for the future switch for IPEX environment which ultimately will reuse the code path I introduce now for stock pytorch.

Alternatively, we can postpone adding "xccl" distributed support till IPEX will be ready to use it. Having "xccl" support now, however, even requiring stock pytorch will be a help to me and other developers to prepare things in advance.

I hope above helps to clarify the story and make a decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants