-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Distributed][refactor] Add base class for device-specific communicator #11324
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
c4f0481
to
eeb5aae
Compare
eeb5aae
to
6e6501a
Compare
1a977d3
to
6de2b98
Compare
This pull request has merge conflicts that must be resolved before it can be |
class CommunicatorABC(ABC): | ||
""" | ||
CommunicatorBase ABC | ||
""" | ||
|
||
@abstractmethod | ||
def all_reduce(self, input_: torch.Tensor) -> torch.Tensor: | ||
raise NotImplementedError | ||
|
||
def gather(self, | ||
input_: torch.Tensor, | ||
dst: int = 0, | ||
dim: int = -1) -> Optional[torch.Tensor]: | ||
|
||
raise NotImplementedError | ||
|
||
def all_gather(self, input_: torch.Tensor, dim: int = -1) -> torch.Tensor: | ||
raise NotImplementedError | ||
|
||
|
||
class CommunicatorBase(CommunicatorABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are these two classes instead of one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The abstract class is just to provide a unified interface and indicate the object type, like following
vllm/vllm/distributed/parallel_state.py
Line 229 in 6de2b98
self.communicator: CommunicatorABC = CommunicatorBase( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao WDYT of we keep both CommunicatorABC
and CommunicatorBase
here? Plz let me know if you have any suggestion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: now we using CommunicatorProtocol
instead of CommunicatorABC
for type indication
b5d2063
to
f03eedb
Compare
242fb40
to
b085f82
Compare
cadfb32
to
cc6d46a
Compare
cc6d46a
to
464594a
Compare
This pull request has merge conflicts that must be resolved before it can be |
464594a
to
1e986a0
Compare
1e986a0
to
79a5eb0
Compare
Signed-off-by: Mengqing Cao <[email protected]>
79a5eb0
to
6851cd0
Compare
CI failed due to network issues. This pr is ready for review now, thanks in advance! @youkaichao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@youkaichao Would you mind taking another look?
Or if you are worried about the code changes being too big and want us to split the PR, for example:
- a separate PR for CommunicatorBase and interface change
- Adapts cuda/rocm, hpu, tpu, xpu separately and split to 3 followup PRs
Please let us know, we'd like to do so.
sorry I'm super busy recently. will review this week. |
part of #11162
This PR provide a base class
CommunicatorBase
for device-specific communicators (HpuCommunicator
,TpuCommunicator
andXpuCommunicator
), avoiding the cumbersome dispatch in each communicator operator ofGroupCoordinator
, e.g.,https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L342-L353
In this pr, the communication-related classes are organized as the following fig. This allows new backends to implement their own communicators and dynamic dispatch them in the platform.
![image](https://private-user-images.githubusercontent.com/52243582/397282587-03513068-a9d2-4f6a-9283-75887d29c326.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzQzMTQsIm5iZiI6MTczOTE3NDAxNCwicGF0aCI6Ii81MjI0MzU4Mi8zOTcyODI1ODctMDM1MTMwNjgtYTlkMi00ZjZhLTkyODMtNzU4ODdkMjljMzI2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA3NTMzNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWE3ZGFjZGQ5YTYzNjQ5ZDkyNDQ2ZWEyYWFlZTg2NWZmZTMxYzljM2VmYTNiYTMxMmI0NmNmZjRmNmQ3YTNlOGUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.4liO4OECcQU-2aQHyJATe2IKW0NFnUfhdgcdnHQqLDE)