Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UB] Adding support for multinode nvlink #815

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

shamisp
Copy link
Contributor

@shamisp shamisp commented Apr 26, 2024

This adds support for multi-node nvlink architecture. In addition it includes changes for making CE deadlock checker configurable at the runtime.

@shamisp shamisp changed the title [UB] Adding support for multinode nvlink [WIP] [UB] Adding support for multinode nvlink Jun 4, 2024
@shamisp
Copy link
Contributor Author

shamisp commented Jun 4, 2024

@denera I think this branch is ready for upstream but I know u have outstanding changes as well.

@denera
Copy link
Collaborator

denera commented Jun 4, 2024

@shamisp Let's review and merge this branch. I will rebase my PR on top later.

@timmoon10 timmoon10 requested a review from denera June 12, 2024 21:19
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely at the implementation and my changes are mostly stylistic. Also, DCO is complaining about an unsigned commit.

@timmoon10 timmoon10 self-requested a review June 13, 2024 00:26
@timmoon10
Copy link
Collaborator

/te-ci pytorch

shamisp added 22 commits June 26, 2024 21:28
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Since CE deadlock detector launch and additional increment kernel it
may introduce an overhead to the communication flow. Adding in
option to disable it.

Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Making CE deadlock detection a runtime option, which is disabled by
defaults

Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
shamisp added 2 commits June 26, 2024 21:28
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
shamisp added 2 commits June 28, 2024 12:02
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants