-
Notifications
You must be signed in to change notification settings - Fork 366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UB] Adding support for multinode nvlink #815
base: main
Are you sure you want to change the base?
Conversation
@denera I think this branch is ready for upstream but I know u have outstanding changes as well. |
@shamisp Let's review and merge this branch. I will rebase my PR on top later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked closely at the implementation and my changes are mostly stylistic. Also, DCO is complaining about an unsigned commit.
/te-ci pytorch |
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Since CE deadlock detector launch and additional increment kernel it may introduce an overhead to the communication flow. Adding in option to disable it. Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Making CE deadlock detection a runtime option, which is disabled by defaults Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]> Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pasha (Pavel) Shamis <[email protected]>
Signed-off-by: Pavel Shamis (Pasha) <[email protected]>
This adds support for multi-node nvlink architecture. In addition it includes changes for making CE deadlock checker configurable at the runtime.