Skip to content

Added watchdog timeout when stop is issued for stuck actors. #336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

vidhyav
Copy link
Contributor

@vidhyav vidhyav commented Jun 25, 2025

Summary: Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Meta Open Source bot. fb-exported labels Jun 25, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 25, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from d4f7b17 to 96bbded Compare June 25, 2025 15:25
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 25, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from 96bbded to baf8c7a Compare June 25, 2025 15:25
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 25, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from baf8c7a to c848e18 Compare June 25, 2025 15:29
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 25, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch 2 times, most recently from 2dd8505 to 9bc4b2c Compare June 25, 2025 21:44
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 25, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 25, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from 9bc4b2c to e9be98d Compare June 25, 2025 22:45
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

@vidhyav vidhyav force-pushed the export-D77303504 branch from e9be98d to 79511d0 Compare June 26, 2025 14:35
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 26, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 26, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from 79511d0 to c99a6f5 Compare June 26, 2025 14:35
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

@vidhyav vidhyav force-pushed the export-D77303504 branch from c99a6f5 to c41bd9c Compare June 26, 2025 14:38
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 26, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

@vidhyav vidhyav force-pushed the export-D77303504 branch from c41bd9c to e9dd94f Compare June 26, 2025 14:47
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 26, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from e9dd94f to 7809d28 Compare June 26, 2025 23:41
@vidhyav vidhyav force-pushed the export-D77303504 branch from 29f2e92 to b96f22e Compare June 27, 2025 00:18
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Reviewed By: suo

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from b96f22e to b28f15d Compare June 27, 2025 00:26
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Reviewed By: suo

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from b28f15d to 0e93560 Compare June 27, 2025 15:35
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Reviewed By: suo

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from 0e93560 to 2ed2110 Compare June 27, 2025 15:36
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Reviewed By: suo

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from 2ed2110 to 4a0646e Compare June 27, 2025 15:39
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

@vidhyav vidhyav force-pushed the export-D77303504 branch from 4a0646e to fe0a267 Compare June 27, 2025 15:45
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Reviewed By: suo

Differential Revision: D77303504
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

@vidhyav vidhyav force-pushed the export-D77303504 branch from fe0a267 to 875e191 Compare June 27, 2025 22:04
vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Here is the reason why we needed in the first place:

While running an xlformers test, we encountered a problem wherein the process was getting stuck upon issuing an exit.

The root cause for the blocking was the cuda unregistry routine getting stuck blocking the exit() call and all other calls.

In order to simulate the same, we added here an exit handler that loops forever.

Reviewed By: suo, technicianted

Differential Revision: D77303504
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Here is the reason why we needed in the first place:

While running an xlformers test, we encountered a problem wherein the process was getting stuck upon issuing an exit.

The root cause for the blocking was the cuda unregistry routine getting stuck blocking the exit() call and all other calls.

In order to simulate the same, we added here an exit handler that loops forever.

Reviewed By: suo, technicianted

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from 875e191 to f82e669 Compare June 27, 2025 22:45
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

vidhyav added a commit to vidhyav/monarch that referenced this pull request Jun 27, 2025
…-labs#336)

Summary:
Pull Request resolved: pytorch-labs#336

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Here is the reason why we needed in the first place:

While running an xlformers test, we encountered a problem wherein the process was getting stuck upon issuing an exit.

The root cause for the blocking was the cuda unregistry routine getting stuck blocking the exit() call and all other calls.

In order to simulate the same, we added here an exit handler that loops forever.

Reviewed By: suo, technicianted

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from f82e669 to 73af9fe Compare June 27, 2025 22:56
…-labs#336)

Summary:

Allocators have a timeout to kill unresponsive procs when stop() is issued.

Here is the reason why we needed in the first place: 

While running an xlformers test, we encountered a problem wherein the process was getting stuck upon issuing an exit. 

The root cause for the blocking was the cuda unregistry routine getting stuck blocking the exit() call and all other calls. 

In order to simulate the same, we added here an exit handler that loops forever.

Reviewed By: suo, technicianted

Differential Revision: D77303504
@vidhyav vidhyav force-pushed the export-D77303504 branch from 73af9fe to 32e8f6b Compare June 30, 2025 06:57
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77303504

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 96cc0a7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported Merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants