Skip to content

Commit 09aeb5d

Browse files
dulinrileyfacebook-github-bot
authored andcommitted
Enhance exceptions in process allocation and setup (#339)
Summary: Pull Request resolved: #339 When working on pretraining on CPU machines, I ran into a couple issues that were made easier by these changes: * When task group allocation fails, the error message was too terse, and didn't include all the context. Use `{:#}` to select the more-context print of anyhow::Error * Add one layer of context to narrow down the path better along mast process allocation * Add an exception that will catch when the `__init__` of an actor fails and doesn't instantiate the instance correctly Reviewed By: shayne-fletcher Differential Revision: D77322952 fbshipit-source-id: 9372cc423a158a6f4538356746500918349692b1
1 parent 1d43a98 commit 09aeb5d

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

hyperactor_mesh/src/alloc/remoteprocess.rs

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -794,7 +794,7 @@ impl Alloc for RemoteProcessAlloc {
794794
if let Err(e) = self.ensure_started().await {
795795
break Some(ProcState::Failed {
796796
world_id: self.world_id.clone(),
797-
description: format!("failed to ensure started: {}", e),
797+
description: format!("failed to ensure started: {:#}", e),
798798
});
799799
}
800800

python/monarch/actor_mesh.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -520,6 +520,10 @@ async def handle_cast(
520520
self.instance = Class(*args, **kwargs)
521521
return None
522522

523+
if self.instance is None:
524+
raise AssertionError(
525+
"__init__ failed earlier and no Actor object is available"
526+
)
523527
the_method = getattr(self.instance, message.method)._method
524528

525529
if inspect.iscoroutinefunction(the_method):

0 commit comments

Comments
 (0)