fix(plugins/ray): honor RECOVERABLE error.pb so Ray tasks retry#7566
Open
1fanwang wants to merge 1 commit into
Open
fix(plugins/ray): honor RECOVERABLE error.pb so Ray tasks retry#75661fanwang wants to merge 1 commit into
1fanwang wants to merge 1 commit into
Conversation
When a RayJob fails, GetTaskPhase reports a terminal failure regardless of the task's error.pb. The k8s plugin manager only reads error.pb (and honors its RECOVERABLE kind) when a plugin reports PhaseSuccess, so a failed RayJob never gets that treatment: a user raising FlyteRecoverableException is reported as a terminal USER error and the task's retries never fire, unlike container/pod tasks. Read the error file in the failed branch and map a RECOVERABLE error to a retryable failure, falling back to terminal when the file is absent or unreadable. Signed-off-by: 1fanwang <1fannnw@gmail.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the Ray Kubernetes plugin to honor error.pb recoverability when a RayJob ends in JobDeploymentStatusFailed, allowing Flyte task retries to trigger for RECOVERABLE errors (consistent with container/pod task behavior).
Changes:
- Update Ray plugin
GetTaskPhaseto readerror.pbon Ray job failure and map recoverable errors toPhaseRetryableFailure. - Add a table-driven unit test covering recoverable, non-recoverable, and absent
error.pbcases. - Extend test helpers to wire an in-memory datastore/output-writer so
GetTaskPhasecan readerror.pb.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
flyteplugins/go/tasks/plugins/k8s/ray/ray.go |
Reads error.pb on JobDeploymentStatusFailed and marks failures retryable when recoverable. |
flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go |
Adds test coverage and test plumbing for error.pb presence/contents. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+2077
to
+2081
| store, _ := storage.NewDataStore(&storage.Config{Type: storage.TypeMemory}, promutils.NewTestScope()) | ||
| errPath := storage.DataReference("/error.pb") | ||
| if errorDoc != nil { | ||
| _ = store.WriteProtobuf(context.Background(), errPath, storage.Options{}, errorDoc) | ||
| } |
| func rayPluginContext(pluginState k8s.PluginState) *k8smocks.PluginContext { | ||
| func rayPluginContextWithErrorDoc(pluginState k8s.PluginState, errorDoc *core.ErrorDocument) *k8smocks.PluginContext { | ||
| pluginCtx := newPluginContext(pluginState) | ||
| startTime := time.Date(2024, 0, 0, 0, 0, 0, 0, time.UTC) |
pingsutw
reviewed
Jun 22, 2026
pingsutw
left a comment
Member
There was a problem hiding this comment.
Thanks, It looks reasonable for me
Comment on lines
+834
to
+835
| // Honor a RECOVERABLE error.pb (written by pyflyte-execute when user code raises | ||
| // FlyteRecoverableException) so the task's retries fire. A failed RayJob surfaces here as a |
Member
There was a problem hiding this comment.
nit: since there is no pyflyte-execute in v2. we called it a0 in v2
Suggested change
| // Honor a RECOVERABLE error.pb (written by pyflyte-execute when user code raises | |
| // FlyteRecoverableException) so the task's retries fire. A failed RayJob surfaces here as a | |
| // Honor a RECOVERABLE error.pb (written by sdk) so the task's retries fire. A failed RayJob surfaces here as a |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
When user code raises
FlyteRecoverableException,pyflyte-executewrites aRECOVERABLEerror toerror.pb, and container/pod tasks retry on it. A Ray task doesn't: a failedRayJobsurfaces as a terminal phase, and the Ray plugin'sGetTaskPhasereports it without ever readingerror.pb. So the same recoverable failure — e.g. a transientObjectReconstructionFailedLineageEvictedError— terminates a Ray task as aUSERerror and never fires itsretries, inconsistent with every other task type. (The k8s plugin manager only consultserror.pbon thePhaseSuccesspath, which a failed RayJob never reaches.)What changes were proposed in this pull request?
In the
JobDeploymentStatusFailedbranch of the Ray plugin'sGetTaskPhase, readerror.pband map aRECOVERABLEerror to a retryable failure. When the file is absent or unreadable the phase stays terminal, preserving today's behavior.How was this patch tested?
TestGetTaskPhase_RecoverableErrorFilecovers all three cases —RECOVERABLE→ retryable,NON_RECOVERABLE→ permanent, absent → permanent — and reverting the fix flips the first back to permanent. Also end-to-end: wrote theRECOVERABLEerror.pbthatpyflyte-executeproduces to a real blob store and droveGetTaskPhasethrough the real file reader (before: permanent, after: retryable), with the triggeringRayJobstatus confirmed on a live KubeRay cluster — a failing job settles atjobDeploymentStatus: Failed, exactly the status this branch reads.Labels
fixed
Check all the applicable boxes