Skip to content

fix(plugins/ray): honor RECOVERABLE error.pb so Ray tasks retry#7566

Open
1fanwang wants to merge 1 commit into
flyteorg:mainfrom
1fanwang:fix/ray-honor-recoverable-error
Open

fix(plugins/ray): honor RECOVERABLE error.pb so Ray tasks retry#7566
1fanwang wants to merge 1 commit into
flyteorg:mainfrom
1fanwang:fix/ray-honor-recoverable-error

Conversation

@1fanwang

@1fanwang 1fanwang commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Why are the changes needed?

When user code raises FlyteRecoverableException, pyflyte-execute writes a RECOVERABLE error to error.pb, and container/pod tasks retry on it. A Ray task doesn't: a failed RayJob surfaces as a terminal phase, and the Ray plugin's GetTaskPhase reports it without ever reading error.pb. So the same recoverable failure — e.g. a transient ObjectReconstructionFailedLineageEvictedError — terminates a Ray task as a USER error and never fires its retries, inconsistent with every other task type. (The k8s plugin manager only consults error.pb on the PhaseSuccess path, which a failed RayJob never reaches.)

What changes were proposed in this pull request?

In the JobDeploymentStatusFailed branch of the Ray plugin's GetTaskPhase, read error.pb and map a RECOVERABLE error to a retryable failure. When the file is absent or unreadable the phase stays terminal, preserving today's behavior.

How was this patch tested?

TestGetTaskPhase_RecoverableErrorFile covers all three cases — RECOVERABLE → retryable, NON_RECOVERABLE → permanent, absent → permanent — and reverting the fix flips the first back to permanent. Also end-to-end: wrote the RECOVERABLE error.pb that pyflyte-execute produces to a real blob store and drove GetTaskPhase through the real file reader (before: permanent, after: retryable), with the triggering RayJob status confirmed on a live KubeRay cluster — a failing job settles at jobDeploymentStatus: Failed, exactly the status this branch reads.

Labels

fixed

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

When a RayJob fails, GetTaskPhase reports a terminal failure regardless of the
task's error.pb. The k8s plugin manager only reads error.pb (and honors its
RECOVERABLE kind) when a plugin reports PhaseSuccess, so a failed RayJob never
gets that treatment: a user raising FlyteRecoverableException is reported as a
terminal USER error and the task's retries never fire, unlike container/pod
tasks.

Read the error file in the failed branch and map a RECOVERABLE error to a
retryable failure, falling back to terminal when the file is absent or
unreadable.

Signed-off-by: 1fanwang <1fannnw@gmail.com>
Copilot AI review requested due to automatic review settings June 20, 2026 22:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Ray Kubernetes plugin to honor error.pb recoverability when a RayJob ends in JobDeploymentStatusFailed, allowing Flyte task retries to trigger for RECOVERABLE errors (consistent with container/pod task behavior).

Changes:

  • Update Ray plugin GetTaskPhase to read error.pb on Ray job failure and map recoverable errors to PhaseRetryableFailure.
  • Add a table-driven unit test covering recoverable, non-recoverable, and absent error.pb cases.
  • Extend test helpers to wire an in-memory datastore/output-writer so GetTaskPhase can read error.pb.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
flyteplugins/go/tasks/plugins/k8s/ray/ray.go Reads error.pb on JobDeploymentStatusFailed and marks failures retryable when recoverable.
flyteplugins/go/tasks/plugins/k8s/ray/ray_test.go Adds test coverage and test plumbing for error.pb presence/contents.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +2077 to +2081
store, _ := storage.NewDataStore(&storage.Config{Type: storage.TypeMemory}, promutils.NewTestScope())
errPath := storage.DataReference("/error.pb")
if errorDoc != nil {
_ = store.WriteProtobuf(context.Background(), errPath, storage.Options{}, errorDoc)
}
func rayPluginContext(pluginState k8s.PluginState) *k8smocks.PluginContext {
func rayPluginContextWithErrorDoc(pluginState k8s.PluginState, errorDoc *core.ErrorDocument) *k8smocks.PluginContext {
pluginCtx := newPluginContext(pluginState)
startTime := time.Date(2024, 0, 0, 0, 0, 0, 0, time.UTC)

@pingsutw pingsutw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, It looks reasonable for me

Comment on lines +834 to +835
// Honor a RECOVERABLE error.pb (written by pyflyte-execute when user code raises
// FlyteRecoverableException) so the task's retries fire. A failed RayJob surfaces here as a

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: since there is no pyflyte-execute in v2. we called it a0 in v2

Suggested change
// Honor a RECOVERABLE error.pb (written by pyflyte-execute when user code raises
// FlyteRecoverableException) so the task's retries fire. A failed RayJob surfaces here as a
// Honor a RECOVERABLE error.pb (written by sdk) so the task's retries fire. A failed RayJob surfaces here as a

@pingsutw pingsutw added this to the V2 GA milestone Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants