Skip to content

Don't silently mark TI FAILED when state-update payload exceeds column type#66890

Open
1fanwang wants to merge 2 commits into
apache:mainfrom
1fanwang:fix/exec-api-ti-routes-let-dataerror-bubble
Open

Don't silently mark TI FAILED when state-update payload exceeds column type#66890
1fanwang wants to merge 2 commits into
apache:mainfrom
1fanwang:fix/exec-api-ti-routes-let-dataerror-bubble

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

@1fanwang 1fanwang commented May 13, 2026

Same class of bug as #66888 but on the execution-API side. Walking through our LinkedIn DI cluster's (1406, ...) traces, we found two task-instance routes that catch broader-than-DataError exceptions and swallow the DB rejection before any global handler can see it. One returns an opaque 500; the other silently reroutes into a 'mark TI FAILED + return 204' branch — which is worse, because the worker thinks state update succeeded while the server has already marked the task FAILED.

Two execution-API routes — PATCH /task-instances/{id}/run and PATCH /task-instances/{id}/state — wrap their DB-touching code in catches broader than DataError. When a worker submits a payload with an oversized field (note, rendered_map_index, external_executor_id, extra), the DB rejects with DataError, the local catch fires first, and the user-visible result is wrong in two distinct ways:

Two except SQLAlchemyError: catches (in ti_run at line 305 and in ti_update_state at line 465) re-raise as HTTPException(500, "Database error occurred") — opaque 500 with no hint at the column or remediation. One broader except Exception: (in ti_update_state at line 425) reroutes into a "mark this TI FAILED + return 204" branch. That last one is worse than an opaque 500: the worker thinks the state update succeeded, but the server has silently marked the task FAILED — state corruption with no signal the request was invalid.

All three swallow DataError before the global handler from #66888 can see it, so registering the handler alone does not fix these routes.

This PR adds except DataError: raise before each of the three broad catches so the user-payload rejection bubbles to the app-level handler and the caller gets the actionable 413/422 instead. The SQLAlchemyError → 500 and Exception → mark-FAILED fallbacks for unrelated unexpected errors stay intact.

End-to-end against PATCH /task-instances/{id}/state driven through TestClient(cached_app(apps="execution")) with the underlying state-update raising the same DataError MySQL would raise on an oversized note. Three states:

--- BEFORE (upstream/main) — no handler, no DataError catches ---
  HTTP 204
  body: (empty)
  Worker thinks state update succeeded; server silently marked TI FAILED.

--- MIDDLE (only #66888 applied — handler registered, no DataError catches yet) ---
  HTTP 204
  body: (empty)
  Same silent-FAILED behaviour. Registering the handler doesn't help — the
  local except Exception: at line 425 still eats the DataError first.

--- AFTER (this PR + #66888) ---
  HTTP 413
  body: {"detail":{"reason":"Payload exceeded database column limit",
                   "orig_error":"(1406, \"Data too long for column 'note' at row 1\")",
                   "message":"Database rejected the payload. Reduce the field size, or
                              your operator may widen the column type (e.g. MEDIUMTEXT
                              / LONGTEXT on MySQL)."}}
  DataError reaches the app-level handler. TI stays in its original state.

The MIDDLE state is load-bearing — it shows #66888 alone is not enough for this endpoint, which is the whole reason this PR exists.

Three catches in one file, same routes, same fix shape (except DataError: raise before the existing catch), 12 lines of additions total. Read naturally together as "let DataError bubble through the task-instance update routes." Independent of #66888 in merge ordering — each PR is useful on its own, but the user-visible 413 only lands when both are applied.

Closes (partially) #66889.

…d fields

Two execution-API routes (ti_run, ti_update_state) catch the broad
SQLAlchemyError parent class and re-raise as a generic 500. When the
DB rejects a payload (note, rendered_map_index, etc.) with DataError,
the local catch fires first and the worker sees an unactionable 500
even on deployments where a global DataError->4xx handler is wired up.

Let DataError bubble through unchanged so any handler registered on
the execution API can translate it to 413/422. The SQLAlchemyError
->500 fallback for other DB error classes is kept intact.

Signed-off-by: 1fanwang <1fannnw@gmail.com>
…n type

The ti_update_state route's broad except Exception: at line 425 wraps the
\`_create_ti_state_update_query_and_update_state\` call and reroutes any
exception into a 'mark TI FAILED + return 204' branch. When the underlying
exception is DataError (oversized note / rendered_map_index / etc.), the
worker sees a 204 response, believes the state-update succeeded, but the
server has silently marked the task FAILED — corrupting state without any
signal the request was invalid.

Add 'except DataError: raise' before the broad catch so the user-payload
rejection bubbles to the app-level handler and the caller gets a 413/422
with the column name + remediation hint, while the TI stays in its
original state. The broad Exception fallback for unrelated unexpected
errors is kept intact.

Signed-off-by: 1fanwang <1fannnw@gmail.com>
@1fanwang 1fanwang changed the title Stop opaque 500s on execution-API task-instance updates with oversized fields Don't silently mark TI FAILED when state-update payload exceeds column type May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant