Don't silently mark TI FAILED when state-update payload exceeds column type#66890
Open
1fanwang wants to merge 2 commits into
Open
Don't silently mark TI FAILED when state-update payload exceeds column type#668901fanwang wants to merge 2 commits into
1fanwang wants to merge 2 commits into
Conversation
…d fields Two execution-API routes (ti_run, ti_update_state) catch the broad SQLAlchemyError parent class and re-raise as a generic 500. When the DB rejects a payload (note, rendered_map_index, etc.) with DataError, the local catch fires first and the worker sees an unactionable 500 even on deployments where a global DataError->4xx handler is wired up. Let DataError bubble through unchanged so any handler registered on the execution API can translate it to 413/422. The SQLAlchemyError ->500 fallback for other DB error classes is kept intact. Signed-off-by: 1fanwang <1fannnw@gmail.com>
This was referenced May 13, 2026
…n type The ti_update_state route's broad except Exception: at line 425 wraps the \`_create_ti_state_update_query_and_update_state\` call and reroutes any exception into a 'mark TI FAILED + return 204' branch. When the underlying exception is DataError (oversized note / rendered_map_index / etc.), the worker sees a 204 response, believes the state-update succeeded, but the server has silently marked the task FAILED — corrupting state without any signal the request was invalid. Add 'except DataError: raise' before the broad catch so the user-payload rejection bubbles to the app-level handler and the caller gets a 413/422 with the column name + remediation hint, while the TI stays in its original state. The broad Exception fallback for unrelated unexpected errors is kept intact. Signed-off-by: 1fanwang <1fannnw@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Same class of bug as #66888 but on the execution-API side. Walking through our LinkedIn DI cluster's
(1406, ...)traces, we found two task-instance routes that catch broader-than-DataError exceptions and swallow the DB rejection before any global handler can see it. One returns an opaque 500; the other silently reroutes into a 'mark TI FAILED + return 204' branch — which is worse, because the worker thinks state update succeeded while the server has already marked the task FAILED.Two execution-API routes —
PATCH /task-instances/{id}/runandPATCH /task-instances/{id}/state— wrap their DB-touching code in catches broader thanDataError. When a worker submits a payload with an oversized field (note,rendered_map_index,external_executor_id,extra), the DB rejects withDataError, the local catch fires first, and the user-visible result is wrong in two distinct ways:Two
except SQLAlchemyError:catches (inti_runat line 305 and inti_update_stateat line 465) re-raise asHTTPException(500, "Database error occurred")— opaque 500 with no hint at the column or remediation. One broaderexcept Exception:(inti_update_stateat line 425) reroutes into a "mark this TI FAILED + return 204" branch. That last one is worse than an opaque 500: the worker thinks the state update succeeded, but the server has silently marked the task FAILED — state corruption with no signal the request was invalid.All three swallow
DataErrorbefore the global handler from #66888 can see it, so registering the handler alone does not fix these routes.This PR adds
except DataError: raisebefore each of the three broad catches so the user-payload rejection bubbles to the app-level handler and the caller gets the actionable 413/422 instead. TheSQLAlchemyError → 500andException → mark-FAILEDfallbacks for unrelated unexpected errors stay intact.End-to-end against
PATCH /task-instances/{id}/statedriven throughTestClient(cached_app(apps="execution"))with the underlying state-update raising the sameDataErrorMySQL would raise on an oversizednote. Three states:The MIDDLE state is load-bearing — it shows #66888 alone is not enough for this endpoint, which is the whole reason this PR exists.
Three catches in one file, same routes, same fix shape (
except DataError: raisebefore the existing catch), 12 lines of additions total. Read naturally together as "let DataError bubble through the task-instance update routes." Independent of #66888 in merge ordering — each PR is useful on its own, but the user-visible 413 only lands when both are applied.Closes (partially) #66889.