Skip to content

Return actionable 4xx when the database rejects an API payload#66888

Open
1fanwang wants to merge 3 commits into
apache:mainfrom
1fanwang:fix/db-data-error-actionable
Open

Return actionable 4xx when the database rejects an API payload#66888
1fanwang wants to merge 3 commits into
apache:mainfrom
1fanwang:fix/db-data-error-actionable

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

@1fanwang 1fanwang commented May 13, 2026

On the LinkedIn DI side we run into this regularly: some users trigger Spark and Hadoop jobs by inlining very large arguments into the Dag run conf — entire dataset configs, serialised job parameters, sometimes whole payloads that should have been XCom or external storage. Today the request returns an opaque 500 and they retry with the same args, getting the same 500. This change makes the API defensive against that anti-pattern: the underlying DB rejection surfaces as a clear 413 with the column name + remediation hint, so the user immediately knows what's wrong and how to fix it (shrink the payload, or have an operator widen the column type on MySQL).

Triggering a Dag run with an oversized conf (and a whole class of similarly-shaped writes across the API) currently returns a generic 500 Internal Server Error. The SQL error surfaces deep in SQLAlchemy as (1406, "Data too long for column 'conf' at row 1") on MySQL, the caller has no signal that payload size was the cause, and every write endpoint that touches a length-capped column has the same shape today — Connection.extra, Variable.val, XCom.value, TaskInstance.note, HITL fields, and so on.

This adds a single FastAPI exception handler for sqlalchemy.exc.DataError and registers it on both the public REST API and the task-execution API. Data too long / too large / too big errors map to 413 Content Too Large; out-of-range / numeric overflow maps to 422 Unprocessable Entity. The response body carries the original DB error plus an actionable hint at either reducing the payload or widening the column type on MySQL. Postgres deployments never hit it (JSONB has no length cap); MySQL deployments get a clear 4xx + remediation hint instead of a generic 500.

This replaces #66787, which proposed a config-knob + per-route validator + new exception class for the same problem. Closing that in favour of this minimal, generalised version. #66890 separately fixes two execution-API routes whose local catches shadow this handler.

Reproducer (after docker run --rm -d --name mysql-66888 -e MYSQL_ROOT_PASSWORD=test -e MYSQL_DATABASE=airflow_test -p 3309:3306 mysql:8.0):

import json
from sqlalchemy import create_engine, text
engine = create_engine("mysql+pymysql://root:test@127.0.0.1:3309/airflow_test")
with engine.begin() as c:
    c.execute(text("CREATE TABLE dag_run (id INT PRIMARY KEY AUTO_INCREMENT, conf TEXT) ENGINE=InnoDB"))
    c.execute(text("INSERT INTO dag_run (conf) VALUES (:c)"), {"c": json.dumps({"k": "x"*70000})})
# sqlalchemy.exc.DataError: (pymysql.err.DataError) (1406, "Data too long for column 'conf' at row 1")

Driving that same DataError through five real Airflow routes via TestClient(create_app()) with Session.flush/Session.commit monkey-patched to raise it. On main every endpoint returns 500 Internal Server Error. With this PR every endpoint returns 413 Content Too Large with the structured detail body:

--- POST /dags/{id}/dagRuns — DagRun.conf (JSON / TEXT)
    HTTP 413
    reason: Payload exceeded database column limit
    orig:   (1406, "Data too long for column 'conf' at row 1")
--- POST /connections — Connection.extra (JSON)
    HTTP 413
    reason: Payload exceeded database column limit
    orig:   (1406, "Data too long for column 'extra' at row 1")
--- POST /variables — Variable.val (TEXT)
    HTTP 413
    reason: Payload exceeded database column limit
    orig:   (1406, "Data too long for column 'val' at row 1")
--- POST /pools — Pool.description (TEXT)
    HTTP 413
    reason: Payload exceeded database column limit
    orig:   (1406, "Data too long for column 'description' at row 1")
--- POST /assets/events — AssetEvent.extra (JSON)
    HTTP 413
    reason: Payload exceeded database column limit
    orig:   (1406, "Data too long for column 'extra' at row 1")

PASS — all 5 endpoints returned 413 via the global handler

Two execution-API routes (PATCH /task-instances/{id}/run, PATCH /task-instances/{id}/state) deliberately fall outside this PR — they catch SQLAlchemyError (parent class of DataError) and re-raise as 500, so the global handler never sees the exception. #66890 fixes that gap. The two PRs are independent and safe to merge in either order; each one is useful on its own.

airflow-core/tests/unit/api_fastapi/common/test_exceptions.py::TestDataErrorHandler adds five parametrised dialect-error shape tests (MySQL 1406, Postgres value too long for type, SQLite string or blob too big, MySQL 1264, Postgres numeric field overflow) plus an end-to-end FastAPI dispatch test. Existing handler tests still pass. IntegrityError translation (FK / NOT NULL violations beyond the unique-constraint case already handled) is intentionally out of scope — natural follow-up if maintainers like this shape.

Closes #66779.
Closes #66889.

@boring-cyborg boring-cyborg Bot added area:API Airflow's REST/HTTP API area:task-sdk labels May 13, 2026
@1fanwang 1fanwang force-pushed the fix/db-data-error-actionable branch from 1f743e7 to e0d8b96 Compare May 13, 2026 22:33
@1fanwang 1fanwang changed the title Translate sqlalchemy DataError to 413/422 instead of generic 500 Return actionable 4xx when the database rejects an API payload May 13, 2026
1fanwang added 3 commits May 13, 2026 23:01
Triggering a DAG run with an oversized 'conf' payload (and other
DB-rejected writes across the API surface) currently produces a generic
500. The SQL error surfaces deep in SQLAlchemy as
(1406, "Data too long for column 'conf' at row 1") on MySQL, the
caller has no signal that payload size was the cause, and every write
endpoint that touches a length-capped column has the same shape today
(Connection.extra, Variable.val, XCom.value, TaskInstance.note, HITL
fields, etc).

Add a single FastAPI exception handler for sqlalchemy.exc.DataError on
both the public REST API and the execution API. 'Data too long' /
'too large' / 'too big' errors map to 413 Content Too Large; other
DataErrors (out-of-range, numeric overflow) map to 422. The response
body carries the original DB error and an actionable hint pointing at
either reducing the payload or widening the column type on MySQL.

Every existing and future write endpoint inherits the translation
automatically. Postgres deployments never hit it (JSONB has no length
cap); MySQL deployments get a clear 4xx + remediation hint instead
of a generic 500.

Closes apache#66779

Signed-off-by: 1fanwang <1fannnw@gmail.com>
Signed-off-by: 1fanwang <1fannnw@gmail.com>
Pass DataError directly to add_exception_handler instead of via the
BaseErrorHandler.exception_cls attribute (typed as instance T, not
type[T]) so the call type-checks against Starlette's expected
type[Exception]. The variance issue between Callable[Request, DataError]
and Callable[Request, Exception] is silenced with a type-ignore matching
the existing pattern used in the core_api ERROR_HANDLERS loop.

In the new TestDataErrorHandler tests, extract HTTPException.detail
into a typed dict before subscripting so mypy stops inferring it as str.

Signed-off-by: 1fanwang <1fannnw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:task-sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API write endpoints return opaque 500 when the database rejects a payload Validate conf payload size on Dag trigger, fail-fast with actionable error

1 participant