Skip to content

Connection teardown is unbounded on network fault: the query-cancel path (connect_utils._cancel) ignores all timeouts, hanging close(timeout=...) indefinitely #1326

@weidaolee

Description

@weidaolee

Environment

  • asyncpg version: 0.31.0
  • PostgreSQL version: 16 (also reproduces against a trivial in-process fake server — no real PostgreSQL needed)
  • Do you use a PostgreSQL SaaS? Can you reproduce locally?: Original incident was self-hosted PostgreSQL over plaintext (no SSL). Yes — fully reproducible locally, both with a local PostgreSQL + network blackhole and with a self-contained fake server (below).
  • Python version: 3.11.11 (CPython)
  • Platform: Linux x86_64 (6.8.0)
  • Do you use pgbouncer?: No
  • Did you install asyncpg with pip?: Yes (binary wheel)
  • If built locally, Cython version: N/A
  • Reproduced under both asyncio and uvloop?: Yes, both. The affected code path is event-loop agnostic.

Summary

When a command times out (via command_timeout) or is cancelled, asyncpg sends a PostgreSQL CancelRequest by opening a brand-new connection in connect_utils._cancel. Neither that new connection (loop.create_connection, connect_utils.py:1279) nor the subsequent await pr.on_disconnect (connect_utils.py:1288) is bounded by any timeout. If the server is unreachable or wedged at that moment (network partition / blackhole), connection teardown hangs indefinitely — including Connection.close(timeout=N), which ignores its documented timeout.

Precise framing — what works vs. what doesn't

This is not a "command_timeout doesn't fire" bug. command_timeout works correctly: the query is aborted at the deadline.

The defect is in the cancel / teardown path:

  1. command_timeout fires → the query coroutine correctly raises TimeoutError on time.
  2. asyncpg schedules a background cancel task (_cancel_current_command, connection.py:1681) that opens a new connection to deliver the CancelRequest.
  3. That cancel connection is subject to no timeout at all.
  4. When you then close the connection, close() blocks waiting for that in-flight cancel — and no timeout you can set (command_timeout, connect timeout, or close(timeout=)) bounds it.

Documented-behavior contradiction

Connection.close() docstring (connection.py:1502-1510):

Close the connection gracefully.
:param float timeout: Optional timeout value in seconds.

close(timeout=2) is documented to bound the graceful close. It does not — see repro: close(timeout=2) still hangs > 15s.

Self-contained reproduction (no PostgreSQL, no root, single file)

"""asyncpg query-cancel path has no timeout -> connection teardown hangs.
Zero deps beyond asyncpg: a tiny in-process fake server speaks just enough of the
PG v3 wire protocol to let asyncpg connect, stays silent on the query so
command_timeout fires, and holds the resulting CANCEL connection open forever.
conn.close(timeout=2) then hangs despite the explicit timeout.

  pip install asyncpg     # 0.31.0
  python repro.py
"""
import asyncio, struct, time
import asyncpg

CODE_SSLREQUEST = 80877103
CODE_CANCELREQUEST = 80877102


def _param(k, v):
    p = k + b"\x00" + v + b"\x00"
    return struct.pack("!I", 4 + len(p)) + p


async def _read(reader):
    (length,) = struct.unpack("!I", await reader.readexactly(4))
    body = await reader.readexactly(length - 4)
    return struct.unpack("!I", body[:4])[0]


async def _hold(writer):
    try:
        while True:
            await asyncio.sleep(3600)
    finally:
        writer.close()


async def handler(reader, writer):
    code = await _read(reader)
    if code == CODE_SSLREQUEST:                 # asyncpg ssl='disable' -> won't hit this
        writer.write(b"N"); await writer.drain()
        code = await _read(reader)
    if code == CODE_CANCELREQUEST:
        # The brand-new connection asyncpg opened just to send CancelRequest.
        # Accept it and never respond/close -> `await pr.on_disconnect` never resolves.
        print("  [server] CANCEL connection received -> holding open forever", flush=True)
        await _hold(writer)
        return
    # StartupMessage on the DATA connection: minimal handshake.
    writer.write(b"R" + struct.pack("!II", 8, 0))                      # AuthenticationOk
    for k, v in [(b"server_version", b"16.0"), (b"client_encoding", b"UTF8"),
                 (b"server_encoding", b"UTF8"), (b"integer_datetimes", b"on")]:
        writer.write(b"S" + _param(k, v))                             # ParameterStatus
    writer.write(b"K" + struct.pack("!III", 12, 4242, 9999))          # BackendKeyData
    writer.write(b"Z" + struct.pack("!I", 5) + b"I")                  # ReadyForQuery
    await writer.drain()
    # Stay silent on the query (command_timeout fires), then close cleanly after 3s.
    # Closing the DATA socket proves the later hang is the CANCEL conn, not this one.
    await asyncio.sleep(3)
    writer.close()


async def main():
    server = await asyncio.start_server(handler, "127.0.0.1", 0)
    host, port = server.sockets[0].getsockname()[:2]
    conn = await asyncio.wait_for(
        asyncpg.connect(host=host, port=port, user="x", password="x", database="x",
                        timeout=5, command_timeout=1, ssl="disable"),
        timeout=10)
    print("connected (command_timeout=1)", flush=True)

    t0 = time.monotonic()
    try:
        await conn.fetchval("SELECT 1")
    except asyncio.TimeoutError:
        print(f"fetchval timed out at {time.monotonic()-t0:.1f}s (command_timeout works)", flush=True)

    t1 = time.monotonic()
    try:
        await asyncio.wait_for(conn.close(timeout=2), timeout=15)   # documented timeout
        print(f"conn.close(timeout=2) returned after {time.monotonic()-t1:.1f}s", flush=True)
    except asyncio.TimeoutError:
        print("BUG: conn.close(timeout=2) HUNG > 15s despite explicit timeout=2", flush=True)
    finally:
        server.close()


if __name__ == "__main__":
    asyncio.run(main())

Actual output (CPython 3.11.11 and uvloop, asyncpg 0.31.0):

connected (command_timeout=1)
fetchval timed out at 1.0s (command_timeout works)
  [server] CANCEL connection received -> holding open forever
BUG: conn.close(timeout=2) HUNG > 15s despite explicit timeout=2

Note the DATA socket is closed cleanly after 3s, so the teardown hang is attributable solely to the held-open cancel connection.

Faithful reproduction (real PostgreSQL + true network blackhole)

The fake server above demonstrates the await pr.on_disconnect (line 1288) variant. A real network blackhole hits the loop.create_connection (line 1279) variant — the cancel connection's SYN is silently dropped:

docker run -d --name pg -e POSTGRES_HOST_AUTH_METHOD=trust postgres:16
IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' pg)
# establish an asyncpg/SQLAlchemy connection to $IP, then blackhole it:
sudo iptables -I OUTPUT -d $IP -j DROP        # silent drop, no RST
# issue a query/checkout -> command_timeout=5 fires, cancel connection's SYN is
# dropped -> tcp_syn_retries (~127s) -> total hang ~140s
sudo iptables -D OUTPUT -d $IP -j DROP        # cleanup

Through SQLAlchemy 2.0.44 (create_async_engine(..., pool_pre_ping=True, connect_args={"command_timeout":5,"timeout":5})), session.connection() hangs 139.6s (the pre_ping BEGIN times out at 5s, then the cancel connection's SYN blackholes for ~127s, while SQLAlchemy's _terminate_graceful_closeclose(timeout=2) is stuck behind it). This matches a production incident where a transient network partition caused ~140s stalls per DB write despite a 5s command_timeout.

Root cause

connect_utils.py:

async def _cancel(*, loop, addr, params, backend_pid, backend_secret):
    ...
    if params.ssl and params.sslmode != SSLMode.allow:
        tr, pr = await _create_ssl_connection(CancelProto, *addr, loop=loop, ...)  # no timeout
    else:
        tr, pr = await loop.create_connection(CancelProto, *addr)                  # line 1279, no timeout
        _set_nodelay(_get_socket(tr))
    msg = struct.pack('!llll', 16, 80877102, backend_pid, backend_secret)
    try:
        tr.write(msg)
        await pr.on_disconnect            # line 1288, no timeout
    finally:
        tr.close()

Both the connect and the on_disconnect wait are unbounded. The connection's command_timeout / connect timeout are not propagated here, and Connection.close(timeout=...) cannot interrupt it.

Relationship to existing issues (not a duplicate)


I'm happy to open a PR if this is accepted as a bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions