Connection teardown is unbounded on network fault: the query-cancel path (connect_utils._cancel) ignores all timeouts, hanging close(timeout=...) indefinitely

## Environment

- **asyncpg version**: 0.31.0
- **PostgreSQL version**: 16 (also reproduces against a trivial in-process fake server — no real PostgreSQL needed)
- **Do you use a PostgreSQL SaaS? Can you reproduce locally?**: Original incident was self-hosted PostgreSQL over **plaintext** (no SSL). **Yes** — fully reproducible locally, both with a local PostgreSQL + network blackhole and with a self-contained fake server (below).
- **Python version**: 3.11.11 (CPython)
- **Platform**: Linux x86_64 (6.8.0)
- **Do you use pgbouncer?**: No
- **Did you install asyncpg with pip?**: Yes (binary wheel)
- **If built locally, Cython version**: N/A
- **Reproduced under both asyncio and uvloop?**: **Yes, both.** The affected code path is event-loop agnostic.

## Summary

When a command times out (via `command_timeout`) or is cancelled, asyncpg sends a PostgreSQL `CancelRequest` by **opening a brand-new connection** in `connect_utils._cancel`. Neither that new connection (`loop.create_connection`, connect_utils.py:1279) nor the subsequent `await pr.on_disconnect` (connect_utils.py:1288) is bounded by **any** timeout. If the server is unreachable or wedged at that moment (network partition / blackhole), **connection teardown hangs indefinitely — including `Connection.close(timeout=N)`, which ignores its documented `timeout`.**

## Precise framing — what works vs. what doesn't

This is **not** a "command_timeout doesn't fire" bug. `command_timeout` works correctly: the query is aborted at the deadline.

The defect is in the **cancel / teardown path**:
1. `command_timeout` fires → the query coroutine correctly raises `TimeoutError` on time.
2. asyncpg schedules a **background** cancel task (`_cancel_current_command`, connection.py:1681) that opens a new connection to deliver the `CancelRequest`.
3. That cancel connection is subject to **no timeout at all**.
4. When you then **close** the connection, `close()` blocks waiting for that in-flight cancel — and no timeout you can set (`command_timeout`, connect `timeout`, or `close(timeout=)`) bounds it.

## Documented-behavior contradiction

`Connection.close()` docstring (connection.py:1502-1510):

```
Close the connection gracefully.
:param float timeout: Optional timeout value in seconds.
```

`close(timeout=2)` is documented to bound the graceful close. **It does not** — see repro: `close(timeout=2)` still hangs > 15s.

## Self-contained reproduction (no PostgreSQL, no root, single file)

```python
"""asyncpg query-cancel path has no timeout -> connection teardown hangs.
Zero deps beyond asyncpg: a tiny in-process fake server speaks just enough of the
PG v3 wire protocol to let asyncpg connect, stays silent on the query so
command_timeout fires, and holds the resulting CANCEL connection open forever.
conn.close(timeout=2) then hangs despite the explicit timeout.

  pip install asyncpg     # 0.31.0
  python repro.py
"""
import asyncio, struct, time
import asyncpg

CODE_SSLREQUEST = 80877103
CODE_CANCELREQUEST = 80877102


def _param(k, v):
    p = k + b"\x00" + v + b"\x00"
    return struct.pack("!I", 4 + len(p)) + p


async def _read(reader):
    (length,) = struct.unpack("!I", await reader.readexactly(4))
    body = await reader.readexactly(length - 4)
    return struct.unpack("!I", body[:4])[0]


async def _hold(writer):
    try:
        while True:
            await asyncio.sleep(3600)
    finally:
        writer.close()


async def handler(reader, writer):
    code = await _read(reader)
    if code == CODE_SSLREQUEST:                 # asyncpg ssl='disable' -> won't hit this
        writer.write(b"N"); await writer.drain()
        code = await _read(reader)
    if code == CODE_CANCELREQUEST:
        # The brand-new connection asyncpg opened just to send CancelRequest.
        # Accept it and never respond/close -> `await pr.on_disconnect` never resolves.
        print("  [server] CANCEL connection received -> holding open forever", flush=True)
        await _hold(writer)
        return
    # StartupMessage on the DATA connection: minimal handshake.
    writer.write(b"R" + struct.pack("!II", 8, 0))                      # AuthenticationOk
    for k, v in [(b"server_version", b"16.0"), (b"client_encoding", b"UTF8"),
                 (b"server_encoding", b"UTF8"), (b"integer_datetimes", b"on")]:
        writer.write(b"S" + _param(k, v))                             # ParameterStatus
    writer.write(b"K" + struct.pack("!III", 12, 4242, 9999))          # BackendKeyData
    writer.write(b"Z" + struct.pack("!I", 5) + b"I")                  # ReadyForQuery
    await writer.drain()
    # Stay silent on the query (command_timeout fires), then close cleanly after 3s.
    # Closing the DATA socket proves the later hang is the CANCEL conn, not this one.
    await asyncio.sleep(3)
    writer.close()


async def main():
    server = await asyncio.start_server(handler, "127.0.0.1", 0)
    host, port = server.sockets[0].getsockname()[:2]
    conn = await asyncio.wait_for(
        asyncpg.connect(host=host, port=port, user="x", password="x", database="x",
                        timeout=5, command_timeout=1, ssl="disable"),
        timeout=10)
    print("connected (command_timeout=1)", flush=True)

    t0 = time.monotonic()
    try:
        await conn.fetchval("SELECT 1")
    except asyncio.TimeoutError:
        print(f"fetchval timed out at {time.monotonic()-t0:.1f}s (command_timeout works)", flush=True)

    t1 = time.monotonic()
    try:
        await asyncio.wait_for(conn.close(timeout=2), timeout=15)   # documented timeout
        print(f"conn.close(timeout=2) returned after {time.monotonic()-t1:.1f}s", flush=True)
    except asyncio.TimeoutError:
        print("BUG: conn.close(timeout=2) HUNG > 15s despite explicit timeout=2", flush=True)
    finally:
        server.close()


if __name__ == "__main__":
    asyncio.run(main())
```

**Actual output (CPython 3.11.11 and uvloop, asyncpg 0.31.0):**

```
connected (command_timeout=1)
fetchval timed out at 1.0s (command_timeout works)
  [server] CANCEL connection received -> holding open forever
BUG: conn.close(timeout=2) HUNG > 15s despite explicit timeout=2
```

Note the DATA socket is closed cleanly after 3s, so the teardown hang is attributable **solely to the held-open cancel connection**.

## Faithful reproduction (real PostgreSQL + true network blackhole)

The fake server above demonstrates the **`await pr.on_disconnect`** (line 1288) variant. A real network blackhole hits the **`loop.create_connection`** (line 1279) variant — the cancel connection's SYN is silently dropped:

```bash
docker run -d --name pg -e POSTGRES_HOST_AUTH_METHOD=trust postgres:16
IP=$(docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' pg)
# establish an asyncpg/SQLAlchemy connection to $IP, then blackhole it:
sudo iptables -I OUTPUT -d $IP -j DROP        # silent drop, no RST
# issue a query/checkout -> command_timeout=5 fires, cancel connection's SYN is
# dropped -> tcp_syn_retries (~127s) -> total hang ~140s
sudo iptables -D OUTPUT -d $IP -j DROP        # cleanup
```

Through SQLAlchemy 2.0.44 (`create_async_engine(..., pool_pre_ping=True, connect_args={"command_timeout":5,"timeout":5})`), `session.connection()` hangs **139.6s** (the pre_ping `BEGIN` times out at 5s, then the cancel connection's SYN blackholes for ~127s, while SQLAlchemy's `_terminate_graceful_close` → `close(timeout=2)` is stuck behind it). This matches a production incident where a transient network partition caused ~140s stalls per DB write despite a 5s `command_timeout`.

## Root cause

`connect_utils.py`:

```python
async def _cancel(*, loop, addr, params, backend_pid, backend_secret):
    ...
    if params.ssl and params.sslmode != SSLMode.allow:
        tr, pr = await _create_ssl_connection(CancelProto, *addr, loop=loop, ...)  # no timeout
    else:
        tr, pr = await loop.create_connection(CancelProto, *addr)                  # line 1279, no timeout
        _set_nodelay(_get_socket(tr))
    msg = struct.pack('!llll', 16, 80877102, backend_pid, backend_secret)
    try:
        tr.write(msg)
        await pr.on_disconnect            # line 1288, no timeout
    finally:
        tr.close()
```

Both the connect and the `on_disconnect` wait are unbounded. The connection's `command_timeout` / connect `timeout` are not propagated here, and `Connection.close(timeout=...)` cannot interrupt it.

## Relationship to existing issues (not a duplicate)

- **#1310** — same function, **complementary symptom**. #1310 is about the cancel connection **failing fast** (RST) and raising a wrongly-typed built-in `ConnectionError` that escapes. This report is about the cancel connection **not failing at all** — hanging because it has no timeout. #1310's suggested `except OSError` cannot help a hang (there is nothing to catch). A complete fix for `_cancel` should address **both** (wrong exception type **and** missing timeout).
- **PR #1312** addresses the exception type only.
- Related: **#1211** (connection_lost / leak on cancelling), **#442** (`command_timeout` set, yet `__rollback` hangs — likely the same root cause via the transaction rollback path), **#549** (hang on connectivity loss).

---

I'm happy to open a PR if this is accepted as a bug.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connection teardown is unbounded on network fault: the query-cancel path (connect_utils._cancel) ignores all timeouts, hanging close(timeout=...) indefinitely #1326

Environment

Summary

Precise framing — what works vs. what doesn't

Documented-behavior contradiction

Self-contained reproduction (no PostgreSQL, no root, single file)

Faithful reproduction (real PostgreSQL + true network blackhole)

Root cause

Relationship to existing issues (not a duplicate)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Connection teardown is unbounded on network fault: the query-cancel path (connect_utils._cancel) ignores all timeouts, hanging close(timeout=...) indefinitely #1326

Description

Environment

Summary

Precise framing — what works vs. what doesn't

Documented-behavior contradiction

Self-contained reproduction (no PostgreSQL, no root, single file)

Faithful reproduction (real PostgreSQL + true network blackhole)

Root cause

Relationship to existing issues (not a duplicate)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions