Skip to content

ConnectionError from _cancel() during CancelledError not caught, crashes callers #1310

@yemreck

Description

@yemreck

Environment

  • asyncpg version: 0.31.0 (also reproduced on 0.30.x)
  • PostgreSQL version: 16
  • Python version: 3.11.14
  • Platform: Linux (Kubernetes)
  • pgbouncer: No
  • SQLAlchemy: 2.0.23

Summary

When an asyncpg operation is cancelled via asyncio.CancelledError while mid-query, the cancellation mechanism in connect_utils._cancel can raise a built-in ConnectionError that escapes to the caller. This is problematic because:

  1. Callers (e.g. SQLAlchemy) expect asyncpg-specific exception types and don't handle built-in ConnectionError
  2. The cancel operation is inherently best-effort — if the cancel connection fails, the error should be suppressed or wrapped, not propagated

This is related to #1211 but occurs on non-direct_tls connections via the cancel request code path.

Reproduction flow

  1. An asyncpg connection is executing a query (e.g. inside SQLAlchemy's session.execute())
  2. The asyncio task is cancelled (task.cancel())
  3. CancelledError propagates into protocol.query() / bind_execute
  4. asyncpg's cancellation handler tries to send a PostgreSQL cancel request by opening a new SSL connection via connect_utils._cancel_create_ssl_connection
  5. The new connection fails (server already closed the original, or network issue)
  6. TLSUpgradeProto.connection_lost() raises built-in ConnectionError('unexpected connection_lost() call')
  7. This escapes through connect_utils._cancel (which has no error handling around _create_ssl_connection)
  8. Caller receives ConnectionError instead of CancelledError

Traceback

asyncio.exceptions.CancelledError  (original exception)

During handling of the above exception, another exception occurred:

  File "asyncpg/transaction.py", line 206, in __rollback
    await self._connection.execute(query)
  File "asyncpg/connection.py", line 350, in execute
    result = await self._protocol.query(query, timeout)
  File "asyncpg/connection.py", line 1584, in _cancel
    await connect_utils._cancel(
  File "asyncpg/connect_utils.py", line 1040, in _cancel
    tr, pr = await _create_ssl_connection(
  File "asyncpg/connect_utils.py", line 752, in _create_ssl_connection
    do_ssl_upgrade = await pr.on_data
                     ^^^^^^^^^^^^^^^^
ConnectionError: unexpected connection_lost() call

Root cause

Two issues in connect_utils.py:

1. _cancel() has no error handling around _create_ssl_connection

async def _cancel(*, loop, addr, params, backend_pid, backend_secret):
    ...
    if params.ssl and params.sslmode != SSLMode.allow:
        tr, pr = await _create_ssl_connection(...)  # ← no try/except!
    ...

The cancel request is best-effort (we're telling PostgreSQL to cancel a query on a connection that may already be dead). If opening the cancel connection fails, the error should be suppressed or wrapped in asyncpg.InterfaceError, not propagated as a raw ConnectionError.

2. TLSUpgradeProto.connection_lost() raises built-in ConnectionError

def connection_lost(self, exc):
    if not self.on_data.done():
        if exc is None:
            exc = ConnectionError('unexpected connection_lost() call')
        self.on_data.set_exception(exc)

This raises a built-in Python ConnectionError, not an asyncpg exception type. Callers like SQLAlchemy check for asyncpg.InterfaceError or asyncpg.PostgresError to detect disconnects. A built-in ConnectionError bypasses all those checks, which means:

  • SQLAlchemy's is_disconnect() doesn't recognize it
  • SQLAlchemy's pool pre-ping handler (_do_ping_w_event) only catches self.loaded_dbapi.Error, so ConnectionError escapes
  • The pool's retry logic (which would create a fresh connection) never triggers

Suggested fix

Option A (minimal): Catch OSError (parent of ConnectionError) in connect_utils._cancel() and suppress it — cancel is best-effort:

async def _cancel(*, loop, addr, params, backend_pid, backend_secret):
    ...
    try:
        if params.ssl and params.sslmode != SSLMode.allow:
            tr, pr = await _create_ssl_connection(...)
        ...
    except OSError:
        # Cancel is best-effort. If we can't reach the server, the
        # connection is dead anyway.
        return

Option B (comprehensive): Also change TLSUpgradeProto.connection_lost() to raise asyncpg.InterfaceError instead of built-in ConnectionError, so callers can handle it consistently:

def connection_lost(self, exc):
    if not self.on_data.done():
        if exc is None:
            exc = InterfaceError('unexpected connection_lost() call')
        self.on_data.set_exception(exc)

Impact

This causes process crashes in production services. When a task is cancelled during a DB query, the ConnectionError escapes all exception handlers (which expect either CancelledError or asyncpg-specific exceptions) and terminates the process.

This is 100% correlated with CancelledError in our logs — every ConnectionError: unexpected connection_lost() we've seen is triggered by task cancellation.

Additional context

We use Google CloudSQL with SSL connections. The PostgreSQL server is accessed over SSL (non-direct_tls), which means the cancel code path goes through _create_ssl_connection to establish a new SSL connection for sending the cancel request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions