Skip to content

Route trailing readReadyForQuery error through handleError on extended-protocol cleanup#1321

Open
m1ralx wants to merge 3 commits into
lib:masterfrom
m1ralx:fix/discard-readyforquery-error-leak
Open

Route trailing readReadyForQuery error through handleError on extended-protocol cleanup#1321
m1ralx wants to merge 3 commits into
lib:masterfrom
m1ralx:fix/discard-readyforquery-error-leak

Conversation

@m1ralx
Copy link
Copy Markdown

@m1ralx m1ralx commented May 9, 2026

Fixes #1320

When a transparent pooler (e.g. pgbouncer's disconnect_server(false, ...) -> send_pooler_error) emits a synthetic ErrorResponse mid-extended-protocol and closes the socket before sending a trailing ReadyForQuery, the trailing readReadyForQuery() inside conn.go's five extended-protocol cleanup sites returns io.EOF — and that error is silently dropped by _ =. Prior to this fix, this means:

  1. cn.err was never set (the EOF never reached handleError), and IsValid() returned true
  2. The inProgress atomic flag remained stuck at true (since ReadyForQuery never arrived to clear it in recvMessage)
  3. database/sql kept handing out the broken connection from the pool
  4. The CompareAndSwap guard rejected every subsequent query with pq: there is already a query being processed on this connection

This is the same end-state as #1298, but reached through a different bug-path that the merge of #1299 (commit 6d77ced41719616090c9e7eec2c313a18640bc3f) does not close. The merged change classifies io.ErrUnexpectedEOF in handleError, but for this path the EOF is dropped before handleError ever sees it. The defensive errQueryInProgress -> driver.ErrBadConn wrap from that PR was rejected in review and is not in upstream, so (*DB).retry does not silently retry either. The path is fully open in current master.

This is especially impactful in production behind pgbouncer ≥1.15: the byte sequence is severity ERROR (not FATAL) followed by a TCP close with no ReadyForQuery. The non-FATAL severity also means the parsed *pq.Error is not classified as driver.ErrBadConn by handleError, so the only remaining signal — the EOF on the trailing ReadyForQuery read — was the one that needed to propagate.

One change:

  1. conn.go: route the result of cn.readReadyForQuery() through cn.handleError at the five extended-protocol cleanup sites (readParseResponse, readStatementDescribeResponse, readPortalDescribeResponse, readBindResponse, postExecuteWorkaround). handleError(nil) is a no-op so the happy path is unaffected; for io.EOF, io.ErrUnexpectedEOF, and *net.OpError the existing cn.err.set(driver.ErrBadConn) side effect runs even though the return value is still discarded by _ =. *sql.DB.putConn -> dc.IsValid() then returns false and the conn is closed instead of being returned to freeConn.
 case proto.ErrorResponse:
     err := parseError(r, "")
-    _ = cn.readReadyForQuery()
+    _ = cn.handleError(cn.readReadyForQuery())
     return err

Includes integration test using pqtest.Fake that emits a non-fatal ErrorResponse (SQLSTATE 08P01) and closes the connection without ReadyForQuery, then asserts that a subsequent db.Exec does not short-circuit to errQueryInProgress — i.e. the poisoned conn was actually evicted from the pool rather than merely flagged. The test reproduces the exact production failure mode reported in #1320.

@arp242
Copy link
Copy Markdown
Collaborator

arp242 commented May 12, 2026

I'm looking at this, and I wonder if it wouldn't be a better fix to reset inProgress in a more reliable way? Because this is already the second such bug to occur with it (previous: #1299). When I added this (#1272) I thought this should be fairly safe to do because it adapted a long-existing mechanism, but it seems I was wrong about that – I guess that was always a bit buggy but no one noticed because it's used fairly infrequently. So I wonder how many more subtle bugs exist here?

@arp242
Copy link
Copy Markdown
Collaborator

arp242 commented May 12, 2026

Or maybe just backing out #1272 altogether, or re-think how to do that from scratch. It didn't really fix a bug per-se, but rather just clarifies the error when using pq in the wrong way. So in that sense it's not super-critical.

I had to read your description carefully to follow what's going on here and it was not obvious from just the diff. That's a good indication that this entire logic is not quite right, and even if we do get it 100% right now I can see missing something here in a future change.

@m1ralx
Copy link
Copy Markdown
Author

m1ralx commented May 12, 2026

Yes, I think the error-handling mechanism here is fairly fragile in its current form.

I considered wrapping cn.err.set(driver.ErrBadConn) in some helper that would also do inProgress.Store(false), and calling it inside readReadyForQuery when handling the cn.recv1 error – but I held off, since that starts to look more like a refactor that requires more context than I have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discarded readReadyForQuery() EOF causes connection pool poisoning via stuck inProgress flag

2 participants