Route trailing readReadyForQuery error through handleError on extended-protocol cleanup by m1ralx · Pull Request #1321 · lib/pq

m1ralx · 2026-05-09T12:38:11Z

When a transparent pooler (e.g. pgbouncer's disconnect_server(false, ...) -> send_pooler_error) emits a synthetic ErrorResponse mid-extended-protocol and closes the socket before sending a trailing ReadyForQuery, the trailing readReadyForQuery() inside conn.go's five extended-protocol cleanup sites returns io.EOF — and that error is silently dropped by _ =. Prior to this fix, this means:

cn.err was never set (the EOF never reached handleError), and IsValid() returned true
The inProgress atomic flag remained stuck at true (since ReadyForQuery never arrived to clear it in recvMessage)
database/sql kept handing out the broken connection from the pool
The CompareAndSwap guard rejected every subsequent query with pq: there is already a query being processed on this connection

This is the same end-state as #1298, but reached through a different bug-path that the merge of #1299 (commit 6d77ced41719616090c9e7eec2c313a18640bc3f) does not close. The merged change classifies io.ErrUnexpectedEOF in handleError, but for this path the EOF is dropped before handleError ever sees it. The defensive errQueryInProgress -> driver.ErrBadConn wrap from that PR was rejected in review and is not in upstream, so (*DB).retry does not silently retry either. The path is fully open in current master.

This is especially impactful in production behind pgbouncer ≥1.15: the byte sequence is severity ERROR (not FATAL) followed by a TCP close with no ReadyForQuery. The non-FATAL severity also means the parsed *pq.Error is not classified as driver.ErrBadConn by handleError, so the only remaining signal — the EOF on the trailing ReadyForQuery read — was the one that needed to propagate.

One change:

conn.go: route the result of cn.readReadyForQuery() through cn.handleError at the five extended-protocol cleanup sites (readParseResponse, readStatementDescribeResponse, readPortalDescribeResponse, readBindResponse, postExecuteWorkaround). handleError(nil) is a no-op so the happy path is unaffected; for io.EOF, io.ErrUnexpectedEOF, and *net.OpError the existing cn.err.set(driver.ErrBadConn) side effect runs even though the return value is still discarded by _ =. *sql.DB.putConn -> dc.IsValid() then returns false and the conn is closed instead of being returned to freeConn.

 case proto.ErrorResponse:
     err := parseError(r, "")
-    _ = cn.readReadyForQuery()
+    _ = cn.handleError(cn.readReadyForQuery())
     return err

Includes integration test using pqtest.Fake that emits a non-fatal ErrorResponse (SQLSTATE 08P01) and closes the connection without ReadyForQuery, then asserts that a subsequent db.Exec does not short-circuit to errQueryInProgress — i.e. the poisoned conn was actually evicted from the pool rather than merely flagged. The test reproduces the exact production failure mode reported in #1320.

…d-protocol cleanup

arp242 · 2026-05-12T10:22:09Z

I'm looking at this, and I wonder if it wouldn't be a better fix to reset inProgress in a more reliable way? Because this is already the second such bug to occur with it (previous: #1299). When I added this (#1272) I thought this should be fairly safe to do because it adapted a long-existing mechanism, but it seems I was wrong about that – I guess that was always a bit buggy but no one noticed because it's used fairly infrequently. So I wonder how many more subtle bugs exist here?

arp242 · 2026-05-12T10:32:47Z

Or maybe just backing out #1272 altogether, or re-think how to do that from scratch. It didn't really fix a bug per-se, but rather just clarifies the error when using pq in the wrong way. So in that sense it's not super-critical.

I had to read your description carefully to follow what's going on here and it was not obvious from just the diff. That's a good indication that this entire logic is not quite right, and even if we do get it 100% right now I can see missing something here in a future change.

m1ralx · 2026-05-12T16:55:03Z

Yes, I think the error-handling mechanism here is fairly fragile in its current form.

I considered wrapping cn.err.set(driver.ErrBadConn) in some helper that would also do inProgress.Store(false), and calling it inside readReadyForQuery when handling the cn.recv1 error – but I held off, since that starts to look more like a refactor that requires more context than I have.

Route trailing readReadyForQuery error through handleError on extende…

1c646cb

…d-protocol cleanup

m1ralx mentioned this pull request May 9, 2026

Discarded readReadyForQuery() EOF causes connection pool poisoning via stuck inProgress flag #1320

Open

arp242 added 2 commits May 12, 2026 12:34

Small stylistic fixes for test case

baa4516

Better test name

e553c4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Route trailing readReadyForQuery error through handleError on extended-protocol cleanup#1321

Route trailing readReadyForQuery error through handleError on extended-protocol cleanup#1321
m1ralx wants to merge 3 commits into
lib:masterfrom
m1ralx:fix/discard-readyforquery-error-leak

m1ralx commented May 9, 2026

Uh oh!

arp242 commented May 12, 2026 •

edited

Loading

Uh oh!

arp242 commented May 12, 2026 •

edited

Loading

Uh oh!

m1ralx commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

m1ralx commented May 9, 2026

Uh oh!

arp242 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arp242 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m1ralx commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arp242 commented May 12, 2026 •

edited

Loading

arp242 commented May 12, 2026 •

edited

Loading