Route trailing readReadyForQuery error through handleError on extended-protocol cleanup#1321
Route trailing readReadyForQuery error through handleError on extended-protocol cleanup#1321m1ralx wants to merge 3 commits into
Conversation
…d-protocol cleanup
|
I'm looking at this, and I wonder if it wouldn't be a better fix to reset inProgress in a more reliable way? Because this is already the second such bug to occur with it (previous: #1299). When I added this (#1272) I thought this should be fairly safe to do because it adapted a long-existing mechanism, but it seems I was wrong about that – I guess that was always a bit buggy but no one noticed because it's used fairly infrequently. So I wonder how many more subtle bugs exist here? |
|
Or maybe just backing out #1272 altogether, or re-think how to do that from scratch. It didn't really fix a bug per-se, but rather just clarifies the error when using pq in the wrong way. So in that sense it's not super-critical. I had to read your description carefully to follow what's going on here and it was not obvious from just the diff. That's a good indication that this entire logic is not quite right, and even if we do get it 100% right now I can see missing something here in a future change. |
|
Yes, I think the error-handling mechanism here is fairly fragile in its current form. I considered wrapping |
Fixes #1320
When a transparent pooler (e.g. pgbouncer's
disconnect_server(false, ...) -> send_pooler_error) emits a syntheticErrorResponsemid-extended-protocol and closes the socket before sending a trailingReadyForQuery, the trailingreadReadyForQuery()insideconn.go's five extended-protocol cleanup sites returnsio.EOF— and that error is silently dropped by_ =. Prior to this fix, this means:cn.errwas never set (the EOF never reachedhandleError), andIsValid()returned trueinProgressatomic flag remained stuck at true (sinceReadyForQuerynever arrived to clear it inrecvMessage)database/sqlkept handing out the broken connection from the poolpq: there is already a query being processed on this connectionThis is the same end-state as #1298, but reached through a different bug-path that the merge of #1299 (commit 6d77ced41719616090c9e7eec2c313a18640bc3f) does not close. The merged change classifies
io.ErrUnexpectedEOFinhandleError, but for this path the EOF is dropped beforehandleErrorever sees it. The defensiveerrQueryInProgress -> driver.ErrBadConnwrap from that PR was rejected in review and is not in upstream, so(*DB).retrydoes not silently retry either. The path is fully open in currentmaster.This is especially impactful in production behind pgbouncer ≥1.15: the byte sequence is severity ERROR (not FATAL) followed by a TCP close with no
ReadyForQuery. The non-FATAL severity also means the parsed*pq.Erroris not classified asdriver.ErrBadConnbyhandleError, so the only remaining signal — the EOF on the trailingReadyForQueryread — was the one that needed to propagate.One change:
conn.go: route the result ofcn.readReadyForQuery()throughcn.handleErrorat the five extended-protocol cleanup sites (readParseResponse,readStatementDescribeResponse,readPortalDescribeResponse,readBindResponse,postExecuteWorkaround).handleError(nil)is a no-op so the happy path is unaffected; forio.EOF,io.ErrUnexpectedEOF, and*net.OpErrorthe existingcn.err.set(driver.ErrBadConn)side effect runs even though the return value is still discarded by_ =.*sql.DB.putConn -> dc.IsValid()then returns false and the conn is closed instead of being returned tofreeConn.case proto.ErrorResponse: err := parseError(r, "") - _ = cn.readReadyForQuery() + _ = cn.handleError(cn.readReadyForQuery()) return errIncludes integration test using
pqtest.Fakethat emits a non-fatalErrorResponse(SQLSTATE08P01) and closes the connection withoutReadyForQuery, then asserts that a subsequentdb.Execdoes not short-circuit toerrQueryInProgress— i.e. the poisoned conn was actually evicted from the pool rather than merely flagged. The test reproduces the exact production failure mode reported in #1320.