Skip to content

Commit 80348ca

Browse files
committed
fix(db): retry the migration connection on transient slot exhaustion
The migration opens its session on the first query (the advisory-lock acquire). When the deploy database briefly exhausts every non-superuser connection slot at peak, that connect fails with 53300 ("remaining connection slots are reserved for roles with the SUPERUSER attribute") and the whole deploy's migrate step errors out — even when the spike clears within seconds. Add a bounded connectWithRetry() before acquiring the lock that retries 53300, the 08xxx connection_exception class, and the driver's transport errors with backoff (10 attempts, ~90s ceiling). Non-transient errors (auth, bad config) still fail fast. The migration is a single short-lived session, so waiting out a transient spike is far safer than failing the deploy.
1 parent 358a957 commit 80348ca

1 file changed

Lines changed: 57 additions & 0 deletions

File tree

packages/db/scripts/migrate.ts

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,10 +71,42 @@ const DDL_LOCK_TIMEOUT = '5s'
7171
const MAX_MIGRATE_ATTEMPTS = 8
7272
const MIGRATE_RETRY_BACKOFF = { baseMs: 2_000, maxMs: 30_000 } as const
7373

74+
const CONNECT_MAX_ATTEMPTS = 10
75+
const CONNECT_RETRY_BACKOFF = { baseMs: 1_000, maxMs: 15_000 } as const
76+
77+
/**
78+
* Error codes that mean the database was momentarily unreachable rather than
79+
* the migration being wrong: chiefly `53300` (too_many_connections — every
80+
* non-superuser slot was taken, surfaced as "remaining connection slots are
81+
* reserved for roles with the SUPERUSER attribute"), the `08xxx`
82+
* connection_exception class, and the postgres-js driver's own transport
83+
* codes. These are retried while opening the session; anything else is fatal.
84+
*/
85+
const TRANSIENT_CONNECT_CODES = new Set([
86+
'53300',
87+
'53400',
88+
'CONNECT_TIMEOUT',
89+
'CONNECTION_CLOSED',
90+
'CONNECTION_DESTROYED',
91+
'CONNECTION_ENDED',
92+
'ECONNREFUSED',
93+
'ECONNRESET',
94+
'ETIMEDOUT',
95+
'EHOSTUNREACH',
96+
'ENOTFOUND',
97+
])
98+
99+
function isTransientConnectError(error: unknown): boolean {
100+
const code = getPostgresErrorCode(error)
101+
if (!code) return false
102+
return TRANSIENT_CONNECT_CODES.has(code) || code.startsWith('08')
103+
}
104+
74105
/** Backend pid of the lock-holding session; a change means the lock was lost. */
75106
let lockSessionPid = 0
76107

77108
try {
109+
await connectWithRetry()
78110
await acquireMigrationLock()
79111
try {
80112
await runMigrationsWithRetry()
@@ -91,6 +123,31 @@ try {
91123
await client.end()
92124
}
93125

126+
/**
127+
* Open the migration session before taking the advisory lock, retrying
128+
* transient connection failures with bounded backoff. The deploy database can
129+
* briefly exhaust every non-superuser connection slot at peak (`53300`); the
130+
* migration is a single short-lived session, so waiting out a spike that frees
131+
* within seconds is far safer than failing the whole deploy. Non-transient
132+
* errors (auth, unknown host config, etc.) still fail fast.
133+
*/
134+
async function connectWithRetry(): Promise<void> {
135+
for (let attempt = 1; ; attempt++) {
136+
try {
137+
await client`SELECT 1`
138+
return
139+
} catch (error) {
140+
if (!isTransientConnectError(error) || attempt >= CONNECT_MAX_ATTEMPTS) throw error
141+
const delayMs = backoffWithJitter(attempt, null, CONNECT_RETRY_BACKOFF)
142+
console.warn(
143+
`WARN: database unavailable (${getPostgresErrorCode(error)}); ` +
144+
`attempt ${attempt}/${CONNECT_MAX_ATTEMPTS}, retrying in ${Math.round(delayMs)}ms.`
145+
)
146+
await sleep(delayMs)
147+
}
148+
}
149+
}
150+
94151
/**
95152
* Acquire the cross-process migration lock, failing loudly after the deadline
96153
* instead of blocking forever behind a wedged runner.

0 commit comments

Comments
 (0)