Skip to content

Commit 2fa3dd6

Browse files
authored
fix(db): retry the migration connection on transient slot exhaustion (#5226)
* ci(migrations): skip db:migrate on merges that change no migration files Every push to main/staging ran db:migrate against the production/staging database even when the merge changed no schema, so a no-op migration would dial the DB and fail whenever it was at its connection limit (53300, slots reserved for SUPERUSER) — red-X'ing UI-only merges. Add a detect-migrations job (dorny/paths-filter on packages/db/migrations/**) and pass the result into the reusable migrations workflow, which now skips the apply step when no migration files changed. The migrate job still runs so downstream build/deploy jobs that need it are never skipped, and the flag defaults to 'true' so manual dispatch and any unknown value always apply migrations — the gate only ever skips a provably-empty change. * fix(db): retry the migration connection on transient slot exhaustion The migration opens its session on the first query (the advisory-lock acquire). When the deploy database briefly exhausts every non-superuser connection slot at peak, that connect fails with 53300 ("remaining connection slots are reserved for roles with the SUPERUSER attribute") and the whole deploy's migrate step errors out — even when the spike clears within seconds. Add a bounded connectWithRetry() before acquiring the lock that retries 53300, the 08xxx connection_exception class, and the driver's transport errors with backoff (10 attempts, ~90s ceiling). Non-transient errors (auth, bad config) still fail fast. The migration is a single short-lived session, so waiting out a transient spike is far safer than failing the deploy. * ci: drop the migration paths-filter gate (out of scope) Revert the detect-migrations gate carried over from the closed CI PR; we are fixing the connection failure at its source (migrate.ts connection retry) rather than gating db:migrate, which the reviewers correctly noted could leave a previously-merged migration unapplied after a failed deploy.
1 parent 02d254e commit 2fa3dd6

1 file changed

Lines changed: 57 additions & 0 deletions

File tree

packages/db/scripts/migrate.ts

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,10 +71,42 @@ const DDL_LOCK_TIMEOUT = '5s'
7171
const MAX_MIGRATE_ATTEMPTS = 8
7272
const MIGRATE_RETRY_BACKOFF = { baseMs: 2_000, maxMs: 30_000 } as const
7373

74+
const CONNECT_MAX_ATTEMPTS = 10
75+
const CONNECT_RETRY_BACKOFF = { baseMs: 1_000, maxMs: 15_000 } as const
76+
77+
/**
78+
* Error codes that mean the database was momentarily unreachable rather than
79+
* the migration being wrong: chiefly `53300` (too_many_connections — every
80+
* non-superuser slot was taken, surfaced as "remaining connection slots are
81+
* reserved for roles with the SUPERUSER attribute"), the `08xxx`
82+
* connection_exception class, and the postgres-js driver's own transport
83+
* codes. These are retried while opening the session; anything else is fatal.
84+
*/
85+
const TRANSIENT_CONNECT_CODES = new Set([
86+
'53300',
87+
'53400',
88+
'CONNECT_TIMEOUT',
89+
'CONNECTION_CLOSED',
90+
'CONNECTION_DESTROYED',
91+
'CONNECTION_ENDED',
92+
'ECONNREFUSED',
93+
'ECONNRESET',
94+
'ETIMEDOUT',
95+
'EHOSTUNREACH',
96+
'ENOTFOUND',
97+
])
98+
99+
function isTransientConnectError(error: unknown): boolean {
100+
const code = getPostgresErrorCode(error)
101+
if (!code) return false
102+
return TRANSIENT_CONNECT_CODES.has(code) || code.startsWith('08')
103+
}
104+
74105
/** Backend pid of the lock-holding session; a change means the lock was lost. */
75106
let lockSessionPid = 0
76107

77108
try {
109+
await connectWithRetry()
78110
await acquireMigrationLock()
79111
try {
80112
await runMigrationsWithRetry()
@@ -91,6 +123,31 @@ try {
91123
await client.end()
92124
}
93125

126+
/**
127+
* Open the migration session before taking the advisory lock, retrying
128+
* transient connection failures with bounded backoff. The deploy database can
129+
* briefly exhaust every non-superuser connection slot at peak (`53300`); the
130+
* migration is a single short-lived session, so waiting out a spike that frees
131+
* within seconds is far safer than failing the whole deploy. Non-transient
132+
* errors (auth, unknown host config, etc.) still fail fast.
133+
*/
134+
async function connectWithRetry(): Promise<void> {
135+
for (let attempt = 1; ; attempt++) {
136+
try {
137+
await client`SELECT 1`
138+
return
139+
} catch (error) {
140+
if (!isTransientConnectError(error) || attempt >= CONNECT_MAX_ATTEMPTS) throw error
141+
const delayMs = backoffWithJitter(attempt, null, CONNECT_RETRY_BACKOFF)
142+
console.warn(
143+
`WARN: database unavailable (${getPostgresErrorCode(error)}); ` +
144+
`attempt ${attempt}/${CONNECT_MAX_ATTEMPTS}, retrying in ${Math.round(delayMs)}ms.`
145+
)
146+
await sleep(delayMs)
147+
}
148+
}
149+
}
150+
94151
/**
95152
* Acquire the cross-process migration lock, failing loudly after the deadline
96153
* instead of blocking forever behind a wedged runner.

0 commit comments

Comments
 (0)