Skip to content

Commit 04132cb

Browse files
committed
Fix race condition deleting active runs when the write in active runs
repository is locked
1 parent 8bd997b commit 04132cb

File tree

6 files changed

+700
-2
lines changed

6 files changed

+700
-2
lines changed

apps/gateway/src/routes/api/v3/projects/versions/documents/run/run.handler.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,8 @@ export const runHandler: AppRouteHandler<RunRoute> = async (c) => {
6767
const shouldRunInBackground =
6868
background !== undefined ? background : backgroundRunsFeatureEnabled
6969

70+
console.log('SHOULD_RUN_IN_BACKGROUND', shouldRunInBackground)
71+
7072
if (shouldRunInBackground) {
7173
return await handleBackgroundRun({
7274
c,

package.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,10 @@
1414
"prettier:check": "prettier --check \"**/*.{ts,tsx,md}\" --ignore-path .prettierrcignore",
1515
"test": "turbo test",
1616
"catchup": "pnpm i --ignore-scripts && pnpm build --filter=\"./packages/**/*\" && pnpm --filter \"@latitude-data/web\" workers:build && pnpm --filter \"@latitude-data/core\" db:migrate",
17-
"console": "clear && ./bin/console.ts"
17+
"console": "clear && ./bin/console.ts",
18+
"stress-test": "pnpm --filter @latitude-data/core stress-test",
19+
"check-stuck-runs": "pnpm --filter @latitude-data/core check-stuck-runs",
20+
"watch-stuck-runs": "pnpm --filter @latitude-data/core watch-stuck-runs"
1821
},
1922
"devDependencies": {
2023
"@babel/parser": "^7.28.4",

packages/core/package.json

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,10 @@
103103
"lint": "eslint src",
104104
"tc": "tsc --noEmit",
105105
"test": "pnpm run db:migrate:test && vitest run --pool=forks",
106-
"test:watch": "vitest"
106+
"test:watch": "vitest",
107+
"stress-test": "tsx scripts/stress-test-runs.ts",
108+
"check-stuck-runs": "tsx scripts/check-stuck-runs.ts",
109+
"watch-stuck-runs": "tsx scripts/check-stuck-runs.ts --watch"
107110
},
108111
"dependencies": {
109112
"@ai-sdk/amazon-bedrock": "3.0.21",

packages/core/scripts/README.md

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
# Stress Test Scripts - Race Condition Reproduction
2+
3+
These scripts help reproduce and diagnose the race condition where runs get stuck in Redis cache after BullMQ jobs complete.
4+
5+
## Setup
6+
7+
Make sure your services are running:
8+
```bash
9+
# Terminal 1: Start Redis
10+
docker-compose up redis
11+
12+
# Terminal 2: Start services
13+
pnpm dev
14+
15+
# Terminal 3: Start workers
16+
cd apps/workers && pnpm dev
17+
```
18+
19+
## Scripts
20+
21+
### 1. `stress-test-runs.ts` - Trigger race condition
22+
23+
Makes parallel API requests to stress test the run queue and provoke lock contention.
24+
25+
**Usage (from repo root):**
26+
```bash
27+
# Start with 2 requests
28+
pnpm stress-test 2
29+
30+
# Scale up
31+
pnpm stress-test 10
32+
pnpm stress-test 40
33+
pnpm stress-test 100
34+
```
35+
36+
**Or run directly:**
37+
```bash
38+
pnpm --filter @latitude-data/core stress-test 40
39+
tsx packages/core/scripts/stress-test-runs.ts 40
40+
```
41+
42+
**Configuration:**
43+
44+
Edit `packages/core/scripts/stress-test-runs.ts` to change:
45+
- `API_URL` - Your API endpoint
46+
- `API_KEY` - Your API key
47+
- `DOCUMENT_PATH` - Document to run
48+
- `PARAMETERS` - Parameters for the run
49+
50+
### 2. `check-stuck-runs.ts` - Verify race condition
51+
52+
Checks Redis for active runs that don't have corresponding BullMQ jobs.
53+
54+
**Usage (from repo root):**
55+
```bash
56+
# Run once
57+
pnpm check-stuck-runs
58+
59+
# Continuous monitoring (updates every 2 seconds)
60+
pnpm watch-stuck-runs
61+
62+
# Or run directly
63+
pnpm --filter @latitude-data/core check-stuck-runs
64+
pnpm --filter @latitude-data/core watch-stuck-runs
65+
tsx packages/core/scripts/check-stuck-runs.ts --watch
66+
```
67+
68+
**What it checks:**
69+
- Reads `runs:active:*` keys from Redis
70+
- For each run UUID, checks if BullMQ job exists
71+
- Reports stuck runs (in cache but no job)
72+
73+
**Watch mode features:**
74+
- Live status line showing current state
75+
- Alerts when NEW stuck runs are detected
76+
- Tracks total stuck runs found
77+
- Press Ctrl+C to stop
78+
79+
## Reproduction Steps
80+
81+
### Method 1: Watch Mode (Recommended)
82+
83+
1. **Start watching for stuck runs** (in Terminal 1):
84+
```bash
85+
pnpm watch-stuck-runs
86+
```
87+
You'll see:
88+
```
89+
🔍 Starting continuous monitoring for stuck runs...
90+
Check interval: 2000ms
91+
Press Ctrl+C to stop
92+
93+
✅ [12:34:56] Check #1 | Cache keys: 0 | Total runs: 0 | Stuck: 0 | Total found: 0
94+
```
95+
96+
2. **Run stress test** (in Terminal 2):
97+
```bash
98+
pnpm stress-test 40
99+
```
100+
101+
3. **Watch Terminal 1** - You'll see alerts when stuck runs appear:
102+
```
103+
⚠️ [12:35:02] Check #8 | Cache keys: 1 | Total runs: 12 | Stuck: 3 | Total found: 3
104+
105+
🚨 NEW STUCK RUN DETECTED!
106+
UUID: abc-123-def-456
107+
Queued: 2024-01-01T12:35:00.000Z (2s ago)
108+
Started: Not started
109+
```
110+
111+
4. **Press Ctrl+C** to stop watching
112+
113+
### Method 2: Manual Check
114+
115+
1. **Run stress test:**
116+
```bash
117+
pnpm stress-test 40
118+
```
119+
120+
2. **Wait 5-10 seconds** for jobs to complete
121+
122+
3. **Check for stuck runs:**
123+
```bash
124+
pnpm check-stuck-runs
125+
```
126+
127+
4. **Expected result if race condition occurs:**
128+
```
129+
⚠️ STUCK RUN FOUND:
130+
UUID: abc-123-def
131+
Queued: 2024-01-01T12:00:00.000Z (30s ago)
132+
Job exists in BullMQ: NO ❌
133+
134+
🎯 RACE CONDITION CONFIRMED!
135+
```
136+
137+
### Also check the UI:
138+
- Go to http://localhost:3000/projects/50
139+
- Look for runs stuck "in progress" (spinning icon)
140+
- Try clicking them → should see "Active run job not found" error
141+
142+
## Alternative Manual Checks
143+
144+
### Check Redis directly:
145+
```bash
146+
# List all active run cache keys
147+
redis-cli --scan --pattern "latitude:runs:active:*"
148+
149+
# Get contents of a key
150+
redis-cli GET "latitude:runs:active:1:50"
151+
```
152+
153+
### Check BullMQ admin:
154+
- Open http://localhost:3000/admin/queues/runsQueue
155+
- Look for completed/failed jobs
156+
- Compare with stuck runs in cache
157+
158+
### Check logs:
159+
```bash
160+
# Worker logs
161+
cd apps/workers && pnpm logs
162+
163+
# Look for:
164+
# - "Failed to acquire lock" errors
165+
# - endRun() failures
166+
```
167+
168+
## What Causes the Race Condition?
169+
170+
**Lock Contention:**
171+
- Multiple runs for same workspace/project share one Redis lock
172+
- Lock key: `lock:runs:active:{workspaceId}:{projectId}`
173+
- Lock timeout: 5 seconds (default in `withCacheLock()`)
174+
175+
**Failure Scenario:**
176+
1. Job A updates run → holds lock
177+
2. Job B completes → tries `endRun()` → waits for lock
178+
3. After 5s → `withCacheLock()` throws timeout
179+
4. Job B marked complete, removed from BullMQ
180+
5. **Run B stuck in Redis cache**
181+
182+
## Expected Metrics
183+
184+
With 40 parallel requests, you should see:
185+
-**No race condition:** All runs complete, no stuck entries
186+
- ⚠️ **Race condition occurs:** 1-5 runs stuck in cache (5-15%)
187+
- 🔥 **Severe contention:** 10+ runs stuck (>25%)
188+
189+
The higher the parallelism, the higher the chance of lock contention.
190+
191+
## Clean Up
192+
193+
Remove stuck runs manually:
194+
```bash
195+
# Delete specific cache key
196+
redis-cli DEL "latitude:runs:active:1:50"
197+
198+
# Or flush all (⚠️ nuclear option)
199+
redis-cli FLUSHDB
200+
```
201+

0 commit comments

Comments
 (0)