|
| 1 | +# Stress Test Scripts - Race Condition Reproduction |
| 2 | + |
| 3 | +These scripts help reproduce and diagnose the race condition where runs get stuck in Redis cache after BullMQ jobs complete. |
| 4 | + |
| 5 | +## Setup |
| 6 | + |
| 7 | +Make sure your services are running: |
| 8 | +```bash |
| 9 | +# Terminal 1: Start Redis |
| 10 | +docker-compose up redis |
| 11 | + |
| 12 | +# Terminal 2: Start services |
| 13 | +pnpm dev |
| 14 | + |
| 15 | +# Terminal 3: Start workers |
| 16 | +cd apps/workers && pnpm dev |
| 17 | +``` |
| 18 | + |
| 19 | +## Scripts |
| 20 | + |
| 21 | +### 1. `stress-test-runs.ts` - Trigger race condition |
| 22 | + |
| 23 | +Makes parallel API requests to stress test the run queue and provoke lock contention. |
| 24 | + |
| 25 | +**Usage (from repo root):** |
| 26 | +```bash |
| 27 | +# Start with 2 requests |
| 28 | +pnpm stress-test 2 |
| 29 | + |
| 30 | +# Scale up |
| 31 | +pnpm stress-test 10 |
| 32 | +pnpm stress-test 40 |
| 33 | +pnpm stress-test 100 |
| 34 | +``` |
| 35 | + |
| 36 | +**Or run directly:** |
| 37 | +```bash |
| 38 | +pnpm --filter @latitude-data/core stress-test 40 |
| 39 | +tsx packages/core/scripts/stress-test-runs.ts 40 |
| 40 | +``` |
| 41 | + |
| 42 | +**Configuration:** |
| 43 | + |
| 44 | +Edit `packages/core/scripts/stress-test-runs.ts` to change: |
| 45 | +- `API_URL` - Your API endpoint |
| 46 | +- `API_KEY` - Your API key |
| 47 | +- `DOCUMENT_PATH` - Document to run |
| 48 | +- `PARAMETERS` - Parameters for the run |
| 49 | + |
| 50 | +### 2. `check-stuck-runs.ts` - Verify race condition |
| 51 | + |
| 52 | +Checks Redis for active runs that don't have corresponding BullMQ jobs. |
| 53 | + |
| 54 | +**Usage (from repo root):** |
| 55 | +```bash |
| 56 | +# Run once |
| 57 | +pnpm check-stuck-runs |
| 58 | + |
| 59 | +# Continuous monitoring (updates every 2 seconds) |
| 60 | +pnpm watch-stuck-runs |
| 61 | + |
| 62 | +# Or run directly |
| 63 | +pnpm --filter @latitude-data/core check-stuck-runs |
| 64 | +pnpm --filter @latitude-data/core watch-stuck-runs |
| 65 | +tsx packages/core/scripts/check-stuck-runs.ts --watch |
| 66 | +``` |
| 67 | + |
| 68 | +**What it checks:** |
| 69 | +- Reads `runs:active:*` keys from Redis |
| 70 | +- For each run UUID, checks if BullMQ job exists |
| 71 | +- Reports stuck runs (in cache but no job) |
| 72 | + |
| 73 | +**Watch mode features:** |
| 74 | +- Live status line showing current state |
| 75 | +- Alerts when NEW stuck runs are detected |
| 76 | +- Tracks total stuck runs found |
| 77 | +- Press Ctrl+C to stop |
| 78 | + |
| 79 | +## Reproduction Steps |
| 80 | + |
| 81 | +### Method 1: Watch Mode (Recommended) |
| 82 | + |
| 83 | +1. **Start watching for stuck runs** (in Terminal 1): |
| 84 | + ```bash |
| 85 | + pnpm watch-stuck-runs |
| 86 | + ``` |
| 87 | + You'll see: |
| 88 | + ``` |
| 89 | + 🔍 Starting continuous monitoring for stuck runs... |
| 90 | + Check interval: 2000ms |
| 91 | + Press Ctrl+C to stop |
| 92 | + |
| 93 | + ✅ [12:34:56] Check #1 | Cache keys: 0 | Total runs: 0 | Stuck: 0 | Total found: 0 |
| 94 | + ``` |
| 95 | + |
| 96 | +2. **Run stress test** (in Terminal 2): |
| 97 | + ```bash |
| 98 | + pnpm stress-test 40 |
| 99 | + ``` |
| 100 | + |
| 101 | +3. **Watch Terminal 1** - You'll see alerts when stuck runs appear: |
| 102 | + ``` |
| 103 | + ⚠️ [12:35:02] Check #8 | Cache keys: 1 | Total runs: 12 | Stuck: 3 | Total found: 3 |
| 104 | +
|
| 105 | + 🚨 NEW STUCK RUN DETECTED! |
| 106 | + UUID: abc-123-def-456 |
| 107 | + Queued: 2024-01-01T12:35:00.000Z (2s ago) |
| 108 | + Started: Not started |
| 109 | + ``` |
| 110 | + |
| 111 | +4. **Press Ctrl+C** to stop watching |
| 112 | + |
| 113 | +### Method 2: Manual Check |
| 114 | + |
| 115 | +1. **Run stress test:** |
| 116 | + ```bash |
| 117 | + pnpm stress-test 40 |
| 118 | + ``` |
| 119 | + |
| 120 | +2. **Wait 5-10 seconds** for jobs to complete |
| 121 | + |
| 122 | +3. **Check for stuck runs:** |
| 123 | + ```bash |
| 124 | + pnpm check-stuck-runs |
| 125 | + ``` |
| 126 | + |
| 127 | +4. **Expected result if race condition occurs:** |
| 128 | + ``` |
| 129 | + ⚠️ STUCK RUN FOUND: |
| 130 | + UUID: abc-123-def |
| 131 | + Queued: 2024-01-01T12:00:00.000Z (30s ago) |
| 132 | + Job exists in BullMQ: NO ❌ |
| 133 | + |
| 134 | + 🎯 RACE CONDITION CONFIRMED! |
| 135 | + ``` |
| 136 | + |
| 137 | +### Also check the UI: |
| 138 | +- Go to http://localhost:3000/projects/50 |
| 139 | +- Look for runs stuck "in progress" (spinning icon) |
| 140 | +- Try clicking them → should see "Active run job not found" error |
| 141 | + |
| 142 | +## Alternative Manual Checks |
| 143 | + |
| 144 | +### Check Redis directly: |
| 145 | +```bash |
| 146 | +# List all active run cache keys |
| 147 | +redis-cli --scan --pattern "latitude:runs:active:*" |
| 148 | + |
| 149 | +# Get contents of a key |
| 150 | +redis-cli GET "latitude:runs:active:1:50" |
| 151 | +``` |
| 152 | + |
| 153 | +### Check BullMQ admin: |
| 154 | +- Open http://localhost:3000/admin/queues/runsQueue |
| 155 | +- Look for completed/failed jobs |
| 156 | +- Compare with stuck runs in cache |
| 157 | + |
| 158 | +### Check logs: |
| 159 | +```bash |
| 160 | +# Worker logs |
| 161 | +cd apps/workers && pnpm logs |
| 162 | + |
| 163 | +# Look for: |
| 164 | +# - "Failed to acquire lock" errors |
| 165 | +# - endRun() failures |
| 166 | +``` |
| 167 | + |
| 168 | +## What Causes the Race Condition? |
| 169 | + |
| 170 | +**Lock Contention:** |
| 171 | +- Multiple runs for same workspace/project share one Redis lock |
| 172 | +- Lock key: `lock:runs:active:{workspaceId}:{projectId}` |
| 173 | +- Lock timeout: 5 seconds (default in `withCacheLock()`) |
| 174 | + |
| 175 | +**Failure Scenario:** |
| 176 | +1. Job A updates run → holds lock |
| 177 | +2. Job B completes → tries `endRun()` → waits for lock |
| 178 | +3. After 5s → `withCacheLock()` throws timeout |
| 179 | +4. Job B marked complete, removed from BullMQ |
| 180 | +5. **Run B stuck in Redis cache** ❌ |
| 181 | + |
| 182 | +## Expected Metrics |
| 183 | + |
| 184 | +With 40 parallel requests, you should see: |
| 185 | +- ✅ **No race condition:** All runs complete, no stuck entries |
| 186 | +- ⚠️ **Race condition occurs:** 1-5 runs stuck in cache (5-15%) |
| 187 | +- 🔥 **Severe contention:** 10+ runs stuck (>25%) |
| 188 | + |
| 189 | +The higher the parallelism, the higher the chance of lock contention. |
| 190 | + |
| 191 | +## Clean Up |
| 192 | + |
| 193 | +Remove stuck runs manually: |
| 194 | +```bash |
| 195 | +# Delete specific cache key |
| 196 | +redis-cli DEL "latitude:runs:active:1:50" |
| 197 | + |
| 198 | +# Or flush all (⚠️ nuclear option) |
| 199 | +redis-cli FLUSHDB |
| 200 | +``` |
| 201 | + |
0 commit comments