Move Watcher Scan/Reconcile To Worker Thread And Fix Startup/Telemetry Races#64
Conversation
Keep heartbeat emission responsive during long filesystem scans by moving full-scan and reconcile work off the main watcher event loop. - add dedicated scan worker for full-scan/reconcile tasks - stream worker telemetry back to main process for heartbeat counters - run initial and periodic reconciliation through worker with overlap guard - keep chokidar incremental ingestion path in main process unchanged Validation: - node --check src/watcher/watcher-client.js - node --check src/watcher/scan-worker.js
Address two reliability issues in the async scan worker integration. - prevent config-reload heartbeat path from starting a watcher before bootstrap watcher init - guard bootstrap watcher startup so it does not create a second instance - reconcile streamed worker telemetry with final worker totals to keep unstreamed error counts - emit reconcile catch errors as telemetry deltas in scan worker Validation: - node --check src/watcher/watcher-client.js - node --check src/watcher/scan-worker.js - watcher heartbeat remained active after docker restart
There was a problem hiding this comment.
Review: Move Watcher Scan/Reconcile To Worker Thread
The overall approach is solid — offloading heavy scan/reconcile work to a worker thread is the right fix for heartbeat responsiveness, and the two-phase startup guard for the duplicate watcher issue is clean. A few things to address before merge.
Should fix
1. No tests for scan-worker.js
This is the most significant gap. scan-worker.js is 520 lines of new business logic with zero test coverage. The telemetry streaming reconciliation in runScanTask is particularly tricky — the streamedTotals subtraction logic:
errorsCount: Math.max(0, (finalTelemetry.errorsCount || 0) - streamedTotals.errorsCount),...has edge cases that are easy to get wrong (e.g. worker exits with 0 but never emits result, streamedTelemetry is true but finalTelemetry is null). The existing test pattern (src/*.test.js) won't pick up src/watcher/scan-worker.test.js without updating the glob too.
Worth testing at minimum:
mergeScanTelemetryaccumulation- Streamed vs. final telemetry reconciliation in
runScanTask shouldExcludepattern matching- The
activeWatcherbootstrap guard (config change during bootstrap)
2. nextReconcileTimestamp not updated when periodic reconcile is skipped
When periodicReconcile bails early due to scanTaskInFlight:
if (scanTaskInFlight) {
console.log(`${logPrefix} Skipping periodic reconciliation (scan already in progress)`);
return; // nextReconcileTimestamp stays in the past
}nextReconcileTimestamp is only updated in the finally block, so the dashboard will display a stale timestamp until the next reconcile actually completes. Suggest updating it at the top of periodicReconcile unconditionally:
nextReconcileTimestamp = new Date(Date.now() + getReconcileIntervalMs()).toISOString();
if (scanTaskInFlight) { ... return; }Minor / informational
3. Asset parse errors silently swallowed without error telemetry
In both runFullScan and runReconcile, when parseAsset throws it's silently dropped:
try { assets.push(parseAsset(f.path, project)); } catch {}No recordIngest({ errors: 1 }) call, so the error counter stays clean even when assets fail to parse. Makes diagnosing broken .uasset files harder. Compare with the source file path which does log a warning.
4. shouldExclude regex lacks anchoring (pre-existing, but now duplicated)
const regex = new RegExp(pattern.replace(/\*\*/g, '.*').replace(/\*/g, '[^/]*'));This was already in watcher-client.js and is now copied into the worker. The regex is unanchored — a pattern like Intermediate would match NotIntermediate/Foo. Also, non-* special regex characters in patterns (., +, (, etc.) aren't escaped. Not new to this PR, but duplicating the bug is a good moment to fix it in both places with something like:
const escaped = pattern.replace(/[.+^${}()|[\]\\]/g, '\\$&');
const regex = new RegExp('^' + escaped.replace(/\*\*/g, '.*').replace(/\*/g, '[^/]*'));What's working well
- The
scanTaskInFlightguard inperiodicReconcilecorrectly prevents overlapping reconciles. - The
activeWatcher = nullinitialization before heartbeat start cleanly prevents the config-reload path from creating a watcher during bootstrap. - Per-project error handling inside
runReconcileProjectsmatches the pre-PR per-project try/catch behavior — no regression in resilience. - The streamed-vs-final telemetry reconciliation pattern is a good design for keeping heartbeat counters current.
Move watcher scan/reconcile edge-case handling into testable helpers and add watcher worker coverage. - Add watcher tests for telemetry reconciliation, exclude matching, and bootstrap reload guard - Update periodic reconcile scheduling so nextRunAt stays fresh when reconcile is skipped - Surface asset parse errors in telemetry/logs instead of swallowing them - Precompile exclude patterns per config/worker to avoid per-path regex recompilation - Include src/watcher test files in npm test
There was a problem hiding this comment.
Review (updated — commit ab17e13)
The follow-up commit addresses all four items from the previous review well. Here's the updated status:
Previous feedback — all resolved ✅
| Item | Status |
|---|---|
| No tests for scan-worker.js | ✅ Fixed — scan-worker.test.js added with telemetry, exclude, and bootstrap-guard coverage; package.json glob updated |
nextReconcileTimestamp stale when periodic reconcile is skipped |
✅ Fixed — timestamp now set unconditionally at the top of periodicReconcile |
| Asset parse errors silently swallowed | ✅ Fixed — recordIngest({ errors: 1 }) + warn() in both worker and watcher-client |
| Unanchored / unescaped exclude regex | ✅ Fixed — exclude-patterns.js with proper escapeRegexLiteral and `(^ |
Remaining issue — should fix
Initial scan worker calls are not caught; a worker crash kills the watcher process
periodicReconcile correctly catches worker failures:
try {
mergeScanTelemetry(await runScanTask({ kind: 'reconcile' }));
lastReconcileTimestamp = new Date().toISOString();
} catch (err) {
console.error(`${logPrefix} Periodic reconcile failed: ${err.message}`);
totalErrors++;
} finally {
scanTaskInFlight = false;
}But the initial full-scan and bootstrap reconcile only have finally:
// emptyLanguages path
try {
mergeScanTelemetry(await runScanTask({ kind: 'full-scan', languages: emptyLanguages }));
} finally {
scanTaskInFlight = false; // exception still propagates
}
// populatedLanguages path
try {
...
mergeScanTelemetry(await runScanTask({ kind: 'reconcile', projectNames }));
} finally {
scanTaskInFlight = false; // exception still propagates
}
if (\!activeWatcher) { // ← never reached if either task above threw
activeWatcher = startWatcher();
}If either worker exits non-zero (e.g. a fatal OOM, a missing serviceUrl check, or a hard network failure after all retries), the exception propagates through main() to main().catch(), which calls process.exit(1). The watcher process dies and heartbeats stop — a silent failure from the dashboard's perspective until the container is restarted.
The pre-PR reconcile loop had per-project try-catch, so any project-level failure still let the watcher start. The new code has per-project try-catch inside the worker (good), but the worker-level failure mode is new and unhandled in main().
Suggested fix:
if (emptyLanguages.length > 0) {
scanTaskInFlight = true;
try {
mergeScanTelemetry(await runScanTask({ kind: 'full-scan', languages: emptyLanguages }));
} catch (err) {
console.error(`${logPrefix} Full scan failed: ${err.message}`);
totalErrors++;
} finally {
scanTaskInFlight = false;
}
}
if (populatedLanguages.length > 0) {
scanTaskInFlight = true;
try {
...
mergeScanTelemetry(await runScanTask({ kind: 'reconcile', projectNames }));
lastReconcileTimestamp = new Date().toISOString();
} catch (err) {
console.error(`${logPrefix} Bootstrap reconcile failed: ${err.message}`);
totalErrors++;
} finally {
scanTaskInFlight = false;
}
}Minor / informational
Source file parse errors in readAndParseBatch are not counted in telemetry
} catch (err) {
warn(`Error parsing ${f.path}: ${err.message}`);
return null; // ← no recordIngest({ errors: 1 })
}Asset parse errors in the same file correctly call recordIngest({ errors: 1 }). Source file parse errors are logged but don't increment errorsCount, so the telemetry will under-report parse failures in .as/.h/.cpp files. Not a correctness bug, but makes diagnosing parser regressions harder.
The architecture of this PR is sound — worker isolation, streamed telemetry reconciliation, and the bootstrap guard are all implemented cleanly. Just the missing catch before merge.
Catch initial full-scan/bootstrap reconcile worker failures so watcher startup continues and heartbeats remain alive. Also count source parse failures in worker telemetry for parity with asset parse errors.
There was a problem hiding this comment.
Review: Move Watcher Scan/Reconcile To Worker Thread
This is a solid, well-motivated PR. The architecture is clean — moving CPU/IO-heavy work off the main event loop while keeping heartbeats responsive is exactly the right fix, and the telemetry reconciliation logic handles the streamed-vs-final accounting correctly.
CI is green. The core logic in exclude-patterns.js and scan-telemetry.js is well-tested. A few things worth noting for follow-up:
Nit: shouldExclude dual-mode in scan-worker.js
scan-worker.js:179-184 exports shouldExclude with a runtime type-sniff to accept either string arrays (test path) or compiled regex arrays (production path):
const matchers = Array.isArray(excludeMatchers) && typeof excludeMatchers[0] === 'string'
? compileExcludePatterns(excludeMatchers)
: excludeMatchers;This recompiles patterns on every test call and the check is fragile (fails silently on an empty array, though that case is harmless). Since the test actually only needs the shouldExcludePath + compileExcludePatterns helpers from exclude-patterns.js, a cleaner approach would be to import and test those directly rather than going through scan-worker.js. Not a blocker, but worth cleaning up.
Minor: waitForService() has no escape hatch
scan-worker.js:240-251 loops forever waiting for the service. If the service is permanently gone while a scan is in flight, the worker hangs indefinitely and scanTaskInFlight stays true, blocking all future periodic reconciles. This is likely intentional (we want to wait out a Docker restart), but a max-wait or configurable timeout would make the behavior more predictable and easier to reason about in operational runbooks.
Minor: Worker is not cancellable on watcher restart
When a config reload triggers activeWatcher.close() + startWatcher(), any in-flight worker keeps running against the old config and continues posting telemetry. This is harmless for correctness (the service handles concurrent ingests), but the redundant work and mixed-config telemetry could be confusing. A worker.terminate() call (with a guard on done) would be a cleaner model.
Minor: nextReconcileTimestamp set twice per periodicReconcile run
Set at the top of the function (useful for the skip path) and again in the finally block (correct next-run time after completion). The double-set is harmless but slightly confusing to read.
Overall, the PR achieves its goals cleanly. The startup race fix, telemetry accuracy improvements, and exclude-pattern precompilation are all good improvements on top of the core worker thread work. Approving.
PR Title
Move Watcher Scan/Reconcile To Worker Thread And Fix Startup/Telemetry Races
PR Description
What changed
full-scanandreconcile) into a dedicated worker thread.Why
Long synchronous scan/reconcile work could block watcher heartbeat timing and produce unstable watcher status.
Additionally, two reliability bugs were found after the worker integration:
This PR addresses both correctness and operational stability.
Key files
src/watcher/watcher-client.jssrc/watcher/scan-worker.js(new)Validation
node --check src/watcher/watcher-client.jsnode --check src/watcher/scan-worker.js