Move Watcher Scan/Reconcile To Worker Thread And Fix Startup/Telemetry Races by johanrembark · Pull Request #64 · EmbarkStudios/UnrealClaudeFileHelper

johanrembark · 2026-02-25T15:42:56Z

PR Title

Move Watcher Scan/Reconcile To Worker Thread And Fix Startup/Telemetry Races

PR Description

What changed

Moved heavy watcher bootstrap work (full-scan and reconcile) into a dedicated worker thread.
Kept heartbeat handling in the main watcher process so heartbeats stay responsive during long scans.
Added worker telemetry streaming and reconciled streamed deltas with final worker totals.
Fixed a startup race that could create duplicate watcher instances when config changed during bootstrap.
Ensured reconcile worker exceptions are reflected in error telemetry counters.

Why

Long synchronous scan/reconcile work could block watcher heartbeat timing and produce unstable watcher status.
Additionally, two reliability bugs were found after the worker integration:

duplicate watcher start risk during bootstrap/config reload overlap
under-reported reconcile errors in telemetry

This PR addresses both correctness and operational stability.

Key files

src/watcher/watcher-client.js
src/watcher/scan-worker.js (new)

Validation

node --check src/watcher/watcher-client.js
node --check src/watcher/scan-worker.js
Rebuilt/restarted containers and verified watcher heartbeat updates continuously (15s cadence) after restart.

Keep heartbeat emission responsive during long filesystem scans by moving full-scan and reconcile work off the main watcher event loop. - add dedicated scan worker for full-scan/reconcile tasks - stream worker telemetry back to main process for heartbeat counters - run initial and periodic reconciliation through worker with overlap guard - keep chokidar incremental ingestion path in main process unchanged Validation: - node --check src/watcher/watcher-client.js - node --check src/watcher/scan-worker.js

Address two reliability issues in the async scan worker integration. - prevent config-reload heartbeat path from starting a watcher before bootstrap watcher init - guard bootstrap watcher startup so it does not create a second instance - reconcile streamed worker telemetry with final worker totals to keep unstreamed error counts - emit reconcile catch errors as telemetry deltas in scan worker Validation: - node --check src/watcher/watcher-client.js - node --check src/watcher/scan-worker.js - watcher heartbeat remained active after docker restart

claude

Review: Move Watcher Scan/Reconcile To Worker Thread

The overall approach is solid — offloading heavy scan/reconcile work to a worker thread is the right fix for heartbeat responsiveness, and the two-phase startup guard for the duplicate watcher issue is clean. A few things to address before merge.

Should fix

1. No tests for scan-worker.js

This is the most significant gap. scan-worker.js is 520 lines of new business logic with zero test coverage. The telemetry streaming reconciliation in runScanTask is particularly tricky — the streamedTotals subtraction logic:

errorsCount: Math.max(0, (finalTelemetry.errorsCount || 0) - streamedTotals.errorsCount),

...has edge cases that are easy to get wrong (e.g. worker exits with 0 but never emits result, streamedTelemetry is true but finalTelemetry is null). The existing test pattern (src/*.test.js) won't pick up src/watcher/scan-worker.test.js without updating the glob too.

Worth testing at minimum:

mergeScanTelemetry accumulation
Streamed vs. final telemetry reconciliation in runScanTask
shouldExclude pattern matching
The activeWatcher bootstrap guard (config change during bootstrap)

2. nextReconcileTimestamp not updated when periodic reconcile is skipped

When periodicReconcile bails early due to scanTaskInFlight:

if (scanTaskInFlight) {
  console.log(`${logPrefix} Skipping periodic reconciliation (scan already in progress)`);
  return;  // nextReconcileTimestamp stays in the past
}

nextReconcileTimestamp is only updated in the finally block, so the dashboard will display a stale timestamp until the next reconcile actually completes. Suggest updating it at the top of periodicReconcile unconditionally:

nextReconcileTimestamp = new Date(Date.now() + getReconcileIntervalMs()).toISOString();
if (scanTaskInFlight) { ... return; }

Minor / informational

3. Asset parse errors silently swallowed without error telemetry

In both runFullScan and runReconcile, when parseAsset throws it's silently dropped:

try { assets.push(parseAsset(f.path, project)); } catch {}

No recordIngest({ errors: 1 }) call, so the error counter stays clean even when assets fail to parse. Makes diagnosing broken .uasset files harder. Compare with the source file path which does log a warning.

4. shouldExclude regex lacks anchoring (pre-existing, but now duplicated)

const regex = new RegExp(pattern.replace(/\*\*/g, '.*').replace(/\*/g, '[^/]*'));

This was already in watcher-client.js and is now copied into the worker. The regex is unanchored — a pattern like Intermediate would match NotIntermediate/Foo. Also, non-* special regex characters in patterns (., +, (, etc.) aren't escaped. Not new to this PR, but duplicating the bug is a good moment to fix it in both places with something like:

const escaped = pattern.replace(/[.+^${}()|[\]\\]/g, '\\$&');
const regex = new RegExp('^' + escaped.replace(/\*\*/g, '.*').replace(/\*/g, '[^/]*'));

What's working well

The scanTaskInFlight guard in periodicReconcile correctly prevents overlapping reconciles.
The activeWatcher = null initialization before heartbeat start cleanly prevents the config-reload path from creating a watcher during bootstrap.
Per-project error handling inside runReconcileProjects matches the pre-PR per-project try/catch behavior — no regression in resilience.
The streamed-vs-final telemetry reconciliation pattern is a good design for keeping heartbeat counters current.

Move watcher scan/reconcile edge-case handling into testable helpers and add watcher worker coverage. - Add watcher tests for telemetry reconciliation, exclude matching, and bootstrap reload guard - Update periodic reconcile scheduling so nextRunAt stays fresh when reconcile is skipped - Surface asset parse errors in telemetry/logs instead of swallowing them - Precompile exclude patterns per config/worker to avoid per-path regex recompilation - Include src/watcher test files in npm test

claude

Review (updated — commit `ab17e13`)

The follow-up commit addresses all four items from the previous review well. Here's the updated status:

Previous feedback — all resolved ✅

Item	Status
No tests for scan-worker.js	✅ Fixed — `scan-worker.test.js` added with telemetry, exclude, and bootstrap-guard coverage; `package.json` glob updated
`nextReconcileTimestamp` stale when periodic reconcile is skipped	✅ Fixed — timestamp now set unconditionally at the top of `periodicReconcile`
Asset parse errors silently swallowed	✅ Fixed — `recordIngest({ errors: 1 })` + `warn()` in both worker and watcher-client
Unanchored / unescaped exclude regex	✅ Fixed — `exclude-patterns.js` with proper `escapeRegexLiteral` and `(^

Remaining issue — should fix

Initial scan worker calls are not caught; a worker crash kills the watcher process

periodicReconcile correctly catches worker failures:

try {
  mergeScanTelemetry(await runScanTask({ kind: 'reconcile' }));
  lastReconcileTimestamp = new Date().toISOString();
} catch (err) {
  console.error(`${logPrefix} Periodic reconcile failed: ${err.message}`);
  totalErrors++;
} finally {
  scanTaskInFlight = false;
}

But the initial full-scan and bootstrap reconcile only have finally:

// emptyLanguages path
try {
  mergeScanTelemetry(await runScanTask({ kind: 'full-scan', languages: emptyLanguages }));
} finally {
  scanTaskInFlight = false;  // exception still propagates
}

// populatedLanguages path
try {
  ...
  mergeScanTelemetry(await runScanTask({ kind: 'reconcile', projectNames }));
} finally {
  scanTaskInFlight = false;  // exception still propagates
}

if (\!activeWatcher) {      // ← never reached if either task above threw
  activeWatcher = startWatcher();
}

If either worker exits non-zero (e.g. a fatal OOM, a missing serviceUrl check, or a hard network failure after all retries), the exception propagates through main() to main().catch(), which calls process.exit(1). The watcher process dies and heartbeats stop — a silent failure from the dashboard's perspective until the container is restarted.

The pre-PR reconcile loop had per-project try-catch, so any project-level failure still let the watcher start. The new code has per-project try-catch inside the worker (good), but the worker-level failure mode is new and unhandled in main().

Suggested fix:

if (emptyLanguages.length > 0) {
  scanTaskInFlight = true;
  try {
    mergeScanTelemetry(await runScanTask({ kind: 'full-scan', languages: emptyLanguages }));
  } catch (err) {
    console.error(`${logPrefix} Full scan failed: ${err.message}`);
    totalErrors++;
  } finally {
    scanTaskInFlight = false;
  }
}

if (populatedLanguages.length > 0) {
  scanTaskInFlight = true;
  try {
    ...
    mergeScanTelemetry(await runScanTask({ kind: 'reconcile', projectNames }));
    lastReconcileTimestamp = new Date().toISOString();
  } catch (err) {
    console.error(`${logPrefix} Bootstrap reconcile failed: ${err.message}`);
    totalErrors++;
  } finally {
    scanTaskInFlight = false;
  }
}

Minor / informational

Source file parse errors in readAndParseBatch are not counted in telemetry

} catch (err) {
  warn(`Error parsing ${f.path}: ${err.message}`);
  return null;  // ← no recordIngest({ errors: 1 })
}

Asset parse errors in the same file correctly call recordIngest({ errors: 1 }). Source file parse errors are logged but don't increment errorsCount, so the telemetry will under-report parse failures in .as/.h/.cpp files. Not a correctness bug, but makes diagnosing parser regressions harder.

The architecture of this PR is sound — worker isolation, streamed telemetry reconciliation, and the bootstrap guard are all implemented cleanly. Just the missing catch before merge.

Catch initial full-scan/bootstrap reconcile worker failures so watcher startup continues and heartbeats remain alive. Also count source parse failures in worker telemetry for parity with asset parse errors.

claude

Review: Move Watcher Scan/Reconcile To Worker Thread

This is a solid, well-motivated PR. The architecture is clean — moving CPU/IO-heavy work off the main event loop while keeping heartbeats responsive is exactly the right fix, and the telemetry reconciliation logic handles the streamed-vs-final accounting correctly.

CI is green. The core logic in exclude-patterns.js and scan-telemetry.js is well-tested. A few things worth noting for follow-up:

Nit: `shouldExclude` dual-mode in `scan-worker.js`

scan-worker.js:179-184 exports shouldExclude with a runtime type-sniff to accept either string arrays (test path) or compiled regex arrays (production path):

const matchers = Array.isArray(excludeMatchers) && typeof excludeMatchers[0] === 'string'
  ? compileExcludePatterns(excludeMatchers)
  : excludeMatchers;

This recompiles patterns on every test call and the check is fragile (fails silently on an empty array, though that case is harmless). Since the test actually only needs the shouldExcludePath + compileExcludePatterns helpers from exclude-patterns.js, a cleaner approach would be to import and test those directly rather than going through scan-worker.js. Not a blocker, but worth cleaning up.

Minor: `waitForService()` has no escape hatch

scan-worker.js:240-251 loops forever waiting for the service. If the service is permanently gone while a scan is in flight, the worker hangs indefinitely and scanTaskInFlight stays true, blocking all future periodic reconciles. This is likely intentional (we want to wait out a Docker restart), but a max-wait or configurable timeout would make the behavior more predictable and easier to reason about in operational runbooks.

Minor: Worker is not cancellable on watcher restart

When a config reload triggers activeWatcher.close() + startWatcher(), any in-flight worker keeps running against the old config and continues posting telemetry. This is harmless for correctness (the service handles concurrent ingests), but the redundant work and mixed-config telemetry could be confusing. A worker.terminate() call (with a guard on done) would be a cleaner model.

Minor: `nextReconcileTimestamp` set twice per `periodicReconcile` run

Set at the top of the function (useful for the skip path) and again in the finally block (correct next-run time after completion). The double-set is harmless but slightly confusing to read.

Overall, the PR achieves its goals cleanly. The startup race fix, telemetry accuracy improvements, and exclude-pattern precompilation are all good improvements on top of the core worker thread work. Approving.

johanrembark added 2 commits February 25, 2026 16:30

claude bot requested changes Feb 25, 2026

View reviewed changes

Handle bootstrap worker failures and parse error telemetry

0b1e688

Catch initial full-scan/bootstrap reconcile worker failures so watcher startup continues and heartbeats remain alive. Also count source parse failures in worker telemetry for parity with asset parse errors.

claude bot approved these changes Feb 25, 2026

View reviewed changes

johanrembark merged commit 78a3617 into main Feb 26, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Watcher Scan/Reconcile To Worker Thread And Fix Startup/Telemetry Races#64

Move Watcher Scan/Reconcile To Worker Thread And Fix Startup/Telemetry Races#64
johanrembark merged 4 commits intomainfrom
pr/async-watcher-worker-heartbeats

johanrembark commented Feb 25, 2026

Uh oh!

claude bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

claude bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johanrembark commented Feb 25, 2026

PR Title

PR Description

What changed

Why

Key files

Validation

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Review: Move Watcher Scan/Reconcile To Worker Thread

Should fix

Minor / informational

What's working well

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Review (updated — commit ab17e13)

Previous feedback — all resolved ✅

Remaining issue — should fix

Minor / informational

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Review: Move Watcher Scan/Reconcile To Worker Thread

Nit: shouldExclude dual-mode in scan-worker.js

Minor: waitForService() has no escape hatch

Minor: Worker is not cancellable on watcher restart

Minor: nextReconcileTimestamp set twice per periodicReconcile run

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Review (updated — commit `ab17e13`)

Nit: `shouldExclude` dual-mode in `scan-worker.js`

Minor: `waitForService()` has no escape hatch

Minor: `nextReconcileTimestamp` set twice per `periodicReconcile` run