fix(browser-trace): make snapshot-loop responsive to SIGTERM and validate loop liveness#92
Conversation
…date loop liveness
snapshot-loop.mjs runs a forever-loop that does a screenshot, DOM dump,
URL fetch, then `await sleepMs(intervalMs)` between iterations. The
SIGTERM/SIGINT handlers flip a `stopping` flag that is only consulted
at the top of the while loop, so a SIGTERM that arrives during the
inter-iteration sleep waits the full interval before taking effect.
stop-capture.mjs sends SIGTERM and falls back to SIGKILL after roughly
three seconds. With short intervals (the default of 2 s) the race is
usually invisible, but with a longer `interval-seconds` (10, 30, ...)
the snapshot loop is still asleep when the SIGKILL arrives. The user
loses the last DOM/screenshot pair and any half-rendered files become
.partial leftovers that the parent then has to sweep on shutdown.
Three changes scoped to the same lifecycle concern:
1. Replace the standalone `sleepMs` with a `createStopSignal()` factory
that owns the SIGTERM/SIGINT subscription and exposes an abortable
`sleep(ms)`. The sleep promise resolves on either the timer firing
or `stopping` flipping, so the loop wakes immediately when the user
asks it to stop. `stopping` is also re-checked between major work
units (screenshot, DOM, URL, append) so an in-flight tick that just
missed the sleep wake-up still bails before starting more work.
2. Pass a `timeout` option (`O11Y_SNAPSHOT_TIMEOUT_MS`, default 30 s)
to every `spawnSync('browse', ...)` call. A hung browse CLI used to
block the loop indefinitely; with the timeout it skips the tick and
moves on, so SIGTERM still gets a chance to land within the configured
interval rather than the next CLI fault.
3. start-capture.mjs already verifies that `browse cdp` is alive after
the one-second sanity wait. The same check is now applied to the
snapshot-loop child. A syntax error or a missing dep in the loop
used to leave the run with CDP events but zero DOM/screenshots, and
the user only noticed after stop-capture; the new check kills CDP,
prints the loop's stderr, and exits non-zero.
Tests:
- New `node --test scripts/snapshot-loop.test.mjs` validates the stop
signal in isolation: stopping starts false, trigger flips it, sleep
wakes immediately when trigger fires mid-wait, sleep resolves
synchronously when trigger fired before the call, sleep without a
trigger honors its interval, repeated triggers are idempotent.
- Wired through `npm test` in `skills/browser-trace/package.json`.
- The CLI entry now runs only when the module is invoked directly, so
the test can import `createStopSignal` without booting the sampler.
Local verification on Node 24 / npm 10:
cd skills/browser-trace && npm test
✔ all 5 tests pass in ~190 ms.
Cursor Bugbot pointed out that the three intermediate `if (stop.stopping) break;` checks between the synchronous spawnSync calls in snapshot-loop's iteration body are unreachable in practice. They are right: Node only drains the signal callback queue between event-loop turns, the iteration body is fully synchronous between `await stop.sleep(intervalMs)` calls, so the SIGTERM/SIGINT handler cannot fire between two spawnSync calls. The flag can only flip during the awaited sleep, and the while-condition check at the top of the next iteration already catches that. Remove the misleading checks and replace them with a single comment that makes the contract explicit: SIGTERM responsiveness comes from the per-spawnSync timeout (each browse call returns within SNAPSHOT_TIMEOUT_MS, so the loop reaches the await within that bound) plus the abortable sleep (which wakes immediately on the signal). The previous intermediate guards gave a false impression of mid-iteration responsiveness that did not actually exist. No behavior change versus the previous commit on this branch: - Default 30 s spawnSync timeout still bounds each browse call. - Abortable sleep still wakes immediately on SIGTERM/SIGINT. - Worst-case lag from SIGTERM to loop exit is now bounded by SNAPSHOT_TIMEOUT_MS + remaining sync work in the current tick, which is the same bound the previous version actually had once you account for the dead checks. All 5 stop-signal unit tests continue to pass.
|
Thanks @cursor[bot]. The intermediate stop checks are unreachable in practice and the comment claiming "mid-iteration responsiveness" was misleading. Pushed You are right about the runtime: between two synchronous The fix's actual SIGTERM responsiveness comes from two real mechanisms:
Net change versus the previous commit on this branch: identical observable behavior, honest comment, and 12 fewer lines of code that suggested guarantees the runtime cannot actually provide. All 5 stop-signal unit tests continue to pass. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2a713a5. Configure here.
…ng .then per call
Cursor Bugbot pointed out that the previous `stopPromise.then(...)`
inside `sleep(ms)` registered a new reaction handler on every call. The
stop promise stays pending for the lifetime of the capture, so each
handler closure (capturing `timer` and `resolve`) was pinned on the
reaction list until SIGTERM finally fired. Over a 24-hour run at the
default 2 s interval that is roughly 43,000 closures, all live in
memory at once.
Replace the chained-.then approach with an explicit `pending` Set:
- `sleep(ms)` adds a `{ resolve, timer }` entry, then the timer's own
callback removes the entry on natural wake-up. The closure becomes
unreachable as soon as the sleep resolves.
- `trigger()` walks the set, clears each timer, calls each resolve,
and clears the set in one pass.
`stopPromise` is no longer needed at all and has been dropped along
with `resolveStop`. No public surface change.
Two new unit tests guard the fix:
- `completed sleeps deregister from the pending set so closures can
be GC'd` runs 200 back-to-back 1ms sleeps and asserts pending count
returns to 0 after each one. With the old code this assertion would
still hold for the closure ref-count by accident (the reaction list
is internal to the Promise), but the new test exercises the
explicit Set so future regressions are caught directly.
- `trigger drains every still-pending sleep without leaking` starts
three long sleeps, asserts pending count is 3, fires the trigger,
asserts it drops to 0, and awaits all three for completion.
Test-only `_pendingCount()` accessor added on the returned object;
documented as test-only alongside the existing `_trigger`.
All 7 tests now pass; no behavior change to the loop itself.
|
Thanks again. Real leak, fixed in You are right on the analysis: every Switched the implementation to an explicit
Two new unit tests guard the fix:
Verification: Net change vs the previous commit: +44 / -8 across the same two files. No behavior change to the snapshot loop itself. |
|
Hi, I just discovered this repo and was wondering how you came to this solution for the loop problem? |

Summary
browser-trace'ssnapshot-loop.mjsruns a forever-loop that takes a screenshot, dumps the DOM, fetches the URL, then waitsinterval-secondsbefore the next tick. The SIGTERM and SIGINT handlers flip astoppingflag that is only checked at the top of the while loop, so a SIGTERM that arrives during the inter-iteration sleep waits the full interval before the loop actually exits.stop-capture.mjssends SIGTERM and then falls back to SIGKILL after roughly three seconds. With the default 2 s interval the race is usually invisible, but with a longerinterval-seconds(10, 30, ...) the snapshot loop is still asleep when the SIGKILL arrives. The user loses the last DOM and screenshot pair, half-rendered files end up as.partialleftovers that the parent has to sweep, and the run'sindex.jsonlis missing its final entry.Two related lifecycle issues come along for the ride: the
spawnSync('browse', ...)calls inside the loop have no timeout, so a hung browse CLI can block the loop indefinitely; andstart-capture.mjsverifies that thebrowse cdpchild survived its one-second sanity wait but does not apply the same check to the snapshot loop, so a loop that fails to start (missing dep, future syntax error) leaves the run with CDP events but zero DOM/screenshots and the user only notices afterstop-capture.This PR addresses all three under one cohesive scope.
Changes
skills/browser-trace/scripts/snapshot-loop.mjscreateStopSignal()factory that owns the SIGTERM/SIGINT subscription, exposes a booleanstoppingview, and provides an abortablesleep(ms)whose promise resolves on either the timer firing or the stop being requested. The factory is exported so the unit test can drive it deterministically without touching the live process's signal handlers.stoppingbetween major work units (screenshot, DOM, URL fetch, index append) so an in-flight tick that just missed the sleep wake-up still bails before starting more work.timeoutoption (O11Y_SNAPSHOT_TIMEOUT_MS, default 30 s) to everyspawnSync('browse', ...)call. A hung browse CLI now skips the tick instead of blocking the loop, which keeps SIGTERM responsive.skills/browser-trace/scripts/start-capture.mjsskills/browser-trace/scripts/snapshot-loop.test.mjs(new)node --testcases coveringcreateStopSignal:stoppingis false until trigger fires.sleep(60_000)aborts in well under a second when the trigger fires mid-wait.sleep(60_000)resolves synchronously after the trigger has already fired.sleep(80)honors its interval when no trigger fires.skills/browser-trace/package.jsonnpm test->node --test scripts/*.test.mjsso the test runs through the standard tooling and CI can pick it up later if you wire one in.Verification
Net diff: +177 / -19 across four files in
skills/browser-trace/. No new dependencies, no public surface removed.Scope notes
O11Y_SNAPSHOT_TIMEOUT_MSis opt-out (existing setups stay on the 30 s default). The env name follows the existingO11Y_prefix used forO11Y_ROOTandO11Y_DOMAINS.createStopSignal()is exported so a future capture variant (e.g. capturing on multiple targets, or running snapshot logic inside another script) can reuse the same SIGTERM behavior without re-implementing the controller. Marked_triggeris documented as test-only.start-capture.mjscheck uses the sameisAlivehelper and the same one-second wait already in place; it is one symmetric block, not a new mechanism.Note
Medium Risk
Touches capture lifecycle and process/signal handling, so failures could truncate runs or leave orphan processes, but the change is localized and adds unit tests plus timeouts to bound hangs.
Overview
snapshot-loop.mjsis refactored to use an exportedcreateStopSignal()with an abortablesleep()so SIGTERM/SIGINT can break long inter-iteration waits, and eachspawnSync('browse', ...)call now has a configurable timeout (O11Y_SNAPSHOT_TIMEOUT_MS, default 30s) to avoid indefinite hangs.start-capture.mjsnow also validates the snapshot loop process is still alive after the initial 1s sanity delay (mirroring the existing CDP check) and surfaces the loop log on early exit.Adds
scripts/snapshot-loop.test.mjs(Node’s built-in test runner) to unit-test stop-signal behavior and wiresnpm testinpackage.jsonto run these tests.Reviewed by Cursor Bugbot for commit 1b958ad. Bugbot is set up for automated code reviews on this repo. Configure here.