Skip to content

fix(browser-trace): make snapshot-loop responsive to SIGTERM and validate loop liveness#92

Open
Nagendhra-web wants to merge 3 commits intobrowserbase:mainfrom
Nagendhra-web:fix/snapshot-loop-sigterm-responsiveness
Open

fix(browser-trace): make snapshot-loop responsive to SIGTERM and validate loop liveness#92
Nagendhra-web wants to merge 3 commits intobrowserbase:mainfrom
Nagendhra-web:fix/snapshot-loop-sigterm-responsiveness

Conversation

@Nagendhra-web
Copy link
Copy Markdown

@Nagendhra-web Nagendhra-web commented May 2, 2026

Summary

browser-trace's snapshot-loop.mjs runs a forever-loop that takes a screenshot, dumps the DOM, fetches the URL, then waits interval-seconds before the next tick. The SIGTERM and SIGINT handlers flip a stopping flag that is only checked at the top of the while loop, so a SIGTERM that arrives during the inter-iteration sleep waits the full interval before the loop actually exits.

stop-capture.mjs sends SIGTERM and then falls back to SIGKILL after roughly three seconds. With the default 2 s interval the race is usually invisible, but with a longer interval-seconds (10, 30, ...) the snapshot loop is still asleep when the SIGKILL arrives. The user loses the last DOM and screenshot pair, half-rendered files end up as .partial leftovers that the parent has to sweep, and the run's index.jsonl is missing its final entry.

Two related lifecycle issues come along for the ride: the spawnSync('browse', ...) calls inside the loop have no timeout, so a hung browse CLI can block the loop indefinitely; and start-capture.mjs verifies that the browse cdp child survived its one-second sanity wait but does not apply the same check to the snapshot loop, so a loop that fails to start (missing dep, future syntax error) leaves the run with CDP events but zero DOM/screenshots and the user only notices after stop-capture.

This PR addresses all three under one cohesive scope.

Changes

skills/browser-trace/scripts/snapshot-loop.mjs

  • Introduce a createStopSignal() factory that owns the SIGTERM/SIGINT subscription, exposes a boolean stopping view, and provides an abortable sleep(ms) whose promise resolves on either the timer firing or the stop being requested. The factory is exported so the unit test can drive it deterministically without touching the live process's signal handlers.
  • Re-check stopping between major work units (screenshot, DOM, URL fetch, index append) so an in-flight tick that just missed the sleep wake-up still bails before starting more work.
  • Pass a timeout option (O11Y_SNAPSHOT_TIMEOUT_MS, default 30 s) to every spawnSync('browse', ...) call. A hung browse CLI now skips the tick instead of blocking the loop, which keeps SIGTERM responsive.
  • Wrap the CLI entry path in a guard so the module can be imported by tests without booting the sampler loop.

skills/browser-trace/scripts/start-capture.mjs

  • Apply the existing one-second liveness check to the snapshot loop child as well, mirroring the existing CDP check. If the loop has died, kill CDP, surface the loop's stderr, and exit non-zero so the user sees the real error instead of a half-running capture.

skills/browser-trace/scripts/snapshot-loop.test.mjs (new)

  • Five node --test cases covering createStopSignal:
    • stopping is false until trigger fires.
    • sleep(60_000) aborts in well under a second when the trigger fires mid-wait.
    • sleep(60_000) resolves synchronously after the trigger has already fired.
    • sleep(80) honors its interval when no trigger fires.
    • Repeated triggers are idempotent.

skills/browser-trace/package.json

  • Add npm test -> node --test scripts/*.test.mjs so the test runs through the standard tooling and CI can pick it up later if you wire one in.

Verification

cd skills/browser-trace
npm test
> browser-trace@0.1.0 test
> node --test scripts/*.test.mjs

✔ stopping is false until trigger fires
✔ sleep wakes immediately when trigger fires mid-wait
✔ sleep resolves immediately when trigger has already fired
✔ sleep without a trigger waits for the requested interval
✔ repeated triggers are idempotent
ℹ tests 5
ℹ pass  5
ℹ fail  0
ℹ duration_ms 188.0

Net diff: +177 / -19 across four files in skills/browser-trace/. No new dependencies, no public surface removed.

Scope notes

  • Behavior is purely additive for the default 2 s interval: the loop already exited within one cycle, and the new abortable sleep just removes the residual lag. Long-interval users see the real win.
  • O11Y_SNAPSHOT_TIMEOUT_MS is opt-out (existing setups stay on the 30 s default). The env name follows the existing O11Y_ prefix used for O11Y_ROOT and O11Y_DOMAINS.
  • createStopSignal() is exported so a future capture variant (e.g. capturing on multiple targets, or running snapshot logic inside another script) can reuse the same SIGTERM behavior without re-implementing the controller. Marked _trigger is documented as test-only.
  • The start-capture.mjs check uses the same isAlive helper and the same one-second wait already in place; it is one symmetric block, not a new mechanism.

Note

Medium Risk
Touches capture lifecycle and process/signal handling, so failures could truncate runs or leave orphan processes, but the change is localized and adds unit tests plus timeouts to bound hangs.

Overview
snapshot-loop.mjs is refactored to use an exported createStopSignal() with an abortable sleep() so SIGTERM/SIGINT can break long inter-iteration waits, and each spawnSync('browse', ...) call now has a configurable timeout (O11Y_SNAPSHOT_TIMEOUT_MS, default 30s) to avoid indefinite hangs.

start-capture.mjs now also validates the snapshot loop process is still alive after the initial 1s sanity delay (mirroring the existing CDP check) and surfaces the loop log on early exit.

Adds scripts/snapshot-loop.test.mjs (Node’s built-in test runner) to unit-test stop-signal behavior and wires npm test in package.json to run these tests.

Reviewed by Cursor Bugbot for commit 1b958ad. Bugbot is set up for automated code reviews on this repo. Configure here.

…date loop liveness

snapshot-loop.mjs runs a forever-loop that does a screenshot, DOM dump,
URL fetch, then `await sleepMs(intervalMs)` between iterations. The
SIGTERM/SIGINT handlers flip a `stopping` flag that is only consulted
at the top of the while loop, so a SIGTERM that arrives during the
inter-iteration sleep waits the full interval before taking effect.

stop-capture.mjs sends SIGTERM and falls back to SIGKILL after roughly
three seconds. With short intervals (the default of 2 s) the race is
usually invisible, but with a longer `interval-seconds` (10, 30, ...)
the snapshot loop is still asleep when the SIGKILL arrives. The user
loses the last DOM/screenshot pair and any half-rendered files become
.partial leftovers that the parent then has to sweep on shutdown.

Three changes scoped to the same lifecycle concern:

1. Replace the standalone `sleepMs` with a `createStopSignal()` factory
   that owns the SIGTERM/SIGINT subscription and exposes an abortable
   `sleep(ms)`. The sleep promise resolves on either the timer firing
   or `stopping` flipping, so the loop wakes immediately when the user
   asks it to stop. `stopping` is also re-checked between major work
   units (screenshot, DOM, URL, append) so an in-flight tick that just
   missed the sleep wake-up still bails before starting more work.

2. Pass a `timeout` option (`O11Y_SNAPSHOT_TIMEOUT_MS`, default 30 s)
   to every `spawnSync('browse', ...)` call. A hung browse CLI used to
   block the loop indefinitely; with the timeout it skips the tick and
   moves on, so SIGTERM still gets a chance to land within the configured
   interval rather than the next CLI fault.

3. start-capture.mjs already verifies that `browse cdp` is alive after
   the one-second sanity wait. The same check is now applied to the
   snapshot-loop child. A syntax error or a missing dep in the loop
   used to leave the run with CDP events but zero DOM/screenshots, and
   the user only noticed after stop-capture; the new check kills CDP,
   prints the loop's stderr, and exits non-zero.

Tests:

- New `node --test scripts/snapshot-loop.test.mjs` validates the stop
  signal in isolation: stopping starts false, trigger flips it, sleep
  wakes immediately when trigger fires mid-wait, sleep resolves
  synchronously when trigger fired before the call, sleep without a
  trigger honors its interval, repeated triggers are idempotent.
- Wired through `npm test` in `skills/browser-trace/package.json`.
- The CLI entry now runs only when the module is invoked directly, so
  the test can import `createStopSignal` without booting the sampler.

Local verification on Node 24 / npm 10:
  cd skills/browser-trace && npm test
  ✔ all 5 tests pass in ~190 ms.
Comment thread skills/browser-trace/scripts/snapshot-loop.mjs Outdated
Cursor Bugbot pointed out that the three intermediate `if (stop.stopping)
break;` checks between the synchronous spawnSync calls in snapshot-loop's
iteration body are unreachable in practice. They are right: Node only
drains the signal callback queue between event-loop turns, the iteration
body is fully synchronous between `await stop.sleep(intervalMs)` calls,
so the SIGTERM/SIGINT handler cannot fire between two spawnSync calls.
The flag can only flip during the awaited sleep, and the while-condition
check at the top of the next iteration already catches that.

Remove the misleading checks and replace them with a single comment that
makes the contract explicit: SIGTERM responsiveness comes from the
per-spawnSync timeout (each browse call returns within SNAPSHOT_TIMEOUT_MS,
so the loop reaches the await within that bound) plus the abortable
sleep (which wakes immediately on the signal). The previous intermediate
guards gave a false impression of mid-iteration responsiveness that did
not actually exist.

No behavior change versus the previous commit on this branch:
- Default 30 s spawnSync timeout still bounds each browse call.
- Abortable sleep still wakes immediately on SIGTERM/SIGINT.
- Worst-case lag from SIGTERM to loop exit is now bounded by
  SNAPSHOT_TIMEOUT_MS + remaining sync work in the current tick, which
  is the same bound the previous version actually had once you account
  for the dead checks.

All 5 stop-signal unit tests continue to pass.
@Nagendhra-web
Copy link
Copy Markdown
Author

Thanks @cursor[bot]. The intermediate stop checks are unreachable in practice and the comment claiming "mid-iteration responsiveness" was misleading. Pushed 2a713a5 to remove them.

You are right about the runtime: between two synchronous spawnSync calls in the iteration body there is no event-loop tick, so Node never drains the SIGTERM callback queue, so stopping stays whatever the top-of-loop check observed. The only yield point that actually matters is await stop.sleep(intervalMs) at the bottom; the next iteration's while (!stop.stopping) then catches the flag.

The fix's actual SIGTERM responsiveness comes from two real mechanisms:

  1. The O11Y_SNAPSHOT_TIMEOUT_MS (default 30 s) cap on every spawnSync('browse', ...) call. Without this, a hung browse CLI could pin the loop indefinitely and SIGTERM would never get a chance to land. With it, each iteration body finishes within at most 3 * SNAPSHOT_TIMEOUT_MS + tiny sync work, then the abortable sleep yields control, then the handler fires.

  2. The abortable stop.sleep(intervalMs) at the bottom of the iteration. SIGTERM during this sleep resolves the promise immediately, the while-check picks up stopping = true, and the loop exits in well under a second instead of waiting out the full interval-seconds. This is what the unit test exercises.

Net change versus the previous commit on this branch: identical observable behavior, honest comment, and 12 fewer lines of code that suggested guarantees the runtime cannot actually provide. All 5 stop-signal unit tests continue to pass.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2a713a5. Configure here.

Comment thread skills/browser-trace/scripts/snapshot-loop.mjs Outdated
…ng .then per call

Cursor Bugbot pointed out that the previous `stopPromise.then(...)`
inside `sleep(ms)` registered a new reaction handler on every call. The
stop promise stays pending for the lifetime of the capture, so each
handler closure (capturing `timer` and `resolve`) was pinned on the
reaction list until SIGTERM finally fired. Over a 24-hour run at the
default 2 s interval that is roughly 43,000 closures, all live in
memory at once.

Replace the chained-.then approach with an explicit `pending` Set:

- `sleep(ms)` adds a `{ resolve, timer }` entry, then the timer's own
  callback removes the entry on natural wake-up. The closure becomes
  unreachable as soon as the sleep resolves.
- `trigger()` walks the set, clears each timer, calls each resolve,
  and clears the set in one pass.

`stopPromise` is no longer needed at all and has been dropped along
with `resolveStop`. No public surface change.

Two new unit tests guard the fix:

- `completed sleeps deregister from the pending set so closures can
  be GC'd` runs 200 back-to-back 1ms sleeps and asserts pending count
  returns to 0 after each one. With the old code this assertion would
  still hold for the closure ref-count by accident (the reaction list
  is internal to the Promise), but the new test exercises the
  explicit Set so future regressions are caught directly.
- `trigger drains every still-pending sleep without leaking` starts
  three long sleeps, asserts pending count is 3, fires the trigger,
  asserts it drops to 0, and awaits all three for completion.

Test-only `_pendingCount()` accessor added on the returned object;
documented as test-only alongside the existing `_trigger`.

All 7 tests now pass; no behavior change to the loop itself.
@Nagendhra-web
Copy link
Copy Markdown
Author

Nagendhra-web commented May 2, 2026

Thanks again. Real leak, fixed in 1b958ad.

You are right on the analysis: every sleep(ms) was calling stopPromise.then(...), and since stopPromise only resolves at SIGTERM, every handler closure (capturing timer and resolve) sat on the promise's reaction list for the lifetime of the capture. At the default 2 s interval that is around 43,000 closures over 24 hours.

Switched the implementation to an explicit pending Set:

  • sleep(ms) adds a { resolve, timer } entry. The timer's own callback removes the entry on natural wake-up, so the closure becomes unreachable the moment the sleep resolves.
  • trigger() walks the set, clears each timer, calls each resolve, then clears the set in one pass.
  • stopPromise and resolveStop are gone; the new design does not need a long-lived promise at all.
  • Public surface is unchanged: same stopping getter, same sleep(ms). Added a test-only _pendingCount() accessor alongside the existing _trigger, both documented as test-only.

Two new unit tests guard the fix:

  • completed sleeps deregister from the pending set so closures can be GC'd runs 200 back-to-back 1 ms sleeps and asserts _pendingCount() returns to 0 after each one. With the old code this assertion held by accident (the Set itself was empty, the leak was internal to the Promise reaction list), but it now directly exercises the path the regression would have to break.
  • trigger drains every still-pending sleep without leaking starts three long sleeps, asserts _pendingCount() is 3, fires the trigger, asserts it drops to 0, and awaits all three for completion.

Verification:

> browser-trace@0.1.0 test
> node --test scripts/*.test.mjs

✔ stopping is false until trigger fires
✔ sleep wakes immediately when trigger fires mid-wait
✔ sleep resolves immediately when trigger has already fired
✔ sleep without a trigger waits for the requested interval
✔ repeated triggers are idempotent
✔ completed sleeps deregister from the pending set so closures can be GC'd (3048 ms)
✔ trigger drains every still-pending sleep without leaking
ℹ tests 7  ℹ pass 7  ℹ fail 0

Net change vs the previous commit: +44 / -8 across the same two files. No behavior change to the snapshot loop itself.

@jcalvin4
Copy link
Copy Markdown

jcalvin4 commented May 4, 2026

Hi, I just discovered this repo and was wondering how you came to this solution for the loop problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants