fix(browser-trace): make snapshot-loop responsive to SIGTERM and validate loop liveness by Nagendhra-web · Pull Request #92 · browserbase/skills

Nagendhra-web · 2026-05-02T20:17:44Z

Summary

browser-trace's snapshot-loop.mjs runs a forever-loop that takes a screenshot, dumps the DOM, fetches the URL, then waits interval-seconds before the next tick. The SIGTERM and SIGINT handlers flip a stopping flag that is only checked at the top of the while loop, so a SIGTERM that arrives during the inter-iteration sleep waits the full interval before the loop actually exits.

stop-capture.mjs sends SIGTERM and then falls back to SIGKILL after roughly three seconds. With the default 2 s interval the race is usually invisible, but with a longer interval-seconds (10, 30, ...) the snapshot loop is still asleep when the SIGKILL arrives. The user loses the last DOM and screenshot pair, half-rendered files end up as .partial leftovers that the parent has to sweep, and the run's index.jsonl is missing its final entry.

Two related lifecycle issues come along for the ride: the spawnSync('browse', ...) calls inside the loop have no timeout, so a hung browse CLI can block the loop indefinitely; and start-capture.mjs verifies that the browse cdp child survived its one-second sanity wait but does not apply the same check to the snapshot loop, so a loop that fails to start (missing dep, future syntax error) leaves the run with CDP events but zero DOM/screenshots and the user only notices after stop-capture.

This PR addresses all three under one cohesive scope.

Changes

skills/browser-trace/scripts/snapshot-loop.mjs

Introduce a createStopSignal() factory that owns the SIGTERM/SIGINT subscription, exposes a boolean stopping view, and provides an abortable sleep(ms) whose promise resolves on either the timer firing or the stop being requested. The factory is exported so the unit test can drive it deterministically without touching the live process's signal handlers.
Re-check stopping between major work units (screenshot, DOM, URL fetch, index append) so an in-flight tick that just missed the sleep wake-up still bails before starting more work.
Pass a timeout option (O11Y_SNAPSHOT_TIMEOUT_MS, default 30 s) to every spawnSync('browse', ...) call. A hung browse CLI now skips the tick instead of blocking the loop, which keeps SIGTERM responsive.
Wrap the CLI entry path in a guard so the module can be imported by tests without booting the sampler loop.

skills/browser-trace/scripts/start-capture.mjs

Apply the existing one-second liveness check to the snapshot loop child as well, mirroring the existing CDP check. If the loop has died, kill CDP, surface the loop's stderr, and exit non-zero so the user sees the real error instead of a half-running capture.

skills/browser-trace/scripts/snapshot-loop.test.mjs (new)

Five node --test cases covering createStopSignal:
- stopping is false until trigger fires.
- sleep(60_000) aborts in well under a second when the trigger fires mid-wait.
- sleep(60_000) resolves synchronously after the trigger has already fired.
- sleep(80) honors its interval when no trigger fires.
- Repeated triggers are idempotent.

skills/browser-trace/package.json

Add npm test -> node --test scripts/*.test.mjs so the test runs through the standard tooling and CI can pick it up later if you wire one in.

Verification

cd skills/browser-trace
npm test

> browser-trace@0.1.0 test
> node --test scripts/*.test.mjs

✔ stopping is false until trigger fires
✔ sleep wakes immediately when trigger fires mid-wait
✔ sleep resolves immediately when trigger has already fired
✔ sleep without a trigger waits for the requested interval
✔ repeated triggers are idempotent
ℹ tests 5
ℹ pass  5
ℹ fail  0
ℹ duration_ms 188.0

Net diff: +177 / -19 across four files in skills/browser-trace/. No new dependencies, no public surface removed.

Scope notes

Behavior is purely additive for the default 2 s interval: the loop already exited within one cycle, and the new abortable sleep just removes the residual lag. Long-interval users see the real win.
O11Y_SNAPSHOT_TIMEOUT_MS is opt-out (existing setups stay on the 30 s default). The env name follows the existing O11Y_ prefix used for O11Y_ROOT and O11Y_DOMAINS.
createStopSignal() is exported so a future capture variant (e.g. capturing on multiple targets, or running snapshot logic inside another script) can reuse the same SIGTERM behavior without re-implementing the controller. Marked _trigger is documented as test-only.
The start-capture.mjs check uses the same isAlive helper and the same one-second wait already in place; it is one symmetric block, not a new mechanism.

Note

Medium Risk
Touches capture lifecycle and process/signal handling, so failures could truncate runs or leave orphan processes, but the change is localized and adds unit tests plus timeouts to bound hangs.

Overview
snapshot-loop.mjs is refactored to use an exported createStopSignal() with an abortable sleep() so SIGTERM/SIGINT can break long inter-iteration waits, and each spawnSync('browse', ...) call now has a configurable timeout (O11Y_SNAPSHOT_TIMEOUT_MS, default 30s) to avoid indefinite hangs.

start-capture.mjs now also validates the snapshot loop process is still alive after the initial 1s sanity delay (mirroring the existing CDP check) and surfaces the loop log on early exit.

Adds scripts/snapshot-loop.test.mjs (Node’s built-in test runner) to unit-test stop-signal behavior and wires npm test in package.json to run these tests.

^{Reviewed by Cursor Bugbot for commit 1b958ad. Bugbot is set up for automated code reviews on this repo. Configure here.}

…date loop liveness snapshot-loop.mjs runs a forever-loop that does a screenshot, DOM dump, URL fetch, then `await sleepMs(intervalMs)` between iterations. The SIGTERM/SIGINT handlers flip a `stopping` flag that is only consulted at the top of the while loop, so a SIGTERM that arrives during the inter-iteration sleep waits the full interval before taking effect. stop-capture.mjs sends SIGTERM and falls back to SIGKILL after roughly three seconds. With short intervals (the default of 2 s) the race is usually invisible, but with a longer `interval-seconds` (10, 30, ...) the snapshot loop is still asleep when the SIGKILL arrives. The user loses the last DOM/screenshot pair and any half-rendered files become .partial leftovers that the parent then has to sweep on shutdown. Three changes scoped to the same lifecycle concern: 1. Replace the standalone `sleepMs` with a `createStopSignal()` factory that owns the SIGTERM/SIGINT subscription and exposes an abortable `sleep(ms)`. The sleep promise resolves on either the timer firing or `stopping` flipping, so the loop wakes immediately when the user asks it to stop. `stopping` is also re-checked between major work units (screenshot, DOM, URL, append) so an in-flight tick that just missed the sleep wake-up still bails before starting more work. 2. Pass a `timeout` option (`O11Y_SNAPSHOT_TIMEOUT_MS`, default 30 s) to every `spawnSync('browse', ...)` call. A hung browse CLI used to block the loop indefinitely; with the timeout it skips the tick and moves on, so SIGTERM still gets a chance to land within the configured interval rather than the next CLI fault. 3. start-capture.mjs already verifies that `browse cdp` is alive after the one-second sanity wait. The same check is now applied to the snapshot-loop child. A syntax error or a missing dep in the loop used to leave the run with CDP events but zero DOM/screenshots, and the user only noticed after stop-capture; the new check kills CDP, prints the loop's stderr, and exits non-zero. Tests: - New `node --test scripts/snapshot-loop.test.mjs` validates the stop signal in isolation: stopping starts false, trigger flips it, sleep wakes immediately when trigger fires mid-wait, sleep resolves synchronously when trigger fired before the call, sleep without a trigger honors its interval, repeated triggers are idempotent. - Wired through `npm test` in `skills/browser-trace/package.json`. - The CLI entry now runs only when the module is invoked directly, so the test can import `createStopSignal` without booting the sampler. Local verification on Node 24 / npm 10: cd skills/browser-trace && npm test ✔ all 5 tests pass in ~190 ms.

Cursor Bugbot pointed out that the three intermediate `if (stop.stopping) break;` checks between the synchronous spawnSync calls in snapshot-loop's iteration body are unreachable in practice. They are right: Node only drains the signal callback queue between event-loop turns, the iteration body is fully synchronous between `await stop.sleep(intervalMs)` calls, so the SIGTERM/SIGINT handler cannot fire between two spawnSync calls. The flag can only flip during the awaited sleep, and the while-condition check at the top of the next iteration already catches that. Remove the misleading checks and replace them with a single comment that makes the contract explicit: SIGTERM responsiveness comes from the per-spawnSync timeout (each browse call returns within SNAPSHOT_TIMEOUT_MS, so the loop reaches the await within that bound) plus the abortable sleep (which wakes immediately on the signal). The previous intermediate guards gave a false impression of mid-iteration responsiveness that did not actually exist. No behavior change versus the previous commit on this branch: - Default 30 s spawnSync timeout still bounds each browse call. - Abortable sleep still wakes immediately on SIGTERM/SIGINT. - Worst-case lag from SIGTERM to loop exit is now bounded by SNAPSHOT_TIMEOUT_MS + remaining sync work in the current tick, which is the same bound the previous version actually had once you account for the dead checks. All 5 stop-signal unit tests continue to pass.

Nagendhra-web · 2026-05-02T21:01:30Z

Thanks @cursor[bot]. The intermediate stop checks are unreachable in practice and the comment claiming "mid-iteration responsiveness" was misleading. Pushed 2a713a5 to remove them.

You are right about the runtime: between two synchronous spawnSync calls in the iteration body there is no event-loop tick, so Node never drains the SIGTERM callback queue, so stopping stays whatever the top-of-loop check observed. The only yield point that actually matters is await stop.sleep(intervalMs) at the bottom; the next iteration's while (!stop.stopping) then catches the flag.

The fix's actual SIGTERM responsiveness comes from two real mechanisms:

The O11Y_SNAPSHOT_TIMEOUT_MS (default 30 s) cap on every spawnSync('browse', ...) call. Without this, a hung browse CLI could pin the loop indefinitely and SIGTERM would never get a chance to land. With it, each iteration body finishes within at most 3 * SNAPSHOT_TIMEOUT_MS + tiny sync work, then the abortable sleep yields control, then the handler fires.
The abortable stop.sleep(intervalMs) at the bottom of the iteration. SIGTERM during this sleep resolves the promise immediately, the while-check picks up stopping = true, and the loop exits in well under a second instead of waiting out the full interval-seconds. This is what the unit test exercises.

Net change versus the previous commit on this branch: identical observable behavior, honest comment, and 12 fewer lines of code that suggested guarantees the runtime cannot actually provide. All 5 stop-signal unit tests continue to pass.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 2a713a5. Configure here.}

…ng .then per call Cursor Bugbot pointed out that the previous `stopPromise.then(...)` inside `sleep(ms)` registered a new reaction handler on every call. The stop promise stays pending for the lifetime of the capture, so each handler closure (capturing `timer` and `resolve`) was pinned on the reaction list until SIGTERM finally fired. Over a 24-hour run at the default 2 s interval that is roughly 43,000 closures, all live in memory at once. Replace the chained-.then approach with an explicit `pending` Set: - `sleep(ms)` adds a `{ resolve, timer }` entry, then the timer's own callback removes the entry on natural wake-up. The closure becomes unreachable as soon as the sleep resolves. - `trigger()` walks the set, clears each timer, calls each resolve, and clears the set in one pass. `stopPromise` is no longer needed at all and has been dropped along with `resolveStop`. No public surface change. Two new unit tests guard the fix: - `completed sleeps deregister from the pending set so closures can be GC'd` runs 200 back-to-back 1ms sleeps and asserts pending count returns to 0 after each one. With the old code this assertion would still hold for the closure ref-count by accident (the reaction list is internal to the Promise), but the new test exercises the explicit Set so future regressions are caught directly. - `trigger drains every still-pending sleep without leaking` starts three long sleeps, asserts pending count is 3, fires the trigger, asserts it drops to 0, and awaits all three for completion. Test-only `_pendingCount()` accessor added on the returned object; documented as test-only alongside the existing `_trigger`. All 7 tests now pass; no behavior change to the loop itself.

Nagendhra-web · 2026-05-02T21:20:35Z

Thanks again. Real leak, fixed in 1b958ad.

You are right on the analysis: every sleep(ms) was calling stopPromise.then(...), and since stopPromise only resolves at SIGTERM, every handler closure (capturing timer and resolve) sat on the promise's reaction list for the lifetime of the capture. At the default 2 s interval that is around 43,000 closures over 24 hours.

Switched the implementation to an explicit pending Set:

sleep(ms) adds a { resolve, timer } entry. The timer's own callback removes the entry on natural wake-up, so the closure becomes unreachable the moment the sleep resolves.
trigger() walks the set, clears each timer, calls each resolve, then clears the set in one pass.
stopPromise and resolveStop are gone; the new design does not need a long-lived promise at all.
Public surface is unchanged: same stopping getter, same sleep(ms). Added a test-only _pendingCount() accessor alongside the existing _trigger, both documented as test-only.

Two new unit tests guard the fix:

completed sleeps deregister from the pending set so closures can be GC'd runs 200 back-to-back 1 ms sleeps and asserts _pendingCount() returns to 0 after each one. With the old code this assertion held by accident (the Set itself was empty, the leak was internal to the Promise reaction list), but it now directly exercises the path the regression would have to break.
trigger drains every still-pending sleep without leaking starts three long sleeps, asserts _pendingCount() is 3, fires the trigger, asserts it drops to 0, and awaits all three for completion.

Verification:

> browser-trace@0.1.0 test
> node --test scripts/*.test.mjs

✔ stopping is false until trigger fires
✔ sleep wakes immediately when trigger fires mid-wait
✔ sleep resolves immediately when trigger has already fired
✔ sleep without a trigger waits for the requested interval
✔ repeated triggers are idempotent
✔ completed sleeps deregister from the pending set so closures can be GC'd (3048 ms)
✔ trigger drains every still-pending sleep without leaking
ℹ tests 7  ℹ pass 7  ℹ fail 0

Net change vs the previous commit: +44 / -8 across the same two files. No behavior change to the snapshot loop itself.

jcalvin4 · 2026-05-04T23:39:49Z

Hi, I just discovered this repo and was wondering how you came to this solution for the loop problem?

cursor Bot reviewed May 2, 2026

View reviewed changes

Comment thread skills/browser-trace/scripts/snapshot-loop.mjs Outdated

cursor Bot reviewed May 2, 2026

View reviewed changes

Comment thread skills/browser-trace/scripts/snapshot-loop.mjs Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(browser-trace): make snapshot-loop responsive to SIGTERM and validate loop liveness#92

fix(browser-trace): make snapshot-loop responsive to SIGTERM and validate loop liveness#92
Nagendhra-web wants to merge 3 commits intobrowserbase:mainfrom
Nagendhra-web:fix/snapshot-loop-sigterm-responsiveness

Nagendhra-web commented May 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Nagendhra-web commented May 2, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Nagendhra-web commented May 2, 2026 •

edited

Loading

Uh oh!

jcalvin4 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Nagendhra-web commented May 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Scope notes

Uh oh!

Uh oh!

Nagendhra-web commented May 2, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nagendhra-web commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jcalvin4 commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Nagendhra-web commented May 2, 2026 •

edited by cursor Bot

Loading

Nagendhra-web commented May 2, 2026 •

edited

Loading