Skip to content

feat(cli): add time-barrier wave staggering to e2e batch fan-out#14935

Draft
sarayev wants to merge 1 commit into
devfrom
feat/e2e-barrier-wave-staggering
Draft

feat(cli): add time-barrier wave staggering to e2e batch fan-out#14935
sarayev wants to merge 1 commit into
devfrom
feat/e2e-barrier-wave-staggering

Conversation

@sarayev

@sarayev sarayev commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description of changes

Experimental change that staggers the e2e CodeBuild batch fan-out over time using time-based barriers, to bound how many builds are dispatched simultaneously.

Problem

The e2e batch fans out a large number of shards that today all depend only on upb (Linux l_*) or build_windows/upb (Windows w_*). Dispatching this many builds as a single instantaneous burst is unreliable. This experiment spreads dispatch over time to reduce the size of that burst.

Design — time barriers, not completion chaining

Shards are grouped into waves of E2E_WAVE_SIZE per OS:

  • Wave 1 keeps its original dependency (upb, or build_windows/upb for Windows) — it dispatches immediately.
  • Each later wave k depends on a synthetic barrier build (l_wave_barrier_k / w_wave_barrier_k) whose only command is to sleep (k-1) * E2E_WAVE_BARRIER_INTERVAL_SEC seconds.
  • Critically, every barrier depends ONLY on upb/build_windows — never on the prior barrier or the prior wave. All barriers start in parallel right after the package upload; the stagger comes purely from their differing sleep durations.

This deliberately avoids a serial completion chain approach (shard N depends on shard N-1), which would risk timing out at the 240m batch limit. Here dispatch is spread over time without waiting for any prior wave to finish.

Tunable knobs

Two constants near the top of the generator:

  • E2E_WAVE_SIZE (default 50) — shards per wave.
  • E2E_WAVE_BARRIER_INTERVAL_SEC (default 300) — seconds of additional sleep per wave (barrier_2 sleeps 300s, barrier_3 sleeps 600s, …).
Notes
  • Linux barriers run on BUILD_GENERAL1_SMALL and sleep via bash sleep. Windows barriers run on the same WINDOWS_SERVER_2022_CONTAINER / $WINDOWS_IMAGE_2019 env as w_* shards and sleep via PowerShell Start-Sleep (bash sleep is not guaranteed on the Windows image), using an inline buildspec with shell: powershell.exe.
  • Barriers are excluded from the e2e report wait list (wait_for_ids.json) — the list is computed from the test shards before barriers are injected, so aggregate_e2e_reports does not poll the synthetic builds.
  • The test split / balancing logic and shard contents are unchanged — only the dependency wiring and the new barrier build groups were added.

Issue #, if available

N/A — infra experiment.

Description of how you validated changes

  • Regenerated codebuild_specs/e2e_workflow_generated.yml and validated it parses with js-yaml.
  • Verified 4 barriers present (l_wave_barrier_2/3, w_wave_barrier_2/3), each depending only on upb / build_windows+upb.
  • Verified wave-1 shards still depend on upb (Linux) / build_windows+upb (Windows), and wave-k shards reference their barrier.
  • Confirmed no l_/w_ shard depends on another shard and no barrier depends on another barrier (grep proof).
  • Confirmed wait_for_ids.json is unchanged and contains no barrier identifiers.

Checklist

  • PR description included
  • yarn test passes (CI/CD config change only — no unit-test surface)
  • Tests are changed or added (N/A — build-graph generation change)
  • Relevant documentation is changed or added (and PR referenced)
  • New AWS SDK calls or CloudFormation actions have been added to relevant test and service IAM policies (N/A)
  • Pull request labels are added

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The e2e CodeBuild batch fans out a large number of shards that all depend
only on `upb` (Linux) or `build_windows`/`upb` (Windows). Dispatching that
many builds as a single instantaneous burst is unreliable, so this change
spreads dispatch over time.

This change spreads dispatch over time using TIME barriers rather than
completion chaining. Shards are grouped into waves of `E2E_WAVE_SIZE`
(per OS). Wave 1 keeps its original dependency on `upb`/`build_windows`.
Each later wave is gated behind a synthetic barrier build whose only job
is to sleep `(k-1) * E2E_WAVE_BARRIER_INTERVAL_SEC` seconds. Crucially,
every barrier depends ONLY on `upb`/`build_windows` — never on the prior
barrier or the prior wave — so all barriers start in parallel right after
the package upload and the stagger comes purely from their differing
sleep durations. This cuts the per-scheduling-decision dispatch burst
without serializing wave completion, deliberately avoiding the serial
completion chain that timed out at the 240m batch limit previously.

Both knobs are tunable constants near the top of the generator. Barriers
are excluded from the e2e report wait list (`wait_for_ids.json`) since
they are synthetic and must not be polled by `aggregate_e2e_reports`.

Verified by regenerating `e2e_workflow_generated.yml`: 4 barriers
present (l/w waves 2 and 3), waves reference their barrier, wave 1 still
on `upb`/`build_windows`, no shard depends on another shard, no barrier
depends on another barrier, and the yaml parses cleanly.

---
Prompt: implement barrier-wave staggering, draft PR from dev, run e2e
@sarayev sarayev force-pushed the feat/e2e-barrier-wave-staggering branch from 6ba93f5 to f1277b2 Compare June 23, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant