feat(cli): add time-barrier wave staggering to e2e batch fan-out#14935
Draft
sarayev wants to merge 1 commit into
Draft
feat(cli): add time-barrier wave staggering to e2e batch fan-out#14935sarayev wants to merge 1 commit into
sarayev wants to merge 1 commit into
Conversation
The e2e CodeBuild batch fans out a large number of shards that all depend only on `upb` (Linux) or `build_windows`/`upb` (Windows). Dispatching that many builds as a single instantaneous burst is unreliable, so this change spreads dispatch over time. This change spreads dispatch over time using TIME barriers rather than completion chaining. Shards are grouped into waves of `E2E_WAVE_SIZE` (per OS). Wave 1 keeps its original dependency on `upb`/`build_windows`. Each later wave is gated behind a synthetic barrier build whose only job is to sleep `(k-1) * E2E_WAVE_BARRIER_INTERVAL_SEC` seconds. Crucially, every barrier depends ONLY on `upb`/`build_windows` — never on the prior barrier or the prior wave — so all barriers start in parallel right after the package upload and the stagger comes purely from their differing sleep durations. This cuts the per-scheduling-decision dispatch burst without serializing wave completion, deliberately avoiding the serial completion chain that timed out at the 240m batch limit previously. Both knobs are tunable constants near the top of the generator. Barriers are excluded from the e2e report wait list (`wait_for_ids.json`) since they are synthetic and must not be polled by `aggregate_e2e_reports`. Verified by regenerating `e2e_workflow_generated.yml`: 4 barriers present (l/w waves 2 and 3), waves reference their barrier, wave 1 still on `upb`/`build_windows`, no shard depends on another shard, no barrier depends on another barrier, and the yaml parses cleanly. --- Prompt: implement barrier-wave staggering, draft PR from dev, run e2e
6ba93f5 to
f1277b2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
Experimental change that staggers the e2e CodeBuild batch fan-out over time using time-based barriers, to bound how many builds are dispatched simultaneously.
Problem
The e2e batch fans out a large number of shards that today all depend only on
upb(Linuxl_*) orbuild_windows/upb(Windowsw_*). Dispatching this many builds as a single instantaneous burst is unreliable. This experiment spreads dispatch over time to reduce the size of that burst.Design — time barriers, not completion chaining
Shards are grouped into waves of
E2E_WAVE_SIZEper OS:upb, orbuild_windows/upbfor Windows) — it dispatches immediately.l_wave_barrier_k/w_wave_barrier_k) whose only command is to sleep(k-1) * E2E_WAVE_BARRIER_INTERVAL_SECseconds.upb/build_windows— never on the prior barrier or the prior wave. All barriers start in parallel right after the package upload; the stagger comes purely from their differing sleep durations.This deliberately avoids a serial completion chain approach (shard N depends on shard N-1), which would risk timing out at the 240m batch limit. Here dispatch is spread over time without waiting for any prior wave to finish.
Tunable knobs
Two constants near the top of the generator:
E2E_WAVE_SIZE(default50) — shards per wave.E2E_WAVE_BARRIER_INTERVAL_SEC(default300) — seconds of additional sleep per wave (barrier_2 sleeps 300s, barrier_3 sleeps 600s, …).Notes
BUILD_GENERAL1_SMALLand sleep via bashsleep. Windows barriers run on the sameWINDOWS_SERVER_2022_CONTAINER/$WINDOWS_IMAGE_2019env asw_*shards and sleep via PowerShellStart-Sleep(bashsleepis not guaranteed on the Windows image), using an inline buildspec withshell: powershell.exe.wait_for_ids.json) — the list is computed from the test shards before barriers are injected, soaggregate_e2e_reportsdoes not poll the synthetic builds.Issue #, if available
N/A — infra experiment.
Description of how you validated changes
codebuild_specs/e2e_workflow_generated.ymland validated it parses withjs-yaml.l_wave_barrier_2/3,w_wave_barrier_2/3), each depending only onupb/build_windows+upb.upb(Linux) /build_windows+upb(Windows), and wave-k shards reference their barrier.l_/w_shard depends on another shard and no barrier depends on another barrier (grep proof).wait_for_ids.jsonis unchanged and contains no barrier identifiers.Checklist
yarn testpasses (CI/CD config change only — no unit-test surface)By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.