feat(cli): refined e2e wave staggering (per-platform wave sizes, all waves gated)#14936
Draft
sarayev wants to merge 1 commit into
Draft
feat(cli): refined e2e wave staggering (per-platform wave sizes, all waves gated)#14936sarayev wants to merge 1 commit into
sarayev wants to merge 1 commit into
Conversation
…waves gated) A more aggressive variant of the time-barrier wave staggering, run as a parallel experiment. Instead of letting wave 1 fan out immediately, EVERY wave — including wave 1 — is now gated behind a synthetic barrier build that only sleeps, so nothing dispatches the instant `upb`/ `build_windows` completes. Wave sizes are now per-platform: Linux shards are grouped into waves of 30 (5 waves), Windows shards into waves of 20 (7 waves). Each wave k is gated by a barrier `l_wave_barrier_k` / `w_wave_barrier_k` whose only command sleeps `k * 300` seconds (5m, 10m, 15m, ...). Every shard depends ONLY on its barrier; no shard depends directly on `upb`/`build_windows` anymore and no shard depends on another shard. All barriers depend solely on `upb` (Linux) / `build_windows`+`upb` (Windows) — never on another barrier — so they start in parallel and the stagger comes purely from the differing sleep durations, with no serial completion chaining. Linux barriers run on BUILD_GENERAL1_SMALL; Windows barriers inherit the default Windows container (SMALL is invalid for Windows). Barriers are excluded from `wait_for_ids.json` since they are synthetic and must not be polled by aggregate_e2e_reports; the wait list is computed before the barriers are injected. Verified by regenerating e2e_workflow_generated.yml: 12 barriers (5 Linux, 7 Windows), per-wave shard counts 30/30/30/30/16 (Linux) and 20x6/4 (Windows), wave-1 shards gated by their barrier (0 shards depend directly on upb), 0 shard-to-shard deps, 0 barrier-to-barrier deps, and the yaml parses cleanly. --- Prompt: refined wave staggering v2 (30 linux/20 windows, all waves gated, 5/10/15 staggers)
2d84869 to
d6f7c8c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
Parallel experiment that is a more aggressive refinement of the time-barrier wave
staggering approach (sibling to
feat/e2e-barrier-wave-staggering), run on its ownbranch/cache so the two experiments don't share a prep cache.
Refined wave-staggering design
The e2e CodeBuild batch fans out a large number of shards that previously all depended
directly on
upb(Linux) orbuild_windows/upb(Windows). Dispatching that manybuilds as a single instantaneous burst is unreliable, so this change spreads dispatch
over time.
This change spreads dispatch over time using time barriers rather than completion
chaining, with two refinements over the first variant:
first wave waits one interval. No shard depends directly on
upb/build_windowsanymore.
Windows shards into waves of 20 (7 waves).
Each wave
kis gated by a synthetic barrier build (l_wave_barrier_k/w_wave_barrier_k) whose only command sleepsk * 300seconds (5m, 10m, 15m, ...).Every shard depends ONLY on its barrier. All barriers depend solely on
upb(Linux) /build_windows+upb(Windows) — never on another barrier or a prior wave — so theystart in parallel and the stagger comes purely from the differing sleep durations, with
no serial completion chaining that would risk the 240m batch timeout.
Linux barriers run on
BUILD_GENERAL1_SMALL; Windows barriers inherit the defaultWindows container (
SMALLis invalid for Windows). Barriers are excluded fromwait_for_ids.jsonsince they are synthetic and must not be polled byaggregate_e2e_reports; the wait list is computed before the barriers are injected.Wave sizes and the interval are tunable named constants at the top of the generator.
Issue #, if available
N/A — infra experiment.
Description of how you validated changes
Regenerated
codebuild_specs/e2e_workflow_generated.ymland verified via a YAML parse:upb/build_windowsk * 300s(wave 1 = 300s ... Linux wave 5 = 1500s)wait_for_ids.jsoncontains no barrier identifiersChecklist
yarn testpassesBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.