Skip to content

feat(cli): refined e2e wave staggering (per-platform wave sizes, all waves gated)#14936

Draft
sarayev wants to merge 1 commit into
devfrom
feat/e2e-wave-staggering-v2
Draft

feat(cli): refined e2e wave staggering (per-platform wave sizes, all waves gated)#14936
sarayev wants to merge 1 commit into
devfrom
feat/e2e-wave-staggering-v2

Conversation

@sarayev

@sarayev sarayev commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description of changes

Parallel experiment that is a more aggressive refinement of the time-barrier wave
staggering approach (sibling to feat/e2e-barrier-wave-staggering), run on its own
branch/cache so the two experiments don't share a prep cache.

Refined wave-staggering design

The e2e CodeBuild batch fans out a large number of shards that previously all depended
directly on upb (Linux) or build_windows/upb (Windows). Dispatching that many
builds as a single instantaneous burst is unreliable, so this change spreads dispatch
over time.

This change spreads dispatch over time using time barriers rather than completion
chaining, with two refinements over the first variant:

  • Every wave is gated — including wave 1. No shard fans out immediately; even the
    first wave waits one interval. No shard depends directly on upb/build_windows
    anymore.
  • Per-platform wave sizes. Linux shards are grouped into waves of 30 (5 waves);
    Windows shards into waves of 20 (7 waves).

Each wave k is gated by a synthetic barrier build (l_wave_barrier_k /
w_wave_barrier_k) whose only command sleeps k * 300 seconds (5m, 10m, 15m, ...).
Every shard depends ONLY on its barrier. All barriers depend solely on upb (Linux) /
build_windows+upb (Windows) — never on another barrier or a prior wave — so they
start in parallel and the stagger comes purely from the differing sleep durations, with
no serial completion chaining that would risk the 240m batch timeout.

Linux barriers run on BUILD_GENERAL1_SMALL; Windows barriers inherit the default
Windows container (SMALL is invalid for Windows). Barriers are excluded from
wait_for_ids.json since they are synthetic and must not be polled by
aggregate_e2e_reports; the wait list is computed before the barriers are injected.

Wave sizes and the interval are tunable named constants at the top of the generator.

Issue #, if available

N/A — infra experiment.

Description of how you validated changes

Regenerated codebuild_specs/e2e_workflow_generated.yml and verified via a YAML parse:

  • 12 barriers total (5 Linux, 7 Windows)
  • Per-wave shard counts: Linux 30/30/30/30/16, Windows 20×6 + 4
  • Wave-1 shards gated by their barrier; 0 shards depend directly on upb/build_windows
  • 0 shard-to-shard dependencies
  • 0 barrier-to-barrier dependencies
  • Barrier sleeps follow k * 300s (wave 1 = 300s ... Linux wave 5 = 1500s)
  • wait_for_ids.json contains no barrier identifiers
  • YAML parses cleanly

Checklist

  • PR description included
  • yarn test passes
  • Tests are changed or added
  • Relevant documentation is changed or added (and PR referenced)
  • New AWS SDK calls or CloudFormation actions have been added to relevant test and service IAM policies
  • Pull request labels are added

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…waves gated)

A more aggressive variant of the time-barrier wave staggering, run as a
parallel experiment. Instead of letting wave 1 fan out immediately,
EVERY wave — including wave 1 — is now gated behind a synthetic barrier
build that only sleeps, so nothing dispatches the instant `upb`/
`build_windows` completes.

Wave sizes are now per-platform: Linux shards are grouped into waves of
30 (5 waves), Windows shards into waves of 20 (7 waves). Each wave k is
gated by a barrier `l_wave_barrier_k` / `w_wave_barrier_k` whose only
command sleeps `k * 300` seconds (5m, 10m, 15m, ...). Every shard depends
ONLY on its barrier; no shard depends directly on `upb`/`build_windows`
anymore and no shard depends on another shard. All barriers depend solely
on `upb` (Linux) / `build_windows`+`upb` (Windows) — never on another
barrier — so they start in parallel and the stagger comes purely from the
differing sleep durations, with no serial completion chaining.

Linux barriers run on BUILD_GENERAL1_SMALL; Windows barriers inherit the
default Windows container (SMALL is invalid for Windows). Barriers are
excluded from `wait_for_ids.json` since they are synthetic and must not be
polled by aggregate_e2e_reports; the wait list is computed before the
barriers are injected.

Verified by regenerating e2e_workflow_generated.yml: 12 barriers (5
Linux, 7 Windows), per-wave shard counts 30/30/30/30/16 (Linux) and
20x6/4 (Windows), wave-1 shards gated by their barrier (0 shards depend
directly on upb), 0 shard-to-shard deps, 0 barrier-to-barrier deps, and
the yaml parses cleanly.

---
Prompt: refined wave staggering v2 (30 linux/20 windows, all waves gated, 5/10/15 staggers)
@sarayev sarayev force-pushed the feat/e2e-wave-staggering-v2 branch from 2d84869 to d6f7c8c Compare June 23, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant