Skip to content

fix(amplify-e2e-tests): stagger e2e batch fan-out into waves to prevent codebuild orchestrator fault#14920

Draft
adrianjoshua-strutt wants to merge 1 commit into
devfrom
fix/e2e-batch-stagger-fanout
Draft

fix(amplify-e2e-tests): stagger e2e batch fan-out into waves to prevent codebuild orchestrator fault#14920
adrianjoshua-strutt wants to merge 1 commit into
devfrom
fix/e2e-batch-stagger-fanout

Conversation

@adrianjoshua-strutt

Copy link
Copy Markdown
Member

Description of changes

The generated CodeBuild batch workflow (codebuild_specs/e2e_workflow_generated.yml) has been FAULTing at the batch-orchestrator level ("Internal Service Error") roughly 25 minutes into every run since ~May 22. The batch reaches the e2e fan-out, then FAULTs and stops all downstream builds. Because the error is at the batch-orchestrator level (not an individual build), it is largely invisible in the per-build UI.

Root cause — single-gate thundering herd

The workflow funnels every e2e shard through one dependency gate:

  • ~141 Linux shards each declare depend-on: [upb]
  • ~124 Windows shards each declare depend-on: [build_windows, upb]

When upb completes, all ~141 Linux shards leave INITIALIZED within ~13s — a burst of hundreds of StartBuild calls in under a minute. That simultaneous-start burst trips the CodeBuild batch orchestrator. The individual builds are healthy (they reach PROVISIONING in seconds) and per-build concurrency quota is not the limiter. Setting the project concurrentBuildLimit does not fix it — batch builds are rejected (AccountLimitExceededException) rather than queued.

Fix — stagger the fan-out into waves

scripts/split-e2e-tests-codebuild.ts now groups the generated shards into waves of at most E2E_WAVE_SIZE (50):

  • Wave 1 keeps the original gate ([upb] for Linux, [build_windows, upb] for Windows).
  • Wave N (N > 1) depends solely on the last shard of wave N-1.

A wave only fans out once the previous wave's anchor shard has completed, so the instantaneous StartBuild burst is capped at ~one wave (≤50) instead of the full shard count. Transitive ordering is preserved — every shard still runs strictly after upb (and build_windows for Windows), so artifact access is unchanged (shards download the prebuilt CLI from S3 at runtime; depend-on in a build-graph is ordering-only and does not pass artifacts). Linux and Windows are staggered independently. The regenerated e2e_workflow_generated.yml is committed alongside the generator change (it is a committed artifact).

Tradeoff for reviewer discussion (why this is a draft)

Chaining each wave on the previous wave's anchor shard means that if an anchor shard fails, CodeBuild does not run the dependent wave — downstream waves for that OS are skipped. e2e shards are occasionally flaky, so a flaky failure on an anchor shard could skip a wave. fast-fail: false is set so the rest of the batch still runs, and the e2e monitor already retries failed builds — but reviewers should confirm this tradeoff is acceptable versus, e.g., inserting always-succeeding barrier jobs between waves. Wave size is a single tunable constant.

This change complements a separate AWS Support case opened for the batch-orchestrator behavior; it is the robust, workflow-side mitigation that does not depend on a service-side fix.

Issue #, if available

N/A

Description of how you validated changes

  • yarn split-e2e-tests-codebuild regenerates the workflow with exit 0 (ts-node type-checks the script, confirming it compiles).
  • The regenerated e2e_workflow_generated.yml now shows three waves per OS:
    • Linux (136 shards): 50 → [upb]; 50 → [l_predictions_migration_..._api_connection_migration] (wave-1 anchor); 36 → [l_env_3] (wave-2 anchor)
    • Windows (124 shards): 50 → [build_windows, upb]; 50 → [w_auth_3b_auth_3a] (wave-1 anchor); 24 → [w_schema_key_resolvers] (wave-2 anchor)
  • No e2e anchor gate has more than 50 direct dependents. upb retains both OSes' wave-1 shards, but Windows wave 1 is additionally gated by build_windows, so upb completing releases at most ~50 same-OS (Linux) shards.
  • wait_for_ids.json is unchanged — shard identifiers (and the cache that keys off them) are preserved; only depend-on edges changed.
  • Commit passed the husky / lint-staged / commitlint hooks (prettier + eslint clean).

Checklist

  • PR description included
  • yarn test passes — n/a: no unit tests exist for the workflow generator; validated via the regenerated artifact
  • Tests are changed or added — n/a: build-script change, verified via generated output
  • Relevant documentation is changed or added — n/a: no docs/ file maps to this script
  • Pull request labels are added

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…nt codebuild orchestrator fault

The generated CodeBuild batch workflow funneled every e2e shard
through a single dependency gate: ~141 Linux shards all declared
`depend-on: [upb]` and ~124 Windows shards all declared `depend-on:
[build_windows, upb]`. When the gate completed, every shard left the
INITIALIZED state within seconds, producing a burst of hundreds of
StartBuild calls. That thundering herd trips the CodeBuild batch
orchestrator, which FAULTs roughly 25 minutes in and stops all
downstream builds. Raising the project concurrentBuildLimit does not
help because batch builds are rejected rather than queued.

This change staggers the fan-out into waves. Shards are grouped into
waves of at most E2E_WAVE_SIZE (50). Wave 1 keeps the original gate;
each later wave depends solely on the last shard of the previous
wave, so a wave only fans out once the previous wave is already
underway. This caps the instantaneous StartBuild burst at roughly the
wave size instead of the full shard count, while transitive ordering
still guarantees every shard runs after the prebuilt binaries are
uploaded. Linux and Windows are staggered independently so the
Windows wave 1 keeps its build_windows gate.

Testing: ran `yarn split-e2e-tests-codebuild` to regenerate
codebuild_specs/e2e_workflow_generated.yml and confirmed the shards
now form three waves per OS (50 / 50 / 36 Linux, 50 / 50 / 24
Windows); each wave after the first depends on the prior wave's
anchor shard rather than upb, and no e2e anchor gate has more than
50 direct dependents.
---
Prompt: Implement dependency-graph staggering in
split-e2e-tests-codebuild.ts to fix the CodeBuild batch
thundering-herd FAULT. Root cause: the generated workflow funnels
~141 Linux e2e shards (and ~124 Windows shards) through a single
`depend-on: upb` gate, so all shards start at once and trip the
batch orchestrator (~25 min FAULT). Group the shards into waves
(size ~50, tunable constant); wave 1 keeps the original gate, wave
N>1 depends on the previous wave's last shard. Apply independently to
Linux and Windows (preserving build_windows). Build, regenerate and
commit the workflow, verify the waves, and open a draft PR against
dev.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant