fix(amplify-e2e-tests): stagger e2e batch fan-out into waves to prevent codebuild orchestrator fault by adrianjoshua-strutt · Pull Request #14920 · aws-amplify/amplify-cli

adrianjoshua-strutt · 2026-06-17T13:31:12Z

Description of changes

The generated CodeBuild batch workflow (codebuild_specs/e2e_workflow_generated.yml) has been FAULTing at the batch-orchestrator level ("Internal Service Error") roughly 25 minutes into every run since ~May 22. The batch reaches the e2e fan-out, then FAULTs and stops all downstream builds. Because the error is at the batch-orchestrator level (not an individual build), it is largely invisible in the per-build UI.

Root cause — single-gate thundering herd

The workflow funnels every e2e shard through one dependency gate:

~141 Linux shards each declare depend-on: [upb]
~124 Windows shards each declare depend-on: [build_windows, upb]

When upb completes, all ~141 Linux shards leave INITIALIZED within ~13s — a burst of hundreds of StartBuild calls in under a minute. That simultaneous-start burst trips the CodeBuild batch orchestrator. The individual builds are healthy (they reach PROVISIONING in seconds) and per-build concurrency quota is not the limiter. Setting the project concurrentBuildLimit does not fix it — batch builds are rejected (AccountLimitExceededException) rather than queued.

Fix — stagger the fan-out into waves

scripts/split-e2e-tests-codebuild.ts now groups the generated shards into waves of at most E2E_WAVE_SIZE (50):

Wave 1 keeps the original gate ([upb] for Linux, [build_windows, upb] for Windows).
Wave N (N > 1) depends solely on the last shard of wave N-1.

A wave only fans out once the previous wave's anchor shard has completed, so the instantaneous StartBuild burst is capped at ~one wave (≤50) instead of the full shard count. Transitive ordering is preserved — every shard still runs strictly after upb (and build_windows for Windows), so artifact access is unchanged (shards download the prebuilt CLI from S3 at runtime; depend-on in a build-graph is ordering-only and does not pass artifacts). Linux and Windows are staggered independently. The regenerated e2e_workflow_generated.yml is committed alongside the generator change (it is a committed artifact).

Tradeoff for reviewer discussion (why this is a draft)

Chaining each wave on the previous wave's anchor shard means that if an anchor shard fails, CodeBuild does not run the dependent wave — downstream waves for that OS are skipped. e2e shards are occasionally flaky, so a flaky failure on an anchor shard could skip a wave. fast-fail: false is set so the rest of the batch still runs, and the e2e monitor already retries failed builds — but reviewers should confirm this tradeoff is acceptable versus, e.g., inserting always-succeeding barrier jobs between waves. Wave size is a single tunable constant.

This change complements a separate AWS Support case opened for the batch-orchestrator behavior; it is the robust, workflow-side mitigation that does not depend on a service-side fix.

Issue #, if available

N/A

Description of how you validated changes

yarn split-e2e-tests-codebuild regenerates the workflow with exit 0 (ts-node type-checks the script, confirming it compiles).
The regenerated e2e_workflow_generated.yml now shows three waves per OS:
- Linux (136 shards): 50 → [upb]; 50 → [l_predictions_migration_..._api_connection_migration] (wave-1 anchor); 36 → [l_env_3] (wave-2 anchor)
- Windows (124 shards): 50 → [build_windows, upb]; 50 → [w_auth_3b_auth_3a] (wave-1 anchor); 24 → [w_schema_key_resolvers] (wave-2 anchor)
No e2e anchor gate has more than 50 direct dependents. upb retains both OSes' wave-1 shards, but Windows wave 1 is additionally gated by build_windows, so upb completing releases at most ~50 same-OS (Linux) shards.
wait_for_ids.json is unchanged — shard identifiers (and the cache that keys off them) are preserved; only depend-on edges changed.
Commit passed the husky / lint-staged / commitlint hooks (prettier + eslint clean).

Checklist

PR description included
yarn test passes — n/a: no unit tests exist for the workflow generator; validated via the regenerated artifact
Tests are changed or added — n/a: build-script change, verified via generated output
Relevant documentation is changed or added — n/a: no docs/ file maps to this script
Pull request labels are added

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…nt codebuild orchestrator fault The generated CodeBuild batch workflow funneled every e2e shard through a single dependency gate: ~141 Linux shards all declared `depend-on: [upb]` and ~124 Windows shards all declared `depend-on: [build_windows, upb]`. When the gate completed, every shard left the INITIALIZED state within seconds, producing a burst of hundreds of StartBuild calls. That thundering herd trips the CodeBuild batch orchestrator, which FAULTs roughly 25 minutes in and stops all downstream builds. Raising the project concurrentBuildLimit does not help because batch builds are rejected rather than queued. This change staggers the fan-out into waves. Shards are grouped into waves of at most E2E_WAVE_SIZE (50). Wave 1 keeps the original gate; each later wave depends solely on the last shard of the previous wave, so a wave only fans out once the previous wave is already underway. This caps the instantaneous StartBuild burst at roughly the wave size instead of the full shard count, while transitive ordering still guarantees every shard runs after the prebuilt binaries are uploaded. Linux and Windows are staggered independently so the Windows wave 1 keeps its build_windows gate. Testing: ran `yarn split-e2e-tests-codebuild` to regenerate codebuild_specs/e2e_workflow_generated.yml and confirmed the shards now form three waves per OS (50 / 50 / 36 Linux, 50 / 50 / 24 Windows); each wave after the first depends on the prior wave's anchor shard rather than upb, and no e2e anchor gate has more than 50 direct dependents. --- Prompt: Implement dependency-graph staggering in split-e2e-tests-codebuild.ts to fix the CodeBuild batch thundering-herd FAULT. Root cause: the generated workflow funnels ~141 Linux e2e shards (and ~124 Windows shards) through a single `depend-on: upb` gate, so all shards start at once and trip the batch orchestrator (~25 min FAULT). Group the shards into waves (size ~50, tunable constant); wave 1 keeps the original gate, wave N>1 depends on the previous wave's last shard. Apply independently to Linux and Windows (preserving build_windows). Build, regenerate and commit the workflow, verify the waves, and open a draft PR against dev.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(amplify-e2e-tests): stagger e2e batch fan-out into waves to prevent codebuild orchestrator fault#14920

fix(amplify-e2e-tests): stagger e2e batch fan-out into waves to prevent codebuild orchestrator fault#14920
adrianjoshua-strutt wants to merge 1 commit into
devfrom
fix/e2e-batch-stagger-fanout

adrianjoshua-strutt commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

adrianjoshua-strutt commented Jun 17, 2026

Description of changes

Root cause — single-gate thundering herd

Fix — stagger the fan-out into waves

Tradeoff for reviewer discussion (why this is a draft)

Issue #, if available

Description of how you validated changes

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant