fix(amplify-e2e-tests): stagger e2e batch fan-out into waves to prevent codebuild orchestrator fault#14920
Draft
adrianjoshua-strutt wants to merge 1 commit into
Draft
fix(amplify-e2e-tests): stagger e2e batch fan-out into waves to prevent codebuild orchestrator fault#14920adrianjoshua-strutt wants to merge 1 commit into
adrianjoshua-strutt wants to merge 1 commit into
Conversation
…nt codebuild orchestrator fault The generated CodeBuild batch workflow funneled every e2e shard through a single dependency gate: ~141 Linux shards all declared `depend-on: [upb]` and ~124 Windows shards all declared `depend-on: [build_windows, upb]`. When the gate completed, every shard left the INITIALIZED state within seconds, producing a burst of hundreds of StartBuild calls. That thundering herd trips the CodeBuild batch orchestrator, which FAULTs roughly 25 minutes in and stops all downstream builds. Raising the project concurrentBuildLimit does not help because batch builds are rejected rather than queued. This change staggers the fan-out into waves. Shards are grouped into waves of at most E2E_WAVE_SIZE (50). Wave 1 keeps the original gate; each later wave depends solely on the last shard of the previous wave, so a wave only fans out once the previous wave is already underway. This caps the instantaneous StartBuild burst at roughly the wave size instead of the full shard count, while transitive ordering still guarantees every shard runs after the prebuilt binaries are uploaded. Linux and Windows are staggered independently so the Windows wave 1 keeps its build_windows gate. Testing: ran `yarn split-e2e-tests-codebuild` to regenerate codebuild_specs/e2e_workflow_generated.yml and confirmed the shards now form three waves per OS (50 / 50 / 36 Linux, 50 / 50 / 24 Windows); each wave after the first depends on the prior wave's anchor shard rather than upb, and no e2e anchor gate has more than 50 direct dependents. --- Prompt: Implement dependency-graph staggering in split-e2e-tests-codebuild.ts to fix the CodeBuild batch thundering-herd FAULT. Root cause: the generated workflow funnels ~141 Linux e2e shards (and ~124 Windows shards) through a single `depend-on: upb` gate, so all shards start at once and trip the batch orchestrator (~25 min FAULT). Group the shards into waves (size ~50, tunable constant); wave 1 keeps the original gate, wave N>1 depends on the previous wave's last shard. Apply independently to Linux and Windows (preserving build_windows). Build, regenerate and commit the workflow, verify the waves, and open a draft PR against dev.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes
The generated CodeBuild batch workflow (
codebuild_specs/e2e_workflow_generated.yml) has been FAULTing at the batch-orchestrator level ("Internal Service Error") roughly 25 minutes into every run since ~May 22. The batch reaches the e2e fan-out, then FAULTs and stops all downstream builds. Because the error is at the batch-orchestrator level (not an individual build), it is largely invisible in the per-build UI.Root cause — single-gate thundering herd
The workflow funnels every e2e shard through one dependency gate:
depend-on: [upb]depend-on: [build_windows, upb]When
upbcompletes, all ~141 Linux shards leaveINITIALIZEDwithin ~13s — a burst of hundreds ofStartBuildcalls in under a minute. That simultaneous-start burst trips the CodeBuild batch orchestrator. The individual builds are healthy (they reachPROVISIONINGin seconds) and per-build concurrency quota is not the limiter. Setting the projectconcurrentBuildLimitdoes not fix it — batch builds are rejected (AccountLimitExceededException) rather than queued.Fix — stagger the fan-out into waves
scripts/split-e2e-tests-codebuild.tsnow groups the generated shards into waves of at mostE2E_WAVE_SIZE(50):[upb]for Linux,[build_windows, upb]for Windows).A wave only fans out once the previous wave's anchor shard has completed, so the instantaneous
StartBuildburst is capped at ~one wave (≤50) instead of the full shard count. Transitive ordering is preserved — every shard still runs strictly afterupb(andbuild_windowsfor Windows), so artifact access is unchanged (shards download the prebuilt CLI from S3 at runtime;depend-onin a build-graph is ordering-only and does not pass artifacts). Linux and Windows are staggered independently. The regeneratede2e_workflow_generated.ymlis committed alongside the generator change (it is a committed artifact).Tradeoff for reviewer discussion (why this is a draft)
Chaining each wave on the previous wave's anchor shard means that if an anchor shard fails, CodeBuild does not run the dependent wave — downstream waves for that OS are skipped. e2e shards are occasionally flaky, so a flaky failure on an anchor shard could skip a wave.
fast-fail: falseis set so the rest of the batch still runs, and the e2e monitor already retries failed builds — but reviewers should confirm this tradeoff is acceptable versus, e.g., inserting always-succeeding barrier jobs between waves. Wave size is a single tunable constant.This change complements a separate AWS Support case opened for the batch-orchestrator behavior; it is the robust, workflow-side mitigation that does not depend on a service-side fix.
Issue #, if available
N/A
Description of how you validated changes
yarn split-e2e-tests-codebuildregenerates the workflow with exit 0 (ts-node type-checks the script, confirming it compiles).e2e_workflow_generated.ymlnow shows three waves per OS:[upb]; 50 →[l_predictions_migration_..._api_connection_migration](wave-1 anchor); 36 →[l_env_3](wave-2 anchor)[build_windows, upb]; 50 →[w_auth_3b_auth_3a](wave-1 anchor); 24 →[w_schema_key_resolvers](wave-2 anchor)upbretains both OSes' wave-1 shards, but Windows wave 1 is additionally gated bybuild_windows, soupbcompleting releases at most ~50 same-OS (Linux) shards.wait_for_ids.jsonis unchanged — shard identifiers (and the cache that keys off them) are preserved; onlydepend-onedges changed.Checklist
yarn testpasses — n/a: no unit tests exist for the workflow generator; validated via the regenerated artifactdocs/file maps to this scriptBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.