Skip to content

ci: restructure deploy pipeline for ACR bootstrap#294

Merged
peteroden merged 20 commits intomainfrom
ci/283-deploy-pipeline-restructure
Mar 16, 2026
Merged

ci: restructure deploy pipeline for ACR bootstrap#294
peteroden merged 20 commits intomainfrom
ci/283-deploy-pipeline-restructure

Conversation

@peteroden
Copy link
Owner

What

Restructures the deploy workflow from 2 jobs to 3 jobs to solve the
bootstrap problem: ACR must exist before we can push images.

Changes

infra → build-push → deploy
  1. infra: Terraform apply with placeholder images — creates ACR, VNet, KV, storage, container environment
  2. build-push: Builds Docker image, pushes to the ACR created in step 1
  3. deploy: Terraform apply with real image refs — updates container apps

Also removes stale acr_login_server, controller_image, job_image from dev.tfvars (now provided via -var flags).

Part of #283

Split the 2-job pipeline into 3 jobs:
1. infra: Terraform apply with placeholder images (creates ACR, VNet,
   Key Vault, storage, container app environment, etc.)
2. build-push: Build Docker image and push to the ACR created in step 1
3. deploy: Terraform apply again with real image references

This fixes the bootstrap problem where ACR didn't exist yet on first
deploy. The infra job outputs the ACR login server for build-push.

Also adds terraform_wrapper: false to the infra job so terraform output
commands return raw values without wrapper decoration.

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace subscription_id and jira_email with __PLACEHOLDERS__ that the
deploy pipeline substitutes from GitHub variables at runtime. This keeps
the tfvars committable without exposing personal data.

- Add 'Substitute tfvars placeholders' step to both infra and deploy jobs
- Change .gitignore from *.tfvars to *.tfvars.local (allow committing)
- Set JIRA_EMAIL as GitHub repo variable

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Developer Agent and others added 3 commits March 16, 2026 02:30
After enabling public access on the TF state storage account, poll
with az storage container list until the endpoint is reachable (5s
intervals, 60s timeout). Replaces blind sleep 30 which was unreliable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The CI/CD service principal needs AcrPush on the container registry to
push Docker images during the build-push job. Uses the same
data.azurerm_client_config.current.object_id pattern as deployer_kv.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Azure policy enforces shared_access_key_enabled=false. The provider
needs storage_use_azuread=true to use Azure AD for storage data plane
operations (queue creation, blob container creation).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace single KV_BOOTSTRAP_SECRETS JSON blob with individual secrets
(GITLAB_TOKEN, GH_PAT_FOR_COPILOT, JIRA_API_TOKEN). Assembled into
TF_VAR_kv_bootstrap_secrets env var at apply time. Easier to rotate
and clearer in the GitHub UI.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Developer Agent and others added 2 commits March 16, 2026 03:08
- Add copilot_auth variable ('github_token' | 'byok') to control which
  KV secrets and env vars are injected into container apps
- github_token mode: injects GITHUB_TOKEN only (default)
- byok mode: injects COPILOT_API_KEY + COPILOT_PROVIDER_TYPE +
  COPILOT_PROVIDER_BASE_URL
- Fix deprecated storage_account_name → storage_account_id on queue
  and container resources
- Add copilot_provider_type and copilot_provider_base_url variables

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Storage account created with public_network_access_enabled=false
- Private DNS zones (blob, queue, table) linked to VNet
- Private endpoints for all three storage sub-resources
- Storage subnet (snet-storage) added to networking
- NSG rule: container-apps → storage subnet on 443
- KEDA scaler: cloud=Private, endpointSuffix for private queue
- No bootstrap needed — queue/container creation is ARM control-plane

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rride

var-file values take precedence over TF_VAR_ env vars in Terraform.
Having kv_bootstrap_secrets={} in dev.tfvars prevented the pipeline's
TF_VAR_kv_bootstrap_secrets from injecting the actual secrets.

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ACA validates container images at creation time. Using 'placeholder'
as image name caused 400 errors. Switch to the ACA quickstart image
from MCR which is publicly pullable without ACR auth.

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Collapsed 3 jobs into 1 sequential flow:
1. terraform apply with MCR quickstart (creates ACR + all infra)
2. docker build + push to ACR
3. terraform apply with real image (updates container apps)

Eliminates placeholder image issues and cross-job state passing.
Increased TF state propagation timeout to 120s.

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Developer Agent and others added 2 commits March 16, 2026 11:10
Pipeline now builds and pushes to GHCR first (no infra dependency),
then a single terraform apply creates ACR (Premium), imports the image
from GHCR via az acr import, and deploys container apps with the
ACR-hosted image.

- ACR upgraded to Premium (required for private endpoints)
- ACR private DNS zone + endpoint on storage subnet
- null_resource.acr_import: open ACR public → import → close
- Removed controller_image/job_image vars; replaced with image_tag
- Container apps derive image from ACR login_server + image_tag
- Added packages:write permission for GHCR push
- Closes #295 (ACR hardening)

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The az storage container list probe uses different auth than terraform
init's OIDC backend. Polling with az succeeds but terraform still gets
403. Replace the az-based probe with a retry loop around terraform init
itself — the definitive test.

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Remove ACR public access toggle — az acr import is ARM control-plane
  and works regardless of network settings
- Add set -euo pipefail to all provisioner heredocs so errors fail fast
  instead of silently continuing (fixes ACR --public-network-access bug)

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The KEDA azure-queue scaler had no authentication configured and
shared_access_key_enabled=false on the storage account, so KEDA
could never poll the queue and job executions never triggered.

- Remove cloud=Private and endpointSuffix from KEDA metadata
  (private DNS handles routing; clients use queue.core.windows.net)
- Add azapi_update_resource to patch the scale rule with the job
  managed identity (azurerm provider doesn't support this yet)
- Job identity already has Storage Queue Data Contributor role

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The Copilot SDK agent may run git commit during a coding session.
When this happens, git diff --cached returns empty because the files
are already committed. Fix by capturing HEAD SHA before the session
and falling back to git diff pre_sha..HEAD when staged diff is empty
but HEAD has moved.

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Belt-and-suspenders: the prompt tells the agent not to git add/commit
(primary fix), and _build_coding_result detects agent-committed changes
as a fallback (defense-in-depth).

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ommands

The task_id was mr-{project}-{mr_iid}, so the second /copilot command
on the same MR would return the cached result from the first command.
Include the note ID to make each command dispatch unique.

Works for both webhook (GitLab includes object_attributes.id) and
poller (passes note.id explicitly).

Part of #283

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move all workflow_dispatch inputs and vars to env blocks to prevent
  shell injection via crafted image_tag values (OWASP #3 finding)
- Change deployer ACR role from AcrPush to Container Registry Data
  Importer and Data Reader (least-privilege for az acr import)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@peteroden peteroden merged commit 8e9cd3a into main Mar 16, 2026
11 checks passed
@peteroden peteroden deleted the ci/283-deploy-pipeline-restructure branch March 16, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant