chore(ci): add automatic rerun controller for flaky workflows#2984
chore(ci): add automatic rerun controller for flaky workflows#2984imbajin merged 6 commits intoapache:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a GitHub Actions controller workflow to automatically rerun flaky CI workflow runs once when they fail, reducing the need for manual “Re-run failed jobs” actions.
Changes:
- Introduces
.github/workflows/rerun-ci.ymllistening onworkflow_run: completedfor the four main CI workflows. - Automatically triggers
gh run rerun <run_id> --failedwhen the upstream run fails and is on its first attempt. - Sets minimal permissions required to rerun Actions workflows (
actions: write,contents: read).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This PR should target a simple first version of auto-rerun:
Easy-to-read behavior: 5 ⭐️ Required: support K automatic reruns instead of only 1 The current If
Reference shape: env:
MAX_RERUNS: 2
if: >-
github.event.workflow_run.conclusion == 'failure' &&
fromJSON(github.event.workflow_run.run_attempt) <= fromJSON(env.MAX_RERUNS)That also makes future tuning trivial. 4 ⭐️ Strongly recommended for v1: add a small delay before each rerun If the failure comes from transient Maven Central / Docker Hub / DNS issues, rerunning immediately often hits the same outage window. A small delay is likely enough:
Simple reference: env:
RETRY_DELAY_SECONDS: 180
steps:
- name: Wait before rerun
run: sleep "$RETRY_DELAY_SECONDS"
- name: Rerun failed jobs
run: gh run rerun ${{ github.event.workflow_run.id }} --failedThis is not the most runner-minute-efficient approach, but it is simple and predictable for an initial version. 3 ⭐️ Should add: make the basic logging easier to read The current Suggested fields:
2 ⭐️ Keep the trigger scope narrow Since this workflow has
That avoids auto-rerunning manual/debug runs unexpectedly. 1 ⭐️ Later improvements, not required for v1 These can wait:
These are useful, but they should not block the initial rollout. If reduced to one sentence, the most practical v1 is:
|
imbajin
left a comment
There was a problem hiding this comment.
How could we easily test it after PR merged?
Given that the merge is complete, I suggest creating a intentionally failing PR for testing purposes. However, I'm not sure if this follows the usual workflow. |
Purpose of the PR
Apache HugeGraph's four main CI pipelines occasionally fail on transient
issues that pass cleanly on retry, forcing manual "Re-run failed jobs"
clicks.
Main Changes
Adds
.github/workflows/rerun-ci.yml, a small controller that watchesthe four main CI pipelines via
workflow_runand automatically rerunsfailed jobs at most once per original failure:
conclusion == 'failure'ANDrun_attempt < 2— noinfinite loops.
gh run rerun <id> --failedto re-run only failed jobs, not thewhole workflow.
actions: write,contents: read.Store & Hstore CI, and Cluster Test CI — no effect on CodeQL / stale
/ license-checker / auto-pr-review.
Verifying these changes
Validated on my fork: controller correctly fires on failed runs, reruns
only the failed jobs, and does not re-trigger itself on the second
attempt.
Does this PR potentially affect the following parts?
Documentation Status
Doc - TODODoc - DoneDoc - No Need