chore(ci): add automatic rerun controller for flaky workflows by contrueCT · Pull Request #2984 · apache/hugegraph

contrueCT · 2026-04-08T13:13:21Z

Purpose of the PR

close [Feature] automatically rerun failed CI jobs once to mitigate flaky workflows #2983

Apache HugeGraph's four main CI pipelines occasionally fail on transient
issues that pass cleanly on retry, forcing manual "Re-run failed jobs"
clicks.

Main Changes

Adds .github/workflows/rerun-ci.yml, a small controller that watches
the four main CI pipelines via workflow_run and automatically reruns
failed jobs at most once per original failure:

Fires only when conclusion == 'failure' AND run_attempt < 2 — no
infinite loops.
Uses gh run rerun <id> --failed to re-run only failed jobs, not the
whole workflow.
Least-privilege permissions: actions: write, contents: read.
Scoped to HugeGraph-Server CI, HugeGraph-Commons CI, HugeGraph-PD &
Store & Hstore CI, and Cluster Test CI — no effect on CodeQL / stale
/ license-checker / auto-pr-review.

Verifying these changes

Trivial rework / code cleanup without any test coverage. (No Need)
Already covered by existing tests, such as (please modify tests here).
Need tests and can be verified as follows:
- xxx

Validated on my fork: controller correctly fires on failed runs, reruns
only the failed jobs, and does not re-trigger itself on the second
attempt.

Does this PR potentially affect the following parts?

Dependencies (add/update license info & regenerate_known_dependencies.sh)
Modify configurations
The public API
Other affects (typed here)
Nope

Documentation Status

Doc - TODO
Doc - Done
Doc - No Need

Copilot

Pull request overview

Adds a GitHub Actions controller workflow to automatically rerun flaky CI workflow runs once when they fail, reducing the need for manual “Re-run failed jobs” actions.

Changes:

Introduces .github/workflows/rerun-ci.yml listening on workflow_run: completed for the four main CI workflows.
Automatically triggers gh run rerun <run_id> --failed when the upstream run fails and is on its first attempt.
Sets minimal permissions required to rerun Actions workflows (actions: write, contents: read).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

.github/workflows/rerun-ci.yml

imbajin · 2026-04-08T16:07:54Z

This PR should target a simple first version of auto-rerun:

apply only to the 4 main CI workflows
rerun failed jobs when a workflow fails
keep retrying until success or the retry limit is reached
define K as the number of automatic reruns, with default K = 2
this means: 1 original run + up to 2 automatic reruns = up to 3 attempts total
keep only basic logging for now

Easy-to-read behavior:

1st failure -> wait a short delay -> rerun
2nd failure -> wait a short delay -> rerun
3rd failure -> stop and leave it for manual investigation

+----------------------+
| Main CI run finished |
+----------+-----------+
           |
           v
   conclusion == failure ?
           |
      +----+----+
      |         |
     no        yes
      |         |
      |   run_attempt <= K ?
      |         |
      |    +----+----+
      |    |         |
      |   no        yes
      |    |         |
      |    |   sleep(delay)
      |    |         |
      |    |   gh run rerun <id> --failed
      |    |
      v    v
     stop  wait for next completed event

5 ⭐️ Required: support K automatic reruns instead of only 1

The current run_attempt < 2 only gives one automatic retry. That is still short of the stated goal of healing flaky / network failures automatically.

If K means the number of automatic reruns, K = 2 is a good default and easy to understand:

K = 2
attempt 1 fails -> rerun -> attempt 2
attempt 2 fails -> rerun -> attempt 3
attempt 3 fails -> stop

Reference shape:

env:
  MAX_RERUNS: 2

if: >-
  github.event.workflow_run.conclusion == 'failure' &&
  fromJSON(github.event.workflow_run.run_attempt) <= fromJSON(env.MAX_RERUNS)

That also makes future tuning trivial.

4 ⭐️ Strongly recommended for v1: add a small delay before each rerun

If the failure comes from transient Maven Central / Docker Hub / DNS issues, rerunning immediately often hits the same outage window.

A small delay is likely enough:

default 3 minutes, or
5 minutes if you want to be more conservative

Simple reference:

env:
  RETRY_DELAY_SECONDS: 180

steps:
  - name: Wait before rerun
    run: sleep "$RETRY_DELAY_SECONDS"

  - name: Rerun failed jobs
    run: gh run rerun ${{ github.event.workflow_run.id }} --failed

This is not the most runner-minute-efficient approach, but it is simple and predictable for an initial version.

3 ⭐️ Should add: make the basic logging easier to read

The current echo lines are enough for debugging, but not very readable. Writing the decision into GITHUB_STEP_SUMMARY would make this much easier to inspect.

Suggested fields:

workflow name
event type
run id
current attempt
max reruns
delay seconds
action: rerun / skip
reason: below limit / exceeded limit / non-failure

2 ⭐️ Keep the trigger scope narrow

Since this workflow has actions: write, it is still worth keeping the trigger surface tight:

keep the current allowlist of the 4 main CI workflows
additionally gate source events to push / pull_request
optionally restrict branch scope if needed

That avoids auto-rerunning manual/debug runs unexpectedly.

1 ⭐️ Later improvements, not required for v1

These can wait:

different K values per workflow
only rerun specific failure categories
metrics for auto-rerun hit rate over time
posting a PR summary comment

These are useful, but they should not block the initial rollout.

If reduced to one sentence, the most practical v1 is:

default to at most 2 automatic reruns, add a 3-minute gap before each rerun, rerun failed jobs only, and write a basic decision summary.

.github/workflows/rerun-ci.yml

imbajin

How could we easily test it after PR merged?

contrueCT · 2026-04-11T15:41:25Z

How could we easily test it after PR merged?

Given that the merge is complete, I suggest creating a intentionally failing PR for testing purposes. However, I'm not sure if this follows the usual workflow.

ci: add automatic rerun controller for flaky workflows

f320ec4

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. ci-cd Build or deploy labels Apr 8, 2026

imbajin changed the title ~~ci: add automatic rerun controller for flaky workflows~~ chore(ci): add automatic rerun controller for flaky workflows Apr 8, 2026

imbajin requested a review from Copilot April 8, 2026 15:54

Copilot started reviewing on behalf of imbajin April 8, 2026 15:55 View session

imbajin requested a review from VGalaxies April 8, 2026 15:56

Copilot AI reviewed Apr 8, 2026

View reviewed changes

.github/workflows/rerun-ci.yml Outdated Show resolved Hide resolved

.github/workflows/rerun-ci.yml Outdated Show resolved Hide resolved

imbajin reviewed Apr 8, 2026

View reviewed changes

.github/workflows/rerun-ci.yml Show resolved Hide resolved

imbajin reviewed Apr 8, 2026

View reviewed changes

.github/workflows/rerun-ci.yml Show resolved Hide resolved

ci: refine auto rerun controller policy

af40258

imbajin reviewed Apr 11, 2026

View reviewed changes

.github/workflows/rerun-ci.yml Outdated Show resolved Hide resolved

contrueCT added 4 commits April 11, 2026 17:14

ci: follow source workflow scope for reruns

cff1eae

ci: narrow rerun workflow write scope

7daedbc

ci: reduce retry delay for rerun jobs from 180 to 60 seconds

16c6d85

ci: increase retry delay for rerun jobs from 60 to 180 seconds

48d6dda

imbajin approved these changes Apr 11, 2026

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 11, 2026

imbajin merged commit 28c39b6 into apache:master Apr 11, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(ci): add automatic rerun controller for flaky workflows#2984

chore(ci): add automatic rerun controller for flaky workflows#2984
imbajin merged 6 commits intoapache:masterfrom
contrueCT:pr/ci-rerun-workflow

contrueCT commented Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imbajin commented Apr 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

imbajin left a comment

Uh oh!

Uh oh!

contrueCT commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

contrueCT commented Apr 8, 2026

Purpose of the PR

Main Changes

Verifying these changes

Does this PR potentially affect the following parts?

Documentation Status

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

imbajin commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

imbajin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

contrueCT commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

imbajin commented Apr 8, 2026 •

edited

Loading