Skip to content

chore(ci): add automatic rerun controller for flaky workflows#2984

Merged
imbajin merged 6 commits intoapache:masterfrom
contrueCT:pr/ci-rerun-workflow
Apr 11, 2026
Merged

chore(ci): add automatic rerun controller for flaky workflows#2984
imbajin merged 6 commits intoapache:masterfrom
contrueCT:pr/ci-rerun-workflow

Conversation

@contrueCT
Copy link
Copy Markdown
Contributor

Purpose of the PR

Apache HugeGraph's four main CI pipelines occasionally fail on transient
issues that pass cleanly on retry, forcing manual "Re-run failed jobs"
clicks.

Main Changes

Adds .github/workflows/rerun-ci.yml, a small controller that watches
the four main CI pipelines via workflow_run and automatically reruns
failed jobs at most once per original failure:

  • Fires only when conclusion == 'failure' AND run_attempt < 2 — no
    infinite loops.
  • Uses gh run rerun <id> --failed to re-run only failed jobs, not the
    whole workflow.
  • Least-privilege permissions: actions: write, contents: read.
  • Scoped to HugeGraph-Server CI, HugeGraph-Commons CI, HugeGraph-PD &
    Store & Hstore CI, and Cluster Test CI — no effect on CodeQL / stale
    / license-checker / auto-pr-review.

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Need tests and can be verified as follows:
    • xxx

Validated on my fork: controller correctly fires on failed runs, reruns
only the failed jobs, and does not re-trigger itself on the second
attempt.

Does this PR potentially affect the following parts?

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. ci-cd Build or deploy labels Apr 8, 2026
@imbajin imbajin changed the title ci: add automatic rerun controller for flaky workflows chore(ci): add automatic rerun controller for flaky workflows Apr 8, 2026
@imbajin imbajin requested a review from Copilot April 8, 2026 15:54
@imbajin imbajin requested a review from VGalaxies April 8, 2026 15:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a GitHub Actions controller workflow to automatically rerun flaky CI workflow runs once when they fail, reducing the need for manual “Re-run failed jobs” actions.

Changes:

  • Introduces .github/workflows/rerun-ci.yml listening on workflow_run: completed for the four main CI workflows.
  • Automatically triggers gh run rerun <run_id> --failed when the upstream run fails and is on its first attempt.
  • Sets minimal permissions required to rerun Actions workflows (actions: write, contents: read).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@imbajin
Copy link
Copy Markdown
Member

imbajin commented Apr 8, 2026

This PR should target a simple first version of auto-rerun:

  • apply only to the 4 main CI workflows
  • rerun failed jobs when a workflow fails
  • keep retrying until success or the retry limit is reached
  • define K as the number of automatic reruns, with default K = 2
  • this means: 1 original run + up to 2 automatic reruns = up to 3 attempts total
  • keep only basic logging for now

Easy-to-read behavior:

1st failure -> wait a short delay -> rerun
2nd failure -> wait a short delay -> rerun
3rd failure -> stop and leave it for manual investigation
+----------------------+
| Main CI run finished |
+----------+-----------+
           |
           v
   conclusion == failure ?
           |
      +----+----+
      |         |
     no        yes
      |         |
      |   run_attempt <= K ?
      |         |
      |    +----+----+
      |    |         |
      |   no        yes
      |    |         |
      |    |   sleep(delay)
      |    |         |
      |    |   gh run rerun <id> --failed
      |    |
      v    v
     stop  wait for next completed event

5 ⭐️ Required: support K automatic reruns instead of only 1

The current run_attempt < 2 only gives one automatic retry. That is still short of the stated goal of healing flaky / network failures automatically.

If K means the number of automatic reruns, K = 2 is a good default and easy to understand:

  • K = 2
  • attempt 1 fails -> rerun -> attempt 2
  • attempt 2 fails -> rerun -> attempt 3
  • attempt 3 fails -> stop

Reference shape:

env:
  MAX_RERUNS: 2

if: >-
  github.event.workflow_run.conclusion == 'failure' &&
  fromJSON(github.event.workflow_run.run_attempt) <= fromJSON(env.MAX_RERUNS)

That also makes future tuning trivial.

4 ⭐️ Strongly recommended for v1: add a small delay before each rerun

If the failure comes from transient Maven Central / Docker Hub / DNS issues, rerunning immediately often hits the same outage window.

A small delay is likely enough:

  • default 3 minutes, or
  • 5 minutes if you want to be more conservative

Simple reference:

env:
  RETRY_DELAY_SECONDS: 180

steps:
  - name: Wait before rerun
    run: sleep "$RETRY_DELAY_SECONDS"

  - name: Rerun failed jobs
    run: gh run rerun ${{ github.event.workflow_run.id }} --failed

This is not the most runner-minute-efficient approach, but it is simple and predictable for an initial version.

3 ⭐️ Should add: make the basic logging easier to read

The current echo lines are enough for debugging, but not very readable. Writing the decision into GITHUB_STEP_SUMMARY would make this much easier to inspect.

Suggested fields:

  • workflow name
  • event type
  • run id
  • current attempt
  • max reruns
  • delay seconds
  • action: rerun / skip
  • reason: below limit / exceeded limit / non-failure

2 ⭐️ Keep the trigger scope narrow

Since this workflow has actions: write, it is still worth keeping the trigger surface tight:

  • keep the current allowlist of the 4 main CI workflows
  • additionally gate source events to push / pull_request
  • optionally restrict branch scope if needed

That avoids auto-rerunning manual/debug runs unexpectedly.

1 ⭐️ Later improvements, not required for v1

These can wait:

  • different K values per workflow
  • only rerun specific failure categories
  • metrics for auto-rerun hit rate over time
  • posting a PR summary comment

These are useful, but they should not block the initial rollout.

If reduced to one sentence, the most practical v1 is:

default to at most 2 automatic reruns, add a 3-minute gap before each rerun, rerun failed jobs only, and write a basic decision summary.

Copy link
Copy Markdown
Member

@imbajin imbajin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could we easily test it after PR merged?

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 11, 2026
@imbajin imbajin merged commit 28c39b6 into apache:master Apr 11, 2026
13 checks passed
@contrueCT
Copy link
Copy Markdown
Contributor Author

How could we easily test it after PR merged?

Given that the merge is complete, I suggest creating a intentionally failing PR for testing purposes. However, I'm not sure if this follows the usual workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-cd Build or deploy lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] automatically rerun failed CI jobs once to mitigate flaky workflows

3 participants