Skip to content

maintainer: fast retry temporarily ignored WAITING statuses (#4808)#5204

Merged
ti-chi-bot[bot] merged 4 commits into
pingcap:release-8.5from
ti-chi-bot:cherry-pick-4808-to-release-8.5
Jun 9, 2026
Merged

maintainer: fast retry temporarily ignored WAITING statuses (#4808)#5204
ti-chi-bot[bot] merged 4 commits into
pingcap:release-8.5from
ti-chi-bot:cherry-pick-4808-to-release-8.5

Conversation

@ti-chi-bot

Copy link
Copy Markdown
Member

This is an automated cherry-pick of #4808

What problem does this PR solve?

Issue Number: close #4810

When a non-DDL dispatcher reports a WAITING block status while the maintainer still sees that task as temporarily non-replicating, the status used to be silently ignored. The dispatcher then had to wait for the regular 5-second resend loop before retrying, which unnecessarily delayed barrier recovery and convergence.

What is changed and how it works?

  1. Add IgnoredBlockStatus to heartbeatpb.DispatcherStatus so the maintainer can explicitly tell a dispatcher that its current WAITING status is temporarily ignored and should be retried soon, and regenerate heartbeat.pb.go.
  2. On the dispatcher side, HandleDispatcherStatus now has an ignored-block branch. It schedules a fast retry only when the hint strictly matches the current in-flight WAITING block event.
  3. ResendTask keeps the original 5-second periodic resend as the fallback path and adds a one-shot fast retry starting at 50ms with exponential backoff up to the same interval; ACK still cancels both the slow and fast resend paths.

Before:
img_v3_0210n_60180ab7-f08c-4b1a-8949-d3b32cf6249g

After:
img_v3_0210o_181a85fa-6f50-425f-80a7-b6545799a4ag

Check List

Tests

[x] Unit test

Release note

Fix barrier retry latency by triggering a fast retry when the maintainer temporarily ignores a dispatcher's `WAITING` block status instead of waiting for the regular resend interval.

Summary by CodeRabbit

  • Bug Fixes

    • Prevents spurious blocked events by temporarily treating certain block-status reports from non-replicating dispatchers as ignored and preserving span state.
  • New Features

    • Heartbeat messages now include an IgnoredBlockStatus hint and influenced-dispatcher info so receivers can fast-retry pending block events.
  • Tests

    • Added tests for ignored-block handling, fast-resend timing, retry cancellation on ACK, and status propagation.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added contribution This PR is from a community contributor. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. lgtm ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels Jun 5, 2026
@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: bc8786ac-3477-4c17-bf6b-3214a6b058c4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fast retry mechanism for temporarily ignored WAITING statuses by adding an IgnoredBlockStatus to the heartbeat protocol and handling it in both the dispatcher and the maintainer barrier. However, there are multiple critical issues where unresolved git merge conflict markers have been left in the code across several files, including maintainer/barrier.go, event_dispatcher_test.go, and the generated heartbeat.pb.go file. These conflict markers must be resolved and cleaned up before the pull request can be merged.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread maintainer/barrier.go Outdated
Comment on lines +83 to +118
<<<<<<< HEAD
log.Info("Get block status from unexisted dispatcher, ignore it", zap.String("changefeed", request.ChangefeedID.GetName()), zap.String("dispatcher", dispatcherID.String()), zap.Uint64("commitTs", status.State.BlockTs))
continue
} else {
if !b.spanController.IsReplicating(task) {
log.Info("Get block status from unreplicating dispatcher, ignore it", zap.String("changefeed", request.ChangefeedID.GetName()), zap.String("dispatcher", dispatcherID.String()), zap.Uint64("commitTs", status.State.BlockTs))
=======
log.Info("Get block status from unexisted dispatcher, ignore it",
zap.String("changefeed", request.ChangefeedID.GetName()),
zap.String("dispatcher", dispatcherID.String()),
zap.Uint64("commitTs", status.State.BlockTs),
zap.Int64("mode", b.mode))
continue
} else {
if !b.spanController.IsReplicating(task) {
log.Info("Get block status from unreplicating dispatcher, ignore it",
zap.String("changefeed", request.ChangefeedID.GetName()),
zap.String("dispatcher", dispatcherID.String()),
zap.Uint64("commitTs", status.State.BlockTs),
zap.Int64("mode", b.mode))
// A newly added dispatcher may report its first WAITING barrier before the add
// operator moves it from scheduling to replicating. We still cannot admit that
// status into barrier, but silently dropping it would leave dispatcher waiting
// for the slow 5s resend timer. Return IgnoredBlockStatus so it keeps the live
// WAITING state locally and schedules a fast retry instead.
dispatcherStatus = append(dispatcherStatus, &heartbeatpb.DispatcherStatus{
InfluencedDispatchers: &heartbeatpb.InfluencedDispatchers{
InfluenceType: heartbeatpb.InfluenceType_Normal,
DispatcherIDs: []*heartbeatpb.DispatcherID{status.ID},
},
IgnoredBlockStatus: &heartbeatpb.IgnoredBlockStatus{
CommitTs: status.State.BlockTs,
IsSyncPoint: status.State.IsSyncPoint,
},
})
>>>>>>> fcc173191 (maintainer: fast retry temporarily ignored WAITING statuses (#4808))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Unresolved git merge conflict markers found in maintainer/barrier.go. Please clean up the conflict markers and keep the correct implementation.

				log.Info("Get block status from unexisted dispatcher, ignore it",
					zap.String("changefeed", request.ChangefeedID.GetName()),
					zap.String("dispatcher", dispatcherID.String()),
					zap.Uint64("commitTs", status.State.BlockTs),
					zap.Int64("mode", b.mode))
				continue
			} else {
				if !b.spanController.IsReplicating(task) {
					log.Info("Get block status from unreplicating dispatcher, ignore it",
						zap.String("changefeed", request.ChangefeedID.GetName()),
						zap.String("dispatcher", dispatcherID.String()),
						zap.Uint64("commitTs", status.State.BlockTs),
						zap.Int64("mode", b.mode))
					// A newly added dispatcher may report its first WAITING barrier before the add
					// operator moves it from scheduling to replicating. We still cannot admit that
					// status into barrier, but silently dropping it would leave dispatcher waiting
					// for the slow 5s resend timer. Return IgnoredBlockStatus so it keeps the live
					// WAITING state locally and schedules a fast retry instead.
					dispatcherStatus = append(dispatcherStatus, &heartbeatpb.DispatcherStatus{
						InfluencedDispatchers: &heartbeatpb.InfluencedDispatchers{
							InfluenceType: heartbeatpb.InfluenceType_Normal,
							DispatcherIDs: []*heartbeatpb.DispatcherID{status.ID},
						},
						IgnoredBlockStatus: &heartbeatpb.IgnoredBlockStatus{
							CommitTs:    status.State.BlockTs,
							IsSyncPoint: status.State.IsSyncPoint,
						},
					})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep the cp change and add the mode log from HEAD

Comment on lines +404 to +405
<<<<<<< HEAD
=======

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Unresolved git merge conflict markers found in event_dispatcher_test.go. Please remove them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

return pendingEvent == nil && blockStage == heartbeatpb.BlockStage_NONE
}, time.Second, 10*time.Millisecond)
require.Equal(t, int32(1), flushCalls.Load())
>>>>>>> fcc173191 (maintainer: fast retry temporarily ignored WAITING statuses (#4808))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Unresolved git merge conflict markers found in event_dispatcher_test.go. Please remove them.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Comment thread heartbeatpb/heartbeat.pb.go Outdated
return nil
}

<<<<<<< HEAD

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Unresolved git merge conflict markers found in the generated file heartbeat.pb.go. Please resolve the conflicts in the Go files and regenerate this file using the protobuf generation tool (e.g., make proto).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@haiboumich

Copy link
Copy Markdown
Contributor

/cherry-pick-invite -h

@zier-one zier-one left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot

ti-chi-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

@zier-one: adding LGTM is restricted to approvers and reviewers in OWNERS files.

Details

In response to this:

LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot

ti-chi-bot Bot commented Jun 9, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hongyunyan, zier-one

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added approved cherry-pick-approved Cherry pick PR approved by release team. and removed do-not-merge/cherry-pick-not-approved labels Jun 9, 2026
@lidezhu lidezhu removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 9, 2026
@ti-chi-bot ti-chi-bot Bot merged commit 39910e0 into pingcap:release-8.5 Jun 9, 2026
19 of 20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved cherry-pick-approved Cherry pick PR approved by release team. contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. lgtm ok-to-test Indicates a PR is ready to be tested. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants