server, dispatcher: improve node liveness self fence by asddongmen · Pull Request #5106 · pingcap/ticdc

asddongmen · 2026-05-20T06:43:36Z

What problem does this PR solve?

Issue Number: close #5202

When a TiCDC capture loses its etcd session or lease, it can no longer prove that it still owns local dispatcher work. The previous shutdown path did not immediately fence local write paths, so a stale capture could still accept maintainer requests or continue writing downstream while failover was already in progress. This created a short but unsafe window for duplicate or out-of-date downstream writes before normal cleanup finished.

What is changed and how it works?

This PR adds a local fence path for session-done and lease-expired events. The server watches the etcd session, triggers local fencing before shutdown, and the dispatcher orchestrator stops accepting new maintainer requests. Dispatcher managers cancel local write paths immediately, close local dispatchers asynchronously, and continue cleanup without waiting for progress draining. The redo cleanup path is also guarded so partially initialized managers do not panic when fencing. New integration coverage simulates capture session loss and verifies the local fence behavior and downstream consistency.

Check List

Tests

Unit test
Integration test
Manual test: dev-machine MySQL integration cases and GitHub /test all

Questions

Will it cause performance regression or break compatibility?

No. The new path only runs when a local capture loses its session or is being fenced.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

TiCDC now locally fences stale captures after etcd session loss to avoid unsafe downstream writes.

Summary by CodeRabbit

New Features
- Added etcd session watchdog and server-side fencing to trigger local fencing on lease/session loss
- Added local fencing controls to orchestrator and managers to immediately fence write paths
Bug Fixes
- Prevented races between dispatcher creation/merge and shutdown; avoid registering partially-initialized dispatchers
- Short-circuited operations when write path is fenced to improve shutdown safety
Tests
- Added unit and integration tests for fencing, shutdown and session-watchdog behaviors

ti-chi-bot · 2026-05-20T06:43:39Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-05-20T06:43:44Z

📝 Walkthrough

Walkthrough

Adds coordinated local fencing: DispatcherManager write-path fencing and close refactor, DispatcherOrchestrator fenced mode and propagation, a server session watchdog that triggers local fence on etcd lease/session loss, unit/integration tests, and related scripts/failpoint updates.

Changes

Local fencing mechanism

Layer / File(s)	Summary
All related changes (fencing, watchdog, tests, integration, helpers) `downstreamadapter/...`, `server/...`, `tests/integration_tests/`, `pkg/`	Implements write-path fencing across DispatcherManager and DispatcherOrchestrator (fenced flag, IsWritePathClosedError, LocalFence methods), server etcd session watchdog to trigger local fencing, updates to event/redo dispatcher creation and merge to be fencing-aware, helper and test updates, new normalized error, and integration scripts + failpoint adjustments.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

lgtm

Suggested reviewers

hongyunyan
lidezhu

"I nibble logs and guard the line,
When leases drop and sessions fade,
I fence the writes, keep data fine,
Quiet managers in evening shade.
A carrot hop for safe cascade." 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'server, dispatcher: improve node liveness self fence' is specific and concise, clearly describing the main change of adding local fencing when a capture loses its etcd session.
Description check	✅ Passed	The description includes the issue number (close `#5202`), explains the problem and solution, provides a comprehensive checklist with unit/integration/manual tests, and includes a proper release note.
Linked Issues check	✅ Passed	The PR addresses all requirements from `#5202`: immediate local fencing on etcd session loss, rejection of new maintainer requests, cancellation of dispatcher write paths, and prevention of unsafe downstream writes through comprehensive changes across server, dispatcher orchestrator, and manager components.
Out of Scope Changes check	✅ Passed	All changes are in scope for the local fencing feature: session watchdog, dispatcher fencing logic, redo dispatcher protection, integration tests, and MySQL writer test failpoint expansion are all directly related to preventing stale captures from writing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a local fencing mechanism to ensure that downstream writes are stopped immediately when a capture loses its etcd session or lease. This is achieved by adding a session watchdog in the server and implementing a LocalFence method across the DispatcherOrchestrator and DispatcherManager to bypass graceful draining in failure scenarios. The review feedback identifies potential nil pointer dereferences in the DispatcherManager's shutdown logic, specifically regarding the redoSink when redo logging is enabled.

asddongmen · 2026-05-22T01:10:41Z

/test all

Signed-off-by: dongmen <414110582@qq.com>

asddongmen · 2026-06-04T12:54:11Z

/test all

asddongmen · 2026-06-04T13:54:55Z

/test all

asddongmen · 2026-06-04T14:00:09Z

/test all

asddongmen · 2026-06-04T15:39:53Z

/test pull-cdc-mysql-integration-heavy

asddongmen · 2026-06-05T02:25:57Z

/test all

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration_tests/capture_local_fence_on_session_done/run.sh`:
- Around line 27-28: The curl calls that populate addr (the command using curl
-s "http://${api_addr}/api/v2/captures" piped to jq) should include timeouts to
prevent hanging CI; update that invocation (and the other curl invocations at
the similar spots referenced) to set both a connection timeout and an overall
max time (e.g., --connect-timeout and --max-time) and preserve -s; ensure the
same timeout flags are applied to the other two curl calls in this script (the
ones at the ranges you noted) so the retry loop cannot block indefinitely.
- Line 168: The script forwards positional parameters using $* which can split
or reshape arguments; update the invocation of the run helper (the line
containing "run $*") to forward arguments safely by using the quoted array form
"$@" instead so each original argument is preserved exactly when passed to run.

In `@tests/integration_tests/capture_session_done_during_task/run.sh`:
- Around line 59-60: The two strict checks in run.sh ("check_logs_contains
$WORK_DIR \"local fence triggered\"" and "check_logs_contains $WORK_DIR \"etcd
lease expired\"") make the test flaky because the reason can be either "etcd
session done" or "etcd lease expired"; update the assertions so after verifying
"local fence triggered" with check_logs_contains you assert that the logs
contain either "etcd session done" OR "etcd lease expired" (e.g., replace the
second check with a single check that uses a regex/alternate match or two-branch
logic calling check_logs_contains for "etcd session done" OR "etcd lease
expired") so the test passes when either reason appears.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e024b9c5-b91a-4f1c-ae51-6f90b9eb32f8

📥 Commits

Reviewing files that changed from the base of the PR and between ff135e6 and 5e97fd8.

📒 Files selected for processing (12)

downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
downstreamadapter/dispatchermanager/dispatcher_manager_test.go
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
pkg/sink/mysql/mysql_writer_dml_exec.go
server/server.go
server/server_session_watchdog_test.go
tests/integration_tests/capture_local_fence_on_session_done/conf/diff_config.toml
tests/integration_tests/capture_local_fence_on_session_done/run.sh
tests/integration_tests/capture_session_done_during_task/run.sh
tests/integration_tests/run_light_it_in_ci.sh

ti-chi-bot · 2026-06-05T08:26:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-06-05T08:26:12Z

[LGTM Timeline notifier]

Timeline:

2026-06-05 08:26:11.880158741 +0000 UTC m=+516472.950476131: ☑️ agreed by wk989898.

asddongmen · 2026-06-05T12:59:16Z

/test all

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go (1)
458-462: ⚡ Quick win

Make the drop assertion deterministic.

Lines 458-462 only show that nothing reached processed within 50ms. A message that was merely delayed, or enqueued after fence but never drained, looks identical, so this can both mask regressions and flake on slow CI. Please assert the post-fence admission state directly instead of relying on a short timeout.

As per coding guidelines, **/*_test.go: Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`
around lines 458 - 462, Replace the flaky 50ms timeout check on the processed
channel with a deterministic assertion of post-fence admission state: after
applying the local fence, directly assert the orchestrator/maintainer admission
flag or queue length (e.g. check an "isAdmitting" / "admissionOpen" boolean or
the maintainer queue length) to prove new maintainer messages are rejected, and
if only the processed channel is available, use a non-blocking select (case msg
:= <-processed: require.FailNow(...); default: ) and also attempt to enqueue a
new maintainer and assert that enqueue returns a failure/false or does not
increase the queue size; reference the processed channel and the local fence
operation in the test to locate where to make this change.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager.go`:
- Around line 60-77: Create a dedicated predefined repo error for the fenced
write-path state (e.g., add errors.ErrDispatcherWritePathClosed in the central
errors package) and replace the current message-substring approach: have
newWritePathClosedError() generate/return that specific error (use the repo
error construction consistent with other errors, e.g., FastGenByArgs on
errors.ErrDispatcherWritePathClosed), and update IsWritePathClosedError(err) to
detect the condition by matching that error directly (either via errors.Is(err,
errors.ErrDispatcherWritePathClosed) or comparing RFCCode to
errors.ErrDispatcherWritePathClosed.RFCCode()) instead of checking for
ErrChangefeedInitTableTriggerDispatcherFailed plus a substring.

---

Nitpick comments:
In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`:
- Around line 458-462: Replace the flaky 50ms timeout check on the processed
channel with a deterministic assertion of post-fence admission state: after
applying the local fence, directly assert the orchestrator/maintainer admission
flag or queue length (e.g. check an "isAdmitting" / "admissionOpen" boolean or
the maintainer queue length) to prove new maintainer messages are rejected, and
if only the processed channel is available, use a non-blocking select (case msg
:= <-processed: require.FailNow(...); default: ) and also attempt to enqueue a
new maintainer and assert that enqueue returns a failure/false or does not
increase the queue size; reference the processed channel and the local fence
operation in the test to locate where to make this change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 54d91177-8133-4394-a936-6f0327fa9be2

📥 Commits

Reviewing files that changed from the base of the PR and between 5e97fd8 and ae413aa.

📒 Files selected for processing (8)

downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
downstreamadapter/dispatchermanager/dispatcher_manager_test.go
downstreamadapter/dispatchermanager/helper.go
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
tests/integration_tests/capture_local_fence_on_session_done/run.sh
tests/integration_tests/capture_session_done_during_task/run.sh

🚧 Files skipped from review as they are similar to previous changes (2)

tests/integration_tests/capture_local_fence_on_session_done/run.sh
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go

asddongmen · 2026-06-05T13:51:56Z

/test all

coderabbitai

🧹 Nitpick comments (1)

downstreamadapter/dispatchermanager/dispatcher_manager_redo.go (1)
338-340: Document fenced no-op in UpdateRedoMeta.

DispatcherManager.UpdateRedoMeta (downstreamadapter/dispatchermanager/dispatcher_manager_redo.go:338-340) returns immediately when writePathClosed is set with no error/log. The only observed call site (downstreamadapter/dispatchermanager/helper.go:751-752) ignores the outcome, so callers can’t tell whether the meta was updated. Add a brief comment above the early return explaining this is an intentional no-op during write-path shutdown/fencing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatchermanager/dispatcher_manager_redo.go` around lines
338 - 340, UpdateRedoMeta currently returns immediately when
e.writePathClosed.Load() is true with no indication to readers; add a brief
comment directly above the early return in function UpdateRedoMeta explaining
that this is an intentional no-op because the write path is closed/fenced during
shutdown so metadata updates are ignored (do not change behavior or add an
error/log, just document the intent).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager_redo.go`:
- Around line 338-340: UpdateRedoMeta currently returns immediately when
e.writePathClosed.Load() is true with no indication to readers; add a brief
comment directly above the early return in function UpdateRedoMeta explaining
that this is an intentional no-op because the write path is closed/fenced during
shutdown so metadata updates are ignored (do not change behavior or add an
error/log, just document the intent).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3fbd800b-16e6-49af-9547-7a196144a232

📥 Commits

Reviewing files that changed from the base of the PR and between e7cda87 and a05f186.

📒 Files selected for processing (5)

downstreamadapter/dispatcher/event_dispatcher.go
downstreamadapter/dispatchermanager/dispatcher_manager.go
downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
downstreamadapter/dispatchermanager/dispatcher_manager_test.go
downstreamadapter/dispatchermanager/helper.go

🚧 Files skipped from review as they are similar to previous changes (2)

downstreamadapter/dispatchermanager/helper.go
downstreamadapter/dispatchermanager/dispatcher_manager.go

asddongmen · 2026-06-08T06:43:02Z

/test all

asddongmen · 2026-06-08T07:52:23Z

/retest

asddongmen · 2026-06-08T11:04:43Z

/retest

ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 20, 2026

ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 20, 2026

gemini-code-assist Bot reviewed May 20, 2026

View reviewed changes

Comment thread downstreamadapter/dispatchermanager/dispatcher_manager.go Outdated

Comment thread downstreamadapter/dispatchermanager/dispatcher_manager.go Outdated

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 4, 2026

asddongmen added 5 commits June 4, 2026 20:52

server, dispatcher: implemented the minimal local fence

e2f37c9

Signed-off-by: dongmen <414110582@qq.com>

server, dispatcher: implemented the minimal local fence 2

ca2187f

Signed-off-by: dongmen <414110582@qq.com>

tests,sink: add local fence session done integration test

431bdab

tests: use advertised capture address in local fence test

5906c66

tests: allow local fence test cdc port override

fa0781c

asddongmen force-pushed the 0520-improve-node-liveness-self-fence branch from 4a22383 to fa0781c Compare June 4, 2026 12:53

tests: update session done local fence assertions

fe802b8

tests: expect local fence lease expiry log

3d78ccc

downstreamadapter: guard redo sink during local fence

5e97fd8

ti-chi-bot Bot added do-not-merge/needs-triage-completed and removed do-not-merge/needs-linked-issue labels Jun 5, 2026

asddongmen marked this pull request as ready for review June 5, 2026 06:14

ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026

asddongmen self-assigned this Jun 5, 2026

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Jun 5, 2026

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread tests/integration_tests/capture_local_fence_on_session_done/run.sh Outdated

Comment thread tests/integration_tests/capture_local_fence_on_session_done/run.sh Outdated

Comment thread tests/integration_tests/capture_session_done_during_task/run.sh Outdated

wk989898 approved these changes Jun 5, 2026

View reviewed changes

ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 5, 2026

ti-chi-bot Bot added the approved label Jun 5, 2026

asddongmen added 2 commits June 5, 2026 20:57

downstreamadapter: fence table trigger writes on local fence

afb5fb3

tests: harden local fence integration scripts

ae413aa

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread downstreamadapter/dispatchermanager/dispatcher_manager.go

downstreamadapter: use dedicated write path fence error

e7cda87

downstreamadapter: avoid fenced write path blocking

a05f186

coderabbitai Bot reviewed Jun 8, 2026

View reviewed changes

Conversation

asddongmen commented May 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Summary by CodeRabbit

Uh oh!

ti-chi-bot Bot commented May 20, 2026

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

asddongmen commented May 22, 2026

Uh oh!

asddongmen commented Jun 4, 2026

Uh oh!

asddongmen commented Jun 4, 2026

Uh oh!

asddongmen commented Jun 4, 2026

Uh oh!

asddongmen commented Jun 4, 2026

Uh oh!

asddongmen commented Jun 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot Bot commented Jun 5, 2026

Uh oh!

ti-chi-bot Bot commented Jun 5, 2026

[LGTM Timeline notifier]

Uh oh!

asddongmen commented Jun 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

asddongmen commented Jun 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

asddongmen commented Jun 8, 2026

Uh oh!

asddongmen commented Jun 8, 2026

Uh oh!

asddongmen commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asddongmen commented May 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading