Skip to content

server, dispatcher: improve node liveness self fence#5106

Open
asddongmen wants to merge 12 commits into
pingcap:masterfrom
asddongmen:0520-improve-node-liveness-self-fence
Open

server, dispatcher: improve node liveness self fence#5106
asddongmen wants to merge 12 commits into
pingcap:masterfrom
asddongmen:0520-improve-node-liveness-self-fence

Conversation

@asddongmen

@asddongmen asddongmen commented May 20, 2026

Copy link
Copy Markdown
Collaborator

What problem does this PR solve?

Issue Number: close #5202

When a TiCDC capture loses its etcd session or lease, it can no longer prove that it still owns local dispatcher work. The previous shutdown path did not immediately fence local write paths, so a stale capture could still accept maintainer requests or continue writing downstream while failover was already in progress. This created a short but unsafe window for duplicate or out-of-date downstream writes before normal cleanup finished.

What is changed and how it works?

This PR adds a local fence path for session-done and lease-expired events. The server watches the etcd session, triggers local fencing before shutdown, and the dispatcher orchestrator stops accepting new maintainer requests. Dispatcher managers cancel local write paths immediately, close local dispatchers asynchronously, and continue cleanup without waiting for progress draining. The redo cleanup path is also guarded so partially initialized managers do not panic when fencing. New integration coverage simulates capture session loss and verifies the local fence behavior and downstream consistency.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test: dev-machine MySQL integration cases and GitHub /test all

Questions

Will it cause performance regression or break compatibility?

No. The new path only runs when a local capture loses its session or is being fenced.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

TiCDC now locally fences stale captures after etcd session loss to avoid unsafe downstream writes.

Summary by CodeRabbit

  • New Features

    • Added etcd session watchdog and server-side fencing to trigger local fencing on lease/session loss
    • Added local fencing controls to orchestrator and managers to immediately fence write paths
  • Bug Fixes

    • Prevented races between dispatcher creation/merge and shutdown; avoid registering partially-initialized dispatchers
    • Short-circuited operations when write path is fenced to improve shutdown safety
  • Tests

    • Added unit and integration tests for fencing, shutdown and session-watchdog behaviors

@ti-chi-bot

ti-chi-bot Bot commented May 20, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-linked-issue do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 20, 2026
@coderabbitai

coderabbitai Bot commented May 20, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds coordinated local fencing: DispatcherManager write-path fencing and close refactor, DispatcherOrchestrator fenced mode and propagation, a server session watchdog that triggers local fence on etcd lease/session loss, unit/integration tests, and related scripts/failpoint updates.

Changes

Local fencing mechanism

Layer / File(s) Summary
All related changes (fencing, watchdog, tests, integration, helpers)
downstreamadapter/..., server/..., tests/integration_tests/*, pkg/*
Implements write-path fencing across DispatcherManager and DispatcherOrchestrator (fenced flag, IsWritePathClosedError, LocalFence methods), server etcd session watchdog to trigger local fencing, updates to event/redo dispatcher creation and merge to be fencing-aware, helper and test updates, new normalized error, and integration scripts + failpoint adjustments.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

lgtm

Suggested reviewers

  • hongyunyan
  • lidezhu

"I nibble logs and guard the line,
When leases drop and sessions fade,
I fence the writes, keep data fine,
Quiet managers in evening shade.
A carrot hop for safe cascade." 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'server, dispatcher: improve node liveness self fence' is specific and concise, clearly describing the main change of adding local fencing when a capture loses its etcd session.
Description check ✅ Passed The description includes the issue number (close #5202), explains the problem and solution, provides a comprehensive checklist with unit/integration/manual tests, and includes a proper release note.
Linked Issues check ✅ Passed The PR addresses all requirements from #5202: immediate local fencing on etcd session loss, rejection of new maintainer requests, cancellation of dispatcher write paths, and prevention of unsafe downstream writes through comprehensive changes across server, dispatcher orchestrator, and manager components.
Out of Scope Changes check ✅ Passed All changes are in scope for the local fencing feature: session watchdog, dispatcher fencing logic, redo dispatcher protection, integration tests, and MySQL writer test failpoint expansion are all directly related to preventing stale captures from writing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 20, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a local fencing mechanism to ensure that downstream writes are stopped immediately when a capture loses its etcd session or lease. This is achieved by adding a session watchdog in the server and implementing a LocalFence method across the DispatcherOrchestrator and DispatcherManager to bypass graceful draining in failure scenarios. The review feedback identifies potential nil pointer dereferences in the DispatcherManager's shutdown logic, specifically regarding the redoSink when redo logging is enabled.

Comment thread downstreamadapter/dispatchermanager/dispatcher_manager.go Outdated
Comment thread downstreamadapter/dispatchermanager/dispatcher_manager.go Outdated
@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 4, 2026
@asddongmen asddongmen force-pushed the 0520-improve-node-liveness-self-fence branch from 4a22383 to fa0781c Compare June 4, 2026 12:53
@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test pull-cdc-mysql-integration-heavy

@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@asddongmen asddongmen marked this pull request as ready for review June 5, 2026 06:14
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2026
@asddongmen asddongmen self-assigned this Jun 5, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration_tests/capture_local_fence_on_session_done/run.sh`:
- Around line 27-28: The curl calls that populate addr (the command using curl
-s "http://${api_addr}/api/v2/captures" piped to jq) should include timeouts to
prevent hanging CI; update that invocation (and the other curl invocations at
the similar spots referenced) to set both a connection timeout and an overall
max time (e.g., --connect-timeout and --max-time) and preserve -s; ensure the
same timeout flags are applied to the other two curl calls in this script (the
ones at the ranges you noted) so the retry loop cannot block indefinitely.
- Line 168: The script forwards positional parameters using $* which can split
or reshape arguments; update the invocation of the run helper (the line
containing "run $*") to forward arguments safely by using the quoted array form
"$@" instead so each original argument is preserved exactly when passed to run.

In `@tests/integration_tests/capture_session_done_during_task/run.sh`:
- Around line 59-60: The two strict checks in run.sh ("check_logs_contains
$WORK_DIR \"local fence triggered\"" and "check_logs_contains $WORK_DIR \"etcd
lease expired\"") make the test flaky because the reason can be either "etcd
session done" or "etcd lease expired"; update the assertions so after verifying
"local fence triggered" with check_logs_contains you assert that the logs
contain either "etcd session done" OR "etcd lease expired" (e.g., replace the
second check with a single check that uses a regex/alternate match or two-branch
logic calling check_logs_contains for "etcd session done" OR "etcd lease
expired") so the test passes when either reason appears.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e024b9c5-b91a-4f1c-ae51-6f90b9eb32f8

📥 Commits

Reviewing files that changed from the base of the PR and between ff135e6 and 5e97fd8.

📒 Files selected for processing (12)
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • pkg/sink/mysql/mysql_writer_dml_exec.go
  • server/server.go
  • server/server_session_watchdog_test.go
  • tests/integration_tests/capture_local_fence_on_session_done/conf/diff_config.toml
  • tests/integration_tests/capture_local_fence_on_session_done/run.sh
  • tests/integration_tests/capture_session_done_during_task/run.sh
  • tests/integration_tests/run_light_it_in_ci.sh

Comment thread tests/integration_tests/capture_local_fence_on_session_done/run.sh Outdated
Comment thread tests/integration_tests/capture_local_fence_on_session_done/run.sh Outdated
Comment thread tests/integration_tests/capture_session_done_during_task/run.sh Outdated
@ti-chi-bot

ti-chi-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jun 5, 2026
@ti-chi-bot

ti-chi-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

[LGTM Timeline notifier]

Timeline:

  • 2026-06-05 08:26:11.880158741 +0000 UTC m=+516472.950476131: ☑️ agreed by wk989898.

@ti-chi-bot ti-chi-bot Bot added the approved label Jun 5, 2026
@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go (1)

458-462: ⚡ Quick win

Make the drop assertion deterministic.

Lines 458-462 only show that nothing reached processed within 50ms. A message that was merely delayed, or enqueued after fence but never drained, looks identical, so this can both mask regressions and flake on slow CI. Please assert the post-fence admission state directly instead of relying on a short timeout.

As per coding guidelines, **/*_test.go: Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`
around lines 458 - 462, Replace the flaky 50ms timeout check on the processed
channel with a deterministic assertion of post-fence admission state: after
applying the local fence, directly assert the orchestrator/maintainer admission
flag or queue length (e.g. check an "isAdmitting" / "admissionOpen" boolean or
the maintainer queue length) to prove new maintainer messages are rejected, and
if only the processed channel is available, use a non-blocking select (case msg
:= <-processed: require.FailNow(...); default: ) and also attempt to enqueue a
new maintainer and assert that enqueue returns a failure/false or does not
increase the queue size; reference the processed channel and the local fence
operation in the test to locate where to make this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager.go`:
- Around line 60-77: Create a dedicated predefined repo error for the fenced
write-path state (e.g., add errors.ErrDispatcherWritePathClosed in the central
errors package) and replace the current message-substring approach: have
newWritePathClosedError() generate/return that specific error (use the repo
error construction consistent with other errors, e.g., FastGenByArgs on
errors.ErrDispatcherWritePathClosed), and update IsWritePathClosedError(err) to
detect the condition by matching that error directly (either via errors.Is(err,
errors.ErrDispatcherWritePathClosed) or comparing RFCCode to
errors.ErrDispatcherWritePathClosed.RFCCode()) instead of checking for
ErrChangefeedInitTableTriggerDispatcherFailed plus a substring.

---

Nitpick comments:
In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`:
- Around line 458-462: Replace the flaky 50ms timeout check on the processed
channel with a deterministic assertion of post-fence admission state: after
applying the local fence, directly assert the orchestrator/maintainer admission
flag or queue length (e.g. check an "isAdmitting" / "admissionOpen" boolean or
the maintainer queue length) to prove new maintainer messages are rejected, and
if only the processed channel is available, use a non-blocking select (case msg
:= <-processed: require.FailNow(...); default: ) and also attempt to enqueue a
new maintainer and assert that enqueue returns a failure/false or does not
increase the queue size; reference the processed channel and the local fence
operation in the test to locate where to make this change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 54d91177-8133-4394-a936-6f0327fa9be2

📥 Commits

Reviewing files that changed from the base of the PR and between 5e97fd8 and ae413aa.

📒 Files selected for processing (8)
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatchermanager/helper.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go
  • tests/integration_tests/capture_local_fence_on_session_done/run.sh
  • tests/integration_tests/capture_session_done_during_task/run.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integration_tests/capture_local_fence_on_session_done/run.sh
  • downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go

Comment thread downstreamadapter/dispatchermanager/dispatcher_manager.go
@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
downstreamadapter/dispatchermanager/dispatcher_manager_redo.go (1)

338-340: Document fenced no-op in UpdateRedoMeta.

DispatcherManager.UpdateRedoMeta (downstreamadapter/dispatchermanager/dispatcher_manager_redo.go:338-340) returns immediately when writePathClosed is set with no error/log. The only observed call site (downstreamadapter/dispatchermanager/helper.go:751-752) ignores the outcome, so callers can’t tell whether the meta was updated. Add a brief comment above the early return explaining this is an intentional no-op during write-path shutdown/fencing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@downstreamadapter/dispatchermanager/dispatcher_manager_redo.go` around lines
338 - 340, UpdateRedoMeta currently returns immediately when
e.writePathClosed.Load() is true with no indication to readers; add a brief
comment directly above the early return in function UpdateRedoMeta explaining
that this is an intentional no-op because the write path is closed/fenced during
shutdown so metadata updates are ignored (do not change behavior or add an
error/log, just document the intent).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager_redo.go`:
- Around line 338-340: UpdateRedoMeta currently returns immediately when
e.writePathClosed.Load() is true with no indication to readers; add a brief
comment directly above the early return in function UpdateRedoMeta explaining
that this is an intentional no-op because the write path is closed/fenced during
shutdown so metadata updates are ignored (do not change behavior or add an
error/log, just document the intent).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 3fbd800b-16e6-49af-9547-7a196144a232

📥 Commits

Reviewing files that changed from the base of the PR and between e7cda87 and a05f186.

📒 Files selected for processing (5)
  • downstreamadapter/dispatcher/event_dispatcher.go
  • downstreamadapter/dispatchermanager/dispatcher_manager.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_redo.go
  • downstreamadapter/dispatchermanager/dispatcher_manager_test.go
  • downstreamadapter/dispatchermanager/helper.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • downstreamadapter/dispatchermanager/helper.go
  • downstreamadapter/dispatchermanager/dispatcher_manager.go

@asddongmen

Copy link
Copy Markdown
Collaborator Author

/test all

@asddongmen

Copy link
Copy Markdown
Collaborator Author

/retest

1 similar comment
@asddongmen

Copy link
Copy Markdown
Collaborator Author

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TiCDC node may keep writing after losing its etcd session

2 participants