server, dispatcher: improve node liveness self fence#5106
Conversation
|
Skipping CI for Draft Pull Request. |
📝 WalkthroughWalkthroughAdds coordinated local fencing: DispatcherManager write-path fencing and close refactor, DispatcherOrchestrator fenced mode and propagation, a server session watchdog that triggers local fence on etcd lease/session loss, unit/integration tests, and related scripts/failpoint updates. ChangesLocal fencing mechanism
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a local fencing mechanism to ensure that downstream writes are stopped immediately when a capture loses its etcd session or lease. This is achieved by adding a session watchdog in the server and implementing a LocalFence method across the DispatcherOrchestrator and DispatcherManager to bypass graceful draining in failure scenarios. The review feedback identifies potential nil pointer dereferences in the DispatcherManager's shutdown logic, specifically regarding the redoSink when redo logging is enabled.
|
/test all |
Signed-off-by: dongmen <414110582@qq.com>
Signed-off-by: dongmen <414110582@qq.com>
4a22383 to
fa0781c
Compare
|
/test all |
|
/test all |
|
/test all |
|
/test pull-cdc-mysql-integration-heavy |
|
/test all |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/integration_tests/capture_local_fence_on_session_done/run.sh`:
- Around line 27-28: The curl calls that populate addr (the command using curl
-s "http://${api_addr}/api/v2/captures" piped to jq) should include timeouts to
prevent hanging CI; update that invocation (and the other curl invocations at
the similar spots referenced) to set both a connection timeout and an overall
max time (e.g., --connect-timeout and --max-time) and preserve -s; ensure the
same timeout flags are applied to the other two curl calls in this script (the
ones at the ranges you noted) so the retry loop cannot block indefinitely.
- Line 168: The script forwards positional parameters using $* which can split
or reshape arguments; update the invocation of the run helper (the line
containing "run $*") to forward arguments safely by using the quoted array form
"$@" instead so each original argument is preserved exactly when passed to run.
In `@tests/integration_tests/capture_session_done_during_task/run.sh`:
- Around line 59-60: The two strict checks in run.sh ("check_logs_contains
$WORK_DIR \"local fence triggered\"" and "check_logs_contains $WORK_DIR \"etcd
lease expired\"") make the test flaky because the reason can be either "etcd
session done" or "etcd lease expired"; update the assertions so after verifying
"local fence triggered" with check_logs_contains you assert that the logs
contain either "etcd session done" OR "etcd lease expired" (e.g., replace the
second check with a single check that uses a regex/alternate match or two-branch
logic calling check_logs_contains for "etcd session done" OR "etcd lease
expired") so the test passes when either reason appears.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e024b9c5-b91a-4f1c-ae51-6f90b9eb32f8
📒 Files selected for processing (12)
downstreamadapter/dispatchermanager/dispatcher_manager.godownstreamadapter/dispatchermanager/dispatcher_manager_redo.godownstreamadapter/dispatchermanager/dispatcher_manager_test.godownstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.godownstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.gopkg/sink/mysql/mysql_writer_dml_exec.goserver/server.goserver/server_session_watchdog_test.gotests/integration_tests/capture_local_fence_on_session_done/conf/diff_config.tomltests/integration_tests/capture_local_fence_on_session_done/run.shtests/integration_tests/capture_session_done_during_task/run.shtests/integration_tests/run_light_it_in_ci.sh
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: wk989898 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
|
/test all |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go (1)
458-462: ⚡ Quick winMake the drop assertion deterministic.
Lines 458-462 only show that nothing reached
processedwithin 50ms. A message that was merely delayed, or enqueued after fence but never drained, looks identical, so this can both mask regressions and flake on slow CI. Please assert the post-fence admission state directly instead of relying on a short timeout.As per coding guidelines,
**/*_test.go: Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go` around lines 458 - 462, Replace the flaky 50ms timeout check on the processed channel with a deterministic assertion of post-fence admission state: after applying the local fence, directly assert the orchestrator/maintainer admission flag or queue length (e.g. check an "isAdmitting" / "admissionOpen" boolean or the maintainer queue length) to prove new maintainer messages are rejected, and if only the processed channel is available, use a non-blocking select (case msg := <-processed: require.FailNow(...); default: ) and also attempt to enqueue a new maintainer and assert that enqueue returns a failure/false or does not increase the queue size; reference the processed channel and the local fence operation in the test to locate where to make this change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager.go`:
- Around line 60-77: Create a dedicated predefined repo error for the fenced
write-path state (e.g., add errors.ErrDispatcherWritePathClosed in the central
errors package) and replace the current message-substring approach: have
newWritePathClosedError() generate/return that specific error (use the repo
error construction consistent with other errors, e.g., FastGenByArgs on
errors.ErrDispatcherWritePathClosed), and update IsWritePathClosedError(err) to
detect the condition by matching that error directly (either via errors.Is(err,
errors.ErrDispatcherWritePathClosed) or comparing RFCCode to
errors.ErrDispatcherWritePathClosed.RFCCode()) instead of checking for
ErrChangefeedInitTableTriggerDispatcherFailed plus a substring.
---
Nitpick comments:
In `@downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.go`:
- Around line 458-462: Replace the flaky 50ms timeout check on the processed
channel with a deterministic assertion of post-fence admission state: after
applying the local fence, directly assert the orchestrator/maintainer admission
flag or queue length (e.g. check an "isAdmitting" / "admissionOpen" boolean or
the maintainer queue length) to prove new maintainer messages are rejected, and
if only the processed channel is available, use a non-blocking select (case msg
:= <-processed: require.FailNow(...); default: ) and also attempt to enqueue a
new maintainer and assert that enqueue returns a failure/false or does not
increase the queue size; reference the processed channel and the local fence
operation in the test to locate where to make this change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 54d91177-8133-4394-a936-6f0327fa9be2
📒 Files selected for processing (8)
downstreamadapter/dispatchermanager/dispatcher_manager.godownstreamadapter/dispatchermanager/dispatcher_manager_redo.godownstreamadapter/dispatchermanager/dispatcher_manager_test.godownstreamadapter/dispatchermanager/helper.godownstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.godownstreamadapter/dispatcherorchestrator/dispatcher_orchestrator_test.gotests/integration_tests/capture_local_fence_on_session_done/run.shtests/integration_tests/capture_session_done_during_task/run.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/integration_tests/capture_local_fence_on_session_done/run.sh
- downstreamadapter/dispatcherorchestrator/dispatcher_orchestrator.go
|
/test all |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
downstreamadapter/dispatchermanager/dispatcher_manager_redo.go (1)
338-340: Document fenced no-op inUpdateRedoMeta.
DispatcherManager.UpdateRedoMeta(downstreamadapter/dispatchermanager/dispatcher_manager_redo.go:338-340) returns immediately whenwritePathClosedis set with no error/log. The only observed call site (downstreamadapter/dispatchermanager/helper.go:751-752) ignores the outcome, so callers can’t tell whether the meta was updated. Add a brief comment above the early return explaining this is an intentional no-op during write-path shutdown/fencing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@downstreamadapter/dispatchermanager/dispatcher_manager_redo.go` around lines 338 - 340, UpdateRedoMeta currently returns immediately when e.writePathClosed.Load() is true with no indication to readers; add a brief comment directly above the early return in function UpdateRedoMeta explaining that this is an intentional no-op because the write path is closed/fenced during shutdown so metadata updates are ignored (do not change behavior or add an error/log, just document the intent).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@downstreamadapter/dispatchermanager/dispatcher_manager_redo.go`:
- Around line 338-340: UpdateRedoMeta currently returns immediately when
e.writePathClosed.Load() is true with no indication to readers; add a brief
comment directly above the early return in function UpdateRedoMeta explaining
that this is an intentional no-op because the write path is closed/fenced during
shutdown so metadata updates are ignored (do not change behavior or add an
error/log, just document the intent).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3fbd800b-16e6-49af-9547-7a196144a232
📒 Files selected for processing (5)
downstreamadapter/dispatcher/event_dispatcher.godownstreamadapter/dispatchermanager/dispatcher_manager.godownstreamadapter/dispatchermanager/dispatcher_manager_redo.godownstreamadapter/dispatchermanager/dispatcher_manager_test.godownstreamadapter/dispatchermanager/helper.go
🚧 Files skipped from review as they are similar to previous changes (2)
- downstreamadapter/dispatchermanager/helper.go
- downstreamadapter/dispatchermanager/dispatcher_manager.go
|
/test all |
|
/retest |
1 similar comment
|
/retest |
What problem does this PR solve?
Issue Number: close #5202
When a TiCDC capture loses its etcd session or lease, it can no longer prove that it still owns local dispatcher work. The previous shutdown path did not immediately fence local write paths, so a stale capture could still accept maintainer requests or continue writing downstream while failover was already in progress. This created a short but unsafe window for duplicate or out-of-date downstream writes before normal cleanup finished.
What is changed and how it works?
This PR adds a local fence path for session-done and lease-expired events. The server watches the etcd session, triggers local fencing before shutdown, and the dispatcher orchestrator stops accepting new maintainer requests. Dispatcher managers cancel local write paths immediately, close local dispatchers asynchronously, and continue cleanup without waiting for progress draining. The redo cleanup path is also guarded so partially initialized managers do not panic when fencing. New integration coverage simulates capture session loss and verifies the local fence behavior and downstream consistency.
Check List
Tests
/test allQuestions
Will it cause performance regression or break compatibility?
No. The new path only runs when a local capture loses its session or is being fenced.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note
Summary by CodeRabbit
New Features
Bug Fixes
Tests