-
-
Notifications
You must be signed in to change notification settings - Fork 265
Open
Description
We found a behavior in the replication applier that may explain intermittent connection shutdown errors when the incoming segment flow is sparse.
What is confirmed from the code:
- The main loop is in src/remote/server/ReplServer.cpp, function process_thread().
- process_thread() calls process_archive().
- If process_archive() returns PROCESS_CONTINUE, the loop continues without shutdown.
- For any other return value, including PROCESS_SUSPEND, process_thread() calls target->shutdown().
- Target::shutdown() closes the replicator and detaches the database attachment.
- When process_archive() finds no new segments (queue.isEmpty()), it returns PROCESS_SUSPEND.
- On the next iteration, if new segments appear, Target::initReplica() performs a new attachDatabase() and recreates the replicator.
This means that with a sparse replication stream, the applier repeatedly goes through this cycle:
process segments -> no new segments -> PROCESS_SUSPEND -> shutdown/detach -> next segment arrives -> attach again
With a dense stream, where new segments arrive every few seconds, PROCESS_SUSPEND happens less often, so reconnects also happen less often.
Observed behavior:
- On low traffic, where segment gaps are longer than the idle timeout, intermittent connection shutdown errors appear.
- On high traffic, where new segments arrive every 3-5 seconds, the issue does not reproduce.
Additional diagnostic concern:
- In Target::shutdown(), m_replicator->close(&localStatus) and m_attachment->detach(&localStatus) are called without checking localStatus.check().
- Because of that, failures during close/detach may not be logged clearly.
Why this looks suspicious:
- The code explicitly tears down the attachment on idle.
- Sparse traffic therefore causes frequent disconnect/reconnect cycles.
- This matches the observed difference between low-traffic and high-traffic replication.
Suggested direction:
- Confirm whether the attachment really needs to be closed on every PROCESS_SUSPEND.
- Consider keeping the attachment alive across idle periods, or at least improving logging around close() / detach() / reconnect paths to capture the original failure instead of only the later connection
shutdown.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels