Skip to content

fix(topology): prevent source pump deadlock during sink config reload#25330

Open
joshcoughlan wants to merge 5 commits intovectordotdev:masterfrom
Automattic:fix/topology-reload-deadlock
Open

fix(topology): prevent source pump deadlock during sink config reload#25330
joshcoughlan wants to merge 5 commits intovectordotdev:masterfrom
Automattic:fix/topology-reload-deadlock

Conversation

@joshcoughlan
Copy link
Copy Markdown
Contributor

Summary

Fixes a deadlock during config reload that stalls all sources indefinitely when a changed sink shares a resource (e.g. a bound port) with the new sink, or when the sink is in `reuse_buffers` on the SIGHUP path.

Root cause: During reload, `remove_inputs` sent `Pause` to the upstream fanout for sinks in `wait_for_sinks`. The source pump then blocked in `wait_for_replacements` waiting for a `Replace` message that only arrives after `connect_diff`, which runs after `shutdown_diff`, which itself was waiting for the old sink to finish. Circular dependency, every source stalls.

Fix: For sinks in `wait_for_sinks`, use a full `Remove` (not `Pause`) from the upstream fanout, with a corresponding `Add` (not `Replace`) when reconnecting in `connect_diff`. The source pump can keep sending events to other sinks while the old sink drains. Also bounds `wait_for_sinks` with a drain timeout so a misbehaving sink can't hold reload forever.

Buffer-type agnostic — fixes the deadlock for memory and disk buffered sinks alike, only affects the reload path.

Vector configuration

Reproduction harness included under `testing/github-24125/` — uses a `demo_logs` source fanning out to two `http` sinks listening on the same port range, then SIGHUPs Vector to trigger the reload path. Without the fix, all sources stall; with the fix, reload completes within the timeout.

How did you test this PR?

  • Two new regression tests in `src/topology/test/reload.rs`:
    • `topology_reload_conflicting_sink_does_not_stall` — covers the conflicting-resource path
    • `topology_reload_reuse_buffer_does_not_stall` — covers the SIGHUP / `reuse_buffers` path
  • Both pass alongside the pre-existing `topology_disk_buffer_config_change_*_does_not_stall` tests from fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers #24949 (verified the two fixes coexist correctly)
  • `make check-clippy` clean
  • Reproduction harness in `testing/github-24125/` confirms the bug pre-fix and the fix post-fix

Change Type

  • Bug fix

Is this a breaking change?

  • No

Does this PR include user facing changes?

  • Yes. Changelog fragment included (`changelog.d/24125_fix_topology_reload_deadlock.fix.md`).

References

joshcoughlan and others added 5 commits April 29, 2026 19:19
During config reload, when a sink in wait_for_sinks is being changed,
remove_inputs previously sent Pause to the upstream fanout. This caused
the source pump to block in wait_for_replacements (waiting for a Replace
message that only arrives after connect_diff, which runs after
shutdown_diff completes). Since shutdown_diff waits for the old sink to
finish, this created a circular dependency that stalled all sources.

The fix uses Remove instead of Pause for sinks in wait_for_sinks during
reload, and correspondingly uses Add instead of Replace when
reconnecting them in connect_diff. This allows the source pump to
continue sending events to other sinks while the old sink drains,
breaking the circular dependency.

This is buffer-type agnostic — it fixes the deadlock for memory and disk
buffered sinks alike, and only affects the reload path.

Refs: vectordotdev#24125

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-sink scenario mirroring the production failure: demo_logs ->
route -> two http sinks where one downstream returns 429 after a delay,
backing up events at the sink. After a soak period, an
encoding.except_fields edit + SIGHUP triggers the deadlock.

The orchestrator probes tap output and prometheus metrics at four
checkpoints (initial, post-degradation, post-baseline-reload,
post-config-change-reload) and prints a colorized PASS/FAIL summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add a minimal Dockerfile so the repro can build a Vector image from the
  current source tree on host arch (no cross-compile, no Rosetta) when no
  prebuilt image is supplied.
- Split the merged vector.yaml into base-config.yaml (bind-mounted from the
  repo) and a rendered sinks.yaml in $WORKDIR (also bind-mounted).
- Pass --config explicitly per file at docker run + validate time so the
  test works against any image, including timberio/vector:* images that
  ship a default config in /etc/vector.
- After rewriting sinks.yaml, gate the SIGHUP on "vector validate" succeeding
  inside the container (1s sleep + 5 attempts with exponential backoff).
  Absorbs Docker for Mac bind-mount propagation lag that was racing the
  signal and causing Vector to read a truncated file.
- Misc template/comment cleanups so the renderer doesn't substitute its own
  placeholder docs into the rendered YAML.

Verified: post-edit reload reproduces the bug on
timberio/vector:0.50.0-debian and completes successfully on the patched
local build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reload's shutdown_diff awaits each old sink task in wait_for_sinks
(buffer-reuse and resource-conflict cases) so that ownership of the
buffer / port can be reclaimed before the new sink starts. The await
had no upper bound: if the old sink's downstream was permanently
failing — e.g. an http sink wedged on retriable HTTP 429 — the task
never finished, shutdown_diff blocked forever, connect_diff never
ran, the new sinks never got wired up, and backpressure from the
zero-consumer fanout halted the source. Recovery required restarting
the process.

Wrap both wait_for_sinks awaits in a 60s tokio::time::timeout
(RELOAD_DRAIN_TIMEOUT). On expiry:

  - Emit a structured warn with component_id, timeout_secs, and
    (for changed sinks) reuse_buffers, so the event is observable
    and operators can see which sink stranded events.
  - Detach the old task; it continues retrying in the background
    and exits whenever its in-flight work finally fails.
  - For sinks that were buffer-reuse candidates, drop the buffer_tx
    entry so the builder's existing fallback constructs a fresh
    buffer instead of waiting forever to reclaim the old one.

The timeout strictly improves on prior behaviour: an unbounded
wedge currently is recovered by SIGTERM/SIGKILL, which loses every
in-flight event across every pipeline. The timeout scopes loss to
the one stuck sink and lets the rest of the topology resume.

Verified against testing/github-24125: the post-edit reload now
completes ~60s after the SIGHUP (warn fires, new sinks wire up,
source resumes) where it previously hung indefinitely.

Refs: vectordotdev#24125

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joshcoughlan joshcoughlan requested a review from a team as a code owner April 30, 2026 00:53
@github-actions github-actions Bot added the domain: topology Anything related to Vector's topology code label Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: topology Anything related to Vector's topology code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Vector fails to reload sink with failed events

1 participant