fix(topology): prevent source pump deadlock during sink config reload by joshcoughlan · Pull Request #25330 · vectordotdev/vector

joshcoughlan · 2026-04-30T00:53:46Z

Summary

Fixes a deadlock during config reload that stalls all sources indefinitely when a changed sink shares a resource (e.g. a bound port) with the new sink, or when the sink is in `reuse_buffers` on the SIGHUP path.

Root cause: During reload, `remove_inputs` sent `Pause` to the upstream fanout for sinks in `wait_for_sinks`. The source pump then blocked in `wait_for_replacements` waiting for a `Replace` message that only arrives after `connect_diff`, which runs after `shutdown_diff`, which itself was waiting for the old sink to finish. Circular dependency, every source stalls.

Fix: For sinks in `wait_for_sinks`, use a full `Remove` (not `Pause`) from the upstream fanout, with a corresponding `Add` (not `Replace`) when reconnecting in `connect_diff`. The source pump can keep sending events to other sinks while the old sink drains. Also bounds `wait_for_sinks` with a drain timeout so a misbehaving sink can't hold reload forever.

Buffer-type agnostic — fixes the deadlock for memory and disk buffered sinks alike, only affects the reload path.

Vector configuration

Reproduction harness included under `testing/github-24125/` — uses a `demo_logs` source fanning out to two `http` sinks listening on the same port range, then SIGHUPs Vector to trigger the reload path. Without the fix, all sources stall; with the fix, reload completes within the timeout.

How did you test this PR?

Two new regression tests in `src/topology/test/reload.rs`:
- `topology_reload_conflicting_sink_does_not_stall` — covers the conflicting-resource path
- `topology_reload_reuse_buffer_does_not_stall` — covers the SIGHUP / `reuse_buffers` path
Both pass alongside the pre-existing `topology_disk_buffer_config_change_*_does_not_stall` tests from fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers #24949 (verified the two fixes coexist correctly)
`make check-clippy` clean
Reproduction harness in `testing/github-24125/` confirms the bug pre-fix and the fix post-fix

Change Type

Bug fix

Is this a breaking change?

No

Does this PR include user facing changes?

Yes. Changelog fragment included (`changelog.d/24125_fix_topology_reload_deadlock.fix.md`).

References

Closes: Vector fails to reload sink with failed events #24125
Related: fix(topology): Fix for issue causing stalling on shutdown for sinks configured w/ disk buffers #24949 (disk-buffer reload stall — adjacent reload-path fix, this PR's resolution sits on top)

During config reload, when a sink in wait_for_sinks is being changed, remove_inputs previously sent Pause to the upstream fanout. This caused the source pump to block in wait_for_replacements (waiting for a Replace message that only arrives after connect_diff, which runs after shutdown_diff completes). Since shutdown_diff waits for the old sink to finish, this created a circular dependency that stalled all sources. The fix uses Remove instead of Pause for sinks in wait_for_sinks during reload, and correspondingly uses Add instead of Replace when reconnecting them in connect_diff. This allows the source pump to continue sending events to other sinks while the old sink drains, breaking the circular dependency. This is buffer-type agnostic — it fixes the deadlock for memory and disk buffered sinks alike, and only affects the reload path. Refs: vectordotdev#24125 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Multi-sink scenario mirroring the production failure: demo_logs -> route -> two http sinks where one downstream returns 429 after a delay, backing up events at the sink. After a soak period, an encoding.except_fields edit + SIGHUP triggers the deadlock. The orchestrator probes tap output and prometheus metrics at four checkpoints (initial, post-degradation, post-baseline-reload, post-config-change-reload) and prints a colorized PASS/FAIL summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Add a minimal Dockerfile so the repro can build a Vector image from the current source tree on host arch (no cross-compile, no Rosetta) when no prebuilt image is supplied. - Split the merged vector.yaml into base-config.yaml (bind-mounted from the repo) and a rendered sinks.yaml in $WORKDIR (also bind-mounted). - Pass --config explicitly per file at docker run + validate time so the test works against any image, including timberio/vector:* images that ship a default config in /etc/vector. - After rewriting sinks.yaml, gate the SIGHUP on "vector validate" succeeding inside the container (1s sleep + 5 attempts with exponential backoff). Absorbs Docker for Mac bind-mount propagation lag that was racing the signal and causing Vector to read a truncated file. - Misc template/comment cleanups so the renderer doesn't substitute its own placeholder docs into the rendered YAML. Verified: post-edit reload reproduces the bug on timberio/vector:0.50.0-debian and completes successfully on the patched local build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reload's shutdown_diff awaits each old sink task in wait_for_sinks (buffer-reuse and resource-conflict cases) so that ownership of the buffer / port can be reclaimed before the new sink starts. The await had no upper bound: if the old sink's downstream was permanently failing — e.g. an http sink wedged on retriable HTTP 429 — the task never finished, shutdown_diff blocked forever, connect_diff never ran, the new sinks never got wired up, and backpressure from the zero-consumer fanout halted the source. Recovery required restarting the process. Wrap both wait_for_sinks awaits in a 60s tokio::time::timeout (RELOAD_DRAIN_TIMEOUT). On expiry: - Emit a structured warn with component_id, timeout_secs, and (for changed sinks) reuse_buffers, so the event is observable and operators can see which sink stranded events. - Detach the old task; it continues retrying in the background and exits whenever its in-flight work finally fails. - For sinks that were buffer-reuse candidates, drop the buffer_tx entry so the builder's existing fallback constructs a fresh buffer instead of waiting forever to reclaim the old one. The timeout strictly improves on prior behaviour: an unbounded wedge currently is recovered by SIGTERM/SIGKILL, which loses every in-flight event across every pipeline. The timeout scopes loss to the one stuck sink and lets the rest of the topology resume. Verified against testing/github-24125: the post-edit reload now completes ~60s after the SIGHUP (warn fires, new sinks wire up, source resumes) where it previously hung indefinitely. Refs: vectordotdev#24125 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

joshcoughlan and others added 5 commits April 29, 2026 19:19

fix: rustfmt formatting for long connect_diff calls

41ee5ff

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joshcoughlan requested a review from a team as a code owner April 30, 2026 00:53

github-actions Bot added the domain: topology Anything related to Vector's topology code label Apr 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(topology): prevent source pump deadlock during sink config reload#25330

fix(topology): prevent source pump deadlock during sink config reload#25330
joshcoughlan wants to merge 5 commits intovectordotdev:masterfrom
Automattic:fix/topology-reload-deadlock

joshcoughlan commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joshcoughlan commented Apr 30, 2026

Summary

Vector configuration

How did you test this PR?

Change Type

Is this a breaking change?

Does this PR include user facing changes?

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant