[BUGFIX] Fix thpool_destroy() hang: unlock all workers at once #135

nil0x42 · 2025-05-16T16:46:45Z

Hi, i'm nil0x42, the creator of duplicut
I wrote this patch for duplicut's copy of thpool.c, as i had a strange timeout in some unit tests that arised only once per ~1000 runs.

unblock all workers in thpool_destroy(), fix hang with medium-to-large pools

thpool_destroy() is supposed to wake every worker, wait until the pool
is empty, and return quickly.
In practice it can block for dozens of seconds as
soon as the pool size out-grows its one-second “fast-exit” window.

Root cause

bsem_post_all() sets v = 1 and broadcasts.
The first worker that wakes executes
bsem_wait() → while (v==0) … ; v = 0; …
→ the single ticket is consumed, all other awakened threads fall
straight back into pthread_cond_wait().
After that first second, the destroy code slow-polls:
one bsem_post_all() + sleep(1) per loop → one ticket per
second.

With 4000 threads the destructor needs ~4000 s; even with only 16
threads (real-world duplicut run) it can exceed a watchdog such as
timeout 5, hanging about 1 ‰ of the executions.

Fix (pure ANSI C / POSIX, no API change)

bsem_wait() now consumes exactly one ticket:

/* take one ticket */
bsem_p->v--;

bsem_post_all() grants “infinite” tickets so every waiter can
pass:

bsem_p->v = INT_MAX;
pthread_cond_broadcast(&bsem_p->cond);

The single-post path (bsem_post()) is unchanged: one post still wakes
one thread.
All accesses to v remain protected by the semaphore mutex, so no new
data races or lock-order inversions are introduced.

Reproducer added

tests/thpool_destroy_hang.sh
tests/src/thpool_destroy_hang.c

cd tests/
./destroy_hang_deterministic.sh # hangs with old code, passes (<0.2 s) with patch

4000-thread pool hangs 100 % of the time without the patch.
16-thread pool matches the sporadic timeout seen in duplicut.

Impact

Scenario	before	after
`thpool_destroy()` – 4000 threads	60 s+	170 ms
`duplicut` (16 threads)	1 ‰ hangs (timeout 5)	0

The change makes pool shutdown reliable regardless of the thread count
and keeps behaviour identical in every other respect.

`thpool_destroy()` is supposed to wake every worker, wait until the pool is empty, and return quickly. In practice it can block **indefinitely** (or for dozens of seconds) as soon as the pool size out-grows its one-second “fast-exit” window. * `bsem_post_all()` sets `v = 1` and broadcasts. * The **first** worker that wakes executes `bsem_wait() → while (v==0) … ; v = 0; …` → the single ticket is consumed, all other awakened threads fall straight back into `pthread_cond_wait()`. * After that first second, the destroy code slow-polls: one `bsem_post_all()` + `sleep(1)` per loop → **one ticket per second**. With 4000 threads the destructor needs \~4000 s; even with only **16 threads** (real-world `duplicut` run) it can exceed a watchdog such as `timeout 5`, hanging about 1 ‰ of the executions.

I also submitted a PR to upstream project, explaining the issue: Pithikos/C-Thread-Pool#135

npc1054657282 · 2025-05-18T11:23:38Z

Good, this solves #134
I try to solve this by remove the all bsem_* functions and use the cond.
This solve change the code little.

I also submitted a PR to upstream project, explaining the issue: Pithikos/C-Thread-Pool#135

nil0x42 added a commit to nil0x42/duplicut that referenced this pull request May 16, 2025

core: fix hang bug @thpool_destroy()

55b7204

I also submitted a PR to upstream project, explaining the issue: Pithikos/C-Thread-Pool#135

ShowOffJonah added a commit to ShowOffJonah/Cutdi that referenced this pull request Sep 19, 2025

core: fix hang bug @thpool_destroy()

0163a21

I also submitted a PR to upstream project, explaining the issue: Pithikos/C-Thread-Pool#135

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUGFIX] Fix thpool_destroy() hang: unlock all workers at once #135

[BUGFIX] Fix thpool_destroy() hang: unlock all workers at once #135

Uh oh!

nil0x42 commented May 16, 2025

Uh oh!

npc1054657282 commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[BUGFIX] Fix thpool_destroy() hang: unlock all workers at once #135

Are you sure you want to change the base?

[BUGFIX] Fix thpool_destroy() hang: unlock all workers at once #135

Uh oh!

Conversation

nil0x42 commented May 16, 2025

unblock all workers in thpool_destroy(), fix hang with medium-to-large pools

Root cause

Fix (pure ANSI C / POSIX, no API change)

Reproducer added

Impact

Uh oh!

npc1054657282 commented May 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants