Skip to content

Conversation

nil0x42
Copy link
Contributor

@nil0x42 nil0x42 commented May 16, 2025

Hi, i'm nil0x42, the creator of duplicut
I wrote this patch for duplicut's copy of thpool.c, as i had a strange timeout in some unit tests that arised only once per ~1000 runs.

unblock all workers in thpool_destroy(), fix hang with medium-to-large pools

thpool_destroy() is supposed to wake every worker, wait until the pool
is empty, and return quickly.
In practice it can block for dozens of seconds as
soon as the pool size out-grows its one-second “fast-exit” window.

Root cause

  • bsem_post_all() sets v = 1 and broadcasts.
  • The first worker that wakes executes
    bsem_wait() → while (v==0) … ; v = 0; …
    → the single ticket is consumed, all other awakened threads fall
    straight back into pthread_cond_wait().
  • After that first second, the destroy code slow-polls:
    one bsem_post_all() + sleep(1) per loop → one ticket per
    second
    .

With 4000 threads the destructor needs ~4000 s; even with only 16
threads
(real-world duplicut run) it can exceed a watchdog such as
timeout 5, hanging about 1 ‰ of the executions.

Fix (pure ANSI C / POSIX, no API change)

  1. bsem_wait() now consumes exactly one ticket:
/* take one ticket */
bsem_p->v--;
  1. bsem_post_all() grants “infinite” tickets so every waiter can
    pass:
bsem_p->v = INT_MAX;
pthread_cond_broadcast(&bsem_p->cond);

The single-post path (bsem_post()) is unchanged: one post still wakes
one thread.
All accesses to v remain protected by the semaphore mutex, so no new
data races or lock-order inversions are introduced.

Reproducer added

tests/thpool_destroy_hang.sh
tests/src/thpool_destroy_hang.c

cd tests/
./destroy_hang_deterministic.sh # hangs with old code, passes (<0.2 s) with patch
  • 4000-thread pool hangs 100 % of the time without the patch.
  • 16-thread pool matches the sporadic timeout seen in duplicut.

Impact

Scenario before after
thpool_destroy() – 4000 threads 60 s+ 170 ms
duplicut (16 threads) 1 ‰ hangs (timeout 5) 0

The change makes pool shutdown reliable regardless of the thread count
and keeps behaviour identical in every other respect.

`thpool_destroy()` is supposed to wake every worker, wait until the pool
is empty, and return quickly.
In practice it can block **indefinitely** (or for dozens of seconds) as
soon as the pool size out-grows its one-second “fast-exit” window.

* `bsem_post_all()` sets `v = 1` and broadcasts.
* The **first** worker that wakes executes
  `bsem_wait() → while (v==0) … ; v = 0; …`
  → the single ticket is consumed, all other awakened threads fall
  straight back into `pthread_cond_wait()`.
* After that first second, the destroy code slow-polls:
  one `bsem_post_all()` + `sleep(1)` per loop → **one ticket per
  second**.

With 4000 threads the destructor needs \~4000 s; even with only **16
threads** (real-world `duplicut` run) it can exceed a watchdog such as
`timeout 5`, hanging about 1 ‰ of the executions.
nil0x42 added a commit to nil0x42/duplicut that referenced this pull request May 16, 2025
I also submitted a PR to upstream project, explaining the issue:
    Pithikos/C-Thread-Pool#135
@npc1054657282
Copy link

Good, this solves #134
I try to solve this by remove the all bsem_* functions and use the cond.
This solve change the code little.

ShowOffJonah added a commit to ShowOffJonah/Cutdi that referenced this pull request Sep 19, 2025
I also submitted a PR to upstream project, explaining the issue:
    Pithikos/C-Thread-Pool#135
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants