Skip to content

GCC/release: coroutine engine abort with boost.context ≥ 1.88 #1247

Description

@isaevii

Add a description

Release build aborts with "active exception in flight" on GCC + boost.context ≥ 1.88 (manage_exception_state vs userver's __cxa_get_globals interposition)

Summary

On a toolchain with boost.context ≥ 1.88.0 and libstdc++ (GCC), userver-core-unittest
(RelWithDebInfo) aborts non-deterministically with:

Unable to start coroutine engine with an active exception in flight

Root cause: boost.context 1.88 introduced detail::manage_exception_state, which
saves/restores *__cxa_get_globals() around every fiber resume()/resume_with()
(and, since ~1.91, also in ~fiber()). userver interposes __cxa_get_globals to keep
per-coroutine C++ exception state. The two mechanisms now both manage the same
__cxa_eh_globals, and when the current task context changes inside boost's
save/restore window, the thread's uncaughtExceptions counter underflows to -1, which
later trips the std::uncaught_exceptions() != 0 guard in engine::RunStandalone.

A secondary problem makes the existing escape hatch (USERVER_FEATURE_UBOOST_CORO=ON)
not actually work: engine/coro/marked_allocator.hpp includes <boost/coroutine2/...>
directly instead of going through the <coroutines/coroutine.hpp> abstraction, so even
with USERVER_FEATURE_UBOOST_CORO=ON the system boost headers (with the real
manage_exception_state) are compiled in.

Environment

  • OS: Manjaro Linux (x86_64), kernel 6.18
  • Compiler: GCC 16.1.1, libstdc++ 6.0.35
  • boost: system 1.91.0 (reproduces on any boost.context ≥ 1.88.0)
  • userver: develop (release 3.0)
  • Build type: RelWithDebInfo (release-only; Debug is unaffected)

Symptom

The guard in core/src/engine/run_standalone.cpp:

if (std::uncaught_exceptions() != 0) {
    // We are probably inside a destructor, UINVARIANT would `std::terminate`.
    utils::AbortWithStacktrace("Unable to start coroutine engine with an active exception in flight");
}

fires on a "random" test — typically the first RunStandalone after a heavy suite. Example:

[ RUN      ] NWayLRU.Ctr
Unable to start coroutine engine with an active exception in flight. Stacktrace:
 0# userver::utils::AbortWithStacktrace(std::basic_string_view<...>)
 1# userver::engine::RunStandalone(unsigned long, ...)
 2# userver::utest::impl::DoRunTest(...)
 3# userver::NWayLRU_Ctr_Test::TestBody()

A single test per process passes; the failure only appears with accumulation, which is why
it looks flaky.

Root cause (evidence)

1. It is an underflow, not a leak

At the abort, on the main (non-coroutine) thread:

GetCurrentTaskContextUnchecked() == nullptr      # genuine bare-thread tls_globals path
*(int*)(__cxa_get_globals() + offsetof(uncaughtExceptions)) == -1   # 0xffffffff
caughtExceptions == 0

So a __cxa_begin_catch decremented uncaughtExceptions without a matching __cxa_throw
increment on the same __cxa_eh_globals instance.

2. boost.context 1.88 manage_exception_state

/usr/include/boost/context/fiber_fcontext.hpp (libstdc++ branch):

class manage_exception_state {
public:
    manage_exception_state()  { exception_state_ = *__cxa_get_globals(); }   // SAVE
    ~manage_exception_state() { *__cxa_get_globals() = exception_state_; }   // RESTORE
private:
    __cxa_eh_globals exception_state_;
};

used on every switch:

fiber resume() && {
    detail::manage_exception_state exstate;          // SAVE
    return { detail::jump_fcontext( ... ) };          // switch fiber
}                                                     // RESTORE

This assumes __cxa_get_globals() is a thread-stable storage.

3. userver's interposition makes it non-stable

core/src/engine/task/cxxabi_eh_globals.cpp (USERVER_EHGLOBALS_INTERPOSE):

abi::__cxa_eh_globals* GetGlobals() throw() {
    constinit thread_local EhGlobals tls_globals;
    auto* globals = &tls_globals;
    auto* context = current_task::GetCurrentTaskContextUnchecked();
    if (context) globals = context->GetEhGlobals();   // <-- per-coroutine, changes with context
    return reinterpret_cast<abi::__cxa_eh_globals*>(globals);
}

When the current task context changes between boost's SAVE and RESTORE (e.g. a re-entrant
destructor during coroutine teardown — CoroFunc even notes "dtors may want to schedule"),
the __cxa_throw (+1) and the __cxa_begin_catch (-1) of boost's forced_unwind land in
different __cxa_eh_globals instances, leaving the thread counter at -1.

gdb watchpoint on the main thread's uncaughtExceptions shows exactly this — a
__cxa_begin_catch from boost::context::detail::fiber_entry (fiber_fcontext.hpp:147)
taking it 0 -> -1, after a forced_unwind whose throw landed elsewhere.

4. Version matrix (verified by rebuilding userver-core-unittest against each)

manage_exception_state was introduced in boost.context 1.88.0 (absent in
1.74/1.83/1.86/1.87, present in 1.88/1.89/1.91).

boost.context manage_exception_state Full userver-core-unittest (GCC, RelWithDebInfo)
≤ 1.87 none ✅ exit 0, 1889 passed, 0 abort/segv
1.88.0 resume()/resume_with() only deterministic abort at MutexDeathTest.SelfDeadlock (3/3)
1.91 + also ~fiber() non-deterministic abort (~NWayLRU.Ctr)

(≤1.87 was tested by feeding the build a fiber_fcontext.hpp with manage_exception_state
forced to the empty dummy struct — the only difference between 1.87 and 1.88.)

Minimal standalone reproduction

~70 lines, no userver, just boost.context ≥ 1.88 + a userver-style __cxa_get_globals
interposition. Prints a corrupted uncaughtExceptions when the "current context" changes
inside boost's manage_exception_state window:

#include <cxxabi.h>
#include <cstdio>
#include <cstring>
#include <utility>
#include <boost/context/fiber.hpp>
namespace ctx = boost::context;

struct EhGlobals { void* data[4] = {}; };
thread_local EhGlobals tls_globals;
thread_local EhGlobals* current_ctx = nullptr;
static EhGlobals* CurrentEh() { return current_ctx ? current_ctx : &tls_globals; }

extern "C" {
abi::__cxa_eh_globals* __cxa_get_globals() throw()      { return reinterpret_cast<abi::__cxa_eh_globals*>(CurrentEh()); }
abi::__cxa_eh_globals* __cxa_get_globals_fast() throw() { return reinterpret_cast<abi::__cxa_eh_globals*>(CurrentEh()); }
}
static int Uncaught(const EhGlobals& g) { int v; std::memcpy(&v, (const char*)&g + 8, 4); return v; }

struct UnwindContextShift { EhGlobals* to; ~UnwindContextShift() { current_ctx = to; } };

int main() {
    static EhGlobals coro_eh;
    {
        ctx::fiber f{[&](ctx::fiber&& m) {
            UnwindContextShift guard{&coro_eh};   // flips current ctx during forced_unwind
            current_ctx = nullptr;
            m = std::move(m).resume();
            return std::move(m);
        }};
        f = std::move(f).resume();                // run to first suspend
        current_ctx = nullptr;
        // fiber destroyed here -> manage_exception_state SAVE / forced_unwind / RESTORE
    }
    std::printf("tls.uncaught=%d coro.uncaught=%d  => %s\n",
                Uncaught(tls_globals), Uncaught(coro_eh),
                (Uncaught(tls_globals)==0 && Uncaught(coro_eh)==0) ? "OK" : "CORRUPTED");
}
$ g++ -O2 -std=c++17 repro.cpp -o repro -lboost_context && ./repro
tls.uncaught=0 coro.uncaught=-1  => CORRUPTED      # boost >= 1.88
                                                   # (OK on boost <= 1.87)

Why USERVER_FEATURE_UBOOST_CORO=ON does not fix it as-is

The vendored third_party/uboost_coro already neutralizes manage_exception_state
(uboost_coro/context/fiber_fcontext.hpp:67 is committed as #if 1 || ..., i.e. always the
dummy struct — commit 823a03770 "update boost ... to 1.88"). Good.

But core/src/engine/coro/marked_allocator.hpp bypasses the
<coroutines/coroutine.hpp> abstraction:

// core/src/engine/coro/marked_allocator.hpp
#include <boost/coroutine2/protected_fixedsize_stack.hpp>   // <-- direct, not the abstraction

Under USERVER_FEATURE_UBOOST_CORO=ON, core/uboost_coro/include only provides
coroutines/coroutine.hpp (→ uboost_coro/coroutine2/...); there is no boost/-named shim.
So this direct <boost/coroutine2/...> falls through to system /usr/include/boost
(boost 1.91, real manage_exception_state). Since pool.hpp pulls the coroutine type via
marked_allocator.hpp, the entire coroutine2 template code is compiled against system boost,
and the abort persists. Confirmed by preprocessing:

$ g++ <uboost build flags> -E core/src/engine/coro/marked_allocator.hpp | grep fiber_fcontext
# 1 "/usr/include/boost/context/fiber_fcontext.hpp"        # <-- system, not vendored

Proposed fix

Route marked_allocator.hpp through the coroutine abstraction so
USERVER_FEATURE_UBOOST_CORO=ON truly uses the vendored copy (both sys_coro and
uboost_coro variants of <coroutines/coroutine.hpp> already provide
boost::coroutines2::protected_fixedsize_stack):

 // core/src/engine/coro/marked_allocator.hpp
-#include <boost/coroutine2/protected_fixedsize_stack.hpp>
+#include <coroutines/coroutine.hpp>

With this change + USERVER_FEATURE_UBOOST_CORO=ON, the full suite is green on GCC 16 /
boost 1.91:

[  PASSED  ] 1889 tests        (0 aborts, 0 segv; reproduced across multiple runs)

Notes / open questions

  • This makes USERVER_FEATURE_UBOOST_CORO=ON a reliable workaround for boost.context ≥ 1.88
    on libstdc++. System-boost builds (USERVER_FEATURE_UBOOST_CORO=OFF) with boost ≥ 1.88
    are still affected
    — for those, the real fix is to make userver's interposition
    cooperate with (or stand down for) boost's manage_exception_state, or to require
    USERVER_FEATURE_UBOOST_CORO=ON / boost ≤ 1.87.
  • Removing the interposition entirely and relying solely on boost's manage_exception_state
    is not sufficient: it preserves the switching thread's globals but not each suspended
    coroutine's own in-flight exception state, and segfaults on a normal coroutine resume
    (pull_control_block_cc.ipp c = std::move(c).resume() from TaskContext::CoroFunc).
  • clang + libc++ builds are unaffected (they use USERVER_EHGLOBALS_SWAP, a different
    mechanism).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions