GCC/release: coroutine engine abort with boost.context ≥ 1.88

### Add a description

# Release build aborts with "active exception in flight" on GCC + boost.context ≥ 1.88 (manage_exception_state vs userver's `__cxa_get_globals` interposition)

## Summary

On a toolchain with **boost.context ≥ 1.88.0** and **libstdc++** (GCC), `userver-core-unittest`
(RelWithDebInfo) aborts non-deterministically with:

```
Unable to start coroutine engine with an active exception in flight
```

Root cause: boost.context **1.88** introduced `detail::manage_exception_state`, which
**saves/restores `*__cxa_get_globals()`** around every fiber `resume()`/`resume_with()`
(and, since ~1.91, also in `~fiber()`). userver **interposes `__cxa_get_globals`** to keep
per-coroutine C++ exception state. The two mechanisms now both manage the same
`__cxa_eh_globals`, and when the current task context changes inside boost's
save/restore window, the thread's `uncaughtExceptions` counter **underflows to -1**, which
later trips the `std::uncaught_exceptions() != 0` guard in `engine::RunStandalone`.

A secondary problem makes the existing escape hatch (`USERVER_FEATURE_UBOOST_CORO=ON`)
**not actually work**: `engine/coro/marked_allocator.hpp` includes `<boost/coroutine2/...>`
directly instead of going through the `<coroutines/coroutine.hpp>` abstraction, so even
with `USERVER_FEATURE_UBOOST_CORO=ON` the system boost headers (with the real
`manage_exception_state`) are compiled in.

## Environment

- OS: Manjaro Linux (x86_64), kernel 6.18
- Compiler: GCC 16.1.1, libstdc++ 6.0.35
- boost: system 1.91.0 (reproduces on any boost.context ≥ 1.88.0)
- userver: `develop` (release 3.0)
- Build type: RelWithDebInfo (release-only; Debug is unaffected)

## Symptom

The guard in `core/src/engine/run_standalone.cpp`:

```cpp
if (std::uncaught_exceptions() != 0) {
    // We are probably inside a destructor, UINVARIANT would `std::terminate`.
    utils::AbortWithStacktrace("Unable to start coroutine engine with an active exception in flight");
}
```

fires on a "random" test — typically the first `RunStandalone` after a heavy suite. Example:

```
[ RUN      ] NWayLRU.Ctr
Unable to start coroutine engine with an active exception in flight. Stacktrace:
 0# userver::utils::AbortWithStacktrace(std::basic_string_view<...>)
 1# userver::engine::RunStandalone(unsigned long, ...)
 2# userver::utest::impl::DoRunTest(...)
 3# userver::NWayLRU_Ctr_Test::TestBody()
```

A single test per process passes; the failure only appears with accumulation, which is why
it looks flaky.

## Root cause (evidence)

### 1. It is an underflow, not a leak

At the abort, on the main (non-coroutine) thread:

```
GetCurrentTaskContextUnchecked() == nullptr      # genuine bare-thread tls_globals path
*(int*)(__cxa_get_globals() + offsetof(uncaughtExceptions)) == -1   # 0xffffffff
caughtExceptions == 0
```

So a `__cxa_begin_catch` decremented `uncaughtExceptions` without a matching `__cxa_throw`
increment **on the same `__cxa_eh_globals` instance**.

### 2. boost.context 1.88 `manage_exception_state`

`/usr/include/boost/context/fiber_fcontext.hpp` (libstdc++ branch):

```cpp
class manage_exception_state {
public:
    manage_exception_state()  { exception_state_ = *__cxa_get_globals(); }   // SAVE
    ~manage_exception_state() { *__cxa_get_globals() = exception_state_; }   // RESTORE
private:
    __cxa_eh_globals exception_state_;
};
```

used on every switch:

```cpp
fiber resume() && {
    detail::manage_exception_state exstate;          // SAVE
    return { detail::jump_fcontext( ... ) };          // switch fiber
}                                                     // RESTORE
```

This assumes `__cxa_get_globals()` is a **thread-stable** storage.

### 3. userver's interposition makes it non-stable

`core/src/engine/task/cxxabi_eh_globals.cpp` (`USERVER_EHGLOBALS_INTERPOSE`):

```cpp
abi::__cxa_eh_globals* GetGlobals() throw() {
    constinit thread_local EhGlobals tls_globals;
    auto* globals = &tls_globals;
    auto* context = current_task::GetCurrentTaskContextUnchecked();
    if (context) globals = context->GetEhGlobals();   // <-- per-coroutine, changes with context
    return reinterpret_cast<abi::__cxa_eh_globals*>(globals);
}
```

When the current task context changes between boost's SAVE and RESTORE (e.g. a re-entrant
destructor during coroutine teardown — `CoroFunc` even notes "dtors may want to schedule"),
the `__cxa_throw` (+1) and the `__cxa_begin_catch` (-1) of boost's `forced_unwind` land in
**different** `__cxa_eh_globals` instances, leaving the thread counter at -1.

gdb watchpoint on the main thread's `uncaughtExceptions` shows exactly this — a
`__cxa_begin_catch` from `boost::context::detail::fiber_entry` (`fiber_fcontext.hpp:147`)
taking it `0 -> -1`, after a `forced_unwind` whose throw landed elsewhere.

### 4. Version matrix (verified by rebuilding userver-core-unittest against each)

`manage_exception_state` was introduced in **boost.context 1.88.0** (absent in
1.74/1.83/1.86/1.87, present in 1.88/1.89/1.91).

| boost.context | `manage_exception_state` | Full `userver-core-unittest` (GCC, RelWithDebInfo) |
|---|---|---|
| ≤ 1.87 | none | ✅ exit 0, 1889 passed, 0 abort/segv |
| 1.88.0 | `resume()`/`resume_with()` only | ❌ **deterministic** abort at `MutexDeathTest.SelfDeadlock` (3/3) |
| 1.91 | + also `~fiber()` | ❌ **non-deterministic** abort (~`NWayLRU.Ctr`) |

(`≤1.87` was tested by feeding the build a `fiber_fcontext.hpp` with `manage_exception_state`
forced to the empty dummy struct — the only difference between 1.87 and 1.88.)

## Minimal standalone reproduction

~70 lines, no userver, just boost.context ≥ 1.88 + a userver-style `__cxa_get_globals`
interposition. Prints a corrupted `uncaughtExceptions` when the "current context" changes
inside boost's `manage_exception_state` window:

```cpp
#include <cxxabi.h>
#include <cstdio>
#include <cstring>
#include <utility>
#include <boost/context/fiber.hpp>
namespace ctx = boost::context;

struct EhGlobals { void* data[4] = {}; };
thread_local EhGlobals tls_globals;
thread_local EhGlobals* current_ctx = nullptr;
static EhGlobals* CurrentEh() { return current_ctx ? current_ctx : &tls_globals; }

extern "C" {
abi::__cxa_eh_globals* __cxa_get_globals() throw()      { return reinterpret_cast<abi::__cxa_eh_globals*>(CurrentEh()); }
abi::__cxa_eh_globals* __cxa_get_globals_fast() throw() { return reinterpret_cast<abi::__cxa_eh_globals*>(CurrentEh()); }
}
static int Uncaught(const EhGlobals& g) { int v; std::memcpy(&v, (const char*)&g + 8, 4); return v; }

struct UnwindContextShift { EhGlobals* to; ~UnwindContextShift() { current_ctx = to; } };

int main() {
    static EhGlobals coro_eh;
    {
        ctx::fiber f{[&](ctx::fiber&& m) {
            UnwindContextShift guard{&coro_eh};   // flips current ctx during forced_unwind
            current_ctx = nullptr;
            m = std::move(m).resume();
            return std::move(m);
        }};
        f = std::move(f).resume();                // run to first suspend
        current_ctx = nullptr;
        // fiber destroyed here -> manage_exception_state SAVE / forced_unwind / RESTORE
    }
    std::printf("tls.uncaught=%d coro.uncaught=%d  => %s\n",
                Uncaught(tls_globals), Uncaught(coro_eh),
                (Uncaught(tls_globals)==0 && Uncaught(coro_eh)==0) ? "OK" : "CORRUPTED");
}
```

```
$ g++ -O2 -std=c++17 repro.cpp -o repro -lboost_context && ./repro
tls.uncaught=0 coro.uncaught=-1  => CORRUPTED      # boost >= 1.88
                                                   # (OK on boost <= 1.87)
```

## Why `USERVER_FEATURE_UBOOST_CORO=ON` does not fix it as-is

The vendored `third_party/uboost_coro` already neutralizes `manage_exception_state`
(`uboost_coro/context/fiber_fcontext.hpp:67` is committed as `#if 1 || ...`, i.e. always the
dummy struct — commit `823a03770` "update boost ... to 1.88"). Good.

But `core/src/engine/coro/marked_allocator.hpp` bypasses the
`<coroutines/coroutine.hpp>` abstraction:

```cpp
// core/src/engine/coro/marked_allocator.hpp
#include <boost/coroutine2/protected_fixedsize_stack.hpp>   // <-- direct, not the abstraction
```

Under `USERVER_FEATURE_UBOOST_CORO=ON`, `core/uboost_coro/include` only provides
`coroutines/coroutine.hpp` (→ `uboost_coro/coroutine2/...`); there is no `boost/`-named shim.
So this direct `<boost/coroutine2/...>` falls through to system `/usr/include/boost`
(boost 1.91, real `manage_exception_state`). Since `pool.hpp` pulls the coroutine type via
`marked_allocator.hpp`, the entire coroutine2 template code is compiled against system boost,
and the abort persists. Confirmed by preprocessing:

```
$ g++ <uboost build flags> -E core/src/engine/coro/marked_allocator.hpp | grep fiber_fcontext
# 1 "/usr/include/boost/context/fiber_fcontext.hpp"        # <-- system, not vendored
```

## Proposed fix

Route `marked_allocator.hpp` through the coroutine abstraction so
`USERVER_FEATURE_UBOOST_CORO=ON` truly uses the vendored copy (both `sys_coro` and
`uboost_coro` variants of `<coroutines/coroutine.hpp>` already provide
`boost::coroutines2::protected_fixedsize_stack`):

```diff
 // core/src/engine/coro/marked_allocator.hpp
-#include <boost/coroutine2/protected_fixedsize_stack.hpp>
+#include <coroutines/coroutine.hpp>
```

With this change + `USERVER_FEATURE_UBOOST_CORO=ON`, the full suite is green on GCC 16 /
boost 1.91:

```
[  PASSED  ] 1889 tests        (0 aborts, 0 segv; reproduced across multiple runs)
```

### Notes / open questions

- This makes `USERVER_FEATURE_UBOOST_CORO=ON` a reliable workaround for boost.context ≥ 1.88
  on libstdc++. **System-boost builds (`USERVER_FEATURE_UBOOST_CORO=OFF`) with boost ≥ 1.88
  are still affected** — for those, the real fix is to make userver's interposition
  cooperate with (or stand down for) boost's `manage_exception_state`, or to require
  `USERVER_FEATURE_UBOOST_CORO=ON` / boost ≤ 1.87.
- Removing the interposition entirely and relying solely on boost's `manage_exception_state`
  is **not** sufficient: it preserves the switching thread's globals but not each suspended
  coroutine's own in-flight exception state, and **segfaults** on a normal coroutine resume
  (`pull_control_block_cc.ipp` `c = std::move(c).resume()` from `TaskContext::CoroFunc`).
- clang + libc++ builds are unaffected (they use `USERVER_EHGLOBALS_SWAP`, a different
  mechanism).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCC/release: coroutine engine abort with boost.context ≥ 1.88 #1247

Add a description

Release build aborts with "active exception in flight" on GCC + boost.context ≥ 1.88 (manage_exception_state vs userver's `__cxa_get_globals` interposition)

Summary

Environment

Symptom

Root cause (evidence)

1. It is an underflow, not a leak

2. boost.context 1.88 `manage_exception_state`

3. userver's interposition makes it non-stable

4. Version matrix (verified by rebuilding userver-core-unittest against each)

Minimal standalone reproduction

Why `USERVER_FEATURE_UBOOST_CORO=ON` does not fix it as-is

Proposed fix

Notes / open questions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

boost.context	`manage_exception_state`	Full `userver-core-unittest` (GCC, RelWithDebInfo)
≤ 1.87	none	✅ exit 0, 1889 passed, 0 abort/segv
1.88.0	`resume()`/`resume_with()` only	❌ deterministic abort at `MutexDeathTest.SelfDeadlock` (3/3)
1.91	+ also `~fiber()`	❌ non-deterministic abort (~`NWayLRU.Ctr`)

GCC/release: coroutine engine abort with boost.context ≥ 1.88 #1247

Description

Add a description

Release build aborts with "active exception in flight" on GCC + boost.context ≥ 1.88 (manage_exception_state vs userver's __cxa_get_globals interposition)

Summary

Environment

Symptom

Root cause (evidence)

1. It is an underflow, not a leak

2. boost.context 1.88 manage_exception_state

3. userver's interposition makes it non-stable

4. Version matrix (verified by rebuilding userver-core-unittest against each)

Minimal standalone reproduction

Why USERVER_FEATURE_UBOOST_CORO=ON does not fix it as-is

Proposed fix

Notes / open questions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Release build aborts with "active exception in flight" on GCC + boost.context ≥ 1.88 (manage_exception_state vs userver's `__cxa_get_globals` interposition)

2. boost.context 1.88 `manage_exception_state`

Why `USERVER_FEATURE_UBOOST_CORO=ON` does not fix it as-is