Skip to content

Conversation

@jpnurmi
Copy link
Collaborator

@jpnurmi jpnurmi commented Sep 29, 2025

When chaining signal handlers in AOT mode, detect whether the .NET runtime converts a signal to a managed exception and transfers execution to the managed exception handler. In this case, Sentry Native should abort crash handling because the exception is caught and handled in managed code.

try
{
    var s = default(string);
    var c = s.Length;
}
catch (NullReferenceException exception)
{
    // the exception is caught and handled in managed code. in AOT mode, execution
    // should continue normally without sentry-native's crash handling kicking in 
}

See also:


Note

Detect .NET runtime converting signals to managed exceptions and skip native crash handling; add JIT/AOT tests and changelog entry.

  • Inproc backend (Linux):
    • Add get_stack_pointer and get_instruction_pointer for multiple architectures to read from ucontext_t.
    • When CHAIN_AT_START, compare IP/SP before/after invoking prior handler; if changed, treat as managed exception and abort native handling.
  • Tests:
    • Refactor JIT runners (run_jit_*) and add AOT runners (run_aot_*), including AOT publish and execution.
    • Update fixture Program.cs to use null-forgiving s! and conditionally rethrow via args ("managed-exception").
    • Add separate test_aot_signals_inproc and rename JIT test; adjust skip reasons/messages.
  • Changelog:
    • Add Unreleased note: fix AOT interop with managed .NET runtimes.

Written by Cursor Bugbot for commit 060b18b. This will update automatically on new commits. Configure here.

@jpnurmi jpnurmi marked this pull request as ready for review September 29, 2025 12:10
cursor[bot]

This comment was marked as outdated.

@jpnurmi jpnurmi requested a review from supervacuus September 29, 2025 13:53
Copy link
Collaborator

@supervacuus supervacuus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding the diff to make the other implementations work @jpnurmi. Since this deviates significantly, we should document the change as clearly as possible (for our internal use) to ensure understanding that we have shifted the focus to the current AOT signal/exception interface or adapt the tests to cover the relevant area.

Co-authored-by: Mischan Toosarani-Hausberger <[email protected]>
@jpnurmi jpnurmi changed the title fix: interop with managed .NET runtimes fix: AOT interop with managed .NET runtimes Sep 30, 2025
cursor[bot]

This comment was marked as outdated.

Copy link
Collaborator

@supervacuus supervacuus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can isolate the SIGABRT in case of an unhandled exception on AOT/Mono as well.

SENTRY_DEBUG("runtime converted the signal to a managed "
"exception, we do not handle the signal");
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely correct, but the only side-effect currently visible is for the logging toggle. Similarly, to how we "leave" the signal handler before chaining, we must also re-enable logging immediately after "leaving" and disable it again before re-entering, because if it were a managed code exception, we want logging to remain enabled.

We can also move the entire sig_slot assignment down below the CHAIN_AT_START code, to make the path dependencies more obvious.

However, I think both have a lower priority than figuring out the signaling sequence of both runtimes and how we can align them.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 1, 2025

I'm trying to fix the scenario where Mono's signal handler detects a managed exception, it modifies the context to transfer execution to a managed exception handler, and then execution returns to Sentry Native's signal handler. In this case, Sentry Native needs to detect that Mono wanted execution to continue, and abort crash handling.

In case of a real native crash, though, if we invoke Mono's signal handler first and Mono's native crash handling decides to call _exit(), then Sentry Native misses the crash. 🙁

@supervacuus
Copy link
Collaborator

supervacuus commented Oct 1, 2025

I'm trying to fix the scenario where Mono's signal handler detects a managed exception, it modifies the context to transfer execution to a managed exception handler, and then execution returns to Sentry Native's signal handler. In this case, Sentry Native needs to detect that Mono wanted execution to continue, and abort crash handling.

Isn't that what you're trying to do here all along? Or is there yet another difference when you use pure Mono?

In case of a real native crash, though, if we invoke Mono's signal handler first and Mono's native crash handling decides to call _exit(), then Sentry Native misses the crash. 🙁

Were you able to observe this? Because this only happens when crash_chaining is disabled. I cannot imagine that crash or signal chaining is off by default (especially not on Android or Linux).

@supervacuus
Copy link
Collaborator

Isn't that what you're trying to do here all along? Or is there yet another difference when you use pure Mono?

Btw, if it is the latter, then this is also the reason why I suggested that CLR JIT support can be dropped altogether. When I started this project (which was over a year ago), the primary goal was to determine how much the handler interaction between the various runtime implementations converges. I started with CLR JIT as a baseline. However, if we primarily have downstream usage for another implementation that diverges entirely in signal handling, then we can either drop the current implementation or add another handler strategy.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 1, 2025

Were you able to observe this? Because this only happens when crash_chaining is disabled. I cannot imagine that crash or signal chaining is off by default (especially not on Android or Linux).

I tried creating a test case using mcs + mono --aot on Linux. Mono's native crash reporter kicks in when we call Mono's signal handler for a native crash, and execution ends there...

=================================================================
        Native Crash Reporting
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

=================================================================
        Native stacktrace:
=================================================================
        0x62c7d67295fa - mono :
        0x62c7d66c7e8a - mono :
        0x62c7d671cad0 - mono :
        0x728ba68491a6 - /tmp/pytest-of-jpnurmi/pytest-55/cmake0/libcrash.so : native_crash
        0x40961618 - Unknown

=================================================================
        Native Crash Reporting
=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

=================================================================
        Native stacktrace:
=================================================================
        0x62c7d67295fa - mono :
        0x62c7d66c7e8a - mono :
        0x62c7d671cad0 - mono :
        0x728ba68491a6 - /tmp/pytest-of-jpnurmi/pytest-55/cmake0/libcrash.so : native_crash
        0x40961618 - Unknown

=================================================================
        Telemetry Dumper:
=================================================================
Pkilling 0x125944111036096x from 0x125944120812288x
Entering thread summarizer pause from 0x125944120812288x
Finished thread summarizer pause from 0x125944120812288x.
Failed to create breadcrumb file (null)/crash_hash_0x3652010b5

Waiting for dumping threads to resume

=================================================================
        Basic Fault Address Reporting
=================================================================
Memory around native instruction pointer (0x728ba68491a6):0x728ba6849196  ff ff ff f3 0f 1e fa 55 48 89 e5 b8 0a 00 00 00  .......UH.......
0x728ba68491a6  c7 00 64 00 00 00 90 5d c3 f3 0f 1e fa 55 48 89  ..d....].....UH.
0x728ba68491b6  e5 48 83 ec 30 64 48 8b 04 25 28 00 00 00 48 89  .H..0dH..%(...H.
0x728ba68491c6  45 f8 31 c0 48 c7 45 d8 00 40 00 00 48 8b 45 d8  [email protected].

=================================================================
        Managed Stacktrace:
=================================================================
          at <unknown> <0xffffffff>
          at dotnet_signal.Program:native_crash <0x000a7>
          at dotnet_signal.Program:Main <0x000e8>
          at <Module>:runtime_invoke_void_object <0x00091>
=================================================================

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 1, 2025

Were you able to observe this? Because this only happens when crash_chaining is disabled. I cannot imagine that crash or signal chaining is off by default (especially not on Android or Linux).

I tried creating a test case using mcs + mono --aot on Linux. Mono's native crash reporter kicks in when we call Mono's signal handler for a native crash, and execution ends there...

No wait, it's the newly added IP/SP check that prevents native crash handling, too. 🤦 How the heck do we distinguish between these.......

@supervacuus
Copy link
Collaborator

No wait, it's the newly added IP/SP check that prevents native crash handling, too. 🤦 How the heck do we distinguish between these.......

I was wary of checking ucontext modifications along the signal chain. I didn't have the time to review the implementation, but I can.

@supervacuus
Copy link
Collaborator

supervacuus commented Oct 1, 2025

No wait, it's the newly added IP/SP check that prevents native crash handling, too. 🤦 How the heck do we distinguish between these.......

Try, as a first step, to switch the order of the handler chain for Mono (and drop your current IP/SP check or even the CHAIN_AT_FIRST strategy entirely). The way it seems to be operating makes more sense if their handlers get installed last. In "old" Mono, there were managed-language side functions that could (un)install signal handlers at specific points (which could be controlled from the sentry-dotnet around the native SDK initialization) to control the chain being:

DFL <- Native SDK <- mono handler

rather than what we have now:

DFL <- mono handler <- Native SDK

Not sure if they are still exposed in the dotnet/runtime mono fork, but we can certainly try to achieve something similar. Then we would have their handler first and might not need an alternative strategy inside our handler; maybe not even for CLR (but one step at a time).

@jpnurmi jpnurmi marked this pull request as draft October 1, 2025 12:36
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 2, 2025

Swapping the order of the signal handlers would work. I was able to confirm the theory on Linux, even though I had to patch Mono to either

to make it possible to swap the order in either managed or native code, respectively. However, that's just Linux, which is not relevant for Sentry .NET on Android or iOS. The problem is that there's no such type as Mono.Runtime on either Android or iOS... 🤔

@supervacuus
Copy link
Collaborator

to make it possible to swap the order in either managed or native code, respectively. However, that's just Linux, which is not relevant for Sentry .NET on Android or iOS. The problem is that there's no such type as Mono.Runtime on either Android or iOS...

We should do this in the Native SDK, similar to how we can change the invocation sequence in the handler; we can construct the signal chain up to a point during the setup of the signal handlers (rather than at signal-handling time). I can follow up on this topic next week.

jpnurmi added a commit to getsentry/sentry-dotnet that referenced this pull request Oct 21, 2025
@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 21, 2025

The main purpose of debugging on Linux was just to understand Mono's behavior. 🙂

Anyway, sentry-dotnet has new integration tests for Android:

I have also temporarily hacked sentry-dotnet's build system to pick sentry-android from a local Maven repo instead of downloading from remotes:

This assumes locally built and published (gradlew publishToMavenLocal) of both:

Furthermore, I reverted this old change and switched sentry-dotnet back to CHAIN_AT_START for testing purposes:

With local builds and all above changes temporarily combined in sentry-dotnet's jpnurmi/android-chain-at-start branch, I can confirm that this PR fixes the NullReferenceException test case while the CrashType.Native test case still passes on both arm64 and x86_64.

@jpnurmi jpnurmi marked this pull request as ready for review October 21, 2025 09:38
They are irrelevant for sentry-dotnet or .NET on Android, and there
are no tests checking if they even work.
cursor[bot]

This comment was marked as outdated.

Copy link
Collaborator

@supervacuus supervacuus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see that the adaptation works downstream ❤️

Please move the log flush (Flush logs in a crash-safe...) below the point where we call sentry__page_allocator_enable() but outside the #ifdef because that should happen on all platforms. There is no reason to flush the logs if we don't know there is a terminal signal to handle, and it also requires the allocator, which isn't safe before we enable the page-allocator.

Otherwise, I would either document why we hide the behavior for managed exceptions that aren't handled in C# code now (since this has a more limited scope than "all managed exceptions" and essentially hides behavior) or adapt the test handling accordingly.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 22, 2025

P.S. These tests were only executed in Release mode, but I've prepared a patch to make them execute in both Debug and Release:

For what it's worth, they did seem to pass locally on both x86_64 and arm64, even though the NullReferenceException leakage only occurs in Release-optimized code. It's good to have Debug mode integration tests, nevertheless, to make sure we capture native crashes as expected, even with the IP/SP check in place.

@supervacuus
Copy link
Collaborator

P.S. These tests were only executed in Release mode

Yes, I have seen this. Release is more critical because that is where most problems appear. However, it is sensible to track the behavior on Debug too, so we don't surprise users with changing behavior during development.

but I've prepared a patch to make them execute in both Debug and Release:

Perfection 💯

It's good to have Debug mode integration tests, nevertheless, to make sure we capture native crashes as expected, even with the IP/SP check in place.

Agreed, and also because we want to track changes in behavior or have feedback when we add or extend the test dimensions.

@supervacuus
Copy link
Collaborator

The only thing left now is to add the unhandled-managed-exception run to the AOT test case.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 22, 2025

The only thing left now is to add the unhandled-managed-exception run to the AOT test case.

Well, this is interesting. Unwinding the chained SIGABRT on Linux crashes in backtrace when called as a fallback from here:

// if unwinding from a ucontext didn't yield any results, try again with a
// direct unwind. this is most likely the case when using `libbacktrace`,
// since that does not allow to unwind from a ucontext at all.
if (!frame_count) {
frame_count = sentry_unwind_stack(NULL, &backtrace[0], MAX_FRAMES);
}

Not sure how to debug this, because running in a debugger changes the behavior. 🤨

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 22, 2025

Not sure if skipping this fallback in case of chained signal handlers is the right thing to do, but it helps avoid the crash...

@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

❌ Patch coverage is 67.85714% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.35%. Comparing base (516c150) to head (ac6877e).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1392      +/-   ##
==========================================
- Coverage   83.48%   83.35%   -0.14%     
==========================================
  Files          58       58              
  Lines        9648     9660      +12     
  Branches     1511     1512       +1     
==========================================
- Hits         8055     8052       -3     
- Misses       1439     1451      +12     
- Partials      154      157       +3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@supervacuus supervacuus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fair to exclude the non-user-context backtrace fallback. It is questionable whether that stack trace, had it not crashed, would provide any helpful information. I would comment on why you don't do it.

@jpnurmi
Copy link
Collaborator Author

jpnurmi commented Oct 27, 2025

Looking forward to making this available as an opt-in for starters:

Thanks so much for the help and guidance! ❤️

@jpnurmi jpnurmi merged commit 9895a5c into master Oct 27, 2025
41 checks passed
@jpnurmi jpnurmi deleted the fix/dotnet-interop branch October 27, 2025 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants