-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make longjmp over managed frames work #111259
Make longjmp over managed frames work #111259
Conversation
The new exception handling has broken hosting Lua on Windows when a Lua code throws an exception that ends up crossing managed frames. Lua uses longjmp for its exception handling mechanism and the new exception handling reported the longjmp as an unhandled `SEHException`. This change makes the new EH ignore the `STATUS_LONGJUMP` exception when it passes through managed frames, so it can reach the target location. It is a best effort fix though, as finallys in the skipped frames won't be called. Close dotnet#111242
src/coreclr/vm/exceptionhandling.cpp
Outdated
@@ -932,6 +932,13 @@ ProcessCLRExceptionNew(IN PEXCEPTION_RECORD pExceptionRecord, | |||
|
|||
Thread* pThread = GetThread(); | |||
|
|||
if (pExceptionRecord->ExceptionCode == STATUS_LONGJUMP) | |||
{ | |||
// This is an exception used to unwind during longjmp function. We just let it pass through managed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to unwind Frames to avoid corrupting runtime data structures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if we are unwinding Frames, is it that hard to call finallys too to match the old behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it implemented by propagating the exception through the managed frames as usual, so the finallys were called, and then re-issued the longjmp after processing them. The only missing part to do was preventing calling catch handlers. The old EH doesn't call those "naturally", as the longjmp runs only the unwind pass.
And then you've mentioned it was not supported, so I have deleted that code :-).
But anyways, I can change it back to that, the change wasn't complicated. The only thing is that the Unix handling would then be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing is that the Unix handling would then be different.
longjump over managed code does not and cannot work well on Unix, so I do not think it is a problem
If we want to "support this", should we add a test? |
34375a6
to
b2659c3
Compare
b2659c3
to
ec133dc
Compare
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
It may be worth noting that this is best effort for backward compatibility. Skipping over managed frames using setjmp/longjmp is unsupported. |
@@ -8151,6 +8192,13 @@ extern "C" BOOL QCALLTYPE EHEnumNext(EH_CLAUSE_ENUMERATOR* pEHEnum, RhEHClause* | |||
{ | |||
result = FALSE; | |||
} | |||
#ifdef HOST_WINDOWS | |||
// When processing longjmp, only finally clauses are considered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one of the reasons for not supporting setjmp/longjmp. setjmp/longjmp behaves like uncatchable exception - managed code is not generally robust against uncatchable exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree. The intent is really just to fix the regression on Windows w.r.t. the legacy EH. It is a best effort and not something people should use in new code. I hope that the fact that it doesn't work on Linux is enough reason for trying to stay away from using longjmp over managed code.
LGTM otherwise |
I've added a comment to the code and updated the PR description to be explicit about this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
I now see the test added here failing on native AOT. @janvorli should we track a fix for native AOT as well or should I just disable the test against a "we're never going to fix this" catchall issue? |
@MichalStrehovsky this fix was made just for backward compatibility with the old exception handling, we don't want to fix it for NativeAOT. I am sorry for forgetting disabling the test for NativeAOT. |
Will this fix make it to the .NET 9 releases? We need it for Lua support in Garnet to handle unknown commands properly. |
@badrishc It is hard to justify backporting this fix to .NET 9. setjmp/longjmp interop is not supported by .NET: https://learn.microsoft.com/en-us/dotnet/standard/native-interop/exceptions-interoperability#setjmplongjmp-behaviors . This fix does not make it work or supported. It just makes it about as broken as it was, on Windows only. It was and it is still completely broken on Linux. It is not possible to fix that on non-Windows as described in the linked doc. We had a lot of debate whether to bother with fixing this on Windows at all. My understanding is that https://github.com/microsoft/Garnet supports both Windows and non-Windows platforms. Is that correct? If it is the case, you will need to avoid depending on setjmp/longjmp interop to make the Lua integration work on non-Windows platforms. |
@jkotas / @janvorli I'm trying to get a sense of what the old behavior actually was here, as in my testing Linux seems... fine? I assume with some clever try/catch/finally stuff or throwing exceptions across native frames all hell breaks loose, but just skipping managed frames appears to have worked fine on Linux in .NET 8 (and still does on .NET 9). Eliding a bit of detail, but basically with... Lua.PushCFunction((nint)&SomeManagedFunction);
Lua.PCall(0, 0); // this is the bit that calls setjmp
// ...
[UnmanagedCallersOnly(CallConvs = [typeof(CallConvCdecl)])]
internal static unsafe int SomeManagedFunction(nint luaStateHandle)
{
if(SomeCheck())
{
Lua.LuaError(); // this bit calls longjmp
}
else
{
// do some other things that might _indirectly_ call lonjmp
}
return 0;
} Seems to work fine on Linux in all permutations I've tried (w and w/o try/catch/finally, .NET 8 and .NET 9) provided I don't let an exception bubble up out of |
The runtime keeps track of the managed/native and native/managed transitions. The transitions are structures allocated on the stack and the runtime Thread data structure has pointer (
|
Well, that's fffffffffff-un 🫠. I'll spend some more time trying to come up with a design that lets Garnet embed Lua on Linux w/o the possibility of |
Thanks for the details and context! I want to mention that beyond Garnet, there is fairly wide production use of Lua across Microsoft and outside, and often via the NLua C# library that interops into Lua, and often on Windows. This is also evidenced by @stephentoub's comment and PR into NLua last year. Since Lua uses |
The proper reliable way to implement NET interop with code based on setjmp/longjmp is by building C wrappers that deals with setjmp/longjmp and prevents setjmp/longjmp leaking into .NET code. It is very similar to interop with C++ exception handling - it requires C wrappers that prevent C++ exceptions leaking into .NET code. |
The new exception handling has broken hosting Lua on Windows when a Lua code throws an exception that ends up crossing managed frames. Lua uses longjmp for its exception handling mechanism and the new exception handling reported the longjmp as an unhandled
SEHException
.This change makes the new EH behave like the legacy one w.r.t. the longjmp propagation over managed frames on Windows. It is a best effort fix though just to get rid of the regression related to the legacy exception handling. New code should not rely on this. And it doesn't work on non-windows targets (neither did it work with the legacy exception handling).
Close #111242