Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make longjmp over managed frames work #111259

Merged
merged 8 commits into from
Jan 21, 2025

Conversation

janvorli
Copy link
Member

@janvorli janvorli commented Jan 10, 2025

The new exception handling has broken hosting Lua on Windows when a Lua code throws an exception that ends up crossing managed frames. Lua uses longjmp for its exception handling mechanism and the new exception handling reported the longjmp as an unhandled SEHException.

This change makes the new EH behave like the legacy one w.r.t. the longjmp propagation over managed frames on Windows. It is a best effort fix though just to get rid of the regression related to the legacy exception handling. New code should not rely on this. And it doesn't work on non-windows targets (neither did it work with the legacy exception handling).

Close #111242

The new exception handling has broken hosting Lua on Windows when
a Lua code throws an exception that ends up crossing managed frames.
Lua uses longjmp for its exception handling mechanism and the new
exception handling reported the longjmp as an unhandled `SEHException`.

This change makes the new EH ignore the `STATUS_LONGJUMP` exception
when it passes through managed frames, so it can reach the target
location. It is a best effort fix though, as finallys in the skipped
frames won't be called.

Close dotnet#111242
@janvorli janvorli added this to the 9.0.x milestone Jan 10, 2025
@janvorli janvorli requested a review from jkotas January 10, 2025 00:41
@janvorli janvorli self-assigned this Jan 10, 2025
@@ -932,6 +932,13 @@ ProcessCLRExceptionNew(IN PEXCEPTION_RECORD pExceptionRecord,

Thread* pThread = GetThread();

if (pExceptionRecord->ExceptionCode == STATUS_LONGJUMP)
{
// This is an exception used to unwind during longjmp function. We just let it pass through managed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to unwind Frames to avoid corrupting runtime data structures?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if we are unwinding Frames, is it that hard to call finallys too to match the old behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had it implemented by propagating the exception through the managed frames as usual, so the finallys were called, and then re-issued the longjmp after processing them. The only missing part to do was preventing calling catch handlers. The old EH doesn't call those "naturally", as the longjmp runs only the unwind pass.
And then you've mentioned it was not supported, so I have deleted that code :-).
But anyways, I can change it back to that, the change wasn't complicated. The only thing is that the Unix handling would then be different.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing is that the Unix handling would then be different.

longjump over managed code does not and cannot work well on Unix, so I do not think it is a problem

@jkotas
Copy link
Member

jkotas commented Jan 10, 2025

If we want to "support this", should we add a test?

@janvorli janvorli force-pushed the fix-longjmp-over-managed-frames branch from 34375a6 to b2659c3 Compare January 17, 2025 22:03
@janvorli janvorli force-pushed the fix-longjmp-over-managed-frames branch from b2659c3 to ec133dc Compare January 18, 2025 01:38
@janvorli
Copy link
Member Author

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@janvorli
Copy link
Member Author

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jkotas
Copy link
Member

jkotas commented Jan 21, 2025

It may be worth noting that this is best effort for backward compatibility. Skipping over managed frames using setjmp/longjmp is unsupported.

@@ -8151,6 +8192,13 @@ extern "C" BOOL QCALLTYPE EHEnumNext(EH_CLAUSE_ENUMERATOR* pEHEnum, RhEHClause*
{
result = FALSE;
}
#ifdef HOST_WINDOWS
// When processing longjmp, only finally clauses are considered.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the reasons for not supporting setjmp/longjmp. setjmp/longjmp behaves like uncatchable exception - managed code is not generally robust against uncatchable exceptions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree. The intent is really just to fix the regression on Windows w.r.t. the legacy EH. It is a best effort and not something people should use in new code. I hope that the fact that it doesn't work on Linux is enough reason for trying to stay away from using longjmp over managed code.

@jkotas
Copy link
Member

jkotas commented Jan 21, 2025

LGTM otherwise

@janvorli
Copy link
Member Author

It may be worth noting that this is best effort for backward compatibility. Skipping over managed frames using setjmp/longjmp is unsupported.

I've added a comment to the code and updated the PR description to be explicit about this.

Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@janvorli janvorli merged commit abf8e94 into dotnet:main Jan 21, 2025
95 of 97 checks passed
@janvorli janvorli deleted the fix-longjmp-over-managed-frames branch January 21, 2025 20:48
@MichalStrehovsky
Copy link
Member

I now see the test added here failing on native AOT. @janvorli should we track a fix for native AOT as well or should I just disable the test against a "we're never going to fix this" catchall issue?

@janvorli
Copy link
Member Author

janvorli commented Feb 3, 2025

@MichalStrehovsky this fix was made just for backward compatibility with the old exception handling, we don't want to fix it for NativeAOT. I am sorry for forgetting disabling the test for NativeAOT.

@badrishc
Copy link

Will this fix make it to the .NET 9 releases? We need it for Lua support in Garnet to handle unknown commands properly.

@jkotas
Copy link
Member

jkotas commented Feb 23, 2025

@badrishc It is hard to justify backporting this fix to .NET 9. setjmp/longjmp interop is not supported by .NET: https://learn.microsoft.com/en-us/dotnet/standard/native-interop/exceptions-interoperability#setjmplongjmp-behaviors . This fix does not make it work or supported. It just makes it about as broken as it was, on Windows only. It was and it is still completely broken on Linux. It is not possible to fix that on non-Windows as described in the linked doc. We had a lot of debate whether to bother with fixing this on Windows at all.

My understanding is that https://github.com/microsoft/Garnet supports both Windows and non-Windows platforms. Is that correct? If it is the case, you will need to avoid depending on setjmp/longjmp interop to make the Lua integration work on non-Windows platforms.

@kevin-montrose
Copy link

[...] This fix does not make it work or supported. It just makes it about as broken as it was, on Windows only. It was and it is still completely broken on Linux. It is not possible to fix that on non-Windows as described in the linked doc. We had a lot of debate whether to bother with fixing this on Windows at all.

@jkotas / @janvorli I'm trying to get a sense of what the old behavior actually was here, as in my testing Linux seems... fine? I assume with some clever try/catch/finally stuff or throwing exceptions across native frames all hell breaks loose, but just skipping managed frames appears to have worked fine on Linux in .NET 8 (and still does on .NET 9).

Eliding a bit of detail, but basically with...

Lua.PushCFunction((nint)&SomeManagedFunction);

Lua.PCall(0, 0); // this is the bit that calls setjmp 

// ...

[UnmanagedCallersOnly(CallConvs = [typeof(CallConvCdecl)])]
internal static unsafe int SomeManagedFunction(nint luaStateHandle)
{
   if(SomeCheck())
   {
       Lua.LuaError(); // this bit calls longjmp
   }
   else
   {
     // do some other things that might _indirectly_ call lonjmp
   }

  return 0;
}

Seems to work fine on Linux in all permutations I've tried (w and w/o try/catch/finally, .NET 8 and .NET 9) provided I don't let an exception bubble up out of SomeManagedFunction. Put something similar in a stress case (bunch of threads, random allocs, forced GCs, etc.) and couldn't repro an issue - but perhaps I'm poking at the wrong bits?

@jkotas
Copy link
Member

jkotas commented Feb 25, 2025

The runtime keeps track of the managed/native and native/managed transitions. The transitions are structures allocated on the stack and the runtime Thread data structure has pointer (Thread::m_pFrame) to the inner-most transition.

longjmp on Unix will unwind the stack, without updating the m_pFrame pointer in the Thread structure. It means that it is going to be a dangling pointer for a while that violates the runtime invariants. I guess you are getting lucky in your tests that the stack region where this dangling pointer points to is not overwritten and/or that the dangling pointer gets overwritten with a valid transition pointer before you managed to hit a crash.

@kevin-montrose
Copy link

longjmp on Unix will unwind the stack, without updating the m_pFrame pointer in the Thread structure. It means that it is going to be a dangling pointer for a while that violates the runtime invariants. I guess you are getting lucky in your tests that the stack region where this dangling pointer points to is not overwritten and/or that the dangling pointer gets overwritten with a valid transition pointer before you managed to hit a crash.

Well, that's fffffffffff-un 🫠.

I'll spend some more time trying to come up with a design that lets Garnet embed Lua on Linux w/o the possibility of longjmping over any managed frames. The tricky bit is Lua uses it for implicit failures, like OOMs - which are hard to avoid, and basically impossible to patch out.

@badrishc
Copy link

@badrishc It is hard to justify backporting this fix to .NET 9. setjmp/longjmp interop is not supported by .NET: https://learn.microsoft.com/en-us/dotnet/standard/native-interop/exceptions-interoperability#setjmplongjmp-behaviors . This fix does not make it work or supported. It just makes it about as broken as it was, on Windows only. It was and it is still completely broken on Linux. It is not possible to fix that on non-Windows as described in the linked doc. We had a lot of debate whether to bother with fixing this on Windows at all.

My understanding is that https://github.com/microsoft/Garnet supports both Windows and non-Windows platforms. Is that correct? If it is the case, you will need to avoid depending on setjmp/longjmp interop to make the Lua integration work on non-Windows platforms.

Thanks for the details and context! I want to mention that beyond Garnet, there is fairly wide production use of Lua across Microsoft and outside, and often via the NLua C# library that interops into Lua, and often on Windows. This is also evidenced by @stephentoub's comment and PR into NLua last year.

Since Lua uses longjmp as its exception handling mechanism (other C libraries might also do the same, perhaps), it would be great to see: (1) how best we can support this on windows, e.g., by backporting this fix to .NET 9; and (2) what, if anything, can .NET do as a best effort to make specifically Lua's style of exception handling work on unix (even though I fully understand your point about it being generally unsupported). Just my 2 cents.

@jkotas
Copy link
Member

jkotas commented Feb 26, 2025

The proper reliable way to implement NET interop with code based on setjmp/longjmp is by building C wrappers that deals with setjmp/longjmp and prevents setjmp/longjmp leaking into .NET code. It is very similar to interop with C++ exception handling - it requires C wrappers that prevent C++ exceptions leaking into .NET code.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 29, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SEHException with new Exception behavior in .NET 9
5 participants