Skip to content

Intermittent hang/deadlock in .net core 5.0.400 while debugging #58471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeff-simeon opened this issue Aug 31, 2021 · 72 comments
Closed

Intermittent hang/deadlock in .net core 5.0.400 while debugging #58471

jeff-simeon opened this issue Aug 31, 2021 · 72 comments
Assignees
Milestone

Comments

@jeff-simeon
Copy link

Description

This is a duplicate of #42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows.

When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating

image

and I cannot break or terminate the program:

image
image

If I try to dotnet trace collect on a hung process, dotnet trace hangs as well.

image

I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful.

Configuration

Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE.

Regression?

This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0

Other information

The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.

@ghost
Copy link

ghost commented Aug 31, 2021

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Aug 31, 2021
@ghost
Copy link

ghost commented Aug 31, 2021

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

This is a duplicate of #42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows.

When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating

image

and I cannot break or terminate the program:

image
image

If I try to dotnet trace collect on a hung process, dotnet trace hangs as well.

image

I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful.

Configuration

Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE.

Regression?

This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0

Other information

The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.

Author: jeff-simeon
Assignees: -
Labels:

area-Diagnostics-coreclr, untriaged

Milestone: -

@hoyosjs
Copy link
Member

hoyosjs commented Aug 31, 2021

Is there any way you can share a dump when everything seems stuck (both the debugger or the target process) or a trace from ETW with CPU stacks? If the dumps, we'll likely need one dump of MSVSMON and one of your app.

@jeff-simeon
Copy link
Author

@hoyosjs - sure - how would you like me to deliver it to you? Do you have a preference of Mac vs Windows, Rider vs Visual Studio?

@hoyosjs
Copy link
Member

hoyosjs commented Sep 1, 2021

You can open a feedback ticket on https://developercommunity.visualstudio.com/ and attach them there if possible (ensures proper deletion of customer data) and post it here. And Windows VS would probably be the easiest for us to examine.

@tommcdon tommcdon added this to the 7.0.0 milestone Sep 1, 2021
@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Sep 1, 2021
@jeff-simeon
Copy link
Author

@jeff-simeon
Copy link
Author

jeff-simeon commented Sep 6, 2021

Any update @hoyosjs ?

@jeff-simeon
Copy link
Author

@hoyosjs bump...

@hoyosjs
Copy link
Member

hoyosjs commented Sep 8, 2021

Sorry @jeff-simeon, things are building up around the .NET 6 release and with the long weekend I haven't had a chance to get to this yet. I'll try my best to sink some time into the analysis this week and come back to you.

@jeff-simeon
Copy link
Author

thanks @hoyosjs - this is a real painful experience for us

@jeff-simeon
Copy link
Author

jeff-simeon commented Sep 13, 2021

@hoyosjs any luck? thanks very much

@hoyosjs
Copy link
Member

hoyosjs commented Sep 14, 2021

Hey @jeff-simeon. I've started taking a look, took most of today to take a look at that. I still don't have a good sense why it's locked. I see that there were three threads sending notifications of the thread starting up, then the thread sends the event over to the debugger side and then all threads are stopped waiting for the debugger to notify that it's ok to continue. The runtime never got the even to continue, so not much to do. I need to think of the debugger side, and I am thinking still how to look at it. Given that it repros both on macOS and Windows and Rider and VS, I highly doubt that it's a bug in those layers (if they are all the same bug). I just need to think of a way to see what's happening without having a repro.

What I see:

  • A bunch of locked threads for DI
  • Some tests that are waiting to run managed code, but they can't as the debugger is trapping them.
  • AppInsights trying to sleep
  • A bunch of threads running their termination routines.
  • Threads
  • Some threads notifying of startup and waiting on a continuation event, the threads were acked by the debugger.

@jeff-simeon
Copy link
Author

Thanks @hoyosjs

is there anything I can do for I help? Or anything I can do to narrow down the cause at runtime when debugging?

we really would love to get this resolved.

@jeff-simeon
Copy link
Author

@hoyosjs - any thoughts?

@hoyosjs
Copy link
Member

hoyosjs commented Sep 15, 2021

Sorry @jeff-simeon; I took a look the other day but ever since I haven't been able to take a look much as .NET 6 is about to ship and it has required my full attention. Without a repro I expect progress on this one to be slow. Not sure how big your project is, but does it only happen on one project or is it anything you try? The only way I can thing that could help without a repro is logging.

@jeff-simeon
Copy link
Author

This is the core software for our product, so it puts us in an untenable situation.

This happens randomly about 30% to 50% of the time we debug.

Should we downgrade to a version of .NET that works? It would be a shame to determine that .NET 5.0 isn’t suitable for use.

@hoyosjs
Copy link
Member

hoyosjs commented Sep 16, 2021

I understand it's painful, and I'm sorry it got there. 30-50% is definitely REALLY high. I am surprised we haven't gotten any other reports of something like this. Looping @dotnet/dotnet-diag to see if someone has more time to check this one. The problem is most threads are in Thread::WaitSuspendEventsHelper waiting on m_DebugSuspendEvent. The only part where we set it is ReleaseFromSuspension called from SysResumeFromDebug. The only way to end up there is DB_IPCE_CONTINUE or DB_IPCE_DETACH_FROM_PROCESS getting received in the DebuggerRCThread. That thread didn't report any messages not getting passed down.

@jeff-simeon
Copy link
Author

We really appreciate any help you can provide.

I am surprised that this isn’t more widespread as well. I have to think it has to do with the highly asynchronous nature of our software. It processes many hundreds of http calls to remote APIs in parallel and tracks thousands of TaskCompletionSources to orchestrate the concurrency. The debugger hang seems to happen when the process is spawning off all these tasks and then WhenAll’ing on them

I’m not sure if that helps shed any light on where we can look.

@davmason
Copy link
Contributor

@hoyosjs do we leave the debugger logging in release mode? It seems like if we had logs of the left and right sides it would be helpful here

@hoyosjs
Copy link
Member

hoyosjs commented Sep 16, 2021

I'd say that's the only thing that could really help. I haven't gone fishing for what stress log statements could help us, or if a private will be needed.

@davmason
Copy link
Contributor

I took a quick look, we have more stress log statements than I expected, they might be enough.

@jeff-simeon could you collect more data? We're probably going to have to iterate a bit on what data exactly to collect, usually we like to have a local repro we can debug ourselves. The logging should be enough to get to the bottom of this, but we may have to go back and forth and potentially even ask you to run a privately built runtime before we have the right set of data.

For right now, we should start with the StressLog data, because it doesn't require any private builds. StressLog is an in memory circular buffer we use to collect additional data for hard to debug scenarios. Setting the following environment variables before startup on the debuggee process (testhost.exe) will tell it to create as large of a log as possible and only collect debugger events:

set COMPlus_StressLog=1
set COMPlus_LogFacility=8200
set COMPlus_LogLevel=10
set COMPlus_StressLogSize=80000
set COMPlus_TotalStressLogSize=8000000

Then with these all set, run your repro and once it hangs collect another dump and send it to us. We can extract the logging with SOS, our debugger extension. Depending on how curious you are, you are free to look at the logs yourself by loading it in WinDBG and running the !dumplog command - it will create a StressLog.txt in the directory with the StressLog contents.

There should be lots of CORDB statements:

63bc  13.513426000 : `CORDB`              D::SendIPCEvent DB_IPCE_SYNC_COMPLETE to outofproc appD 0x0,

63bc  13.513425400 : `CORDB`              GetIPCEventSendBuffer called in SendSyncCompleteIPCEvent

63bc  13.513425100 : `CORDB`              D::SSCIPCE: sync complete.

63bc  13.513424900 : `CORDB`              D::SC: suspension complete

63bc  13.513424700 : `CORDB`              DRCT::ML:: wait set empty after sweep.

1954  13.513412400 : `CORDB`              D::SSCIPCE: Calling IsRCThreadReady()

1954  13.513198100 : `CORDB`              D::TART: Trapping all Runtime threads.

@jeff-simeon
Copy link
Author

Thanks - I checked out the dmp file in windbg but still don't see any red flags that I can recognize. I'd be very interested in learning from what you find.

Here is the new ticket link: https://developercommunity.visualstudio.com/t/intermittent-hangdeadlock-in-net-core-50400-while-1/1533767

@davmason
Copy link
Contributor

@jeff-simeon I don't see the dump attached to the new issue. It's entirely possible I'm doing something wrong, can you verify if it succeeded in uploading so I can figure out if the problem is on my end?

@jeff-simeon
Copy link
Author

...I see it's not there as well....seems like something is broken with the site.

I tried to upload again, this time by adding a comment with an attachment. Please let me know if you see it now.

Thanks.

@davmason
Copy link
Contributor

Thanks, I can see the file is uploaded now. We will take a look, I'm not sure exactly when but hopefully early this week

@jeff-simeon
Copy link
Author

Thanks very much @davmason

@hoyosjs
Copy link
Member

hoyosjs commented Nov 17, 2021

(Sorry - this seems to have stuck in my outbox limbo. That's what I get for trying to reply to GitHub on my email.)

Hey @jeff-simeon. I think we might have an idea of what is causing this issue. While I might take a while as I make sure I am on the right trail, there's something that might help as a workaround and it will definitely be easier for you to confirm if it helps that anything I can do on my side.

I was talking to @davmason and he realized that my suggestion to disable tiering is not complete. There's a feature in the profiler that uses that same feature that I believe is a player in the current issue you see. So in addition of needing DOTNET_TieredCompilation=0/COMPLUS_TieredCompilation=0, you should set COMPlus_ProfApi_RejitOnAttach=0 and see if it helps out on this.

@amandal1810
Copy link

@hoyosjs I had already tried this (setting both COMPlus_TieredCompilation and COMPlus_ProfApi_RejitOnAttach to 0 in system environment variables) after seeing this comment. But it did not solve the issue. :(

@amandal1810
Copy link

Ok! I was fumbling around and I think I fixed the issue. But I cannot confirm what exactly fixed it. Maybe someone else can try what I did and confirm if it worked. Note that I am using Visual Studio 2022 17.0.1

After the fix:

  • I can successfully hit breakpoints in the code
  • I was able to Stop debugging without VS freezing up
  • I let VS run in debug mode for 30 mins, and it did not freeze up
    image

What I did:

  • Add the two environment variables like @hoyosjs mentioned above.
  • FWIW, this step might be optional and might break your dev setup, so tread carefully. I kept only the latest revisions of each major versions. I simply deleted the folders from windows explorer. Thankfully nothing broke. This is the output of dotnet --info after the cleanup. Sorry, I don't have a "before".
.NET SDK (reflecting any global.json):
 Version:   6.0.100
 Commit:    9e8b04bbff

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22000
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\6.0.100\

Host (useful for support):
  Version: 6.0.0
  Commit:  4822e3c3aa

.NET SDKs installed:
  3.1.415 [C:\Program Files\dotnet\sdk]
  5.0.403 [C:\Program Files\dotnet\sdk]
  6.0.100 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 3.1.21 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 5.0.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.21 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 5.0.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.21 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 5.0.12 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

To install additional .NET runtimes or SDKs:
  https://aka.ms/dotnet-download
  • In Visual Studio, disable resource usage limits from the Diagnostic Tools Property Pages. Steps:
    • As soon as debugging starts, from the Diagnostic Tools tab, click Select Tools -> Settings
      image
    • Then uncheck Enable resource usage limits. Apply -> Ok.
      image

Tagging the other open issue here for reference

@jeff-simeon
Copy link
Author

jeff-simeon commented Dec 14, 2021

@hoyosjs - confirmed this is not resolving the issue for us....what is the status here please?

@noahfalk
Copy link
Member

noahfalk commented Jan 6, 2022

Hi @jeff-simeon I am a coworker for @hoyosjs. He has been out on vacation for Christmas holidays but now that I am back from my own holiday vacation I'm going to fill in for him and help get this moving. I assisted with some of the earlier investigation so I think I am already mostly up-to-speed on this. My understanding so far is that:

  • You sent us four sets of dumps (thanks!). The first 3 all had tiered compilation enabled and then the last one from Oct 22nd had tiered compilation disabled.
  • When we investigated the dumps with tiered compilation enabled we saw each of them had an identical deadlock involving a thread doing call counting for tiered compilation inside a region of code where the debugger is forbidden from stopping the app. This is a bug in the runtime and I think we understand this part of the problem.

Dump 1 (thread 17):

02 0000006f`0467f570 00007ffc`dffba656     coreclr!ThreadSuspend::SuspendEE+0x228 [D:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 6097] 
03 0000006f`0467f710 00007ffc`dfe5f3d9     coreclr!CallCountingManager::StopAndDeleteAllCallCountingStubs+0xa9182 [D:\workspace\_work\1\s\src\coreclr\src\vm\callcounting.cpp @ 960] 

Dump 2 (thread 30):

07 00000006`6b49f600 00007ffc`cd57a656     coreclr!ThreadSuspend::SuspendEE+0x449 [D:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 6236] 
08 00000006`6b49f7a0 00007ffc`cd41f3d9     coreclr!CallCountingManager::StopAndDeleteAllCallCountingStubs+0xa9182 [D:\workspace\_work\1\s\src\coreclr\src\vm\callcounting.cpp @ 960] 

Dump 3(thread 17):

03 000000ed`b157f670 00007ffe`57baa656     coreclr!ThreadSuspend::SuspendEE+0x283 [D:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 6144] 
04 000000ed`b157f810 00007ffe`57a4f3d9     coreclr!CallCountingManager::StopAndDeleteAllCallCountingStubs+0xa9182 [D:\workspace\_work\1\s\src\coreclr\src\vm\callcounting.cpp @ 960] 
  • Then you disabled tiered compilation in the 4th dump and the problem still reproduced. When we investigated the dump that time we saw the situation was different. This means disabling tiered compilation did have an effect, it just wasn't sufficient because there were additional variations of the problem. We also have an understanding of this 2nd variation now and it too is a runtime bug.
Dump 4, thread 27
06 00000004`f1778030 00007ffc`85992b0a     coreclr!CrstBase::Enter+0x5a [D:\workspace\_work\1\s\src\coreclr\src\vm\crst.cpp @ 330] 
07 (Inline Function) --------`--------     coreclr!CrstBase::AcquireLock+0x5 [D:\workspace\_work\1\s\src\coreclr\src\vm\crst.h @ 187] 
08 (Inline Function) --------`--------     coreclr!CrstBase::CrstAndForbidSuspendForDebuggerHolder::{ctor}+0x5db [D:\workspace\_work\1\s\src\coreclr\src\vm\crst.cpp @ 819] 
09 (Inline Function) --------`--------     coreclr!MethodDescBackpatchInfoTracker::ConditionalLockHolderForGCCoop::{ctor}+0x5db [D:\workspace\_work\1\s\src\coreclr\src\vm\methoddescbackpatchinfo.h @ 134] 
0a 00000004`f1778060 00007ffc`85991f6c     coreclr!CodeVersionManager::PublishVersionableCodeIfNecessary+0x8ba [D:\workspace\_work\1\s\src\coreclr\src\vm\codeversion.cpp @ 1762] 
  • Since we underestimated the scope of the problem the first time we tried to identify all possible codepaths that enter the problematic ForbidSuspendForDebugger region to completely eliminate it. We think disabling tiered compilation and rejit together should have been sufficient to accomplish that, but we have reports from you and @amandal1810 that it still didn't work and we don't yet know why. It is possible our analysis missed something, or that there is yet another variation of the problem we are still unaware of, or there was some mistake in how we had you set up the most recent experiment. I'm sorry to keep asking for dumps but if you can capture one where the app is deadlocked and both tiered compilation and RejitOnAttach are disabled that will help us resolve this part of the puzzle.

In the meantime I am working on a fix for the portions of the bug we do understand from the dumps you already provided. However given that disabling both tiered compilation and rejit didn't work suggests our understanding of the issue is incomplete and anything I do to fix the part we do understand isn't going to be sufficient to fully solve this for you.

Next steps:

  • For me I'll be working on a code fix in the runtime for the portion of the bug we currently understand
  • For you my request is to capture a dump with the app deadlocked after setting the environment variables that disable both tiered compilation and RejitOnAttach. As soon as you can upload that I'll investigate it and let you know what it shows. Apologies again for the trouble I imagine this is causing. Your help solving it by collecting these dumps is really appreciated. Thanks!

@jeff-simeon
Copy link
Author

Sorry for the delayed reply @noahfalk. Ultimately, we decided to move to Rider on MacOS for dev along with AVD VMs where Windows is strictly required. While expensive, the cost of the hardware is nominal in comparison to the productivity lost or the effort in downgrading to an earlier version of dotnet.

I still would like to help get you the information you need, but it will take some time to get a new dev environment set up where I can reproduce.

@noahfalk
Copy link
Member

No worries on the timing at all @jeff-simeon and sorry that it came to a new hardware purchase just to avoid this issue : ( I certainly appreciate any time you choose to spend helping diagnose the issue whenever that is.

@ghost
Copy link

ghost commented Apr 12, 2022

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

This is a duplicate of #42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows.

When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating

image

and I cannot break or terminate the program:

image
image

If I try to dotnet trace collect on a hung process, dotnet trace hangs as well.

image

I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful.

Configuration

Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE.

Regression?

This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0

Other information

The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.

Author: jeff-simeon
Assignees: -
Labels:

area-Diagnostics-coreclr

Milestone: 7.0.0

@tommcdon tommcdon modified the milestones: 7.0.0, 6.0.x Apr 12, 2022
@tommcdon
Copy link
Member

Thanks to @kouvel #67160 should have fixed the issue in 7.0. @noahfalk is working on a 6.0-servicing version of the fix.

@noahfalk
Copy link
Member

The fix @kouvel made thus far addresses the issues that were caused by TieredCompilation and RejitOnAttach. Some of the folks on this thread said that was sufficient to resolve the issue for them but others said they could still reproduce deadlocks after those two features were disabled. We did identify a likely 3rd culprit which is theorized to produce a similar looking deadlock but it hasn't yet been fixed.

@kouvel
Copy link
Contributor

kouvel commented Apr 12, 2022

I've looked over a couple of options on that theorized issue after TieredCompilation and RejitOnAttach are disabled, though it's not clear yet what is actually causing that deadlock. There is a promising option for the theorized issue, more to look at.

@tommcdon
Copy link
Member

Closing via #69121

@ghost ghost locked as resolved and limited conversation to collaborators Aug 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants