-
Notifications
You must be signed in to change notification settings - Fork 5k
Intermittent hang/deadlock in .net core 5.0.400 while debugging #58471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @tommcdon Issue DetailsDescriptionThis is a duplicate of #42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows. When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating and I cannot break or terminate the program: If I try to I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful. ConfigurationReproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE. Regression?This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0 Other informationThe amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.
|
Is there any way you can share a dump when everything seems stuck (both the debugger or the target process) or a trace from ETW with CPU stacks? If the dumps, we'll likely need one dump of MSVSMON and one of your app. |
@hoyosjs - sure - how would you like me to deliver it to you? Do you have a preference of Mac vs Windows, Rider vs Visual Studio? |
You can open a feedback ticket on https://developercommunity.visualstudio.com/ and attach them there if possible (ensures proper deletion of customer data) and post it here. And Windows VS would probably be the easiest for us to examine. |
Any update @hoyosjs ? |
@hoyosjs bump... |
Sorry @jeff-simeon, things are building up around the .NET 6 release and with the long weekend I haven't had a chance to get to this yet. I'll try my best to sink some time into the analysis this week and come back to you. |
thanks @hoyosjs - this is a real painful experience for us |
@hoyosjs any luck? thanks very much |
Hey @jeff-simeon. I've started taking a look, took most of today to take a look at that. I still don't have a good sense why it's locked. I see that there were three threads sending notifications of the thread starting up, then the thread sends the event over to the debugger side and then all threads are stopped waiting for the debugger to notify that it's ok to continue. The runtime never got the even to continue, so not much to do. I need to think of the debugger side, and I am thinking still how to look at it. Given that it repros both on macOS and Windows and Rider and VS, I highly doubt that it's a bug in those layers (if they are all the same bug). I just need to think of a way to see what's happening without having a repro. What I see:
|
Thanks @hoyosjs is there anything I can do for I help? Or anything I can do to narrow down the cause at runtime when debugging? we really would love to get this resolved. |
@hoyosjs - any thoughts? |
Sorry @jeff-simeon; I took a look the other day but ever since I haven't been able to take a look much as .NET 6 is about to ship and it has required my full attention. Without a repro I expect progress on this one to be slow. Not sure how big your project is, but does it only happen on one project or is it anything you try? The only way I can thing that could help without a repro is logging. |
This is the core software for our product, so it puts us in an untenable situation. This happens randomly about 30% to 50% of the time we debug. Should we downgrade to a version of .NET that works? It would be a shame to determine that .NET 5.0 isn’t suitable for use. |
I understand it's painful, and I'm sorry it got there. 30-50% is definitely REALLY high. I am surprised we haven't gotten any other reports of something like this. Looping @dotnet/dotnet-diag to see if someone has more time to check this one. The problem is most threads are in |
We really appreciate any help you can provide. I am surprised that this isn’t more widespread as well. I have to think it has to do with the highly asynchronous nature of our software. It processes many hundreds of http calls to remote APIs in parallel and tracks thousands of TaskCompletionSources to orchestrate the concurrency. The debugger hang seems to happen when the process is spawning off all these tasks and then WhenAll’ing on them I’m not sure if that helps shed any light on where we can look. |
@hoyosjs do we leave the debugger logging in release mode? It seems like if we had logs of the left and right sides it would be helpful here |
I'd say that's the only thing that could really help. I haven't gone fishing for what stress log statements could help us, or if a private will be needed. |
I took a quick look, we have more stress log statements than I expected, they might be enough. @jeff-simeon could you collect more data? We're probably going to have to iterate a bit on what data exactly to collect, usually we like to have a local repro we can debug ourselves. The logging should be enough to get to the bottom of this, but we may have to go back and forth and potentially even ask you to run a privately built runtime before we have the right set of data. For right now, we should start with the StressLog data, because it doesn't require any private builds. StressLog is an in memory circular buffer we use to collect additional data for hard to debug scenarios. Setting the following environment variables before startup on the debuggee process (testhost.exe) will tell it to create as large of a log as possible and only collect debugger events:
Then with these all set, run your repro and once it hangs collect another dump and send it to us. We can extract the logging with SOS, our debugger extension. Depending on how curious you are, you are free to look at the logs yourself by loading it in WinDBG and running the !dumplog command - it will create a StressLog.txt in the directory with the StressLog contents. There should be lots of CORDB statements:
|
Thanks - I checked out the dmp file in windbg but still don't see any red flags that I can recognize. I'd be very interested in learning from what you find. Here is the new ticket link: https://developercommunity.visualstudio.com/t/intermittent-hangdeadlock-in-net-core-50400-while-1/1533767 |
@jeff-simeon I don't see the dump attached to the new issue. It's entirely possible I'm doing something wrong, can you verify if it succeeded in uploading so I can figure out if the problem is on my end? |
...I see it's not there as well....seems like something is broken with the site. I tried to upload again, this time by adding a comment with an attachment. Please let me know if you see it now. Thanks. |
Thanks, I can see the file is uploaded now. We will take a look, I'm not sure exactly when but hopefully early this week |
Thanks very much @davmason |
(Sorry - this seems to have stuck in my outbox limbo. That's what I get for trying to reply to GitHub on my email.) Hey @jeff-simeon. I think we might have an idea of what is causing this issue. While I might take a while as I make sure I am on the right trail, there's something that might help as a workaround and it will definitely be easier for you to confirm if it helps that anything I can do on my side. I was talking to @davmason and he realized that my suggestion to disable tiering is not complete. There's a feature in the profiler that uses that same feature that I believe is a player in the current issue you see. So in addition of needing |
Ok! I was fumbling around and I think I fixed the issue. But I cannot confirm what exactly fixed it. Maybe someone else can try what I did and confirm if it worked. Note that I am using Visual Studio 2022 17.0.1 After the fix:
What I did:
Tagging the other open issue here for reference |
@hoyosjs - confirmed this is not resolving the issue for us....what is the status here please? |
Hi @jeff-simeon I am a coworker for @hoyosjs. He has been out on vacation for Christmas holidays but now that I am back from my own holiday vacation I'm going to fill in for him and help get this moving. I assisted with some of the earlier investigation so I think I am already mostly up-to-speed on this. My understanding so far is that:
Dump 1 (thread 17):
Dump 2 (thread 30):
Dump 3(thread 17):
In the meantime I am working on a fix for the portions of the bug we do understand from the dumps you already provided. However given that disabling both tiered compilation and rejit didn't work suggests our understanding of the issue is incomplete and anything I do to fix the part we do understand isn't going to be sufficient to fully solve this for you. Next steps:
|
Sorry for the delayed reply @noahfalk. Ultimately, we decided to move to Rider on MacOS for dev along with AVD VMs where Windows is strictly required. While expensive, the cost of the hardware is nominal in comparison to the productivity lost or the effort in downgrading to an earlier version of dotnet. I still would like to help get you the information you need, but it will take some time to get a new dev environment set up where I can reproduce. |
No worries on the timing at all @jeff-simeon and sorry that it came to a new hardware purchase just to avoid this issue : ( I certainly appreciate any time you choose to spend helping diagnose the issue whenever that is. |
Tagging subscribers to this area: @tommcdon Issue DetailsDescriptionThis is a duplicate of #42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows. When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating and I cannot break or terminate the program: If I try to I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful. ConfigurationReproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE. Regression?This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0 Other informationThe amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.
|
The fix @kouvel made thus far addresses the issues that were caused by TieredCompilation and RejitOnAttach. Some of the folks on this thread said that was sufficient to resolve the issue for them but others said they could still reproduce deadlocks after those two features were disabled. We did identify a likely 3rd culprit which is theorized to produce a similar looking deadlock but it hasn't yet been fixed. |
I've looked over a couple of options on that theorized issue after TieredCompilation and RejitOnAttach are disabled, though it's not clear yet what is actually causing that deadlock. There is a promising option for the theorized issue, more to look at. |
Closing via #69121 |
Description
This is a duplicate of #42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows.
When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating
and I cannot break or terminate the program:
If I try to
dotnet trace collect
on a hung process,dotnet trace
hangs as well.I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful.
Configuration
Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE.
Regression?
This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0
Other information
The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.
The text was updated successfully, but these errors were encountered: