-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long GC Suspend and hard faults #111201
Comments
Tagging subscribers to this area: @dotnet/gc |
Tagging subscribers to this area: @mangod9 |
looping in @kouvel for suspension issue. thanks for the detailed investigation! |
just a few observations - I don't know details about the flush/trim part, but yes, if a normal memory operation faults, then that thread won't have been put into a "GC ready" state (as opposed to the flush itself, which the DllImport mechanism can do). (glossing over details about fully interruptible code vs not) Presumably the file size (seeing the file offset of 815,000,000,000) is part of it. |
Yes, the file in question is pretty large - 862GB. Here are details of a sample sync operation on that file ( According to our own metrics we had 49.5MB of data written that were not synced. It's literally the measurement of the following piece of code: |
We have also another PerfView traces, collected about ~15 minutes earlier where we also had GC suspension taking 6.8 second. Also during that time we can see very long hard page faults (but meanwhile there are some, even with similar offsets but taking below 1ms. I know it doesn't mean anything but it's interesting): Dropping |
Something that I have problems with understanding is that we see a lot of Can it be that, as show in one of the screenshots from yesterday, that most of our |
It seems very likely that the long-duration hard faults are causing the long GC pauses. When a page fault occurs directly from managed code (such as from I'm not sure as to why the hard faults are taking so long to resolve. Typically a hard fault entails reading from disk. There appears to be some kind of lock being taken in the stack trace ( As for the trimming from the flushes, it seems like it could trim some sections of memory from the working set of the process, but I'm not sure if/why it would evict those sections from the cache. If the memory is cached but not in the working set, it would trigger a page fault but I would imagine it would be a soft fault that wouldn't take so long to resolve. It's plausible that the flushes could be interacting by increasing disk latency. Have you looked at disk latencies around the long pauses?
Maybe there should also be ReadInit/Read and WriteInit/Write events depending on what's happening. Events can be dropped heuristically, such as if raising them entails some overhead like page faults, disk IO, etc. It may help to pass the |
Description
We (RavenDB) have a customer that is occasionally experiencing very long GC pauses. It's not a new issue or a regression. It's experienced from long time. It was happening on .NET 6 already. Some time ago RavenDB got updated to use .NET 8 (in order to be on LTS version). More specifically version
8.0.400
is deployed on the affected environment.The pauses affect the application requests to RavenDB. Also the monitoring shows connectivity issues between the nodes (the setup is 3 nodes RavenDB cluster).
So far we have narrowed it down to GC Suspend events taking a lot of time with the usage of PerfView.
Reproduction Steps
This is happening very occasionally, only in a production environment. RavenDB is configured as 3 nodes (machines) cluster. The issue is happening randomly on any node. The machines have 256 GB of memory, OS is Windows Server 2019 Datacenter (OS build number: 17763.5696.amd64fre.rs5_release.180914-1434). They are hosted in Azure.
Expected behavior
GC pauses aren't taking so long
Actual behavior
We'd like to share our analysis of the recent occurrence of the issue. We have three PerfView traces that were collected with the usage of the following command:
The longest GC suspend took
18,724.797
mSec (time range:244,459.607 - 263,347.329
):The below analysis is about this GC suspend event (although other two PerfView outputs are very similar)
Events
2908
Thread Time (with Ready Threads) Stacks
Analysis of
coreclr!ThreadSuspend::SuspendEE
There is a lot of
READIED BY TID(0) Idle (0) CPU Wait < 1ms IdleCPUs
events which they mostly point tontoskrnl!??KiProcessExpiredTimerList (READIED_BY)
:Below we can also find two RavenDB threads -
5108
and6524
:Looking deeper we can find that both threads are performing some queries:
When looking at the stacks of
5108
thread we can see that it's reading a documents from disk (we use memory mapped files), causing a page fault:This is about the following code:
https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/LowLevelTransaction.cs#L648-L655
https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Impl/Paging/AbstractPager.cs#L332
Events (ReadyThread)
ReadyThread events for AwakenedThreadId -
2908
(GC thead doing the suspension), was awakened by mentioned Raven's thread -5108
and6524
, but also by Idle (0) process so I assume it's System process.CPU Stacks
From CPU Stacks we know that before long GC suspend, we have started
FlushFileBuffers()
on ourRaven.voron
file where documents are saved (threads5108
and6524
are reading from). We call it periodically explicitly here:https://github.com/ravendb/ravendb/blob/1c79a8d9131b248cfe129f7ad516495f31942584/src/Voron/Platform/Win32/WindowsMemoryMapPager.cs#L344
It lasted also during the GC suspend. Under the covers we can see
MmTrimSection()
call, which as we understand will evict some pages, hence subsequent access to its trimmed pages will result in a page fault, requiring the system to re-read the data from the file (what we see in5108
and6524
threads).Events (Suspend EE and Hard Faults)
Going back to Events view and adding Hard Faults events:
So from our analysis it looks that the GC suspend was caused by Hard Faults taking 18-19 seconds (not sure why) after recent
FlushFileBuffers()
.Questions
MmTrimSection()
?Regression?
No response
Known Workarounds
No response
Configuration
No response
Other information
No response
The text was updated successfully, but these errors were encountered: