-
Notifications
You must be signed in to change notification settings - Fork 6.1k
8369622: GlobalChunkPoolMutex is recursively locked during error handling #27869
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/contributor add @jdksjolen |
👋 Welcome back coleenp! A progress list of the required criteria for merging this PR into |
@coleenp This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be:
You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 127 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. ➡️ To integrate this PR with the above commit message to the |
@coleenp |
Webrevs
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same kind of strategy used by ZGC. It seems a good idiom to use for dealing with error reporting.
Looks good.
Thanks
ChunkPoolLocker::LockStrategy ls = ChunkPoolLocker::LockStrategy::Lock; | ||
if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) { | ||
ls = ChunkPoolLocker::LockStrategy::Try; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more, we could simply always do this check in the constructor and do away with the "strategy" flag altogether. Arguably this would be reasonable behaviour for every Mutexlocker (though it may slow things down a little).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a version that did this but Johan was worried about global behavior so wanted to limit it to just NMT reporting on error to be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Worth having a discussion whether all "lockers" should adopt this error reporting behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah. Not sure about that for this lock or in general. Right now it's ad-hoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for taking this PR.
I could reproduce the deadlock by this change:
--- a/test/hotspot/gtest/nmt/test_nmt_buffer_overflow_detection.cpp
+++ b/test/hotspot/gtest/nmt/test_nmt_buffer_overflow_detection.cpp
@@ -23,6 +23,7 @@
*/
#include "memory/allocation.hpp"
+#include "memory/arena.hpp"
#include "nmt/memTracker.hpp"
#include "runtime/os.hpp"
#include "sanitizers/address.hpp"
@@ -142,6 +143,21 @@ DEFINE_TEST(test_corruption_on_realloc_growing, COMMON_NMT_HEAP_CORRUPTION_MESSA
static void test_corruption_on_realloc_shrinking() { test_corruption_on_realloc(0x11, 0x10); }
DEFINE_TEST(test_corruption_on_realloc_shrinking, COMMON_NMT_HEAP_CORRUPTION_MESSAGE_PREFIX);
+
+static void test_chunkpool_lock() {
+ if (!MemTracker::enabled()) {
+ tty->print_cr("Skipped");
+ return;
+ }
+ PrintNMTStatistics = true;
+ {
+ ChunkPoolLocker cpl;
+ char* mem = (char*)os::malloc(100, mtTest);
+ memset(mem - 16, 0, 100 + 16 + 2);
+ os::free(mem);
+ }
+}
+DEFINE_TEST(test_chunkpool_lock, COMMON_NMT_HEAP_CORRUPTION_MESSAGE_PREFIX);
///////
We can add it to the tests if you found it useful.
Thank you Afshin for the test. I'll add it. |
/contributor add @afshin-zafari |
@coleenp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into this! Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nitty suggestions but nothing essential.
ChunkPoolLocker::LockStrategy ls = ChunkPoolLocker::LockStrategy::Lock; | ||
if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) { | ||
ls = ChunkPoolLocker::LockStrategy::Try; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Worth having a discussion whether all "lockers" should adopt this error reporting behaviour.
Thanks for reviewing David and Paul. |
@coleenp This pull request has not yet been marked as ready for integration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tweaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Coleen, for taking this PR.
All good.
ls = ChunkPoolLocker::LockStrategy::Try; | ||
} | ||
ChunkPoolLocker cpl(ls); | ||
ms = MallocMemorySummary::as_snapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preexisting:
The MMS::as_snapshot()
just returns the pointer to the snapshot structure and does not update/access anything there. The life time of the ChunkPoolLocker cpl
should be the whole body of the function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should change with this PR. It could be that the lock is needed to gather the chunk pool information but the NMT reporting and subsequent adjustments should only be local to NMT and not lock the chunk pool. I'll leave this to another CR to investigate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
private: | ||
bool _locked; | ||
public: | ||
ChunkPoolLocker(LockStrategy ls = LockStrategy::Lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the LockStrategy
is defaulted to Lock
, then all the instances of this lock used in ChunkPool
's cleaning functions (return_to_pool
, take_from_pool
, prune
and deallocate_chunk
) would try to lock this explicitly. So, when either of these called while NMT is reporting (acquired the lock), we have deadlock again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the problem that we've seen though. These shouldn't be called during error reporting explicitly like the NMT code. The NMT code is reporting the error while holding the lock, thus needing the lock to be taken again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reviewing and your comments, Afshin.
private: | ||
bool _locked; | ||
public: | ||
ChunkPoolLocker(LockStrategy ls = LockStrategy::Lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the problem that we've seen though. These shouldn't be called during error reporting explicitly like the NMT code. The NMT code is reporting the error while holding the lock, thus needing the lock to be taken again.
ls = ChunkPoolLocker::LockStrategy::Try; | ||
} | ||
ChunkPoolLocker cpl(ls); | ||
ms = MallocMemorySummary::as_snapshot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should change with this PR. It could be that the lock is needed to gather the chunk pool information but the NMT reporting and subsequent adjustments should only be local to NMT and not lock the chunk pool. I'll leave this to another CR to investigate further.
Thank you for the reviews and test, Afshin, David and Paul. |
Going to push as commit 3fdb15f.
Your commit was automatically rebased without conflicts. |
This change disables recursive locking for the ChunkPoolLocker during error handling for NMT callers. The patch is written by Johan as an alternative to supporting another recursive locker for this lock.
Tested with tier1-4, tier5 on aarch64 (product and debug).
Progress
Issue
Reviewers
Contributors
<[email protected]>
<[email protected]>
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27869/head:pull/27869
$ git checkout pull/27869
Update a local copy of the PR:
$ git checkout pull/27869
$ git pull https://git.openjdk.org/jdk.git pull/27869/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 27869
View PR using the GUI difftool:
$ git pr show -t 27869
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27869.diff
Using Webrev
Link to Webrev Comment