Skip to content

Conversation

coleenp
Copy link
Contributor

@coleenp coleenp commented Oct 17, 2025

This change disables recursive locking for the ChunkPoolLocker during error handling for NMT callers. The patch is written by Johan as an alternative to supporting another recursive locker for this lock.
Tested with tier1-4, tier5 on aarch64 (product and debug).


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8369622: GlobalChunkPoolMutex is recursively locked during error handling (Bug - P3)

Reviewers

Contributors

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27869/head:pull/27869
$ git checkout pull/27869

Update a local copy of the PR:
$ git checkout pull/27869
$ git pull https://git.openjdk.org/jdk.git pull/27869/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27869

View PR using the GUI difftool:
$ git pr show -t 27869

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27869.diff

Using Webrev

Link to Webrev Comment

@coleenp
Copy link
Contributor Author

coleenp commented Oct 17, 2025

/contributor add @jdksjolen

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 17, 2025

👋 Welcome back coleenp! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 17, 2025

@coleenp This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8369622: GlobalChunkPoolMutex is recursively locked during error handling

Co-authored-by: Johan Sjölen <[email protected]>
Co-authored-by: Afshin Zafari <[email protected]>
Reviewed-by: dholmes, azafari, phubner

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 127 new commits pushed to the master branch:

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk
Copy link

openjdk bot commented Oct 17, 2025

@coleenp
Contributor Johan Sjölen <[email protected]> successfully added.

@openjdk
Copy link

openjdk bot commented Oct 17, 2025

@coleenp The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@coleenp coleenp marked this pull request as ready for review October 17, 2025 20:09
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 17, 2025
@mlbridge
Copy link

mlbridge bot commented Oct 17, 2025

Webrevs

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same kind of strategy used by ZGC. It seems a good idiom to use for dealing with error reporting.

Looks good.

Thanks

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 19, 2025
Comment on lines +68 to +71
ChunkPoolLocker::LockStrategy ls = ChunkPoolLocker::LockStrategy::Lock;
if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) {
ls = ChunkPoolLocker::LockStrategy::Try;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more, we could simply always do this check in the constructor and do away with the "strategy" flag altogether. Arguably this would be reasonable behaviour for every Mutexlocker (though it may slow things down a little).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a version that did this but Johan was worried about global behavior so wanted to limit it to just NMT reporting on error to be safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Worth having a discussion whether all "lockers" should adopt this error reporting behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. Not sure about that for this lock or in general. Right now it's ad-hoc.

Copy link
Contributor

@afshin-zafari afshin-zafari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking this PR.
I could reproduce the deadlock by this change:

--- a/test/hotspot/gtest/nmt/test_nmt_buffer_overflow_detection.cpp
+++ b/test/hotspot/gtest/nmt/test_nmt_buffer_overflow_detection.cpp
@@ -23,6 +23,7 @@
  */
 
 #include "memory/allocation.hpp"
+#include "memory/arena.hpp"
 #include "nmt/memTracker.hpp"
 #include "runtime/os.hpp"
 #include "sanitizers/address.hpp"
@@ -142,6 +143,21 @@ DEFINE_TEST(test_corruption_on_realloc_growing, COMMON_NMT_HEAP_CORRUPTION_MESSA
 static void test_corruption_on_realloc_shrinking()  { test_corruption_on_realloc(0x11, 0x10); }
 DEFINE_TEST(test_corruption_on_realloc_shrinking, COMMON_NMT_HEAP_CORRUPTION_MESSAGE_PREFIX);
 
+
+static void test_chunkpool_lock() {
+  if (!MemTracker::enabled()) {
+    tty->print_cr("Skipped");
+    return;
+  }
+  PrintNMTStatistics = true;
+  {
+    ChunkPoolLocker cpl;
+    char* mem = (char*)os::malloc(100, mtTest);
+    memset(mem - 16, 0, 100 + 16 + 2);
+    os::free(mem);
+  }
+}
+DEFINE_TEST(test_chunkpool_lock, COMMON_NMT_HEAP_CORRUPTION_MESSAGE_PREFIX);
 ///////

We can add it to the tests if you found it useful.

@coleenp
Copy link
Contributor Author

coleenp commented Oct 21, 2025

Thank you Afshin for the test. I'll add it.

@coleenp
Copy link
Contributor Author

coleenp commented Oct 21, 2025

/contributor add @afshin-zafari

@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Oct 21, 2025
@openjdk
Copy link

openjdk bot commented Oct 21, 2025

@coleenp
Contributor Afshin Zafari <[email protected]> successfully added.

Copy link
Member

@Arraying Arraying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into this! Looks good.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of nitty suggestions but nothing essential.

Comment on lines +68 to +71
ChunkPoolLocker::LockStrategy ls = ChunkPoolLocker::LockStrategy::Lock;
if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) {
ls = ChunkPoolLocker::LockStrategy::Try;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Worth having a discussion whether all "lockers" should adopt this error reporting behaviour.

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 22, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Oct 22, 2025
@coleenp
Copy link
Contributor Author

coleenp commented Oct 22, 2025

Thanks for reviewing David and Paul.
/integrate

@openjdk
Copy link

openjdk bot commented Oct 22, 2025

@coleenp This pull request has not yet been marked as ready for integration.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tweaks

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 23, 2025
Copy link
Contributor

@afshin-zafari afshin-zafari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Coleen, for taking this PR.
All good.

ls = ChunkPoolLocker::LockStrategy::Try;
}
ChunkPoolLocker cpl(ls);
ms = MallocMemorySummary::as_snapshot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preexisting:
The MMS::as_snapshot() just returns the pointer to the snapshot structure and does not update/access anything there. The life time of the ChunkPoolLocker cpl should be the whole body of the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should change with this PR. It could be that the lock is needed to gather the chunk pool information but the NMT reporting and subsequent adjustments should only be local to NMT and not lock the chunk pool. I'll leave this to another CR to investigate further.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

private:
bool _locked;
public:
ChunkPoolLocker(LockStrategy ls = LockStrategy::Lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the LockStrategy is defaulted to Lock, then all the instances of this lock used in ChunkPool's cleaning functions (return_to_pool, take_from_pool, prune and deallocate_chunk) would try to lock this explicitly. So, when either of these called while NMT is reporting (acquired the lock), we have deadlock again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the problem that we've seen though. These shouldn't be called during error reporting explicitly like the NMT code. The NMT code is reporting the error while holding the lock, thus needing the lock to be taken again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Contributor Author

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing and your comments, Afshin.

private:
bool _locked;
public:
ChunkPoolLocker(LockStrategy ls = LockStrategy::Lock);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the problem that we've seen though. These shouldn't be called during error reporting explicitly like the NMT code. The NMT code is reporting the error while holding the lock, thus needing the lock to be taken again.

ls = ChunkPoolLocker::LockStrategy::Try;
}
ChunkPoolLocker cpl(ls);
ms = MallocMemorySummary::as_snapshot();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should change with this PR. It could be that the lock is needed to gather the chunk pool information but the NMT reporting and subsequent adjustments should only be local to NMT and not lock the chunk pool. I'll leave this to another CR to investigate further.

@coleenp
Copy link
Contributor Author

coleenp commented Oct 23, 2025

Thank you for the reviews and test, Afshin, David and Paul.
/integrate

@openjdk
Copy link

openjdk bot commented Oct 23, 2025

Going to push as commit 3fdb15f.
Since your change was applied there have been 131 commits pushed to the master branch:

Your commit was automatically rebased without conflicts.

@openjdk openjdk bot added the integrated Pull request has been integrated label Oct 23, 2025
@openjdk openjdk bot closed this Oct 23, 2025
@openjdk openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 23, 2025
@openjdk
Copy link

openjdk bot commented Oct 23, 2025

@coleenp Pushed as commit 3fdb15f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@coleenp coleenp deleted the chunk-pool branch October 23, 2025 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-runtime [email protected] integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

4 participants