8369622: GlobalChunkPoolMutex is recursively locked during error handling #27869

coleenp · 2025-10-17T17:04:58Z

This change disables recursive locking for the ChunkPoolLocker during error handling for NMT callers. The patch is written by Johan as an alternative to supporting another recursive locker for this lock.
Tested with tier1-4, tier5 on aarch64 (product and debug).

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8369622: GlobalChunkPoolMutex is recursively locked during error handling (Bug - P3)

Reviewers

David Holmes (@dholmes-ora - Reviewer)
Afshin Zafari (@afshin-zafari - Reviewer)
Paul Hübner (@Arraying - Author)

Contributors

Johan Sjölen <[email protected]>
Afshin Zafari <[email protected]>

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27869/head:pull/27869
$ git checkout pull/27869

Update a local copy of the PR:
$ git checkout pull/27869
$ git pull https://git.openjdk.org/jdk.git pull/27869/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27869

View PR using the GUI difftool:
$ git pr show -t 27869

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27869.diff

Using Webrev

Link to Webrev Comment

…ling

coleenp · 2025-10-17T17:05:36Z

/contributor add @jdksjolen

bridgekeeper · 2025-10-17T17:06:42Z

👋 Welcome back coleenp! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-10-17T17:07:11Z

@coleenp This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8369622: GlobalChunkPoolMutex is recursively locked during error handling

Co-authored-by: Johan Sjölen <[email protected]>
Co-authored-by: Afshin Zafari <[email protected]>
Reviewed-by: dholmes, azafari, phubner

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 127 new commits pushed to the master branch:

027aea9: 8370325: G1: Disallow GC for TLAB allocation
ffcb158: 8320677: Printer tests use invalid '@run main/manual=yesno
3e20a93: 8370156: Fix jpackage IconTest
... and 124 more: https://git.openjdk.org/jdk/compare/c9cbd31f8575a25c4decd68dc645378c5ba2bad0...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

openjdk · 2025-10-17T17:07:49Z

@coleenp
Contributor Johan Sjölen <[email protected]> successfully added.

openjdk · 2025-10-17T17:08:32Z

@coleenp The following label will be automatically applied to this pull request:

hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-10-17T20:13:23Z

Webrevs

dholmes-ora

This is the same kind of strategy used by ZGC. It seems a good idiom to use for dealing with error reporting.

Looks good.

Thanks

dholmes-ora · 2025-10-20T00:03:36Z

src/hotspot/share/nmt/mallocTracker.cpp

+  ChunkPoolLocker::LockStrategy ls = ChunkPoolLocker::LockStrategy::Lock;
+  if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) {
+    ls = ChunkPoolLocker::LockStrategy::Try;
+  }


Thinking more, we could simply always do this check in the constructor and do away with the "strategy" flag altogether. Arguably this would be reasonable behaviour for every Mutexlocker (though it may slow things down a little).

I had a version that did this but Johan was worried about global behavior so wanted to limit it to just NMT reporting on error to be safe.

Okay. Worth having a discussion whether all "lockers" should adopt this error reporting behaviour.

yeah. Not sure about that for this lock or in general. Right now it's ad-hoc.

src/hotspot/share/memory/arena.cpp

src/hotspot/share/memory/arena.hpp

afshin-zafari

Thank you for taking this PR.
I could reproduce the deadlock by this change:

--- a/test/hotspot/gtest/nmt/test_nmt_buffer_overflow_detection.cpp
+++ b/test/hotspot/gtest/nmt/test_nmt_buffer_overflow_detection.cpp
@@ -23,6 +23,7 @@
  */
 
 #include "memory/allocation.hpp"
+#include "memory/arena.hpp"
 #include "nmt/memTracker.hpp"
 #include "runtime/os.hpp"
 #include "sanitizers/address.hpp"
@@ -142,6 +143,21 @@ DEFINE_TEST(test_corruption_on_realloc_growing, COMMON_NMT_HEAP_CORRUPTION_MESSA
 static void test_corruption_on_realloc_shrinking()  { test_corruption_on_realloc(0x11, 0x10); }
 DEFINE_TEST(test_corruption_on_realloc_shrinking, COMMON_NMT_HEAP_CORRUPTION_MESSAGE_PREFIX);
 
+
+static void test_chunkpool_lock() {
+  if (!MemTracker::enabled()) {
+    tty->print_cr("Skipped");
+    return;
+  }
+  PrintNMTStatistics = true;
+  {
+    ChunkPoolLocker cpl;
+    char* mem = (char*)os::malloc(100, mtTest);
+    memset(mem - 16, 0, 100 + 16 + 2);
+    os::free(mem);
+  }
+}
+DEFINE_TEST(test_chunkpool_lock, COMMON_NMT_HEAP_CORRUPTION_MESSAGE_PREFIX);
 ///////

We can add it to the tests if you found it useful.

coleenp · 2025-10-21T12:33:11Z

Thank you Afshin for the test. I'll add it.

coleenp · 2025-10-21T13:45:19Z

/contributor add @afshin-zafari

openjdk · 2025-10-21T13:46:41Z

@coleenp
Contributor Afshin Zafari <[email protected]> successfully added.

Arraying

Thank you for looking into this! Looks good.

dholmes-ora

A couple of nitty suggestions but nothing essential.

src/hotspot/share/memory/arena.cpp

src/hotspot/share/memory/arena.hpp

dholmes-ora · 2025-10-22T07:52:56Z

src/hotspot/share/nmt/mallocTracker.cpp

+  ChunkPoolLocker::LockStrategy ls = ChunkPoolLocker::LockStrategy::Lock;
+  if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) {
+    ls = ChunkPoolLocker::LockStrategy::Try;
+  }


Okay. Worth having a discussion whether all "lockers" should adopt this error reporting behaviour.

coleenp · 2025-10-22T13:20:24Z

Thanks for reviewing David and Paul.
/integrate

openjdk · 2025-10-22T13:21:21Z

@coleenp This pull request has not yet been marked as ready for integration.

dholmes-ora

Thanks for the tweaks

afshin-zafari

Thank you Coleen, for taking this PR.
All good.

afshin-zafari · 2025-10-21T11:40:52Z

src/hotspot/share/nmt/nmtUsage.cpp

+      ls = ChunkPoolLocker::LockStrategy::Try;
+    }
+    ChunkPoolLocker cpl(ls);
    ms = MallocMemorySummary::as_snapshot();


Preexisting:
The MMS::as_snapshot() just returns the pointer to the snapshot structure and does not update/access anything there. The life time of the ChunkPoolLocker cpl should be the whole body of the function.

I don't think this should change with this PR. It could be that the lock is needed to gather the chunk pool information but the NMT reporting and subsequent adjustments should only be local to NMT and not lock the chunk pool. I'll leave this to another CR to investigate further.

afshin-zafari · 2025-10-21T11:48:58Z

src/hotspot/share/memory/arena.hpp

+private:
+  bool _locked;
+public:
+  ChunkPoolLocker(LockStrategy ls = LockStrategy::Lock);


If the LockStrategy is defaulted to Lock, then all the instances of this lock used in ChunkPool's cleaning functions (return_to_pool, take_from_pool, prune and deallocate_chunk) would try to lock this explicitly. So, when either of these called while NMT is reporting (acquired the lock), we have deadlock again.

This isn't the problem that we've seen though. These shouldn't be called during error reporting explicitly like the NMT code. The NMT code is reporting the error while holding the lock, thus needing the lock to be taken again.

coleenp

Thank you for reviewing and your comments, Afshin.

coleenp · 2025-10-23T11:15:20Z

src/hotspot/share/memory/arena.hpp

+private:
+  bool _locked;
+public:
+  ChunkPoolLocker(LockStrategy ls = LockStrategy::Lock);


This isn't the problem that we've seen though. These shouldn't be called during error reporting explicitly like the NMT code. The NMT code is reporting the error while holding the lock, thus needing the lock to be taken again.

coleenp · 2025-10-23T11:20:30Z

src/hotspot/share/nmt/nmtUsage.cpp

+      ls = ChunkPoolLocker::LockStrategy::Try;
+    }
+    ChunkPoolLocker cpl(ls);
    ms = MallocMemorySummary::as_snapshot();


I don't think this should change with this PR. It could be that the lock is needed to gather the chunk pool information but the NMT reporting and subsequent adjustments should only be local to NMT and not lock the chunk pool. I'll leave this to another CR to investigate further.

coleenp · 2025-10-23T11:45:03Z

Thank you for the reviews and test, Afshin, David and Paul.
/integrate

openjdk · 2025-10-23T11:46:04Z

Going to push as commit 3fdb15f.
Since your change was applied there have been 131 commits pushed to the master branch:

5a83d6a: 8370406: Parallel: Refactor ParCompactionManager::mark_and_push
da968dc: 8370227: Migrate micros-javac benchmarks from jmh-jdk-microbenchmarks
aec1388: 8313770: jdk/internal/platform/docker/TestSystemMetrics.java fails on Ubuntu
... and 128 more: https://git.openjdk.org/jdk/compare/c9cbd31f8575a25c4decd68dc645378c5ba2bad0...master

Your commit was automatically rebased without conflicts.

openjdk · 2025-10-23T11:46:12Z

@coleenp Pushed as commit 3fdb15f.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

coleenp added 2 commits October 17, 2025 11:52

8369622: GlobalChunkPoolMutex is recursively locked during error hand…

03d769e

…ling

Fix compilation errors.

9a27e59

openjdk bot added the hotspot-runtime [email protected] label Oct 17, 2025

coleenp marked this pull request as ready for review October 17, 2025 20:09

openjdk bot added the rfr Pull request is ready for review label Oct 17, 2025

dholmes-ora approved these changes Oct 19, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Oct 19, 2025

dholmes-ora reviewed Oct 20, 2025

View reviewed changes

Arraying reviewed Oct 21, 2025

View reviewed changes

src/hotspot/share/memory/arena.cpp Show resolved Hide resolved

src/hotspot/share/memory/arena.hpp Outdated Show resolved Hide resolved

afshin-zafari reviewed Oct 21, 2025

View reviewed changes

Add assert, fix access modifiers and add Afshin's test.

9f1525d

openjdk bot removed the ready Pull request is ready to be integrated label Oct 21, 2025

Arraying approved these changes Oct 21, 2025

View reviewed changes

dholmes-ora approved these changes Oct 22, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Oct 22, 2025

Small things.

dabd208

openjdk bot removed the ready Pull request is ready to be integrated label Oct 22, 2025

Arraying approved these changes Oct 22, 2025

View reviewed changes

dholmes-ora approved these changes Oct 23, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Oct 23, 2025

afshin-zafari approved these changes Oct 23, 2025

View reviewed changes

coleenp mentioned this pull request Oct 23, 2025

8369622: GlobalChunkPoolMutex needs to be recursive #27759

Open

3 tasks

coleenp commented Oct 23, 2025

View reviewed changes

openjdk bot added the integrated Pull request has been integrated label Oct 23, 2025

openjdk bot closed this Oct 23, 2025

openjdk bot removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Oct 23, 2025

coleenp deleted the chunk-pool branch October 23, 2025 13:00

8369622: GlobalChunkPoolMutex is recursively locked during error handling #27869

8369622: GlobalChunkPoolMutex is recursively locked during error handling #27869

Conversation

coleenp commented Oct 17, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Contributors

Reviewing

Uh oh!

coleenp commented Oct 17, 2025

Uh oh!

bridgekeeper bot commented Oct 17, 2025

Uh oh!

openjdk bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Oct 17, 2025

Uh oh!

openjdk bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlbridge bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

dholmes-ora left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

afshin-zafari left a comment

Choose a reason for hiding this comment

Uh oh!

coleenp commented Oct 21, 2025

Uh oh!

coleenp commented Oct 21, 2025

Uh oh!

openjdk bot commented Oct 21, 2025

Uh oh!

Arraying left a comment

Choose a reason for hiding this comment

Uh oh!

dholmes-ora left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coleenp commented Oct 22, 2025

Uh oh!

openjdk bot commented Oct 22, 2025

Uh oh!

dholmes-ora left a comment

Choose a reason for hiding this comment

Uh oh!

afshin-zafari left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

coleenp commented Oct 17, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Oct 17, 2025 •

edited

Loading

openjdk bot commented Oct 17, 2025 •

edited

Loading

mlbridge bot commented Oct 17, 2025 •

edited

Loading