Skip to content

Conversation

jdksjolen
Copy link
Contributor

@jdksjolen jdksjolen commented Oct 12, 2025

The GlobalChunkPoolMutex replaced ThreadCritical in JDK-8356173. In NMT, we can unfortunately reach a recursive locking situation in situations when the VM is crashing and NMT attempts to perform a error report. I suggest that we make this particular mutex recursive, until we can resolve the design issues that causes this.

This PR also replaces the static initialization with DeferredStatic, and fixes the indentation of access modifiers to be consistently 0-indented.

Please note that we do not take the lock in NMT in order to allocate memory, but in order to have a "static picture" of the arena counters.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Integration blocker

 ⚠️ Title mismatch between PR and JBS for issue JDK-8369622

Issue

  • JDK-8369622: GlobalChunkPoolMutex is recursively locked during error handling (Bug - P3) ⚠️ Title mismatch between PR and JBS. ⚠️ Issue is already resolved. Consider making this a "backport pull request" by setting the PR title to Backport <hash> with the hash of the original commit. See Backports.

Reviewers

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/27759/head:pull/27759
$ git checkout pull/27759

Update a local copy of the PR:
$ git checkout pull/27759
$ git pull https://git.openjdk.org/jdk.git pull/27759/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 27759

View PR using the GUI difftool:
$ git pr show -t 27759

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/27759.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Oct 12, 2025

👋 Welcome back jsjolen! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Oct 12, 2025

@jdksjolen This change is no longer ready for integration - check the PR body for details.

@openjdk openjdk bot changed the title 8369622 8369622: GlobalChunkPoolMutex needs to be recursive Oct 12, 2025
@openjdk
Copy link

openjdk bot commented Oct 12, 2025

@jdksjolen The following label will be automatically applied to this pull request:

  • hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@jdksjolen jdksjolen marked this pull request as ready for review October 13, 2025 06:16
@openjdk openjdk bot added the rfr Pull request is ready for review label Oct 13, 2025
@mlbridge
Copy link

mlbridge bot commented Oct 13, 2025

Webrevs

// code and other areas. For many calls, the current thread has not
// been created so we cannot use Mutex.
static PlatformMutex* GlobalChunkPoolMutex = nullptr;
class RecursivePlatformMutex : public PlatformMutex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's already recursive platform Mutex in MutexLocker.hpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh, I didn't know that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can't use that one because we may not have a current thread and we don't want safepoint interactions for JavaThreads IIUC.

Copy link
Member

@dholmes-ora dholmes-ora Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a RecursiveMutex not RecursivePlatformMutex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecursiveMutex has safepoint interactions only if the current thread is a JavaThread. It could be modified to have a no-owner sentinel and maybe use os::current_thread() like this equivalent one does.

With this deferred static mechanism, I think it can allocate the semaphore which might not be allowed to be allocated with static linkage.

This code is almost a copy of that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried making RecursiveLock a template but it's sort of a pain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is basically a copy. The reason I went the longer (lines of code-wise) is because I'd prefer it if we didn't spread this implementation around, so having this handmade one in arena.cpp seems like an acceptable trade-off to having the RecursiveLock being templated and so on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. Templating RecursiveLock is pretty horrible.

}

ChunkPoolLocker::ChunkPoolLocker() {
assert(GlobalChunkPoolMutex != nullptr, "must be initialized");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an equivalent assertion to replace this with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's inside of the getter in DeferredStatic (so it's "fixed" by the class abstraction)

// code and other areas. For many calls, the current thread has not
// been created so we cannot use Mutex.
static PlatformMutex* GlobalChunkPoolMutex = nullptr;
class RecursivePlatformMutex : public PlatformMutex {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a shame that we needed to (re)invent yet-another lock class.

Comment on lines 52 to 55
if (_owner == nullptr || t != _owner) {
PlatformMutex::lock();
_owner = t;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per the comment:

// For many calls, the current thread has not
// been created so we cannot use Mutex.

this ownership test won't allow for recursive use in those circumstances. So we have slightly odd semantics here that the mutex is recursive for attached threads but not for un-attached. For this specialised usecase maybe we can tolerate that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how it can work without recursion count ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh! Yep Zhengyu is right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah right, we must only unlock on recursion_count == 0. I completely missed this.

Copy link
Contributor

@coleenp coleenp Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread == nullptr is pre-initialization so is only one thread (the null thread) I think.

Edit: No this doesn't work.

Comment on lines 229 to 250
public:
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide doesn't mention this but I thought the unofficial preference was for one character indent so that we don't have lines at the same level as the outermost class definition.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style guide doesn't mention it because I don't think we have consistent
usage, and I haven't felt like stirring that pot. (Others are, of course, free
to do so.) Personally I use no-indent unless the surrounding code uses
1sp-indent. And only change an existing 1sp-indent to no-indent if it seems
out of place compared to surrounding code and I happen to be touching
something nearby. (switch/case labels are similarly inconsistent in our code.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I've been removing one-space indents for years now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my experience that the no-space indent for access modifiers is what we're moving towards. Obviously, this has clearly been done through an informal and ad-hoc process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pity this was never discussed. I loathe having constructs that mean something not being indented. I have been using one-space indent for access modifiers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using one-space indent for access modifiers also, and sometimes find two space indentation which doesn't really bother me that much. 0 space indentation doesn't seem that bothersome either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I don't think the number of spaces for the access modifiers should be part of this change here and in arena.hpp.

Copy link
Member

@stefank stefank left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In NMT, we can unfortunately reach a recursive locking situation in situations when the VM is crashing and NMT attempts to perform a error report. I suggest that we make this particular mutex recursive, until we can resolve the design issues that causes this.

FWIW, in ZGC we deal with similar situations by using this construct instead of a recursive lock:

static bool try_lock_on_error(ZLock* lock) {
  if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) {
    return lock->try_lock();
  }

  lock->lock();

  return true;
}

void ZPageAllocator::print_usage_on(outputStream* st) const {
  const bool locked = try_lock_on_error(&_lock);

  if (!locked) {
    st->print_cr("<Without lock>");
  }

  // Print information even though we may not have successfully taken the lock.
  // This is thread-safe, but may produce inconsistent results.

  print_total_usage_on(st);

  StreamIndentor si(st, 1);
  print_partition_usage_on(st);

  if (locked) {
    _lock.unlock();
  }
}

I don't know the reason for NMT trying to take the look recursively in the error reporter, but maybe something similar to the above would work without adding a new RecursivePlatformMutex?

Here's another example that simply skips some hs_err logging if the lock is being held (This time with Mutex, though):

void GCLogPrecious::print_on_error(outputStream* st) {
  st->print_cr("GC Precious Log:");

  if (_lines == nullptr) {
    st->print_cr("<Not initialized>\n");
    return;
  }

  if (!_lock->try_lock_without_rank_check()) {
    st->print_cr("<Skipped>\n");
    return;
  }

  if (_lines->size() == 0) {
    st->print_cr("<Empty>\n");
  } else {
    st->print_cr("%s", _lines->base());
  }

  _lock->unlock();
}

@zhengyu123
Copy link
Contributor

What about using os::current_thread_id() for lock owner? it avoids the dependence on the existing of Thread* and more semantically close to the original ThreadCritical.

@afshin-zafari
Copy link
Contributor

I don't know the reason for NMT trying to take the look recursively

NMT report gets the current amounts of allocations for Chunks, so it uses the lock to stop others from changing the values.
The deadlock happens, for example, when ChunkPool::prune() is free-ing a memory that is already corrupted. At os::free() time, NMT detects the corruption and tries to report it while the lock is already taken by ChunkPool.

@jdksjolen
Copy link
Contributor Author

Afshin's description is correct. It would be preferable to have a recursive mutex in this case, imho. It's unfortunate that we have to implement another type of lock.

@jdksjolen
Copy link
Contributor Author

In NMT, we can unfortunately reach a recursive locking situation in situations when the VM is crashing and NMT attempts to perform a error report. I suggest that we make this particular mutex recursive, until we can resolve the design issues that causes this.

FWIW, in ZGC we deal with similar situations by using this construct instead of a recursive lock:

static bool try_lock_on_error(ZLock* lock) {
  if (VMError::is_error_reported() && VMError::is_error_reported_in_current_thread()) {
    return lock->try_lock();
  }

  lock->lock();

  return true;
}

void ZPageAllocator::print_usage_on(outputStream* st) const {
  const bool locked = try_lock_on_error(&_lock);

  if (!locked) {
    st->print_cr("<Without lock>");
  }

  // Print information even though we may not have successfully taken the lock.
  // This is thread-safe, but may produce inconsistent results.

  print_total_usage_on(st);

  StreamIndentor si(st, 1);
  print_partition_usage_on(st);

  if (locked) {
    _lock.unlock();
  }
}

I don't know the reason for NMT trying to take the look recursively in the error reporter, but maybe something similar to the above would work without adding a new RecursivePlatformMutex?

Here's another example that simply skips some hs_err logging if the lock is being held (This time with Mutex, though):

void GCLogPrecious::print_on_error(outputStream* st) {
  st->print_cr("GC Precious Log:");

  if (_lines == nullptr) {
    st->print_cr("<Not initialized>\n");
    return;
  }

  if (!_lock->try_lock_without_rank_check()) {
    st->print_cr("<Skipped>\n");
    return;
  }

  if (_lines->size() == 0) {
    st->print_cr("<Empty>\n");
  } else {
    st->print_cr("%s", _lines->base());
  }

  _lock->unlock();
}

That pattern may work for us, as we do not take the lock often. I'll look into this as an alternative path forward.

Copy link
Contributor

@coleenp coleenp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it can use RecursiveMutex if you allow RecursiveMutex to have a null Thread*.

// code and other areas. For many calls, the current thread has not
// been created so we cannot use Mutex.
static PlatformMutex* GlobalChunkPoolMutex = nullptr;
class RecursivePlatformMutex : public PlatformMutex {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RecursiveMutex has safepoint interactions only if the current thread is a JavaThread. It could be modified to have a no-owner sentinel and maybe use os::current_thread() like this equivalent one does.

With this deferred static mechanism, I think it can allocate the semaphore which might not be allowed to be allocated with static linkage.

This code is almost a copy of that.

Comment on lines 229 to 250
public:
public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been using one-space indent for access modifiers also, and sometimes find two space indentation which doesn't really bother me that much. 0 space indentation doesn't seem that bothersome either.

Comment on lines 229 to 250
public:
public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I don't think the number of spaces for the access modifiers should be part of this change here and in arena.hpp.

: PlatformMutex(), _recursions(0), _owner(no_owner_sentinel) {}

void lock() {
intx current = os::current_thread_id();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is okay but not Thread::current() ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this becomes e.g. pthread_self()

Copy link
Contributor

@zhengyu123 zhengyu123 Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it also prevents the problem from the scenarios that having nullptr current thread from the first call, then current thread becomes not nullptr in subsequent recursive calls.

Copy link
Member

@dholmes-ora dholmes-ora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear if this particular lock will only encounter a null thread during VM init, or whether it can also be encountered during thread attach? If the former and we really don't need mutual exclusion when the thread is null, then a ConditionalMutexLocker might help with that aspect and we could go back to using Thread::current(). If RecursiveMutex were actually a Mutex then we might have been able to use it in conjunction with a ConditionalMutexLocker. Otherwise this custom implementation seems okay to me now.

Copy link
Contributor

@zhengyu123 zhengyu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openjdk openjdk bot added the ready Pull request is ready to be integrated label Oct 16, 2025
@openjdk openjdk bot removed the ready Pull request is ready to be integrated label Oct 16, 2025
@jdksjolen
Copy link
Contributor Author

Here's a sketch of Stefan's suggestion, maybe we should go for that instead? I'm vacationing for 2 weeks, so if someone can take over the PR then that'd be highly appreciated.

jdksjolen@80f58d3

@coleenp
Copy link
Contributor

coleenp commented Oct 17, 2025

I like this last suggestion. I'll take it over for you while on vacation.

@coleenp
Copy link
Contributor

coleenp commented Oct 23, 2025

Note that PR #27869 overrides this one. This one should be closed.

@tstuefe
Copy link
Member

tstuefe commented Oct 23, 2025

I like this last suggestion. I'll take it over for you while on vacation.

I like this too.

@openjdk
Copy link

openjdk bot commented Oct 23, 2025

@jdksjolen this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout try-lock
git fetch https://git.openjdk.org/jdk.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Oct 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-runtime [email protected] merge-conflict Pull request has merge conflict with target branch rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

8 participants