8359820: Improve handshake/safepoint timeout diagnostic messages #26309

toxaart · 2025-07-15T08:37:38Z

Hi, please consider the following changes:

The problem in the issue description is not a problem by itself, the behavior is not unexpected, but it is somewhat difficult to find out what caused SIGILL to be fired.

We propagate this information from handshake::handle_timeout() to VMError::report() with a help of a global variable. The same mechanism is used to address a similar issue in the safepoint timeout handler.

Tested in tiers 1-3.

Progress

Change must be properly reviewed (1 review required, with at least 1 Reviewer)
Change must not contain extraneous whitespace
Commit message must refer to an issue

Issue

JDK-8359820: Improve handshake/safepoint timeout diagnostic messages (Bug - P4)

Reviewers

Thomas Stuefe (@tstuefe - Reviewer)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26309/head:pull/26309
$ git checkout pull/26309

Update a local copy of the PR:
$ git checkout pull/26309
$ git pull https://git.openjdk.org/jdk.git pull/26309/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26309

View PR using the GUI difftool:
$ git pr show -t 26309

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26309.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-07-15T08:38:17Z

👋 Welcome back toxaart! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-07-15T08:39:04Z

@toxaart This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

8359820: Improve handshake/safepoint timeout diagnostic messages

Reviewed-by: stuefe

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been 19 new commits pushed to the master branch:

9dc6282: 8362169: Pointer passed to upcall may get wrong scope
6949e34: 8362592: Remove unused argument in nmethod::oops_do
7da274d: 8361961: Typo in ProtectionDomain.implies
... and 16 more: https://git.openjdk.org/jdk/compare/18190519e73705281adf3f94d710d000e75b1729...master

As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@dholmes-ora, @tstuefe) but any other Committer may sponsor as well.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

openjdk · 2025-07-15T08:39:37Z

@toxaart The following label will be automatically applied to this pull request:

hotspot-runtime

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

mlbridge · 2025-07-15T10:55:52Z

Webrevs

dholmes-ora · 2025-07-16T02:06:21Z

@toxaart I'm really looking for something in the fatal error handler so that instead of seeing just:

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGILL (0x4) at pc=0x00007d9dc5a98d71 (sent by kill), pid=329828, tid=329852
#
# JRE version: Java(TM) SE Runtime Environment (26.0+3) (fastdebug build 26-ea+3-153)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (fastdebug 26-ea+3-153, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C [libc.so.6+0x98d71]

There is something there that indicates it was a handshake timeout. E.g.

# SIGILL (0x4) at pc=0x00007d9dc5a98d71 (sent by handshake timeout handlerl), pid=329828, tid=329852

We may need the handshake code to set a flag on the target Thread that the error code can query if it sees a SIGILL.

…in VMError::report()

…with-low-handshake-timeout-on-intel-sde

…sde' of https://github.com/toxaart/jdk into JDK-8359820-SIGILL-with-low-handshake-timeout-on-intel-sde

openjdk · 2025-07-17T08:54:42Z

⚠️ @toxaart This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

…with-low-handshake-timeout-on-intel-sde

src/hotspot/share/runtime/handshake.cpp

src/hotspot/share/utilities/vmError.cpp

src/hotspot/share/utilities/globalDefinitions.hpp

src/hotspot/share/utilities/vmError.cpp

tstuefe · 2025-07-18T08:15:36Z

BTW, for artificially generated signals we already have a clear indication in hs_err files. We print the sigaction structure associated with the signal.

e.g.

 siginfo: si_signo: 4 (SIGILL), si_code: 0 (SI_USER), si_pid: 13281, si_uid: 1027

SI_USER => sent via kill command or pthread_kill
si_pid = sending process or thread id
si_uid = sending user (in case of outside process)

See also: https://pubs.opengroup.org/onlinepubs/007904875/functions/sigaction.html

I have nothing against making this clearer, just saying that the info is already kind of there.

dholmes-ora · 2025-07-18T09:09:39Z

I have nothing against making this clearer, just saying that the info is already kind of there.

Yes, but there are a few reasons a SIGILL may have been fired at a thread, so it would be useful to get a clear indication of exactly why it happened. Not everyone will realize/know that SIGILL happens because of a timeout in handshake/safepoint.

tstuefe · 2025-07-18T09:53:33Z

I have nothing against making this clearer, just saying that the info is already kind of there.

Yes, but there are a few reasons a SIGILL may have been fired at a thread, so it would be useful to get a clear indication of exactly why it happened. Not everyone will realize/know that SIGILL happens because of a timeout in handshake/safepoint.

I still don't get it.

Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?
Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?
Is the SIGILL completely unrelated to the safepoint? Then why print the information?

tstuefe · 2025-07-18T09:59:29Z

To clarify, I am concerned about misleading printouts. Signal issues are difficult to analyze. Signals can interleave, overlap, get lost, or be blocked. I think it's important to be precise.

For example, saving gobal information A, sending signal, and in the handler printing "signal X relates to global information A" can be wrong. The signal can have some other cause.

It would be good to save both the sender and recipent's thread id in the global information before sending the signal, the in the signal handler to correlate this with what sigaction_t says who send the signal, and with the receiving thread's thread id.

That way we can be sure that, yes, X send a signal A to Y, I am Y, I got signal A from X, so this is safe information I could now print. It would make investigating issues like the one motivating this change a lot easier, especially when multiple signals are involved.

toxaart · 2025-07-18T10:46:45Z

Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?

Yes, the receiving thread is the one to receive the SIGILL. I agree that the changes introduce a degree of redundancy, but it is difficult to see by looking at the thread callstack that it was killed by the timeout mechanism of the handshake. I found it by looking at events log, see the discussion in JBS.

Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?

I think I already described a possible situation: if the receiver does not report the crash within 3 seconds, then a fatal error will be reported by the calling thread. However, it may happen that any other thread receives SIGILL for any other reason within that time interval. But the "busy" thread is already in the "communicative" variable, which will not be the signal receiver in this particular case. I do not really know if this situation is just hypothetical or ever occurred in practice.

Is the SIGILL completely unrelated to the safepoint? Then why print the information?

No, it is intentionally fired by the timeout handler. Quote from mr. Shipilev, see the issue discussion: "The intent for SIGILL is to trigger the crash at the thread that blocks handshake/safepoint sync. E.g. a Java thread that is stuck on miscompiled loop without safepoint checks. Or some VM code that spins without VM transitions. See JDK-8219584. This feature is remarkably useful in the field, used this dozens of times. So whatever we do, we need to keep printing the instructions block and hopefully a backtrace."

dholmes-ora · 2025-07-18T12:22:58Z

Not sure what is so hard to understand here @tstuefe . A thread is hit with a SIGILL and we report that now, but we don't report why it was hit with the SIGILL. If there were only one reason (like it executed an illegal instruction) then it would be obvious, but we have hijacked SIGILL as a generic "something happened" signal. So the proposal here is to record the identity of the thread being sent a SIGILL due to a handshake or safepoint timeout, so that when that thread responds to the SIGILL it can see that is why it got it and report that fact. If a different thread also got a SIGILL for a different reason we don't want it reporting it was due to the timeout mechanism.

tstuefe · 2025-07-18T13:22:02Z

@toxaart Thank you for laying it out to me.

So SDE slowed things down, we did not reach the safepoint, the timeout mechanism fired a SIGILL to the slow thread, the slow thread was not fast enough to end the JVM, and the sending thread then executed the fatal("Handshake timeout")?

Which thread won the hs-err pid writing? Was (A) the winner, and it started error handling? And you maybe saw a "Thread XXX also had an error" line from the sending thread?

Or (B) did the slow thread not even get the signal, and the hs_err file you got was from the fatal("Handshake timeout") in the sending thread?

(B) would be odd; a signal sent with pthread_kill had not been delivered to the target thread for three seconds :-( is signal delivery on SDE broken, or is it just really slow?

In any case, I think I understand now that you try to improve the hs_err printout for case (A), right? If so, sure, that makes sense.

What confused me was your printout timed out thread XXXXX, which suggested to me that the receiving thread could be a different one from the one that should have been interrupted. But I missed the if (handshakeTimedOutThread == p2i(_thread)) condition right above that.

Is the receiving thread the one meant to receive the SIGILL? Then why print this at all, we have callstack and thread info already?

Yes, the receiving thread is the one to receive the SIGILL. I agree that the changes introduce a degree of redundancy, but it is difficult to see by looking at the thread callstack that it was killed by the timeout mechanism of the handshake. I found it by looking at events log, see the discussion in JBS.

No problem, its fine to make things clearer.

Is the receiving thread not the originally intended recipient? But how? This can only happen either if the original recipient thread blocked - which we don't do in hotspot code AFAIK, so it could only be a library method that temporary sets a signal mask - or if there is a bug in the sending code - in which case we should fix it?

I think I already described a possible situation: if the receiver does not report the crash within 3 seconds, then a fatal error will be reported by the calling thread.

To help with this case, I suggest a simple addition in handshake.cpp:

-  fatal("Handshake timeout");
+  fatal("Thread " PTR_FORMAT " has not cleared handshake op: " PTR_FORMAT ", then failed to terminate JVM",
+        p2i(target), p2i(op));
 }

which will show a clearer message at the start of the hs-err file, in case we don't have the VM output.

However, it may happen that any other thread receives SIGILL for any other reason within that time interval. But the "busy" thread is already in the "communicative" variable, which will not be the signal receiver in this particular case. I do not really know if this situation is just hypothetical or ever occurred in practice.

Yes, it could happen. The mechanism could be improved by storing the fact that a SIGILL has been sent to thread X not in a global variable but in the Thread structure of X. Then, in VMError, one checks if the current thread had been the target of a recent pthread_kill, and only write "sent by xxx" in that case. I ignore here the possible case of multiple senders one receiver, because I think that is extremely unlikely.

Not saying you have to do this. Can also be done in a later RFE.

src/hotspot/share/utilities/vmError.cpp

src/hotspot/share/runtime/safepoint.cpp

tstuefe · 2025-07-18T13:34:06Z

Not sure what is so hard to understand here @tstuefe . A thread is hit with a SIGILL and we report that now, but we don't report why it was hit with the SIGILL. If there were only one reason (like it executed an illegal instruction) then it would be obvious, but we have hijacked SIGILL as a generic "something happened" signal. So the proposal here is to record the identity of the thread being sent a SIGILL due to a handshake or safepoint timeout, so that when that thread responds to the SIGILL it can see that is why it got it and report that fact. If a different thread also got a SIGILL for a different reason we don't want it reporting it was due to the timeout mechanism.

Thank you, @dholmes-ora . I already answered Anton, but I get that now.

toxaart · 2025-07-18T15:02:57Z

Was (A) the winner, and it started error handling? And you maybe saw a "Thread XXX also had an error" line from the sending thread?

Yes, the slow thread started reporting, and I think I also observed the latter message as well. Note that the fatal error is still processed in the end of the timeout handler, but not reported by VMError, as it can report only one such error.

So yes, we want to improve the reporting for case A: when a slow thread receives a SIGILL and dies being able to handle the error, we want to know if SIGILL came from handshake/safepoint timeout and print extra info if that is the case.

To help with this case, I suggest a simple addition in handshake.cpp:

Thanks, added to the latest change.

Yes, it could happen. The mechanism could be improved by storing the fact that a SIGILL has been sent to thread X not in a global variable but in the Thread structure of X. Then, in VMError, one checks if the current thread had been the target of a recent pthread_kill, and only write "sent by xxx" in that case. I ignore here the possible case of multiple senders one receiver, because I think that is extremely unlikely.

I think this would be a more invasive change, we can do it when there is a real need.

tstuefe

Ok thanks.

dholmes-ora

The structure of this looks good, but I have a few remaining nits. Thanks.

dholmes-ora · 2025-07-20T21:39:24Z

src/hotspot/share/runtime/handshake.cpp

    if (os::signal_thread(target, SIGILL, "cannot be handshaked")) {
      // Give target a chance to report the error and terminate the VM.
      os::naked_sleep(3000);
    }
  } else {
    log_error(handshake)("No thread with an unfinished handshake op(" INTPTR_FORMAT ") found.", p2i(op));
  }
-  fatal("Handshake timeout");
+  if (target != nullptr) {
+    fatal("Thread " PTR_FORMAT " has not cleared handshake op: " PTR_FORMAT ", then failed to terminate JVM", p2i(target), p2i(op));


Suggested change

fatal("Thread " PTR_FORMAT " has not cleared handshake op: " PTR_FORMAT ", then failed to terminate JVM", p2i(target), p2i(op));

fatal("Thread " PTR_FORMAT " has not cleared handshake op %s, and failed to terminate the JVM", p2i(target), op->name());

The earlier logging statement that uses p2i(op) relies on an even earlier logging statement (line 189/190) that reports the name and the p2i value. But the fatal error message can't rely on using logging information to map the p2i value back to a name, so we need the name directly.

dholmes-ora · 2025-07-20T21:41:42Z

src/hotspot/share/utilities/vmError.cpp

+volatile intptr_t VMError::handshakeTimedOutThread = p2i(nullptr);
+volatile intptr_t VMError::safepointTimedOutThread = p2i(nullptr);


Suggested change

volatile intptr_t VMError::handshakeTimedOutThread = p2i(nullptr);

volatile intptr_t VMError::safepointTimedOutThread = p2i(nullptr);

volatile intptr_t VMError::_handshake_timeout_thread = p2i(nullptr);

volatile intptr_t VMError::_safepoint_timeout_thread = p2i(nullptr);

dholmes-ora · 2025-07-20T21:43:20Z

src/hotspot/share/utilities/vmError.cpp

@@ -1329,6 +1337,14 @@ void VMError::report(outputStream* st, bool _verbose) {
 # undef END
 }

+void VMError::set_handshake_timed_out_thread(intptr_t x) {


Suggested change

void VMError::set_handshake_timed_out_thread(intptr_t x) {

void VMError::set_handshake_timed_out_thread(intptr_t thread_addr) {

dholmes-ora · 2025-07-20T21:43:47Z

src/hotspot/share/utilities/vmError.cpp

+  handshakeTimedOutThread = x;
+}
+
+void VMError::set_safepoint_timed_out_thread(intptr_t x) {


Suggested change

void VMError::set_safepoint_timed_out_thread(intptr_t x) {

void VMError::set_safepoint_timed_out_thread(intptr_t thread_addr) {

openjdk bot added the hotspot-runtime [email protected] label Jul 15, 2025

toxaart marked this pull request as ready for review July 15, 2025 10:51

openjdk bot added the rfr Pull request is ready for review label Jul 15, 2025

toxaart marked this pull request as draft July 16, 2025 09:49

openjdk bot removed the rfr Pull request is ready for review label Jul 16, 2025

8359820: Explicitly report SIGILL fired by handshake timeout handler …

c764efb

…in VMError::report()

toxaart closed this Jul 16, 2025

toxaart force-pushed the JDK-8359820-SIGILL-with-low-handshake-timeout-on-intel-sde branch from a8179fc to 310ef85 Compare July 16, 2025 12:19

8359820: Fixed newline

8a794a6

toxaart reopened this Jul 16, 2025

toxaart added 3 commits July 16, 2025 13:54

Merge remote-tracking branch 'origin/master' into JDK-8359820-SIGILL-…

b90afb6

…with-low-handshake-timeout-on-intel-sde

8359820: Improved safepoint and handshake timeout report

48de6d8

Merge branch 'JDK-8359820-SIGILL-with-low-handshake-timeout-on-intel-…

65e19dc

…sde' of https://github.com/toxaart/jdk into JDK-8359820-SIGILL-with-low-handshake-timeout-on-intel-sde

toxaart changed the title ~~8359820: SIGILL with low -XX:HandshakeTimeout~~ 8359820: Improve handshake/safepoint timeout diagnostic messages Jul 17, 2025

toxaart added 3 commits July 17, 2025 11:03

8359820: Removed extra line

044355b

8359820: Fixed test

4e46531

Merge remote-tracking branch 'origin/master' into JDK-8359820-SIGILL-…

689e614

…with-low-handshake-timeout-on-intel-sde

toxaart marked this pull request as ready for review July 17, 2025 14:04

openjdk bot added the rfr Pull request is ready for review label Jul 17, 2025

dholmes-ora reviewed Jul 18, 2025

View reviewed changes

src/hotspot/share/runtime/handshake.cpp Outdated Show resolved Hide resolved

dholmes-ora suggested changes Jul 18, 2025

View reviewed changes

src/hotspot/share/utilities/vmError.cpp Outdated Show resolved Hide resolved

8359820: Addressed reviewer's comments

4064563

tstuefe reviewed Jul 18, 2025

View reviewed changes

src/hotspot/share/utilities/globalDefinitions.hpp Outdated Show resolved Hide resolved

src/hotspot/share/utilities/vmError.cpp Show resolved Hide resolved

toxaart closed this Jul 18, 2025

toxaart reopened this Jul 18, 2025

8359820: Addressed reviewer's comments

27cb77d

tstuefe reviewed Jul 18, 2025

View reviewed changes

src/hotspot/share/utilities/vmError.cpp Outdated Show resolved Hide resolved

src/hotspot/share/runtime/safepoint.cpp Outdated Show resolved Hide resolved

8359820: Addressed reviewer's comments

80a0b05

8359820: Fixed spaces

9ccf096

tstuefe approved these changes Jul 18, 2025

View reviewed changes

openjdk bot added the ready Pull request is ready to be integrated label Jul 18, 2025

dholmes-ora suggested changes Jul 20, 2025

View reviewed changes

	fatal("Thread " PTR_FORMAT " has not cleared handshake op: " PTR_FORMAT ", then failed to terminate JVM", p2i(target), p2i(op));
	fatal("Thread " PTR_FORMAT " has not cleared handshake op %s, and failed to terminate the JVM", p2i(target), op->name());

		volatile intptr_t VMError::handshakeTimedOutThread = p2i(nullptr);
		volatile intptr_t VMError::safepointTimedOutThread = p2i(nullptr);

	void VMError::set_handshake_timed_out_thread(intptr_t x) {
	void VMError::set_handshake_timed_out_thread(intptr_t thread_addr) {

	void VMError::set_safepoint_timed_out_thread(intptr_t x) {
	void VMError::set_safepoint_timed_out_thread(intptr_t thread_addr) {

8359820: Improve handshake/safepoint timeout diagnostic messages #26309

Are you sure you want to change the base?

8359820: Improve handshake/safepoint timeout diagnostic messages #26309

Conversation

toxaart commented Jul 15, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewers

Reviewing

Uh oh!

bridgekeeper bot commented Jul 15, 2025

Uh oh!

openjdk bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk bot commented Jul 15, 2025

Uh oh!

mlbridge bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

dholmes-ora commented Jul 16, 2025

Uh oh!

openjdk bot commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tstuefe commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dholmes-ora commented Jul 18, 2025

Uh oh!

tstuefe commented Jul 18, 2025

Uh oh!

tstuefe commented Jul 18, 2025

Uh oh!

toxaart commented Jul 18, 2025

Uh oh!

dholmes-ora commented Jul 18, 2025

Uh oh!

tstuefe commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tstuefe commented Jul 18, 2025

Uh oh!

toxaart commented Jul 18, 2025

Uh oh!

tstuefe left a comment

Choose a reason for hiding this comment

Uh oh!

dholmes-ora left a comment

Choose a reason for hiding this comment

Uh oh!

dholmes-ora Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

dholmes-ora Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

dholmes-ora Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

dholmes-ora Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

toxaart commented Jul 15, 2025 •

edited by openjdk bot

Loading

openjdk bot commented Jul 15, 2025 •

edited

Loading

mlbridge bot commented Jul 15, 2025 •

edited

Loading

tstuefe commented Jul 18, 2025 •

edited

Loading

tstuefe commented Jul 18, 2025 •

edited

Loading