Skip to content

Conversation

marc-chevalier
Copy link
Member

@marc-chevalier marc-chevalier commented Jul 17, 2025

Some crashes are consequences of earlier misshaped ideal graphs, which could be detected earlier, closer to the source, before the possibly many transformations that lead to the crash.

Let's verify that the ideal graph is well-shaped earlier then! I propose here such a feature. This runs after IGVN, because at this point, the graph, should be cleaned up for any weirdness happening earlier or during IGVN.

This feature is enabled with the develop flag VerifyIdealStructuralInvariants. Open to renaming. No problem with me! This feature is only available in debug builds, and most of the code is even not compiled in product, since it uses some debug-only functions, such as Node::dump or Node::Name.

For now, only local checks are implemented: they are checks that only look at a node and its neighborhood, wherever it happens in the graph. Typically: under a If node, we have a IfTrue and a IfFalse. To ease development, each check is implemented in its own class, independently of the others. Nevertheless, one needs to do always the same kind of things: checking there is an output of such type, checking there is N inputs, that the k-th input has such type... To ease writing such checks, in a readable way, and in a less error-prone way than pile of copy-pasted code that manually traverse the graph, I propose a set of compositional helpers to write patterns that can be matched against the ideal graph. Since these patterns are... patterns, so not related to a specific graph, they can be allocated once and forever. When used, one provides the node (called center) around which one want to check if the pattern holds.

On top of making the description of pattern easier, these helpers allows nice printing in case of error, by showing the path from the center to the violating node. For instance (made up for the purpose of showing the formatting), a violation with a path climbing only inputs:

1 failure for node
 211  OuterStripMinedLoopEnd  === 215 39  [[ 212 198 ]] P=0,948966, C=23799,000000
At node
    209  CountedLoopEnd  === 182 208  [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100)
  From path:
    [center] 211  OuterStripMinedLoopEnd  === 215 39  [[ 212 198 ]] P=0,948966, C=23799,000000
      <-(0)- 215  SafePoint  === 210 1 7 1 1 216 37 54 185  [[ 211 ]]  SafePoint  !orig=186 !jvms: StringLatin1::equals @ bci:29 (line 100)
      <-(0)- 210  IfFalse  === 209  [[ 215 216 ]] #0 !orig=198 !jvms: StringLatin1::equals @ bci:12 (line 100)
      <-(0)- 209  CountedLoopEnd  === 182 208  [[ 210 197 ]] [lt] P=0,948966, C=23799,000000 !orig=[196] !jvms: StringLatin1::equals @ bci:12 (line 100)
# OuterStripMinedLoopInvariants:
Unexpected type: CountedLoopEnd.

or with outputs:

1 failure for node
 413  OuterStripMinedLoopEnd  === 417 41  [[ 414 399 ]] P=0,960468, C=22887,000000
At node
    415  OuterStripMinedLoop  === 415 180 414  [[ 415 416 ]]
  From path:
    [center] 413  OuterStripMinedLoopEnd  === 417 41  [[ 414 399 ]] P=0,960468, C=22887,000000
         --> 414  IfTrue  === 413  [[ 415 ]] #1
         --> 415  OuterStripMinedLoop  === 415 180 414  [[ 415 416 ]]
# OuterStripMinedLoopInvariants:
Non-unique output of expected type. Found: 0.

So far a small set of checks are implemented:

  • IfProjections: check that If nodes have a IfTrue and IfFalse
  • PhiArity: check that Phi nodes have a Region node of the same arity as 0th input
  • ControlSuccessor: check that control nodes have the right amount of successors (usually 1, but 2 for if-related nodes...)
  • RegionSelfLoop: check that regions are either copy, or have a self loop as 0th input
  • CountedLoopInvariants: check the structure around the backedge of a counted loop
  • OuterStripMinedLoopInvariants: check the structure around OuterStripMinedLoopEnd
  • MultiBranchNodeOut: check that for MultiBranch, outcnt is smaller than or equal to required_outcnt (it is legitimate to have a smaller number of output, especially after some optimizations).

Some of these checks have an additional subtlety: it's ok to have some wrong shape in dead code, for instance IfProjections. After a lot of investigation, it seems that some dead loops are not always detected eagerly and can make some control path survive longer, until being removed before loop opts. This seems to be by design to avoid traversing the whole graph everytime a region lose an input. It seems such misshape is harmless because they are not reachable from the inputs, and the cost of removing them would be prohibitive. To deal with such cases, when such a check fails, we check whether it happened in dead code. The dead of unreachable control nodes is lazily computed to answer that, and it's shared across checkers. While computing unreachable nodes is somewhat expensive, it seems to happen rarely in practice.

This verification has found JDK-8359344 and JDK-8359121. It has been run on tiers 1 to 3, plus some internal testing and, after fixing the above-mentioned, it seems all passing!

Related future: add more checks, should be easy.

Less related future: could we imagine using similar patterns (without the error reporting mechanism) to use for optimizations, instead of manual traversing? It could make the code clearer to understand. We could also imagine optionally using such things in idealization to declare which patterns nodes are looking for, and if they have depth greater than 1, automatically adapting the enqueuing strategy without having to pimp PhaseIterGVN::add_users_of_use_to_worklist everytime. Could at least cover some basic (but numerous) cases.


Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8350864: C2: verify structural invariants of the Ideal graph (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/26362/head:pull/26362
$ git checkout pull/26362

Update a local copy of the PR:
$ git checkout pull/26362
$ git pull https://git.openjdk.org/jdk.git pull/26362/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 26362

View PR using the GUI difftool:
$ git pr show -t 26362

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/26362.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 17, 2025

👋 Welcome back mchevalier! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 17, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk
Copy link

openjdk bot commented Jul 17, 2025

@marc-chevalier The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@marc-chevalier marc-chevalier marked this pull request as ready for review July 17, 2025 08:43
@openjdk openjdk bot added the rfr Pull request is ready for review label Jul 17, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 17, 2025

Webrevs

Copy link
Contributor

@iwanowww iwanowww left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Some high-level comments:

  • IMO it's better to have node-specific invariant checks co-located with corresponding node (as Node::verify() maybe?); it would make it clearer what are the expectations when changing the implementation.
  • on naming: IMO VerifyIdealGraph would clearly describe what the logic does, fits existing conventions well, and easy to find

@marc-chevalier
Copy link
Member Author

IMO it's better to have node-specific invariant checks co-located with corresponding node (as Node::verify() maybe?); it would make it clearer what are the expectations when changing the implementation.

I understand the motivation, but I'm not sure what to do in every case. For instance, when the pattern is not so small (like strip mining), it's hard to associate the invariant with a single node: many are involved, and it's not really describable as the expectation of a single node. Of course, we could split the pattern into a lot of sub-patterns, centered each around each node type, but then, we lose the overview of the structure, and it becomes context-free (e.g. a IfFalse must have a CountedLoopEnd input only when it comes before a safepoint before a OuterStripMinedLoopEnd, but not in general).

Another problematic case is the control successor check that has special handling for some kinds, that could be relocated to the said node types, but for the general case, it simply tests whether the node is a CFG node. One could still do that in Node, but it's then not tied to a specific node type, and I feel it bloats the Node class (that is already not so small). And then, I fear it makes the invariant harder to read since it will be distributed across many node types: I would need to find overrides of Verify. When working on a given node, it may be easier to see what I need to guarantee (or change the invariant), but when working on something else, it makes harder to find the invariants I can actually rely on, because it could always be overridden in a derived class.

I also have a readability concern. Even if we sort them by node types, then we mix the implementation of all invariants of a given node in a single method, making it extra-hard to understand the big picture, and when looking for overrides, I will find some, but maybe they won't be about the invariant I'm interested in.

And a code/maintenance concern. If I have a default implementation of the control successor check in Node, among other such general checks, I'm tempted to override it in IfNode to accept having more than one successor, but then, how do I perform the other general checks that still hold? I can't call Node::Verify since it will enforce the wrong number of successor. I could put these checks in another method, and call it from IfNode::Verify, but it has other annoying consequences: if they are all in the same side method, I can't customize another of these checks in another node type; if each check is in its own method all called from Node::Verify, I need to repeat the call in the overrides of Verify for all the checks one will add in the future... Overall, it seems risky to maintain. We could also just call Node::Verify and have here some handling to skip some steps for some node types, but I feel like that defeats the point of having invariants closer from the type.

Overall, it seems to me that it's beneficial to move checks to the node types if:

  • the pattern is small and clearly has a privileged node, so we won't be surprised by having the invariant implemented in another node type
  • the pattern doesn't have special cases for sub-types.

For instance, PhiArity would be a good candidate (about a special kind of node, no context needed, no exception). So, maybe a solution would be to split the checks in two sources: some that are like this, implemented directly in the node, and some that are less local (not about a node, but about bigger shapes), or needs more special cases, and that we keep standalone. I don't think having two sources of invariants is a problem at all.

@marc-chevalier
Copy link
Member Author

on naming: IMO VerifyIdealGraph would clearly describe what the logic does, fits existing conventions well, and easy to find

Sure, fine with me! I'd be curious to see if somebody has other ideas.

@vnkozlov
Copy link
Contributor

I am fine with VerifyIdealGraph flag. The main concern is we have tons of Verify* flags but I don't think we use them in CI testing. So we are forgetting about them, they will brake and few years later we are removing them like we did with VerifyOpto.

Copy link
Contributor

@benoitmaillard benoitmaillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, and great explanation as well! The invariants that are already implemented seem quite useful already, and it seems there is a lot of potential.

Having recently worked on a few missed optimizations related to PhaseIterGVN::add_users_of_use_to_worklist, I agree that it would be interesting to use such patterns for automatic notifications. The way I see it, we would need to somehow "reverse" the patterns, as they would be expressed from the point of view of the node on which the optimizations is applied, and would require notification when dependencies changes. Probably quite non-trivial, but interesting nonetheless.

I only have a few basic remarks/questions.

@marc-chevalier
Copy link
Member Author

@vnkozlov: That is true. I think the idea was to use the in tests (typically stress tests) once integrated. Or at least, use it in the issues I found thanks to this flag. That should make it not totally at least.

Copy link
Contributor

@eme64 eme64 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, very nice work @marc-chevalier !

Cool that you tried a pattern-matching approach. I really do wonder if we could use that more widely 😊

// An invariant that needs only a local view of the graph, around a given node.
class LocalGraphInvariant : public ResourceObj {
public:
static constexpr int OutputStep = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a quick comment what this is for? After all, it is a public static constant ;)

bool is_node_dead(const Node*);
private:
void fill();
Unique_Node_List live_nodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the hotspot convention is to have fields with an underscore _live_nodes. Especially if they are private.

/* Check whether the invariant is true around the node [center]. The argument [steps] and [path] are initially empty.
*
* If the check fails steps and path must be filled with the path from the center to the failing node (where it's relevant to show).
* Given a list of node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Given a list of node
* Given a list of nodes

* - path must have length k, and contain rk ... r1 where ri is:
* - a non-negative integer p for each step such that N{i-1} has Ni as p-th input (we need to follow an input edge)
* - the OUTPUT_STEP value in case N{i-1} has Ni as an output (we need to follow an output edge)
* The list are reversed to allow to easily fill them lazily on failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The list are reversed to allow to easily fill them lazily on failure.
* The lists are reversed to allow to easily fill them lazily on failure.

Comment on lines +69 to +70
* The parameter [live_nodes] is used to share the lazily computed set of CFG nodes reachable from root. This is because some
* checks don't apply to dead code, suppress their error if a violation is detected in dead code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean we only cache the result if it is reachable, but not if it is not reachable? Does that mean we may check reachability for non-reachable nodes many many times?

return result;
}
assert(counted_loop != nullptr, "sanity");
if (is_long) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you cache the value? Seems is_long is only used once ... and center should not change pointers around.

Comment on lines +556 to +562
ResourceMark rm;

if (_checks.is_empty()) {
return true;
}

VectorSet enqueued;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the ResourceMark to the beginning of the allocations, and do the fast bail-out first.

Comment on lines +569 to +575
// Sometimes, we get weird structure in dead code that will be cleaned up later. It typically happens
// when data dies, but control is not cleanup right away, possibly kept alive by un unreachable loop.
// Since we don't want to eagerly traverse the whole graph to remove dead code in IGVN, we can accept
// weird structure in dead code.
// For CFG-related errors, we will compute the set of reachable CFG nodes and decide whether to keep
// the issue if the problematic node is reachable. This set of reachable node is thus computed lazily
// (and it seems not to happen often in practice), and shared across checks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Sometimes, we get weird structure in dead code that will be cleaned up later. It typically happens
// when data dies, but control is not cleanup right away, possibly kept alive by un unreachable loop.
// Since we don't want to eagerly traverse the whole graph to remove dead code in IGVN, we can accept
// weird structure in dead code.
// For CFG-related errors, we will compute the set of reachable CFG nodes and decide whether to keep
// the issue if the problematic node is reachable. This set of reachable node is thus computed lazily
// (and it seems not to happen often in practice), and shared across checks.
// Sometimes, we get weird structures in dead code that will be cleaned up later. It typically happens
// when data dies, but control is not cleaned up right away, possibly kept alive by an unreachable loop.
// Since we don't want to eagerly traverse the whole graph to remove dead code in IGVN, we can accept
// weird structures in dead code.
// For CFG-related errors, we will compute the set of reachable CFG nodes and decide whether to keep
// the issue if the problematic node is reachable. This set of reachable nodes is thus computed lazily
// (and it seems not to happen often in practice), and shared across checks.

Comment on lines +583 to +585
if (in != nullptr && !enqueued.test_set(in->_idx)) {
worklist.push(in);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make a Unique_Node_List? It would already have a VectorSet included, and you could just push without checking if we already pushed the node. Very nice for BFS traversals.
You would then not even pop nodes, but just traverse over the worklist, as it grows.

Comment on lines +607 to +611
ttyLocker ttyl;
tty->print("%d failure%s for node\n", failures, failures == 1 ? "" : "s");
center->dump();
tty->print_cr("%s", ss.base());
ss.reset();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really want to use the ttyLocker here? I thought we were trying to get away from it because it sometimes leads to lock-priority issues / dead-locks.
Why not just use yet another stringStream ss3, and do it all via that one?

@eme64
Copy link
Contributor

eme64 commented Aug 25, 2025

I am fine with VerifyIdealGraph flag. The main concern is we have tons of Verify* flags but I don't think we use them in CI testing. So we are forgetting about them, they will brake and few years later we are removing them like we did with VerifyOpto.

Yes. What you need is at least a "Hello World" test, where the flag is enabled.
And then we should try to add it to stress and fuzzer tests, so file an RFE for that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hotspot-compiler [email protected] rfr Pull request is ready for review
Development

Successfully merging this pull request may close these issues.

5 participants