Skip to content

Handle DelayFree for HW_Category_SIMDByIndexedElement intrinsics #114525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 12, 2025

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Apr 11, 2025

In #107459, we refactored the code to build refpostions for HWIntrinsics tree nodes. As part of that, we removed the correct way of handling of intrinsics that fall under HW_Category_SIMDByIndexedElement (e.g. MLS) and whose operands have restricted registers V0~V15. We previously marked RefPositions for such operands as "delay-free", but after refactoring, we stopped marking it such and as a result we assigned conflicting registers to targetReg as well as one of the 2nd/3rd operand. Because of that, code gen need to do the required move from mov targetReg, op1Reg (whenever targetReg != op1reg), but if targetReg also contains the value of 2nd/3rd operand, the move would overwrite its value.

Here is the portion of changes that changed the behavior.

Fixes: #114358, #114322

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 11, 2025
@kunalspathak
Copy link
Member Author

cc: @a74nh

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@kunalspathak
Copy link
Member Author

This might need a backport because this can lead to silent bad codegen issue for lot of intrinsics that fall in HW_Category_SIMDByIndexedElement category.

@@ -1481,6 +1481,21 @@ int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree, int* pDstCou
{
srcCount += BuildContainedCselUses(containedCselOp, delayFreeOp, candidates);
}
else if ((intrin.category == HW_Category_SIMDByIndexedElement) && (genTypeSize(intrin.baseType) == 2) && !HWIntrinsicInfo::HasImmediateOperand(intrin.id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this. To keep in the existing style, could you:

Expand getDelayFreeOperand() to:
GenTree* LinearScan::getDelayFreeOperand(GenTreeHWIntrinsic* intrinsicTree, bool embedded, bool *forceDelay)
Where forceDelay is a return argument.

Move your else if into getDelayFreeOperand() (I guess it'll go into the default case) and if it passes, set *forceDelay=true and return nullptr.

Then in line 1500 ("Only build as delay free use if register types match") add a || forceDelay == true

Copy link
Member Author

@kunalspathak kunalspathak Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first choice, but getDelayFreeOperand() is outside this for loop and that will return nullptr even for op1. If that happens, we won't be able to tgtPrefUse = delayUse i.e. we won't be able to prefer the register allocated for op1 for target as well. Usually when targetReg == op1Reg, we avoid an extra move before the RMW instruction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I hadn't really taken the for loop into account properly.

For all these SIMDByIndexedElement intrisnics, will op1 always be a delay operand?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see the code, it's not delayed as such, but rather is a way for us to tell that it is the preferred register for target. It is not marked as delay-free.

@kunalspathak
Copy link
Member Author

/azp run Antigen, Fuzzlyn, runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs

Copy link

Azure Pipelines successfully started running 4 pipeline(s).

@kunalspathak
Copy link
Member Author

Antigen, Fuzzlyn and libraries-jitstressregs failures are unrelated.
@dotnet/jit-contrib @BruceForstall / @amanasifkhalid PTAL

@kunalspathak kunalspathak marked this pull request as ready for review April 11, 2025 21:27
@kunalspathak
Copy link
Member Author

/ba-g timeout issue

@kunalspathak kunalspathak merged commit 46c3ac3 into dotnet:main Apr 12, 2025
144 of 163 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assertion failed '(targetReg == op1Reg) || (targetReg != op3Reg)' during 'Generate code'
3 participants