Skip to content

[AMDGPU][SDAG] Legalise v2i32 or/xor/and instructions to make use of 64-bit wide instructions #140694

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

chrisjbris
Copy link
Contributor

@chrisjbris chrisjbris commented May 20, 2025

Make use of s_or_b64/s_and_b64/s_xor_b64 for v2i32. Legalising these causes a number of test regressions, so extra work in the combiner and Tablegen patterns was necessary.

This addresses #124776 too.

_I realise this patch is imperfect but I@d really appreciate some feedback regarding some of the regressions, imparticular shl64_reduce.ll and fneg-combines.new.ll. Especially as some of my combiner modifications have contradicted existing combiner behaviour. Thanks for any observations.

Rebasing broke a couple of the regression fixes, and some refactoring of the DagCombine modifications is needed to tidy up. So keeping this in draft for now while I fixup those issues.

Test regression and combiner modifications:

  • fix fneg-modifier-casting for gfx1100 - fixed.
  • Refactor / simplify performXorcombine() modifications - done, could probably be generalised to more cases.
  • Refactor / simplify performOrCombine() modifications - done.

Test regressions encountered:

LLVM :: CodeGen/AMDGPU/bf16.ll --- definite regression in v_select_v4bf16 - fixed
LLVM :: CodeGen/AMDGPU/bfi_int.ll -- Also seems to be a regression. v_bfi_32 etc have been replaces with multiple xor/and/or - fixed
LLVM :: CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll -- Fixed, but needs refactor for patch
LLVM :: CodeGen/AMDGPU/fneg-modifier-casting.ll - fixed, needs refactor
LLVM :: CodeGen/AMDGPU/fneg.ll -- fixed, refactor needed to complete.
LLVM :: CodeGen/AMDGPU/fshr.ll -- Some register renaming differences and reordering, but looks ok
LLVM :: CodeGen/AMDGPU/insert_vector_dynelt.ll --fixed.
LLVM :: CodeGen/AMDGPU/insert_vector_elt.ll -- fixed.
LLVM :: CodeGen/AMDGPU/or.ll --- improved / fixed
LLVM :: CodeGen/AMDGPU/rotl.ll -- benign reordering
LLVM :: CodeGen/AMDGPU/rotr.ll -- fixed, subtask and patch created
LLVM :: CodeGen/AMDGPU/select-vectors.ll -- previously handwritten and sparse, update_llc version will require careful checking
LLVM :: CodeGen/AMDGPU/select.f16.ll -- fixed
LLVM :: CodeGen/AMDGPU/widen-smrd-loads.ll
LLVM :: CodeGen/AMDGPU/xor.ll -- improved / fixed
LLVM :: CodeGen/AMDGPU/shl64_reduce.ll - producing extraneous and instructions, fixing now.

Also updated tests or.ll and xor.ll to demonstrate the generation of the 64-bit ops in the v2i32 cases. xor.ll also had a dead dheck-prefix (GCN), which I have removed.

Copy link

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented May 20, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Chris Jackson (chrisjbris)

Changes

Make use of s_or_b64/s_and_b64/s_xor_b64 for v2i32. Legalising these causes a number of test regressions, so extra work in the combiner and Tablegen patterns was necessary.

Rebasing broke a couple of the regression fixes, and some refactoring of the DagCombine modifications is needed to tidy up. So keeping this in draft for now while I fixup those issues.


Patch is 45.28 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/140694.diff

19 Files Affected:

  • (modified) llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp (+3)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp (+1-1)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+97-4)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+3)
  • (modified) llvm/lib/Target/AMDGPU/SIInstructions.td (+64-2)
  • (modified) llvm/lib/Target/AMDGPU/SOPInstructions.td (+15)
  • (modified) llvm/lib/Target/AMDGPU/VOP2Instructions.td (+7-2)
  • (modified) llvm/test/CodeGen/AMDGPU/and.ll (+1-2)
  • (modified) llvm/test/CodeGen/AMDGPU/bf16-conversions.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/bfi_int.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll (+26-10)
  • (modified) llvm/test/CodeGen/AMDGPU/fneg-modifier-casting.ll (+6-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fshr.ll (+54-134)
  • (modified) llvm/test/CodeGen/AMDGPU/or.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/rotl.ll (+28-14)
  • (added) llvm/test/CodeGen/AMDGPU/rotr-v2i32.ll (+32)
  • (modified) llvm/test/CodeGen/AMDGPU/rotr.ll (+23-11)
  • (modified) llvm/test/CodeGen/AMDGPU/xor.ll (+2-2)
diff --git a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
index affcd78ea61b0..6f0c524d33c5d 100644
--- a/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/LegalizeVectorOps.cpp
@@ -235,6 +235,7 @@ bool VectorLegalizer::Run() {
 
   LegalizedNodes.clear();
 
+
   // Remove dead nodes now.
   DAG.RemoveDeadNodes();
 
@@ -1282,6 +1283,8 @@ void VectorLegalizer::Expand(SDNode *Node, SmallVectorImpl<SDValue> &Results) {
   }
 
   SDValue Unrolled = DAG.UnrollVectorOp(Node);
+  LLVM_DEBUG(dbgs() << "\nUnrolled node: "; Unrolled->dump());
+  LLVM_DEBUG(dbgs() << "\n");
   if (Node->getNumValues() == 1) {
     Results.push_back(Unrolled);
   } else {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
index 7ed055e8da2b6..388efe0368256 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
@@ -152,7 +152,7 @@ AMDGPUTargetLowering::AMDGPUTargetLowering(const TargetMachine &TM,
 
   setOperationAction(ISD::LOAD, MVT::i128, Promote);
   AddPromotedToType(ISD::LOAD, MVT::i128, MVT::v4i32);
-
+  
   // TODO: Would be better to consume as directly legal
   setOperationAction(ISD::ATOMIC_LOAD, MVT::f32, Promote);
   AddPromotedToType(ISD::ATOMIC_LOAD, MVT::f32, MVT::i32);
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index ba7e11a853347..7338abd67dfe5 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -40,6 +40,7 @@
 #include "llvm/IR/IntrinsicsR600.h"
 #include "llvm/IR/MDBuilder.h"
 #include "llvm/Support/CommandLine.h"
+
 #include "llvm/Support/KnownBits.h"
 #include "llvm/Support/ModRef.h"
 #include "llvm/Transforms/Utils/LowerAtomic.h"
@@ -430,6 +431,12 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     setOperationAction(ISD::VECTOR_SHUFFLE, {MVT::v2i32, MVT::v2f32}, Legal);
   }
 
+  setOperationAction({ISD::AND, ISD::OR, ISD::XOR}, MVT::v2i32, Legal);
+  // Prevent SELECT from being implemented with the above bitwise ops and instead use cndmask.
+  setOperationAction(ISD::SELECT, MVT::v2i32, Custom);
+ // Enable MatchRotate to produce ISD::ROTR, which is later transformed to alignbit.
+  setOperationAction(ISD::ROTR, MVT::v2i32, Legal);
+
   setOperationAction(ISD::BUILD_VECTOR, {MVT::v4f16, MVT::v4i16, MVT::v4bf16},
                      Custom);
 
@@ -835,6 +842,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
     AddPromotedToType(ISD::SELECT, MVT::v2f16, MVT::i32);
   } else {
     // Legalization hack.
+    // Hmm.
     setOperationAction(ISD::SELECT, {MVT::v2i16, MVT::v2f16}, Custom);
 
     setOperationAction({ISD::FNEG, ISD::FABS}, MVT::v2f16, Custom);
@@ -1986,6 +1994,13 @@ bool SITargetLowering::shouldConvertConstantLoadToIntImm(const APInt &Imm,
   return true;
 }
 
+bool SITargetLowering::shouldFoldSelectWithIdentityConstant(unsigned BinOpcode,
+                                                            EVT VT) const {
+  return (BinOpcode == ISD::AND || BinOpcode == ISD::OR ||
+          BinOpcode == ISD::XOR) &&
+         VT.getScalarType() == MVT::i64;
+}
+
 bool SITargetLowering::isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                                unsigned Index) const {
   if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
@@ -12872,6 +12887,52 @@ SDValue SITargetLowering::performOrCombine(SDNode *N,
     }
   }
 
+  // Detect identity v2i32 OR and replace with identity source node.
+  // Specifically an Or that has operands constructed from the same source node
+  // via extract_vector_elt and build_vector.
+  if (VT == MVT::v2i32) {
+    if (LHS->getOpcode() == ISD::BUILD_VECTOR &&
+        RHS->getOpcode() == ISD::BUILD_VECTOR) {
+      // DAG.canonicalizeCommutativeBinop(ISD::OR, RHS, LHS);
+      SDValue BVLHS = LHS->getOperand(0);
+      SDValue CLHS = LHS->getOperand(1);
+      SDValue CRHS = RHS->getOperand(0);
+      SDValue BVRHS = RHS->getOperand(1);
+      LLVM_DEBUG(
+          dbgs()
+              << "### Performing v2i32 SIISelLowering DAGCombine::CombineOR\n";);
+
+      auto *LC = dyn_cast<ConstantSDNode>(LHS->getOperand(1));
+      auto *RC = dyn_cast<ConstantSDNode>(RHS->getOperand(0));
+
+      if (LC && RC) {
+
+        // Test for and normalise build vectors.
+        if (LHS->getOpcode() == ISD::BUILD_VECTOR &&
+            RHS->getOpcode() == ISD::BUILD_VECTOR &&
+            // Check cast to constantnode here
+            LHS->getConstantOperandVal(1) == 0 &&
+            RHS->getConstantOperandVal(0) == 0) {
+
+          // Get the extract_vector_element operands.
+          SDValue LEVE = LHS->getOperand(0);
+          SDValue REVE = RHS->getOperand(1);
+
+          if (LEVE->getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
+              REVE->getOpcode() == ISD::EXTRACT_VECTOR_ELT) {
+            // Check that the the elements from the same vector are extracted.
+            if (LEVE->getOperand(0) == REVE->getOperand(0) &&
+                LEVE->getOperand(1) != REVE->getOperand(1)) {
+              LLVM_DEBUG(dbgs() << "### Found identity OR, folding...\n";);
+              SDValue IdentitySrc = LEVE.getOperand(0);
+              return IdentitySrc;
+            }
+          }
+        }
+      }
+    }
+  }
+
   if (VT != MVT::i64 || DCI.isBeforeLegalizeOps())
     return SDValue();
 
@@ -12915,20 +12976,52 @@ SDValue SITargetLowering::performXorCombine(SDNode *N,
                                             DAGCombinerInfo &DCI) const {
   if (SDValue RV = reassociateScalarOps(N, DCI.DAG))
     return RV;
-
+  
+  SelectionDAG &DAG = DCI.DAG;
+  EVT VT = N->getValueType(0);
   SDValue LHS = N->getOperand(0);
   SDValue RHS = N->getOperand(1);
 
-  const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS);
-  SelectionDAG &DAG = DCI.DAG;
+  if (VT == MVT::v2i32 && LHS.getNumOperands() > 1) {
+
+    const ConstantSDNode *CRHS_0 = dyn_cast<ConstantSDNode>(RHS.getOperand(0));
+    const ConstantSDNode *CRHS_1 = dyn_cast<ConstantSDNode>(RHS.getOperand(1));
+    SDValue LHS_0 = LHS.getOperand(0);
+    SDValue LHS_1 = LHS.getOperand(1);
+
+    if (LHS.getOpcode() == ISD::VSELECT && VT == MVT::v2i32) {
+      if (CRHS_0 && CRHS_0->getAPIntValue().isSignMask() &&
+          shouldFoldFNegIntoSrc(N, LHS_0))
+        if (CRHS_1 && CRHS_1->getAPIntValue().isSignMask() &&
+            shouldFoldFNegIntoSrc(N, LHS_1)) {
+          SDLoc DL(N);
+          SDValue CastLHS =
+              DAG.getNode(ISD::BITCAST, DL, MVT::v2f32, LHS->getOperand(1));
+          SDValue CastRHS =
+              DAG.getNode(ISD::BITCAST, DL, MVT::v2f32, LHS->getOperand(2));
+          SDValue FNegLHS = DAG.getNode(ISD::FNEG, DL, MVT::v2f32, CastLHS);
+          SDValue FNegRHS = DAG.getNode(ISD::FNEG, DL, MVT::v2f32, CastRHS);
+          SDValue NewSelect = DAG.getNode(ISD::VSELECT, DL, MVT::v2f32,
+                                          LHS->getOperand(0), FNegLHS, FNegRHS);
+          return DAG.getNode(ISD::BITCAST, DL, VT, NewSelect);
+        }
+    }
+    // Possibly split vector here if one side does have a constant RHS.
+  }
 
-  EVT VT = N->getValueType(0);
+  // Add test for when only one of the RHS vector elements is a const. Might be possible to optimise this case.
+
+
+  const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS);
+ 
+ 
   if (CRHS && VT == MVT::i64) {
     if (SDValue Split =
             splitBinaryBitConstantOp(DCI, SDLoc(N), ISD::XOR, LHS, CRHS))
       return Split;
   }
 
+
   // Make sure to apply the 64-bit constant splitting fold before trying to fold
   // fneg-like xors into 64-bit select.
   if (LHS.getOpcode() == ISD::SELECT && VT == MVT::i32) {
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index c42366a1c04c8..b651c2530d628 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -366,6 +366,9 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
                                         Type *Ty) const override;
 
+  bool shouldFoldSelectWithIdentityConstant(unsigned BinOpcode,
+                                            EVT VT) const override;
+
   bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                unsigned Index) const override;
   bool isExtractVecEltCheap(EVT VT, unsigned Index) const override;
diff --git a/llvm/lib/Target/AMDGPU/SIInstructions.td b/llvm/lib/Target/AMDGPU/SIInstructions.td
index 2e2913d88cc54..7944f783c82ba 100644
--- a/llvm/lib/Target/AMDGPU/SIInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SIInstructions.td
@@ -2334,9 +2334,9 @@ def : AMDGPUPatIgnoreCopies <
                 (COPY_TO_REGCLASS VSrc_b32:$z, VGPR_32))
 >;
 
-// 64-bit version
+foreach vt = [i64, v2i32] in {
 def : AMDGPUPatIgnoreCopies <
-  (DivergentBinFrag<xor> i64:$z, (and i64:$x, (xor i64:$y, i64:$z))),
+  (DivergentBinFrag<xor> vt:$z, (and vt:$x, (xor vt:$y, vt:$z))),
   (REG_SEQUENCE VReg_64,
     (V_BFI_B32_e64 (i32 (EXTRACT_SUBREG VReg_64:$x, sub0)),
               (i32 (EXTRACT_SUBREG VReg_64:$y, sub0)),
@@ -2345,6 +2345,7 @@ def : AMDGPUPatIgnoreCopies <
               (i32 (EXTRACT_SUBREG VReg_64:$y, sub1)),
               (i32 (EXTRACT_SUBREG VReg_64:$z, sub1))), sub1)
 >;
+}
 
 def : AMDGPUPat <
   (fcopysign f32:$src0, f32:$src1),
@@ -2378,6 +2379,24 @@ def : AMDGPUPat <
 let True16Predicate = NotHasTrue16BitInsts in {
 def : ROTRPattern <V_ALIGNBIT_B32_e64>;
 
+def : AMDGPUPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0))), sub0,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub1))), sub1)
+>;
+
+// Prevents regression in fneg-modifier-casting.ll along with modifications to XorCombine().
+def : AMDGPUPat <
+  (fneg (select i1:$src0, (f32 (bitconvert i32:$src1)), (f32 (bitconvert i32:$src2)))),
+    (V_CNDMASK_B32_e64 (i32 1), $src2, (i32 1), $src1, $src0)>;
+
 def : GCNPat<(i32 (trunc (srl i64:$src0, (and i32:$src1, (i32 31))))),
           (V_ALIGNBIT_B32_e64 (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
                           (i32 (EXTRACT_SUBREG (i64 $src0), sub0)), $src1)>;
@@ -2385,6 +2404,20 @@ def : GCNPat<(i32 (trunc (srl i64:$src0, (and i32:$src1, (i32 31))))),
 def : GCNPat<(i32 (trunc (srl i64:$src0, (i32 ShiftAmt32Imm:$src1)))),
           (V_ALIGNBIT_B32_e64 (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
                           (i32 (EXTRACT_SUBREG (i64 $src0), sub0)), $src1)>;
+
+def : GCNPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0))), sub0,
+    (V_ALIGNBIT_B32_e64
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      (i32 (EXTRACT_SUBREG VReg_64:$src1, sub1))), sub1)
+>;
+
 } // end True16Predicate = NotHasTrue16BitInsts
 
 let True16Predicate = UseRealTrue16Insts in {
@@ -2397,6 +2430,20 @@ def : GCNPat <
                           /* clamp */ 0, /* op_sel */ 0)
 >;
 
+def : GCNPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_t16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (EXTRACT_SUBREG (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0)) ,lo16),0,0), sub0,
+    (V_ALIGNBIT_B32_t16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      0, (EXTRACT_SUBREG (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0)) ,lo16),0,0), sub1)
+>;
+
+
 def : GCNPat<(i32 (trunc (srl i64:$src0, (i32 ShiftAmt32Imm:$src1)))),
           (V_ALIGNBIT_B32_t16_e64 0, /* src0_modifiers */
                           (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
@@ -2423,6 +2470,20 @@ def : GCNPat <
                              $src1, /* clamp */ 0, /* op_sel */ 0)
 >;
 
+def : GCNPat <
+  (rotr v2i32:$src0, v2i32:$src1),
+  (REG_SEQUENCE VReg_64,
+    (V_ALIGNBIT_B32_fake16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub0)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src1, sub0)),0,0), sub0,
+    (V_ALIGNBIT_B32_fake16_e64
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)),
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src0, sub1)), 
+      0, (i32 (EXTRACT_SUBREG VReg_64:$src1, sub1)),0,0), sub1)
+>;
+
+
 def : GCNPat<(i32 (trunc (srl i64:$src0, (and i32:$src1, (i32 31))))),
      (V_ALIGNBIT_B32_fake16_e64 0, /* src0_modifiers */
                                (i32 (EXTRACT_SUBREG (i64 $src0), sub1)),
@@ -2449,6 +2510,7 @@ def : GCNPat<(fshr i32:$src0, i32:$src1, i32:$src2),
 >;
 } // end True16Predicate = UseFakeTrue16Insts
 
+
 /********** ====================== **********/
 /**********   Indirect addressing  **********/
 /********** ====================== **********/
diff --git a/llvm/lib/Target/AMDGPU/SOPInstructions.td b/llvm/lib/Target/AMDGPU/SOPInstructions.td
index 40b3dfb94ce2f..f2e1a27644afb 100644
--- a/llvm/lib/Target/AMDGPU/SOPInstructions.td
+++ b/llvm/lib/Target/AMDGPU/SOPInstructions.td
@@ -1779,6 +1779,21 @@ def : GCNPat <
   (S_MOV_B32 imm:$imm)
 >;
 
+def : GCNPat <
+  (v2i32 (UniformBinFrag<and> v2i32:$x, v2i32:$y)),
+  (S_AND_B64 SReg_64:$x, SReg_64:$y)
+>;
+
+def : GCNPat <
+  (v2i32 (UniformBinFrag<or> v2i32:$x, v2i32:$y)),
+  (S_OR_B64 SReg_64:$x, SReg_64:$y)
+>;
+
+def : GCNPat <
+  (v2i32 (UniformBinFrag<xor> v2i32:$x, v2i32:$y)),
+  (S_XOR_B64 SReg_64:$x, SReg_64:$y)
+>;
+
 // Same as a 32-bit inreg
 def : GCNPat<
   (i32 (UniformUnaryFrag<sext> i16:$src)),
diff --git a/llvm/lib/Target/AMDGPU/VOP2Instructions.td b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
index 0c7e20fc1ebf3..0fe09a66ada58 100644
--- a/llvm/lib/Target/AMDGPU/VOP2Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
@@ -954,9 +954,9 @@ def : DivergentClampingBinOp<sub, V_SUB_CO_U32_e64>;
 def : DivergentBinOp<adde, V_ADDC_U32_e32>;
 def : DivergentBinOp<sube, V_SUBB_U32_e32>;
 
-class divergent_i64_BinOp <SDPatternOperator Op, Instruction Inst> :
+class divergent_i64_BinOp <SDPatternOperator Op, Instruction Inst, ValueType vt = i64> :
   GCNPat<
-      (DivergentBinFrag<Op> i64:$src0, i64:$src1),
+      (DivergentBinFrag<Op> vt:$src0, vt:$src1),
       (REG_SEQUENCE VReg_64,
         (Inst
           (i32 (EXTRACT_SUBREG $src0, sub0)),
@@ -973,6 +973,11 @@ def :  divergent_i64_BinOp <and, V_AND_B32_e64>;
 def :  divergent_i64_BinOp <or,  V_OR_B32_e64>;
 def :  divergent_i64_BinOp <xor, V_XOR_B32_e64>;
 
+def :  divergent_i64_BinOp <and, V_AND_B32_e64, v2i32>;
+def :  divergent_i64_BinOp <or,  V_OR_B32_e64, v2i32>;
+def :  divergent_i64_BinOp <xor, V_XOR_B32_e64, v2i32>;
+
+
 // mul24 w/ 64 bit output.
 class mul24_64_Pat<SDPatternOperator Op, Instruction InstLo, Instruction InstHi> : GCNPat<
   (i64 (Op i32:$src0, i32:$src1)),
diff --git a/llvm/test/CodeGen/AMDGPU/and.ll b/llvm/test/CodeGen/AMDGPU/and.ll
index c6233642110ea..05402b3c89409 100644
--- a/llvm/test/CodeGen/AMDGPU/and.ll
+++ b/llvm/test/CodeGen/AMDGPU/and.ll
@@ -8,8 +8,7 @@ declare i32 @llvm.amdgcn.workitem.id.x() #0
 ; EG: AND_INT {{\*? *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
 ; EG: AND_INT {{\*? *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
 
-; SI: s_and_b32 s{{[0-9]+, s[0-9]+, s[0-9]+}}
-; SI: s_and_b32 s{{[0-9]+, s[0-9]+, s[0-9]+}}
+; SI: s_and_b64
 
 define amdgpu_kernel void @test2(ptr addrspace(1) %out, ptr addrspace(1) %in) {
   %b_ptr = getelementptr <2 x i32>, ptr addrspace(1) %in, i32 1
diff --git a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
index a597faa028f22..ca8f7736f6093 100644
--- a/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
+++ b/llvm/test/CodeGen/AMDGPU/bf16-conversions.ll
@@ -151,25 +151,25 @@ define amdgpu_ps float @v_test_cvt_v2f64_v2bf16_v(<2 x double> %src) {
 ; GFX-950-LABEL: v_test_cvt_v2f64_v2bf16_v:
 ; GFX-950:       ; %bb.0:
 ; GFX-950-NEXT:    v_cvt_f32_f64_e32 v6, v[2:3]
+; GFX-950-NEXT:    v_and_b32_e32 v4, 1, v6
+; GFX-950-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v4
 ; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[4:5], v6
-; GFX-950-NEXT:    v_and_b32_e32 v7, 1, v6
 ; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[2:3]|, |v[4:5]|
-; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[2:3], v[4:5]
-; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v7
+; GFX-950-NEXT:    v_cmp_nlg_f64_e64 s[0:1], v[2:3], v[4:5]
+; GFX-950-NEXT:    v_cvt_f32_f64_e32 v7, v[0:1]
 ; GFX-950-NEXT:    v_cndmask_b32_e64 v2, -1, 1, s[2:3]
 ; GFX-950-NEXT:    v_add_u32_e32 v2, v6, v2
-; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
-; GFX-950-NEXT:    v_cvt_f32_f64_e32 v5, v[0:1]
+; GFX-950-NEXT:    s_or_b64 vcc, s[0:1], vcc
 ; GFX-950-NEXT:    v_cndmask_b32_e32 v4, v2, v6, vcc
-; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[2:3], v5
-; GFX-950-NEXT:    v_and_b32_e32 v6, 1, v5
+; GFX-950-NEXT:    v_cvt_f64_f32_e32 v[2:3], v7
+; GFX-950-NEXT:    v_and_b32_e32 v8, 1, v7
 ; GFX-950-NEXT:    v_cmp_gt_f64_e64 s[2:3], |v[0:1]|, |v[2:3]|
-; GFX-950-NEXT:    v_cmp_nlg_f64_e32 vcc, v[0:1], v[2:3]
-; GFX-950-NEXT:    v_cmp_eq_u32_e64 s[0:1], 1, v6
+; GFX-950-NEXT:    v_cmp_ne_u32_e32 vcc, 0, v8
+; GFX-950-NEXT:    v_cmp_nlg_f64_e64 s[0:1], v[0:1], v[2:3]
 ; GFX-950-NEXT:    v_cndmask_b32_e64 v0, -1, 1, s[2:3]
-; GFX-950-NEXT:    v_add_u32_e32 v0, v5, v0
-; GFX-950-NEXT:    s_or_b64 vcc, vcc, s[0:1]
-; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v5, vcc
+; GFX-950-NEXT:    v_add_u32_e32 v0, v7, v0
+; GFX-950-NEXT:    s_or_b64 vcc, s[0:1], vcc
+; GFX-950-NEXT:    v_cndmask_b32_e32 v0, v0, v7, vcc
 ; GFX-950-NEXT:    v_cvt_pk_bf16_f32 v0, v0, v4
 ; GFX-950-NEXT:    ; return to shader part epilog
   %res = fptrunc <2 x double> %src to <2 x bfloat>
diff --git a/llvm/test/CodeGen/AMDGPU/bfi_int.ll b/llvm/test/CodeGen/AMDGPU/bfi_int.ll
index 201b97d479c68..d76ecbd73fe6e 100644
--- a/llvm/test/CodeGen/AMDGPU/bfi_int.ll
+++ b/llvm/test/CodeGen/AMDGPU/bfi_int.ll
@@ -582,15 +582,15 @@ define <2 x i32> @v_bitselect_v2i32_pat1(<2 x i32> %a, <2 x i32> %b, <2 x i32> %
 ; GFX7-LABEL: v_bitselect_v2i32_pat1:
 ; GFX7:       ; %bb.0:
 ; GFX7-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX7-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX7-NEXT:    v_bfi_b32 v1, v3, v1, v5
+; GFX7-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX7-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX8-LABEL: v_bitselect_v2i32_pat1:
 ; GFX8:       ; %bb.0:
 ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX8-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX8-NEXT:    v_bfi_b32 v1, v3, v1, v5
+; GFX8-NEXT:    v_bfi_b32 v0, v2, v0, v4
 ; GFX8-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-LABEL: v_bitselect_v2i32_pat1:
diff --git a/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll b/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll
index a01c2fa152ab3..2d73f17d74d8b 100644
--- a/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll
+++ b/llvm/test/CodeGen/AMDGPU/copysign-simplify-demanded-bits.ll
@@ -31,8 +31,8 @@ define <2 x half> @test_pown_reduced_fast_v2f16_known_odd(<2 x half> %x, <2 x i3
 ; GFX9-LABEL: test_pown_reduced_fast_v2f16_known_odd:
 ; GFX9:       ; %bb.0:
 ; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
-; GFX9-NEXT:    v_or_b32_e32 v1, 1, v1
 ; GFX9-NEXT:    v_or_b32_e32 v2, 1, v2
+; GFX9-NEXT:    v_or_b32_e32 v1, 1, v1
 ; GFX9-NEXT:    v_cvt_f32_i32_e32 v2, v2
 ; GFX9-NEXT:    v_cvt_f32_i32_e32 v1, v1
 ; GFX9-NEXT:    v_and_b32_e32 v3, 0x7fff7fff, v0
diff --git a/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll b/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll
index d63a36c4b2958..7e2e8b577e085 100644
--- a/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll
+++ b/llvm/test/CodeGen/AMDGPU/dag-preserve-disjoint-flag.ll
@@ -28,12 +28,15 @@ define amdgpu_ps <2 x i32> @s_or_v2i32_disjoint(<2 x i32> inreg %a, <2 x i32> in
   ; CHECK-NEXT:   [[COPY1:%[0-9]+]]:sgpr_32 = COPY $sgpr2
   ; CHECK-NEX...
[truncated]

Copy link

github-actions bot commented May 20, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from 125a6d4 to c6fac80 Compare May 20, 2025 10:16
Comment on lines 434 to 444
// Prevent SELECT from being implemented with the above bitwise ops and
// instead use cndmask.
setOperationAction(ISD::SELECT, MVT::v2i32, Custom);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should fix whatever combiner logic is doing this. In the future we probably should make a similar change for select to better make use of s_cselect_b64

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set custom here so that SITargetLowering::LowerSELECT() could be applied.

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from bcb5393 to 29e949b Compare May 21, 2025 09:37
@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from 7f18b35 to 957644f Compare May 22, 2025 12:45
@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from d1afe09 to 06c6c6e Compare May 28, 2025 09:20
@chrisjbris
Copy link
Contributor Author

Combined existing commits and modified commit message for easier review.

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from 06c6c6e to 995e7fb Compare May 28, 2025 09:46
@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from 022b765 to 548e519 Compare June 4, 2025 13:30
@chrisjbris chrisjbris marked this pull request as ready for review June 4, 2025 13:35
Copy link
Contributor

@JanekvO JanekvO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of low hanging fruit comments for now

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from edf0ee6 to 359fde6 Compare June 6, 2025 09:56
Copy link
Contributor

@JanekvO JanekvO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also add the issue you're fixing: #124776

Comment on lines -21 to +25
; CHECK-NEXT: flat_load_dword v4, v[2:3]
; CHECK-NEXT: flat_load_dword v5, v[0:1]
; CHECK-NEXT: v_mov_b32_e32 v1, 48
; CHECK-NEXT: flat_load_dwordx2 v[4:5], v[0:1]
; CHECK-NEXT: flat_load_dwordx2 v[6:7], v[2:3]
; CHECK-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; CHECK-NEXT: v_or_b32_e32 v0, v5, v4
; CHECK-NEXT: v_or_b32_e32 v1, v5, v7
; CHECK-NEXT: v_or_b32_e32 v0, v4, v6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here? Why was it previous able to deduce v1 as 48 but now not anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atm, I'm not sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost certainly the v2i32 got scalarised by the legalizer then etc etc. I don't think it matters as we don't have any extra instructions but i'll take a look at llc anyway and see which rule got applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the expansion of a non-legal v2i32 disjoint or gets reduced down to a constant through various legalisation and combines, but now that the v2i32 disjoint or instruction is legal that doesn't happen and instead ISel breaks the v2i32 disjoint or into two i32 or instructions instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a bit of a better look into this, something still feels wrong here. From what I understand, 48 is materialized through the propagation of range metadata so this basically undoes the exact thing it's testing. But I have to admit I'm not sure if this is something that should be fixed here or would be more suitable for a follow up, mostly because there's more happening in this test that I don't exactly understand (e.g., v2i16 can be wholly deduced while v2i64 can only deduce one of its elements through the metadata range?)

All in all, I'd at least want to have some visibility on it (i.e., ticket, gh issue)
@LU-JOHN considering you've added the test, any opinions on this test changing?

Comment on lines +70 to +71
; CHECK-NEXT: [[DEF:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
; CHECK-NEXT: [[DEF1:%[0-9]+]]:sgpr_32 = IMPLICIT_DEF
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't completely understand these implicit defs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be honest. Nor do I.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigating now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw some other implicit_defs occurring in another test (https://github.com/llvm/llvm-project/blob/main/llvm/test/CodeGen/AMDGPU/legalize-amdgcn.raw.ptr.buffer.store.format.f32.ll) and it seems to be caused by some juggling of sgpr/vgpr with reg_sequence in SIFixSGPRCopies.

sgpr = reg_sequence vgpr, ...., vgpr, ....

to

sgpr = copy vgpr
sgpr = copy vgpr
sgpr = reg_sequence sgpr, ...., sgpr, ....

to

sgpr = implicit_def
sgpr = implicit_def
vgpr= reg_sequence vgpr, ...., vgpr, ....
<uses changes to vgpr>

I assume it's a non-issue? I'd like to see it not emit implicit_def but I think it's a separate issue if it does matter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like perhaps we need a fix in SIFixSGPRCopies

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created #144518

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 2 times, most recently from 71b1e29 to b3b53fe Compare June 10, 2025 10:44
@chrisjbris
Copy link
Contributor Author

Rebased commits and updated the commit message for easier review.

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 3 times, most recently from 6245978 to d77c537 Compare June 19, 2025 13:05
@chrisjbris
Copy link
Contributor Author

chrisjbris commented Jun 19, 2025

Just rebased on main.

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from d77c537 to ccb196c Compare June 19, 2025 13:26
@@ -4056,6 +4056,53 @@ SDValue AMDGPUTargetLowering::performShlCombine(SDNode *N,
SDLoc SL(N);
SelectionDAG &DAG = DCI.DAG;

// When the shl64_reduce optimisation code is passed through vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working around something missing in SimplifyDemandedBits?

Copy link
Contributor Author

@chrisjbris chrisjbris Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so, it was my understanding ISel was struggling with the operand to the scalar shl being an extract_element from the vector and instruction result.

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch 3 times, most recently from e17b081 to 137d58d Compare June 23, 2025 16:22
64-bit wide instructions

Make use of s_or_b64/s_and_b64/s_xor_b64 for v2i32. Legalising these
causes a number of test regressions, so extra work in the combiner and
Tablegen patterns was necessary.

- Use custom for v2i32 rotr instead of additional patterns. Modify
PerformOrCombine() to remove some identity or operations

- Fix rotr regression by adding lowerRotr() on the legalizer codepath.

- Add test case to rotr.ll

- Extend performFNEGCombine() for the SELECT case.

- Modify performSelectCombine() and foldFreeOpFromSelect to prevent the
performFNEGCombine() changes from being unwound.

- Add cases to or.ll and xor.ll to demonstrate the generation of the
  s_or_64 and s_xor_64 instructions for the v2i32 cases. Previously
  this was inhibited by "-amdgpu-scalarize-global-loads=false".

- Fix shl/srl64_reduce regression by performing the scalarisation
previously performewd by the vector legaliser in the combiner.
@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from 816ef66 to 50ecdff Compare June 24, 2025 11:13
@chrisjbris
Copy link
Contributor Author

Rebased on main.

Comment on lines +13269 to +13279
SDLoc DL(N);
SDValue CastLHS =
DAG.getNode(ISD::BITCAST, DL, MVT::v2f32, LHS->getOperand(1));
SDValue CastRHS =
DAG.getNode(ISD::BITCAST, DL, MVT::v2f32, LHS->getOperand(2));
SDValue FNegLHS = DAG.getNode(ISD::FNEG, DL, MVT::v2f32, CastLHS);
SDValue FNegRHS = DAG.getNode(ISD::FNEG, DL, MVT::v2f32, CastRHS);
SDValue NewSelect = DAG.getNode(ISD::VSELECT, DL, MVT::v2f32,
LHS->getOperand(0), FNegLHS, FNegRHS);
return DAG.getNode(ISD::BITCAST, DL, VT, NewSelect);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to handle using fneg/fabs on integer select as a separate patch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you suggest we prevent the test regression without including this code in this diff?

Comment on lines +4796 to +4812
// select c, (fneg (f32 bitcast i32 x)), (fneg (f32 bitcast i32 y)) can be
// lowered directly to a V_CNDMASK_. So prevent the fneg from being pulled
// out in this case. For now I've made the logic as specific to the case as
// possible, hopefully this can be relaxed in future.
if (LHS.getOpcode() == ISD::FNEG && RHS.getOpcode() == ISD::FNEG) {
SDValue LHSB = LHS.getOperand(0);
SDValue RHSB = RHS.getOperand(0);
if (LHSB.getOpcode() == ISD::BITCAST &&
RHSB->getOpcode() == ISD::BITCAST) {
EVT LHSBOpTy = LHSB->getOperand(0).getValueType();
EVT RHSBOpTy = RHSB->getOperand(0).getValueType();
if (LHSB.getValueType() == MVT::f32 &&
RHSB.getValueType() == MVT::f32 && LHSBOpTy == MVT::i32 &&
RHSBOpTy == MVT::i32)
return SDValue();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is hacking around allUsesHaveSourceMods not knowing about this particular select case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think you're correct. I'll have a look at modifying hasSourceMods() and selectSupportsSourceMods() as an alternative.

Copy link
Contributor Author

@chrisjbris chrisjbris Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I had a good go at this. hasSourceMods() has to be modified to return false for BUILD_VECTOR and BITCAST. For each case I inspect the users and guard on type and opcode. However, this causes regressions in other tests in fneg-modifier-casting.ll because other (non-foldfreeOpFromSelect()) references to allUsersHaveSourceMods() hit the new code paths. For example for the bitcast case in has sourceMods I end up with:

LLVM_READONLY
static bool bitcastSupportsSourceMods(const SDNode *N) {
  for (const SDNode *U : N->users())
    if ((U->getOpcode() == ISD::BUILD_VECTOR &&
         U->getValueType(0) == MVT::v2i32) ||
        (U->getOpcode() == ISD::CopyToReg && U->getValueType(0) == MVT::Other))
      return false;

  return true;
}

and the case in hasSourceMods() becomes:

case ISD::BITCAST:{
    if(Src->getOpcode() != ISD::SELECT)
    return bitcastSupportsSourceMods(N);

Which still isn't a specific enough guard to limit it to this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could simply move my existing logic from foldFreeOpFromSelect() into allUsesHaveSourceMods(), but I don't know if that is beneficial.

Comment on lines 4060 to 4062
// legalization some scalarising occurs. After ISD::AND was legalised, this
// resulted in the AND instructions no longer being elided, as mentioned
// below. The following code should make sure this takes place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explaining some historical debugging and not directly what this is doing (dag style comments are also helpful)

This also feels like it's working around a generic combiner issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i'll remove the comments. The combiner issue is common to to all the new shift64_reduce optimisations e.g. chrisjbris@c4caf00

@@ -4189,6 +4234,53 @@ SDValue AMDGPUTargetLowering::performSrlCombine(SDNode *N,
SDLoc SL(N);
unsigned RHSVal;

// When the shl64_reduce optimisation code is passed through vector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this looks like the same long combine repeated second time?

Copy link
Contributor Author

@chrisjbris chrisjbris Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the same combine correction so that ISel sees (shl (and )) rather than (shl (extract_element (and))).

@chrisjbris chrisjbris force-pushed the 124775_AMDGPU_v2i32_bitwise_ops branch from 310d14e to fdc5c33 Compare June 25, 2025 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants