[LV] Use VPReductionRecipe for partial reductions #144908

SamTebbs33 · 2025-06-19T15:13:10Z

Partial reductions can easily be represented by the VPReductionRecipe class by setting their scale factor to something greater than 1. This PR merges the two together and gives VPReductionRecipe a VFScaleFactor so that it can choose to generate the partial reduction intrinsic at execute time.

This also leads to partial reductions naturally being included in VPMulAccumulateRecipe, which is nice for hiding the cost of the extends and mul, but it does have the side effect of generating an unnecessary extend for chained partial reduction cases due to how those extends are re-generated. I don't think this can be avoided nicely, and it should be eliminated by DCE anyway.

Partial reductions can easily be represented by the VPReductionRecipe class by setting their scale factor to something greater than 1. This PR merges the two together and gives VPReductionRecipe a VFScaleFactor so that it can choose to generate the partial reduction intrinsic at execute time. This also leads to partial reductions naturally being included in VPMulAccumulateRecipe, which is nice for hiding the cost of the extends and mul, but it does have the side effect of generating an unnecessary extend for chained partial reduction cases. I don't think this can be avoided nicely, and it should be eliminated by DCE anyway.

llvmbot · 2025-06-19T15:13:47Z

@llvm/pr-subscribers-llvm-transforms
@llvm/pr-subscribers-llvm-analysis

@llvm/pr-subscribers-vectorizers

Author: Sam Tebbs (SamTebbs33)

Changes

Partial reductions can easily be represented by the VPReductionRecipe class by setting their scale factor to something greater than 1. This PR merges the two together and gives VPReductionRecipe a VFScaleFactor so that it can choose to generate the partial reduction intrinsic at execute time.

This also leads to partial reductions naturally being included in VPMulAccumulateRecipe, which is nice for hiding the cost of the extends and mul, but it does have the side effect of generating an unnecessary extend for chained partial reduction cases due to how those extends are re-generated. I don't think this can be avoided nicely, and it should be eliminated by DCE anyway.

Patch is 247.38 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/144908.diff

15 Files Affected:

(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+2)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+15-4)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+25-16)
(modified) llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h (+1-1)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+58-86)
(modified) llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp (+10-9)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+54-83)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+4-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-chained.ll (+22-14)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-epilogue.ll (+85-165)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product-neon.ll (+285-525)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/partial-reduce-dot-product.ll (+116-95)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/vplan-printing.ll (+7-10)
(modified) llvm/unittests/Transforms/Vectorize/VPlanTest.cpp (+4-4)

diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index 8f4ce80ada5ed..588e58f053cb7 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -223,6 +223,8 @@ class TargetTransformInfo {
   /// Get the kind of extension that an instruction represents.
   LLVM_ABI static PartialReductionExtendKind
   getPartialReductionExtendKind(Instruction *I);
+  LLVM_ABI static PartialReductionExtendKind
+  getPartialReductionExtendKind(Instruction::CastOps CastOpc);
 
   /// Construct a TTI object using a type implementing the \c Concept
   /// API below.
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 2d053e55bdfa9..06f43a4303fa8 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -994,11 +994,22 @@ InstructionCost TargetTransformInfo::getShuffleCost(
 }
 
 TargetTransformInfo::PartialReductionExtendKind
-TargetTransformInfo::getPartialReductionExtendKind(Instruction *I) {
-  if (isa<SExtInst>(I))
-    return PR_SignExtend;
-  if (isa<ZExtInst>(I))
+TargetTransformInfo::getPartialReductionExtendKind(
+    Instruction::CastOps CastOpc) {
+  switch (CastOpc) {
+  case Instruction::CastOps::ZExt:
     return PR_ZeroExtend;
+  case Instruction::CastOps::SExt:
+    return PR_SignExtend;
+  default:
+    return PR_None;
+  }
+}
+
+TargetTransformInfo::PartialReductionExtendKind
+TargetTransformInfo::getPartialReductionExtendKind(Instruction *I) {
+  if (auto *Cast = dyn_cast<CastInst>(I))
+    return getPartialReductionExtendKind(Cast->getOpcode());
   return PR_None;
 }
 
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f1470fd1f7314..fff8a856b3248 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -7050,7 +7050,8 @@ static bool planContainsAdditionalSimplifications(VPlan &Plan,
       }
       // The VPlan-based cost model is more accurate for partial reduction and
       // comparing against the legacy cost isn't desirable.
-      if (isa<VPPartialReductionRecipe>(&R))
+      if (auto *VPR = dyn_cast<VPReductionRecipe>(&R);
+          VPR && VPR->isPartialReduction())
         return true;
 
       /// If a VPlan transform folded a recipe to one producing a single-scalar,
@@ -8278,11 +8279,15 @@ VPRecipeBase *VPRecipeBuilder::tryToCreateWidenRecipe(VPSingleDefRecipe *R,
              Phi->getIncomingValueForBlock(OrigLoop->getLoopPreheader()));
 
       // If the PHI is used by a partial reduction, set the scale factor.
-      unsigned ScaleFactor =
-          getScalingForReduction(RdxDesc.getLoopExitInstr()).value_or(1);
-      PhiRecipe = new VPReductionPHIRecipe(
-          Phi, RdxDesc, *StartV, CM.isInLoopReduction(Phi),
-          CM.useOrderedReductions(RdxDesc), ScaleFactor);
+      bool UseInLoopReduction = CM.isInLoopReduction(Phi);
+      bool UseOrderedReductions = CM.useOrderedReductions(RdxDesc);
+      auto ScaleFactor = ElementCount::getFixed(
+          (UseOrderedReductions || UseInLoopReduction)
+              ? 0
+              : getScalingForReduction(RdxDesc.getLoopExitInstr()).value_or(1));
+      PhiRecipe = new VPReductionPHIRecipe(Phi, RdxDesc, *StartV,
+                                           CM.isInLoopReduction(Phi),
+                                           UseOrderedReductions, ScaleFactor);
     } else {
       // TODO: Currently fixed-order recurrences are modeled as chains of
       // first-order recurrences. If there are no users of the intermediate
@@ -8315,7 +8320,8 @@ VPRecipeBase *VPRecipeBuilder::tryToCreateWidenRecipe(VPSingleDefRecipe *R,
     return tryToWidenMemory(Instr, Operands, Range);
 
   if (std::optional<unsigned> ScaleFactor = getScalingForReduction(Instr))
-    return tryToCreatePartialReduction(Instr, Operands, ScaleFactor.value());
+    return tryToCreatePartialReduction(
+        Instr, Operands, ElementCount::getFixed(ScaleFactor.value()));
 
   if (!shouldWiden(Instr, Range))
     return nullptr;
@@ -8338,7 +8344,7 @@ VPRecipeBase *VPRecipeBuilder::tryToCreateWidenRecipe(VPSingleDefRecipe *R,
 VPRecipeBase *
 VPRecipeBuilder::tryToCreatePartialReduction(Instruction *Reduction,
                                              ArrayRef<VPValue *> Operands,
-                                             unsigned ScaleFactor) {
+                                             ElementCount ScaleFactor) {
   assert(Operands.size() == 2 &&
          "Unexpected number of operands for partial reduction");
 
@@ -8346,7 +8352,8 @@ VPRecipeBuilder::tryToCreatePartialReduction(Instruction *Reduction,
   VPValue *Accumulator = Operands[1];
   VPRecipeBase *BinOpRecipe = BinOp->getDefiningRecipe();
   if (isa<VPReductionPHIRecipe>(BinOpRecipe) ||
-      isa<VPPartialReductionRecipe>(BinOpRecipe))
+      (isa<VPReductionRecipe>(BinOpRecipe) &&
+       cast<VPReductionRecipe>(BinOpRecipe)->isPartialReduction()))
     std::swap(BinOp, Accumulator);
 
   unsigned ReductionOpcode = Reduction->getOpcode();
@@ -8367,12 +8374,10 @@ VPRecipeBuilder::tryToCreatePartialReduction(Instruction *Reduction,
            "Expected an ADD or SUB operation for predicated partial "
            "reductions (because the neutral element in the mask is zero)!");
     Cond = getBlockInMask(Builder.getInsertBlock());
-    VPValue *Zero =
-        Plan.getOrAddLiveIn(ConstantInt::get(Reduction->getType(), 0));
-    BinOp = Builder.createSelect(Cond, BinOp, Zero, Reduction->getDebugLoc());
   }
-  return new VPPartialReductionRecipe(ReductionOpcode, Accumulator, BinOp, Cond,
-                                      ScaleFactor, Reduction);
+
+  return new VPReductionRecipe(RecurKind::Add, FastMathFlags(), Reduction,
+                               Accumulator, BinOp, Cond, false, ScaleFactor);
 }
 
 void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
@@ -9139,9 +9144,12 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
       FastMathFlags FMFs = isa<FPMathOperator>(CurrentLinkI)
                                ? RdxDesc.getFastMathFlags()
                                : FastMathFlags();
+      bool UseOrderedReductions = CM.useOrderedReductions(RdxDesc);
+      ElementCount VFScaleFactor =
+          ElementCount::getFixed(!UseOrderedReductions);
       auto *RedRecipe = new VPReductionRecipe(
           Kind, FMFs, CurrentLinkI, PreviousLink, VecOp, CondOp,
-          CM.useOrderedReductions(RdxDesc), CurrentLinkI->getDebugLoc());
+          UseOrderedReductions, VFScaleFactor, CurrentLinkI->getDebugLoc());
       // Append the recipe to the end of the VPBasicBlock because we need to
       // ensure that it comes after all of it's inputs, including CondOp.
       // Delete CurrentLink as it will be invalid if its operand is replaced
@@ -9175,8 +9183,9 @@ void LoopVectorizationPlanner::adjustRecipesForReductions(
     // Don't output selects for partial reductions because they have an output
     // with fewer lanes than the VF. So the operands of the select would have
     // different numbers of lanes. Partial reductions mask the input instead.
+    auto *RR = dyn_cast<VPReductionRecipe>(OrigExitingVPV->getDefiningRecipe());
     if (!PhiR->isInLoop() && CM.foldTailByMasking() &&
-        !isa<VPPartialReductionRecipe>(OrigExitingVPV->getDefiningRecipe())) {
+        (!RR || !RR->isPartialReduction())) {
       VPValue *Cond = RecipeBuilder.getBlockInMask(PhiR->getParent());
       std::optional<FastMathFlags> FMFs =
           PhiTy->isFloatingPointTy()
diff --git a/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h b/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
index 8369c78a2d78f..ac345a0d020e1 100644
--- a/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
+++ b/llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
@@ -172,7 +172,7 @@ class VPRecipeBuilder {
   /// along with binary operation and reduction phi operands.
   VPRecipeBase *tryToCreatePartialReduction(Instruction *Reduction,
                                             ArrayRef<VPValue *> Operands,
-                                            unsigned ScaleFactor);
+                                            ElementCount ScaleFactor);
 
   /// Set the recipe created for given ingredient.
   void setRecipe(Instruction *I, VPRecipeBase *R) {
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 5a3c4a514a5dd..9e38771e746df 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -553,7 +553,6 @@ class VPSingleDefRecipe : public VPRecipeBase, public VPValue {
     case VPRecipeBase::VPWidenIntOrFpInductionSC:
     case VPRecipeBase::VPWidenPointerInductionSC:
     case VPRecipeBase::VPReductionPHISC:
-    case VPRecipeBase::VPPartialReductionSC:
       return true;
     case VPRecipeBase::VPBranchOnMaskSC:
     case VPRecipeBase::VPInterleaveSC:
@@ -2189,18 +2188,21 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
 
   /// When expanding the reduction PHI, the plan's VF element count is divided
   /// by this factor to form the reduction phi's VF.
-  unsigned VFScaleFactor = 1;
+  ElementCount VFScaleFactor;
 
 public:
   /// Create a new VPReductionPHIRecipe for the reduction \p Phi described by \p
   /// RdxDesc.
   VPReductionPHIRecipe(PHINode *Phi, const RecurrenceDescriptor &RdxDesc,
                        VPValue &Start, bool IsInLoop = false,
-                       bool IsOrdered = false, unsigned VFScaleFactor = 1)
+                       bool IsOrdered = false,
+                       ElementCount VFScaleFactor = ElementCount::getFixed(1))
       : VPHeaderPHIRecipe(VPDef::VPReductionPHISC, Phi, &Start),
         RdxDesc(RdxDesc), IsInLoop(IsInLoop), IsOrdered(IsOrdered),
         VFScaleFactor(VFScaleFactor) {
     assert((!IsOrdered || IsInLoop) && "IsOrdered requires IsInLoop");
+    assert(((!IsInLoop && !IsOrdered) || VFScaleFactor.isZero()) &&
+           "Invalid VFScaleFactor");
   }
 
   ~VPReductionPHIRecipe() override = default;
@@ -2219,7 +2221,7 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
   void execute(VPTransformState &State) override;
 
   /// Get the factor that the VF of this recipe's output should be scaled by.
-  unsigned getVFScaleFactor() const { return VFScaleFactor; }
+  ElementCount getVFScaleFactor() const { return VFScaleFactor; }
 
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe.
@@ -2237,6 +2239,10 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
   /// Returns true, if the phi is part of an in-loop reduction.
   bool isInLoop() const { return IsInLoop; }
 
+  bool isPartialReduction() const {
+    return ElementCount::isKnownGT(VFScaleFactor, ElementCount::getFixed(1));
+  }
+
   /// Returns true if the recipe only uses the first lane of operand \p Op.
   bool onlyFirstLaneUsed(const VPValue *Op) const override {
     assert(is_contained(operands(), Op) &&
@@ -2408,23 +2414,32 @@ class VPInterleaveRecipe : public VPRecipeBase {
   Instruction *getInsertPos() const { return IG->getInsertPos(); }
 };
 
-/// A recipe to represent inloop reduction operations, performing a reduction on
-/// a vector operand into a scalar value, and adding the result to a chain.
-/// The Operands are {ChainOp, VecOp, [Condition]}.
+/// A recipe to represent inloop, ordered or partial reduction operations. It
+/// performs a reduction on a vector operand into a scalar (vector in the case
+/// of a partial reduction) value, and adds the result to a chain. The Operands
+/// are {ChainOp, VecOp, [Condition]}.
 class VPReductionRecipe : public VPRecipeWithIRFlags {
   /// The recurrence kind for the reduction in question.
   RecurKind RdxKind;
   bool IsOrdered;
   /// Whether the reduction is conditional.
   bool IsConditional = false;
+  /// The scaling factor, relative to the VF, that this recipe's output is
+  /// divided by.
+  /// For outer-loop reductions this is equal to 1.
+  /// For in-loop reductions this is equal to 0, to specify that this is equal
+  /// to the VF (which may not be known yet). For partial-reductions this is
+  /// equal to another scalar value.
+  ElementCount VFScaleFactor;
 
 protected:
   VPReductionRecipe(const unsigned char SC, RecurKind RdxKind,
                     FastMathFlags FMFs, Instruction *I,
                     ArrayRef<VPValue *> Operands, VPValue *CondOp,
-                    bool IsOrdered, DebugLoc DL)
+                    bool IsOrdered, ElementCount VFScaleFactor, DebugLoc DL)
       : VPRecipeWithIRFlags(SC, Operands, FMFs, DL), RdxKind(RdxKind),
-        IsOrdered(IsOrdered) {
+        IsOrdered(IsOrdered), VFScaleFactor(VFScaleFactor) {
+    assert((!IsOrdered || VFScaleFactor.isZero()) && "Invalid scale factor");
     if (CondOp) {
       IsConditional = true;
       addOperand(CondOp);
@@ -2436,9 +2451,11 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
   /// Note that the debug location is from the extend.
   VPReductionRecipe(const unsigned char SC, const RecurKind RdxKind,
                     ArrayRef<VPValue *> Operands, VPValue *CondOp,
-                    bool IsOrdered, DebugLoc DL)
+                    bool IsOrdered, ElementCount VFScaleFactor, DebugLoc DL)
       : VPRecipeWithIRFlags(SC, Operands, DL), RdxKind(RdxKind),
-        IsOrdered(IsOrdered), IsConditional(CondOp) {
+        IsOrdered(IsOrdered), IsConditional(CondOp),
+        VFScaleFactor(VFScaleFactor) {
+    assert((!IsOrdered || VFScaleFactor.isZero()) && "Invalid scale factor");
     if (CondOp)
       addOperand(CondOp);
   }
@@ -2447,9 +2464,12 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
   /// Note that the NUW/NSW flags and the debug location are from the Mul.
   VPReductionRecipe(const unsigned char SC, const RecurKind RdxKind,
                     ArrayRef<VPValue *> Operands, VPValue *CondOp,
-                    bool IsOrdered, WrapFlagsTy WrapFlags, DebugLoc DL)
+                    bool IsOrdered, ElementCount VFScaleFactor,
+                    WrapFlagsTy WrapFlags, DebugLoc DL)
       : VPRecipeWithIRFlags(SC, Operands, WrapFlags, DL), RdxKind(RdxKind),
-        IsOrdered(IsOrdered), IsConditional(CondOp) {
+        IsOrdered(IsOrdered), IsConditional(CondOp),
+        VFScaleFactor(VFScaleFactor) {
+    assert((!IsOrdered || VFScaleFactor.isZero()) && "Invalid scale factor");
     if (CondOp)
       addOperand(CondOp);
   }
@@ -2457,24 +2477,26 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
 public:
   VPReductionRecipe(RecurKind RdxKind, FastMathFlags FMFs, Instruction *I,
                     VPValue *ChainOp, VPValue *VecOp, VPValue *CondOp,
-                    bool IsOrdered, DebugLoc DL = {})
+                    bool IsOrdered, ElementCount VFScaleFactor,
+                    DebugLoc DL = {})
       : VPReductionRecipe(VPDef::VPReductionSC, RdxKind, FMFs, I,
                           ArrayRef<VPValue *>({ChainOp, VecOp}), CondOp,
-                          IsOrdered, DL) {}
+                          IsOrdered, VFScaleFactor, DL) {}
 
   VPReductionRecipe(const RecurKind RdxKind, FastMathFlags FMFs,
                     VPValue *ChainOp, VPValue *VecOp, VPValue *CondOp,
-                    bool IsOrdered, DebugLoc DL = {})
+                    bool IsOrdered, ElementCount VFScaleFactor,
+                    DebugLoc DL = {})
       : VPReductionRecipe(VPDef::VPReductionSC, RdxKind, FMFs, nullptr,
                           ArrayRef<VPValue *>({ChainOp, VecOp}), CondOp,
-                          IsOrdered, DL) {}
+                          IsOrdered, VFScaleFactor, DL) {}
 
   ~VPReductionRecipe() override = default;
 
   VPReductionRecipe *clone() override {
-    return new VPReductionRecipe(RdxKind, getFastMathFlags(),
-                                 getUnderlyingInstr(), getChainOp(), getVecOp(),
-                                 getCondOp(), IsOrdered, getDebugLoc());
+    return new VPReductionRecipe(
+        RdxKind, getFastMathFlags(), getUnderlyingInstr(), getChainOp(),
+        getVecOp(), getCondOp(), IsOrdered, VFScaleFactor, getDebugLoc());
   }
 
   static inline bool classof(const VPRecipeBase *R) {
@@ -2508,6 +2530,10 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
   bool isOrdered() const { return IsOrdered; };
   /// Return true if the in-loop reduction is conditional.
   bool isConditional() const { return IsConditional; };
+  /// Return true if the reduction is a partial reduction.
+  bool isPartialReduction() const {
+    return ElementCount::isKnownGT(VFScaleFactor, ElementCount::getFixed(1));
+  }
   /// The VPValue of the scalar Chain being accumulated.
   VPValue *getChainOp() const { return getOperand(0); }
   /// The VPValue of the vector value to be reduced.
@@ -2516,65 +2542,8 @@ class VPReductionRecipe : public VPRecipeWithIRFlags {
   VPValue *getCondOp() const {
     return isConditional() ? getOperand(getNumOperands() - 1) : nullptr;
   }
-};
-
-/// A recipe for forming partial reductions. In the loop, an accumulator and
-/// vector operand are added together and passed to the next iteration as the
-/// next accumulator. After the loop body, the accumulator is reduced to a
-/// scalar value.
-class VPPartialReductionRecipe : public VPReductionRecipe {
-  unsigned Opcode;
-
-  /// The divisor by which the VF of this recipe's output should be divided
-  /// during execution.
-  unsigned VFScaleFactor;
-
-public:
-  VPPartialReductionRecipe(Instruction *ReductionInst, VPValue *Op0,
-                           VPValue *Op1, VPValue *Cond, unsigned VFScaleFactor)
-      : VPPartialReductionRecipe(ReductionInst->getOpcode(), Op0, Op1, Cond,
-                                 VFScaleFactor, ReductionInst) {}
-  VPPartialReductionRecipe(unsigned Opcode, VPValue *Op0, VPValue *Op1,
-                           VPValue *Cond, unsigned ScaleFactor,
-                           Instruction *ReductionInst = nullptr)
-      : VPReductionRecipe(VPDef::VPPartialReductionSC, RecurKind::Add,
-                          FastMathFlags(), ReductionInst,
-                          ArrayRef<VPValue *>({Op0, Op1}), Cond, false, {}),
-        Opcode(Opcode), VFScaleFactor(ScaleFactor) {
-    [[maybe_unused]] auto *AccumulatorRecipe =
-        getChainOp()->getDefiningRecipe();
-    assert((isa<VPReductionPHIRecipe>(AccumulatorRecipe) ||
-            isa<VPPartialReductionRecipe>(AccumulatorRecipe)) &&
-           "Unexpected operand order for partial reduction recipe");
-  }
-  ~VPPartialReductionRecipe() override = default;
-
-  VPPartialReductionRecipe *clone() override {
-    return new VPPartialReductionRecipe(Opcode, getOperand(0), getOperand(1),
-                                        getCondOp(), VFScaleFactor,
-                                        getUnderlyingInstr());
-  }
-
-  VP_CLASSOF_IMPL(VPDef::VPPartialReductionSC)
-
-  /// Generate the reduction in the loop.
-  void execute(VPTransformState &State) override;
-
-  /// Return the cost of this VPPartialReductionRecipe.
-  InstructionCost computeCost(ElementCount VF,
-                              VPCostContext &Ctx) const override;
-
-  /// Get the binary op's opcode.
-  unsigned getOpcode() const { return Opcode; }
-
   /// Get the factor that the VF of this recipe's output should be scaled by.
-  unsigned getVFScaleFactor() const { return VFScaleFactor; }
-
-#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
-  /// Print the recipe.
-  void print(raw_ostream &O, const Twine &Indent,
-             VPSlotTracker &SlotTracker) const override;
-#endif
+  ElementCount getVFScaleFactor() const { return VFScaleFactor; }
 };
 
 /// A recipe to represent inloop reduction operations with vector-predication
@@ -2590,7 +2559,7 @@ class VPReductionEVLRecipe : public VPReductionRecipe {
             R.getFastMathFlags(),
             cast_or_null<Instruction>(R.getUnderlyingValue()),
             ArrayRef<VPValue *>({R.getChainOp(), R.getVecOp(), &EVL}), CondOp,
-            R.isOrdered(), DL) {}
+            R.isOrdered(), ElementCount::getFixed(0), DL) {}
 
   ~VPReductionEVLRecipe() override = default;
 
@@ -2634,10 +2603,11 @@ class VPExtendedReductionRecipe : public VPReductionRecipe {
 
   /// For cloning VPExtendedReductionRecipe.
   VPExtendedReductionRecipe(VPExtendedReductionRecipe *ExtRed)
-  ...
[truncated]

lukel97 · 2025-06-19T22:23:18Z

llvm/lib/Transforms/Vectorize/VPlan.h

 class VPReductionRecipe : public VPRecipeWithIRFlags {
  /// The recurrence kind for the reduction in question.
  RecurKind RdxKind;
  bool IsOrdered;
  /// Whether the reduction is conditional.
  bool IsConditional = false;
+  /// The scaling factor, relative to the VF, that this recipe's output is
+  /// divided by.
+  /// For outer-loop reductions this is equal to 1.


Aren't VPReductionRecipes for in-loop reductions only? Should this say for unordered reductions?

I'm also a bit confused since I thought both unordered and ordered reductions also produced a scalar result, so the VFScaleFactor would be always be 1 for them?

Aren't VPReductionRecipes for in-loop reductions only?

Partial-reductions are a form of in-loop reductions. They partially reduce (in-loop) to a smaller vector, and in the outer-loop further reduce to a scalar. The reason for making this change is so that we benefit from the code we'd otherwise have to duplicate for partial reductions. Given that in-loop/ordered and unordered reductions are represented by VPReductionRecipe, it seems like a natural extension to use this class to represent partial reductions as well.

Should this say for unordered reductions?

Ordered reductions must be in-loop reductions although the inverse is not true, i.e. in-loop reductions are not required to be ordered reductions. It's a target's choice to implement unordered reductions in-loop. Similarly, it's a target's choice to implement unordered reductions with a partial-reduction.

I'm also a bit confused since I thought both unordered and ordered reductions also produced a scalar result, so the VFScaleFactor would be always be 1 for them?

In-order and in-loop reductions reductions use a scalar PHI

partial-reductions use a vector PHI (albeit scaled down by the VFScaleFactor)

'regular' out-of-order reductions use a vector PHI (not scaled <=> scaled by a factor of 1)

For ordered/in-loop reductions, the result type is a scalar, so we'd scale the result VF down to a scalar (1), meaning that scaling factor must be equal to the VF (VF/VF=1). But the way VPlan works, the VF is not necessarily known yet at this point, it may still be a range of VFs, so we instead represent that with a VFScaleFactor of 0.

I still need to take a closer look to see how everything fits together, but I am not sure if it is a good idea to make the existing VPReductionRecipe even more complicated.

The main motivation is to use VPMulAccumulateReductionRecipe for partial reductions?

Yeah bundling comes mostly for free if partial reductions are just reductions, and otherwise, it doesn't really make sense to differentiate them considering that the only difference is the scale factor being > 1. I think lowering complexity by reducing the number of different recipes in existence is worth the small bit of extra complexity in the reduction class.

Out-of-order in-loop reductions are a special kind of a partial reduction, in that they reduce to a VF of 1. Conversely a partial reduction is an out-of-order in-loop reduction reducing to a smaller VF. Adding partial reductions is therefore a natural extension and both can be described with a single property (VFScaleFactor).

With hindsight, I believe that it was a mistake to keep partial reductions in a separate subclass from all the other ones. If this all would have used the same class from the start, the work done for VPMulAccumulateReductionRecipe, VPExtendedReductionRecipe and VPBundle would all have benefited partial reductions.

I'm not saying that we should always keep this as a single class, but I would first like to see the behaviours of partial reductions merged with the functionality for in-loop reductions, so that we don't have to keep adding new code specifically for partial reductions when it can just piggy-back on existing code.

MacDue · 2025-06-20T12:07:04Z

llvm/lib/Transforms/Vectorize/VPlan.h

+  /// For in-loop reductions this is equal to 0, to specify that this is equal
+  /// to the VF (which may not be known yet). For partial-reductions this is
+  /// equal to another scalar value.
+  ElementCount VFScaleFactor;


I might have missed it, but can VFScaleFactor be scalable (since it's now changed to ElementCount)?

I'm not sure if there's currently a compelling enough use-case for it, but there's no reason why we shouldn't be able to support a partial reduction of e.g. <vscale x 16 x i32> to <4 x i32>, which would require a VFScaleFactor of vscale x 4. An example of an instruction that does this for AArch64 would be addqv that partially reduces e.g. a <vscale x 4 x i32> to a <4 x i32>.

There's several places (including the vplan printing) that call .getFixedValue() (which asserts that the value is not scalable), so it seemed like an oversight in this PR.

Changing from unsigned -> ElementCount should be a separate change, once it is needed

Agreed, I've removed the ElementCount changes.

MDevereau · 2025-06-20T14:33:36Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+              ? 0
+              : getScalingForReduction(RdxDesc.getLoopExitInstr()).value_or(1);
+      PhiRecipe = new VPReductionPHIRecipe(Phi, RdxDesc, *StartV,
+                                           CM.isInLoopReduction(Phi),


Suggested change

CM.isInLoopReduction(Phi),

UseInLoopReduction,

Thanks. It's no longer necessary with Sander's suggestion.

sdesmalen-arm

This also leads to partial reductions naturally being included in VPMulAccumulateRecipe, which is nice for hiding the cost of the extends and mul, but it does have the side effect of generating an unnecessary extend for chained partial reduction cases due to how those extends are re-generated. I don't think this can be avoided nicely, and it should be eliminated by DCE anyway.

I can confirm that this is an artifact resulting from the VPMulAccumulateRecipe, and is not an artifact of partial reductions. I tried to run one of the tests with the prefer-inloop-reductions flag, and see that we basically get the same codegen (albeit using a reduction-to-scalar rather than a partial-reduction)

sdesmalen-arm · 2025-06-20T15:01:24Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+      if (auto *VPR = dyn_cast<VPReductionRecipe>(&R);
+          VPR && VPR->isPartialReduction())


nit: I know it's a style thing, so feel free to ignore, but below you wrote it like this:

Suggested change

if (auto *VPR = dyn_cast<VPReductionRecipe>(&R);

VPR && VPR->isPartialReduction())

if (isa<VPReductionRecipe>(R) &&

cast<VPReductionRecipe>(R).isPartialReduction())

Are you talking about here? That's different because it's checking if it's either a VPReductionRecipe or VPReductionPHIRecipe and you can't declare two variables in an if-statement. I think this method is cleaner when there's just one variable to declare since you don't have to check the type and then cast, you just cast.

sdesmalen-arm · 2025-06-20T15:09:05Z

llvm/lib/Analysis/TargetTransformInfo.cpp

+}
+
+TargetTransformInfo::PartialReductionExtendKind
+TargetTransformInfo::getPartialReductionExtendKind(Instruction *I) {


This function can be removed. There's still one use in VPRecipeBuilder::getScaledReductions, but there the code can assume the input value is a zero/sign-extend, because of the match() clause above.

Can you pull out that change into a separate NFC commit?

The call to getPartialReductionCost in getScaledReductions needs to know have the extend kind as, for example, mixed extension partial reductions are only valid in AArch64 with the i8mm extension. In the future if we change the interface of getPartialReductionCost to not need the extend kinds, I can remove this one.

sdesmalen-arm · 2025-06-20T15:10:39Z

llvm/lib/Transforms/Vectorize/VPlan.h

+  /// to the VF (which may not be known yet). For partial-reductions this is
+  /// equal to another scalar value.


formatting nit:

Suggested change

/// to the VF (which may not be known yet). For partial-reductions this is

/// equal to another scalar value.

/// to the VF (which may not be known yet).

/// For partial-reductions this is equal to another scalar value.

sdesmalen-arm · 2025-06-20T15:16:28Z

llvm/lib/Transforms/Vectorize/VPlan.h

@@ -2201,6 +2200,8 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe,
        RdxDesc(RdxDesc), IsInLoop(IsInLoop), IsOrdered(IsOrdered),
        VFScaleFactor(VFScaleFactor) {
    assert((!IsOrdered || IsInLoop) && "IsOrdered requires IsInLoop");
+    assert(((!IsInLoop && !IsOrdered) || VFScaleFactor == 0) &&


Because VFScaleFactor now encodes whether it is an inloop reduction or not, it's possible to remove IsInLoop now. Given that we're changing the interface, probably best to make that change in this PR.

Sounds good, done.

sdesmalen-arm · 2025-06-20T15:18:01Z

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

+          if (unsigned ScaleFactor = getVFScaleFactor(R); ScaleFactor > 1)
+            VF = VF.divideCoefficientBy(ScaleFactor);


nit: This can be:

Suggested change

if (unsigned ScaleFactor = getVFScaleFactor(R); ScaleFactor > 1)

VF = VF.divideCoefficientBy(ScaleFactor);

if (unsigned ScaleFactor = getVFScaleFactor(R))

VF = VF.divideCoefficientBy(ScaleFactor);

because scaling by 1 should result in the same VF.

sdesmalen-arm · 2025-06-20T15:36:55Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    match(Mul, m_Binary<Instruction::Sub>(m_SpecificInt(0), m_VPValue(Mul)));
+    if (match(Mul,
+              m_Mul(m_ZExtOrSExt(m_VPValue()), m_ZExtOrSExt(m_VPValue())))) {


The case you're trying to match is the case already implemented by the VPMulAccumulateReductionRecipe, so there's no need to try and match this again. For this default case, the call to getPartialReductionCost should assume no extends and no opcode.

Done. Using the simple case stopped mixed extension partial reductions from being turned into VPMulAccumulateRecipe so I had to add support for mixed extension types to that class which has made the change a bit bigger, but I hope it's more correct now.

sdesmalen-arm · 2025-06-20T15:37:27Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -2598,6 +2554,16 @@ VPExtendedReductionRecipe::computeCost(ElementCount VF,
 InstructionCost
 VPMulAccumulateReductionRecipe::computeCost(ElementCount VF,
                                            VPCostContext &Ctx) const {
+  if (isPartialReduction())


Should VPExtendedReductionRecipe::computeCost also be updated?

sdesmalen-arm · 2025-06-24T08:25:23Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

-                                      ScaleFactor, Reduction);
+
+  return new VPReductionRecipe(RecurKind::Add, FastMathFlags(), Reduction,
+                               Accumulator, BinOp, Cond, false, ScaleFactor);


Just sharing an observation in case other reviewers notice changes in CodeGen in partial-reduce-dot-product-neon.ll.

Passing in the condition to the VPReductionRecipe rather than creating a select on the input and passing that into the VPReductionRecipe results in different, but better, code for partial-reduce-dot-product-neon.ll. The test loads from two pointers a and b. Before this change, it would load lane a[i] and b[i] under two separate predicated blocks, i.e. if(cond[i]) { load(a[i]) } , if(cond[i]) { load(b[i]) }, whereas by moving the condition to the VPReductionRecipe it now bundles together these two predicated loads in a single condition block if(cond[i]) { load(a[i]), load(b[i]) }.

…n cost computation

sdesmalen-arm · 2025-06-24T14:01:49Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  if (Red->isPartialReduction())
+    match(VecOp,
+          m_Binary<Instruction::Sub>(m_SpecificInt(0), m_VPValue(VecOp)));


It still needs to somehow encode that this operation is a sub-reduction instead of an add-reduction. At the moment, this just generates a regular add reduction, which is not correct.

Yes you're right, thanks. I'll work on fixing that.

sdesmalen-arm · 2025-06-24T14:02:37Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  // Some chained partial reductions used for complex numbers will have a
+  // negation between the mul and reduction. This extracts the mul from that
+  // pattern to use it for further checking.


nit: Instead of "Some chained partialr eductions used for complex numbers" you can just say 'sub reductions'?

fhahn · 2025-06-24T10:03:02Z

llvm/lib/Transforms/Vectorize/VPlan.h

 class VPReductionRecipe : public VPRecipeWithIRFlags {
  /// The recurrence kind for the reduction in question.
  RecurKind RdxKind;
  bool IsOrdered;
  /// Whether the reduction is conditional.
  bool IsConditional = false;
+  /// The scaling factor, relative to the VF, that this recipe's output is
+  /// divided by.
+  /// For outer-loop reductions this is equal to 1.


I still need to take a closer look to see how everything fits together, but I am not sure if it is a good idea to make the existing VPReductionRecipe even more complicated.

The main motivation is to use VPMulAccumulateReductionRecipe for partial reductions?

fhahn · 2025-06-24T10:04:37Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

@@ -2591,24 +2528,46 @@ VPExtendedReductionRecipe::computeCost(ElementCount VF,
      cast<VectorType>(toVectorTy(Ctx.Types.inferScalarType(getVecOp()), VF));
  assert(RedTy->isIntegerTy() &&
         "ExtendedReduction only support integer type currently.");
+  if (isPartialReduction())
+    return Ctx.TTI.getPartialReductionCost(Opcode, RedTy, SrcVecTy, SrcVecTy,
+                                           VF, TargetTransformInfo::PR_None,


Does this now assume that there are no extends? Does this assume that for the real use cases, it will get folded into a VPMulAccumulateRecipe?

Exactly, see here for the request to use the base case of no extends: #144908 (comment)

fhahn · 2025-06-24T10:04:58Z

llvm/include/llvm/Analysis/TargetTransformInfo.h

@@ -223,6 +223,8 @@ class TargetTransformInfo {
  /// Get the kind of extension that an instruction represents.
  LLVM_ABI static PartialReductionExtendKind
  getPartialReductionExtendKind(Instruction *I);
+  LLVM_ABI static PartialReductionExtendKind


This change looks like it could be split off to reduce the diff a bit?

This could be an NFC, but I don't know if it's worth it since it wouldn't meaningfully reduce the diff and committing something unused doesn't seem like a good idea, in case it ends up being unneeded in this PR for whatever reason.

SamTebbs33 requested review from fhahn, MacDue, huntergr-arm, sdesmalen-arm, gbossu, david-arm, NickGuy-Arm, MDevereau and ElvisWang123 June 19, 2025 15:13

llvmbot added vectorizers llvm:analysis Includes value tracking, cost tables and constant folding llvm:transforms labels Jun 19, 2025

This was referenced Jun 19, 2025

[LoopVectorizer] Bundle partial reductions inside VPMulAccumulateReductionRecipe #136173

Closed

[LoopVectorizer] Bundle partial reductions with different extensions #136997

Closed

SamTebbs33 requested a review from lukel97 June 19, 2025 15:34

lukel97 reviewed Jun 19, 2025

View reviewed changes

MacDue reviewed Jun 20, 2025

View reviewed changes

Make VFScaleFactor an unsigned instead of ElementCount

7c436bf

MDevereau reviewed Jun 20, 2025

View reviewed changes

sdesmalen-arm reviewed Jun 20, 2025

View reviewed changes

sdesmalen-arm mentioned this pull request Jun 20, 2025

[LV] Add support for partial reductions without a binary op #133922

Open

SamTebbs33 added 2 commits June 23, 2025 14:48

Remove IsInLoop

f99cbd4

Format comment

59545b9

sdesmalen-arm reviewed Jun 24, 2025

View reviewed changes

SamTebbs33 added 2 commits June 24, 2025 10:25

Use simple partial reduction case for reduction and extended reductio…

d85eb07

…n cost computation

Remove > 1 check

2a4d170

sdesmalen-arm reviewed Jun 24, 2025

View reviewed changes

fhahn reviewed Jun 24, 2025

View reviewed changes

		if (auto *VPR = dyn_cast<VPReductionRecipe>(&R);
		VPR && VPR->isPartialReduction())

		/// to the VF (which may not be known yet). For partial-reductions this is
		/// equal to another scalar value.

		if (unsigned ScaleFactor = getVFScaleFactor(R); ScaleFactor > 1)
		VF = VF.divideCoefficientBy(ScaleFactor);

[LV] Use VPReductionRecipe for partial reductions #144908

Are you sure you want to change the base?

[LV] Use VPReductionRecipe for partial reductions #144908

Uh oh!

Conversation

SamTebbs33 commented Jun 19, 2025

Uh oh!

llvmbot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MDevereau Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sdesmalen-arm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

llvmbot commented Jun 19, 2025 •

edited

Loading

sdesmalen-arm Jun 20, 2025 •

edited

Loading

MDevereau Jun 20, 2025 •

edited

Loading