-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[VPlan] Handle FirstActiveLane when unrolling. #145394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-vectorizers @llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesCurrently FirstActiveLane is not handled correctly during This patch updates handling of FirstActiveLane to be analogous to computing final reduction results: during unrolling, the created copies for its original operand are added as additional operands, and FirstActiveLane will always produce the index of the first active lane across all unrolled iterations. Note that some of the generated code is still incorrect, as we also need to handle ExtractElement with FirstActiveLane operands. I will share patches for those soon as well. Patch is 44.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/145394.diff 6 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index f4163b0743a9a..71373037ea9eb 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -955,7 +955,9 @@ class VPInstruction : public VPRecipeWithIRFlags,
// Returns a scalar boolean value, which is true if any lane of its (only
// boolean) vector operand is true.
AnyOf,
- // Calculates the first active lane index of the vector predicate operand.
+ // Calculates the first active lane index of the vector predicate operands.
+ // It produces the lane index across all unrolled iterations. Unrolling will
+ // add all copies of its original operand as additional operands.
FirstActiveLane,
// The opcodes below are used for VPInstructionWithType.
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index 805cd04c5ce35..e48ac2fde23cd 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -765,9 +765,35 @@ Value *VPInstruction::generate(VPTransformState &State) {
return Builder.CreateOrReduce(A);
}
case VPInstruction::FirstActiveLane: {
- Value *Mask = State.get(getOperand(0));
- return Builder.CreateCountTrailingZeroElems(Builder.getInt64Ty(), Mask,
- true, Name);
+ if (getNumOperands() == 1) {
+ Value *Mask = State.get(getOperand(0));
+ return Builder.CreateCountTrailingZeroElems(Builder.getInt64Ty(), Mask,
+ true, Name);
+ }
+ // If there are multiple operands, create a chain of selects to pick the
+ // first operand with an active lane and add the number of lanes of the
+ // preceding operands.
+ Value *RuntimeVF =
+ getRuntimeVF(State.Builder, State.Builder.getInt64Ty(), State.VF);
+ Type *ElemTy = State.TypeAnalysis.inferScalarType(getOperand(0));
+ Value *RuntimeBitwidth = Builder.CreateMul(
+ Builder.getInt64(ElemTy->getScalarSizeInBits()), RuntimeVF);
+ unsigned LastOpIdx = getNumOperands() - 1;
+ Value *Res = nullptr;
+ for (int Idx = LastOpIdx; Idx >= 0; --Idx) {
+ Value *Current = Builder.CreateCountTrailingZeroElems(
+ Builder.getInt64Ty(), State.get(getOperand(Idx)), true, Name);
+ Current = Builder.CreateAdd(
+ Builder.CreateMul(RuntimeVF, Builder.getInt64(Idx)), Current);
+ if (Res) {
+ Value *Cmp = Builder.CreateICmpNE(Current, RuntimeBitwidth);
+ Res = Builder.CreateSelect(Cmp, Current, Res);
+ } else {
+ Res = Current;
+ }
+ }
+
+ return Res;
}
default:
llvm_unreachable("Unsupported opcode for instruction");
diff --git a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
index 0bc683e557e70..532539ff5cb00 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanUnroll.cpp
@@ -344,10 +344,12 @@ void UnrollState::unrollBlock(VPBlockBase *VPB) {
if (ToSkip.contains(&R) || isa<VPIRInstruction>(&R))
continue;
- // Add all VPValues for all parts to ComputeReductionResult which combines
- // the parts to compute the final reduction value.
+ // Add all VPValues for all parts to Compute*Result and FirstActiveLaneMask
+ // which combine the parts to compute the final value.
VPValue *Op1;
- if (match(&R, m_VPInstruction<VPInstruction::ComputeAnyOfResult>(
+ if (match(&R, m_VPInstruction<VPInstruction::FirstActiveLane>(
+ m_VPValue(Op1))) ||
+ match(&R, m_VPInstruction<VPInstruction::ComputeAnyOfResult>(
m_VPValue(), m_VPValue(), m_VPValue(Op1))) ||
match(&R, m_VPInstruction<VPInstruction::ComputeReductionResult>(
m_VPValue(), m_VPValue(Op1))) ||
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll b/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll
index 9dfe70ddf1b05..a0d00b7d4b438 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/single-early-exit-interleave.ll
@@ -31,11 +31,38 @@ define i64 @same_exit_block_pre_inc_use1() #0 {
; CHECK-NEXT: [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX1]]
; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i8, ptr [[P1]], i64 [[OFFSET_IDX]]
; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP19:%.*]] = mul nuw i64 [[TMP18]], 16
+; CHECK-NEXT: [[TMP29:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i64 [[TMP19]]
+; CHECK-NEXT: [[TMP33:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP34:%.*]] = mul nuw i64 [[TMP33]], 32
+; CHECK-NEXT: [[TMP35:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i64 [[TMP34]]
+; CHECK-NEXT: [[TMP36:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP37:%.*]] = mul nuw i64 [[TMP36]], 48
+; CHECK-NEXT: [[TMP38:%.*]] = getelementptr inbounds i8, ptr [[TMP7]], i64 [[TMP37]]
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 16 x i8>, ptr [[TMP8]], align 1
+; CHECK-NEXT: [[WIDE_LOAD5:%.*]] = load <vscale x 16 x i8>, ptr [[TMP29]], align 1
+; CHECK-NEXT: [[WIDE_LOAD3:%.*]] = load <vscale x 16 x i8>, ptr [[TMP35]], align 1
+; CHECK-NEXT: [[WIDE_LOAD4:%.*]] = load <vscale x 16 x i8>, ptr [[TMP38]], align 1
; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i8, ptr [[P2]], i64 [[OFFSET_IDX]]
; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i32 0
+; CHECK-NEXT: [[TMP20:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP21:%.*]] = mul nuw i64 [[TMP20]], 16
+; CHECK-NEXT: [[TMP22:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i64 [[TMP21]]
+; CHECK-NEXT: [[TMP23:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP24:%.*]] = mul nuw i64 [[TMP23]], 32
+; CHECK-NEXT: [[TMP25:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i64 [[TMP24]]
+; CHECK-NEXT: [[TMP26:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP27:%.*]] = mul nuw i64 [[TMP26]], 48
+; CHECK-NEXT: [[TMP28:%.*]] = getelementptr inbounds i8, ptr [[TMP9]], i64 [[TMP27]]
; CHECK-NEXT: [[WIDE_LOAD2:%.*]] = load <vscale x 16 x i8>, ptr [[TMP10]], align 1
+; CHECK-NEXT: [[WIDE_LOAD6:%.*]] = load <vscale x 16 x i8>, ptr [[TMP22]], align 1
+; CHECK-NEXT: [[WIDE_LOAD7:%.*]] = load <vscale x 16 x i8>, ptr [[TMP25]], align 1
+; CHECK-NEXT: [[WIDE_LOAD8:%.*]] = load <vscale x 16 x i8>, ptr [[TMP28]], align 1
; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD]], [[WIDE_LOAD2]]
+; CHECK-NEXT: [[TMP30:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD5]], [[WIDE_LOAD6]]
+; CHECK-NEXT: [[TMP31:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD3]], [[WIDE_LOAD7]]
+; CHECK-NEXT: [[TMP32:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD4]], [[WIDE_LOAD8]]
; CHECK-NEXT: [[INDEX_NEXT3]] = add nuw i64 [[INDEX1]], [[TMP5]]
; CHECK-NEXT: [[TMP12:%.*]] = call i1 @llvm.vector.reduce.or.nxv16i1(<vscale x 16 x i1> [[TMP11]])
; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT3]], [[N_VEC]]
@@ -47,8 +74,28 @@ define i64 @same_exit_block_pre_inc_use1() #0 {
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 510, [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[LOOP_END:%.*]], label [[SCALAR_PH]]
; CHECK: vector.early.exit:
+; CHECK-NEXT: [[TMP63:%.*]] = call i64 @llvm.vscale.i64()
+; CHECK-NEXT: [[TMP42:%.*]] = mul nuw i64 [[TMP63]], 16
+; CHECK-NEXT: [[TMP43:%.*]] = mul i64 1, [[TMP42]]
+; CHECK-NEXT: [[TMP44:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP32]], i1 true)
+; CHECK-NEXT: [[TMP62:%.*]] = mul i64 [[TMP42]], 3
+; CHECK-NEXT: [[TMP45:%.*]] = add i64 [[TMP62]], [[TMP44]]
+; CHECK-NEXT: [[TMP46:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP31]], i1 true)
+; CHECK-NEXT: [[TMP58:%.*]] = mul i64 [[TMP42]], 2
+; CHECK-NEXT: [[TMP50:%.*]] = add i64 [[TMP58]], [[TMP46]]
+; CHECK-NEXT: [[TMP47:%.*]] = icmp ne i64 [[TMP50]], [[TMP43]]
+; CHECK-NEXT: [[TMP51:%.*]] = select i1 [[TMP47]], i64 [[TMP50]], i64 [[TMP45]]
+; CHECK-NEXT: [[TMP52:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP30]], i1 true)
+; CHECK-NEXT: [[TMP64:%.*]] = mul i64 [[TMP42]], 1
+; CHECK-NEXT: [[TMP56:%.*]] = add i64 [[TMP64]], [[TMP52]]
+; CHECK-NEXT: [[TMP53:%.*]] = icmp ne i64 [[TMP56]], [[TMP43]]
+; CHECK-NEXT: [[TMP57:%.*]] = select i1 [[TMP53]], i64 [[TMP56]], i64 [[TMP51]]
; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP11]], i1 true)
-; CHECK-NEXT: [[TMP16:%.*]] = add i64 [[INDEX1]], [[TMP15]]
+; CHECK-NEXT: [[TMP65:%.*]] = mul i64 [[TMP42]], 0
+; CHECK-NEXT: [[TMP60:%.*]] = add i64 [[TMP65]], [[TMP15]]
+; CHECK-NEXT: [[TMP59:%.*]] = icmp ne i64 [[TMP60]], [[TMP43]]
+; CHECK-NEXT: [[TMP61:%.*]] = select i1 [[TMP59]], i64 [[TMP60]], i64 [[TMP57]]
+; CHECK-NEXT: [[TMP16:%.*]] = add i64 [[INDEX1]], [[TMP61]]
; CHECK-NEXT: [[TMP17:%.*]] = add i64 3, [[TMP16]]
; CHECK-NEXT: br label [[LOOP_END]]
; CHECK: scalar.ph:
diff --git a/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll b/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll
index 1f8cfa1bfd11c..68f25e92af866 100644
--- a/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll
+++ b/llvm/test/Transforms/LoopVectorize/single-early-exit-interleave.ll
@@ -91,11 +91,26 @@ define i64 @same_exit_block_pre_inc_use1() {
; VF4IC4-NEXT: [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX]]
; VF4IC4-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[P1]], i64 [[OFFSET_IDX]]
; VF4IC4-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 0
+; VF4IC4-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 4
+; VF4IC4-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 8
+; VF4IC4-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 12
; VF4IC4-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP1]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD4:%.*]] = load <4 x i8>, ptr [[TMP14]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD2:%.*]] = load <4 x i8>, ptr [[TMP15]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD3:%.*]] = load <4 x i8>, ptr [[TMP16]], align 1
; VF4IC4-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[P2]], i64 [[OFFSET_IDX]]
; VF4IC4-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0
+; VF4IC4-NEXT: [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 4
+; VF4IC4-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 8
+; VF4IC4-NEXT: [[TMP19:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 12
; VF4IC4-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i8>, ptr [[TMP3]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD5:%.*]] = load <4 x i8>, ptr [[TMP17]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD6:%.*]] = load <4 x i8>, ptr [[TMP18]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD7:%.*]] = load <4 x i8>, ptr [[TMP19]], align 1
; VF4IC4-NEXT: [[TMP4:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD]], [[WIDE_LOAD1]]
+; VF4IC4-NEXT: [[TMP11:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD4]], [[WIDE_LOAD5]]
+; VF4IC4-NEXT: [[TMP12:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD2]], [[WIDE_LOAD6]]
+; VF4IC4-NEXT: [[TMP13:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD3]], [[WIDE_LOAD7]]
; VF4IC4-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
; VF4IC4-NEXT: [[TMP5:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP4]])
; VF4IC4-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 64
@@ -106,7 +121,20 @@ define i64 @same_exit_block_pre_inc_use1() {
; VF4IC4: middle.block:
; VF4IC4-NEXT: br i1 true, label [[LOOP_END:%.*]], label [[SCALAR_PH]]
; VF4IC4: vector.early.exit:
-; VF4IC4-NEXT: [[TMP8:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP4]], i1 true)
+; VF4IC4-NEXT: [[TMP33:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP13]], i1 true)
+; VF4IC4-NEXT: [[TMP34:%.*]] = add i64 12, [[TMP33]]
+; VF4IC4-NEXT: [[TMP35:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP12]], i1 true)
+; VF4IC4-NEXT: [[TMP24:%.*]] = add i64 8, [[TMP35]]
+; VF4IC4-NEXT: [[TMP23:%.*]] = icmp ne i64 [[TMP24]], 4
+; VF4IC4-NEXT: [[TMP25:%.*]] = select i1 [[TMP23]], i64 [[TMP24]], i64 [[TMP34]]
+; VF4IC4-NEXT: [[TMP26:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP11]], i1 true)
+; VF4IC4-NEXT: [[TMP28:%.*]] = add i64 4, [[TMP26]]
+; VF4IC4-NEXT: [[TMP27:%.*]] = icmp ne i64 [[TMP28]], 4
+; VF4IC4-NEXT: [[TMP29:%.*]] = select i1 [[TMP27]], i64 [[TMP28]], i64 [[TMP25]]
+; VF4IC4-NEXT: [[TMP30:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP4]], i1 true)
+; VF4IC4-NEXT: [[TMP32:%.*]] = add i64 0, [[TMP30]]
+; VF4IC4-NEXT: [[TMP31:%.*]] = icmp ne i64 [[TMP32]], 4
+; VF4IC4-NEXT: [[TMP8:%.*]] = select i1 [[TMP31]], i64 [[TMP32]], i64 [[TMP29]]
; VF4IC4-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], [[TMP8]]
; VF4IC4-NEXT: [[TMP10:%.*]] = add i64 3, [[TMP9]]
; VF4IC4-NEXT: br label [[LOOP_END]]
@@ -170,8 +198,17 @@ define ptr @same_exit_block_pre_inc_use1_ivptr() {
; VF4IC4-NEXT: [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
; VF4IC4-NEXT: [[NEXT_GEP:%.*]] = getelementptr i8, ptr [[P1]], i64 [[INDEX]]
; VF4IC4-NEXT: [[TMP1:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 0
+; VF4IC4-NEXT: [[TMP9:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 4
+; VF4IC4-NEXT: [[TMP10:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 8
+; VF4IC4-NEXT: [[TMP11:%.*]] = getelementptr i8, ptr [[NEXT_GEP]], i32 12
; VF4IC4-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP1]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i8>, ptr [[TMP9]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD2:%.*]] = load <4 x i8>, ptr [[TMP10]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD3:%.*]] = load <4 x i8>, ptr [[TMP11]], align 1
; VF4IC4-NEXT: [[TMP2:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD]], splat (i8 72)
+; VF4IC4-NEXT: [[TMP15:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD1]], splat (i8 72)
+; VF4IC4-NEXT: [[TMP16:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD2]], splat (i8 72)
+; VF4IC4-NEXT: [[TMP17:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD3]], splat (i8 72)
; VF4IC4-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
; VF4IC4-NEXT: [[TMP3:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP2]])
; VF4IC4-NEXT: [[TMP4:%.*]] = icmp eq i64 [[INDEX_NEXT]], 1024
@@ -182,7 +219,20 @@ define ptr @same_exit_block_pre_inc_use1_ivptr() {
; VF4IC4: middle.block:
; VF4IC4-NEXT: br i1 true, label [[LOOP_END:%.*]], label [[SCALAR_PH]]
; VF4IC4: vector.early.exit:
-; VF4IC4-NEXT: [[TMP6:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP2]], i1 true)
+; VF4IC4-NEXT: [[TMP28:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP17]], i1 true)
+; VF4IC4-NEXT: [[TMP29:%.*]] = add i64 12, [[TMP28]]
+; VF4IC4-NEXT: [[TMP30:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP16]], i1 true)
+; VF4IC4-NEXT: [[TMP19:%.*]] = add i64 8, [[TMP30]]
+; VF4IC4-NEXT: [[TMP18:%.*]] = icmp ne i64 [[TMP19]], 4
+; VF4IC4-NEXT: [[TMP20:%.*]] = select i1 [[TMP18]], i64 [[TMP19]], i64 [[TMP29]]
+; VF4IC4-NEXT: [[TMP21:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP15]], i1 true)
+; VF4IC4-NEXT: [[TMP23:%.*]] = add i64 4, [[TMP21]]
+; VF4IC4-NEXT: [[TMP22:%.*]] = icmp ne i64 [[TMP23]], 4
+; VF4IC4-NEXT: [[TMP24:%.*]] = select i1 [[TMP22]], i64 [[TMP23]], i64 [[TMP20]]
+; VF4IC4-NEXT: [[TMP25:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP2]], i1 true)
+; VF4IC4-NEXT: [[TMP27:%.*]] = add i64 0, [[TMP25]]
+; VF4IC4-NEXT: [[TMP26:%.*]] = icmp ne i64 [[TMP27]], 4
+; VF4IC4-NEXT: [[TMP6:%.*]] = select i1 [[TMP26]], i64 [[TMP27]], i64 [[TMP24]]
; VF4IC4-NEXT: [[TMP7:%.*]] = add i64 [[INDEX]], [[TMP6]]
; VF4IC4-NEXT: [[TMP8:%.*]] = getelementptr i8, ptr [[P1]], i64 [[TMP7]]
; VF4IC4-NEXT: br label [[LOOP_END]]
@@ -240,11 +290,26 @@ define i64 @same_exit_block_post_inc_use() {
; VF4IC4-NEXT: [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX]]
; VF4IC4-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[P1]], i64 [[OFFSET_IDX]]
; VF4IC4-NEXT: [[TMP1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 0
+; VF4IC4-NEXT: [[TMP14:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 4
+; VF4IC4-NEXT: [[TMP15:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 8
+; VF4IC4-NEXT: [[TMP16:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i32 12
; VF4IC4-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i8>, ptr [[TMP1]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD4:%.*]] = load <4 x i8>, ptr [[TMP14]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD2:%.*]] = load <4 x i8>, ptr [[TMP15]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD3:%.*]] = load <4 x i8>, ptr [[TMP16]], align 1
; VF4IC4-NEXT: [[TMP2:%.*]] = getelementptr inbounds i8, ptr [[P2]], i64 [[OFFSET_IDX]]
; VF4IC4-NEXT: [[TMP3:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 0
+; VF4IC4-NEXT: [[TMP17:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 4
+; VF4IC4-NEXT: [[TMP18:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 8
+; VF4IC4-NEXT: [[TMP19:%.*]] = getelementptr inbounds i8, ptr [[TMP2]], i32 12
; VF4IC4-NEXT: [[WIDE_LOAD1:%.*]] = load <4 x i8>, ptr [[TMP3]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD5:%.*]] = load <4 x i8>, ptr [[TMP17]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD6:%.*]] = load <4 x i8>, ptr [[TMP18]], align 1
+; VF4IC4-NEXT: [[WIDE_LOAD7:%.*]] = load <4 x i8>, ptr [[TMP19]], align 1
; VF4IC4-NEXT: [[TMP4:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD]], [[WIDE_LOAD1]]
+; VF4IC4-NEXT: [[TMP11:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD4]], [[WIDE_LOAD5]]
+; VF4IC4-NEXT: [[TMP12:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD2]], [[WIDE_LOAD6]]
+; VF4IC4-NEXT: [[TMP13:%.*]] = icmp ne <4 x i8> [[WIDE_LOAD3]], [[WIDE_LOAD7]]
; VF4IC4-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
; VF4IC4-NEXT: [[TMP5:%.*]] = call i1 @llvm.vector.reduce.or.v4i1(<4 x i1> [[TMP4]])
; VF4IC4-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], 64
@@ -255,7 +320,20 @@ define i64 @same_exit_block_post_inc_use() {
; VF4IC4: middle.block:
; VF4IC4-NEXT: br i1 true, label [[LOOP_END:%.*]], label [[SCALAR_PH]]
; VF4IC4: vector.early.exit:
-; VF4IC4-NEXT: [[TMP8:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP4]], i1 true)
+; VF4IC4-NEXT: [[TMP33:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP13]], i1 true)
+; VF4IC4-NEXT: [[TMP34:%.*]] = add i64 12, [[TMP33]]
+; VF4IC4-NEXT: [[TMP35:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP12]], i1 true)
+; VF4IC4-NEXT: [[TMP24:%.*]] = add i64 8, [[TMP35]]
+; VF4IC4-NEXT: [[TMP23:%.*]] = icmp ne i64 [[TMP24]], 4
+; VF4IC4-NEXT: [[TMP25:%.*]] = select i1 [[TMP23]], i64 [[TMP24]], i64 [[TMP34]]
+; VF4IC4-NEXT: [[TMP26:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP11]], i1 true)
+; VF4IC4-NEXT: [[TMP28:%.*]] = add i64 4, [[TMP26]]
+; VF4IC4-NEXT: [[TMP27:%.*]] = icmp ne i64 [[TMP28]], 4
+; VF4IC4-NEXT: [[TMP29:%.*]] = select i1 [[TMP27]], i64 [[TMP28]], i64 [[TMP25]]
+; VF4IC4-NEXT: [[TMP30:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.v4i1(<4 x i1> [[TMP4]], i1 true)
+; VF4IC4-NEXT: [[TMP32:%.*]] = add i64 0, [[TMP30]]
+; VF4IC4-NEXT: [[TMP31:%.*]] = icmp ne i64 [[TMP32]], 4
+; VF4IC4-NEXT: [[TMP8:%.*]] = select i1 [[TMP31]], i64 [[TMP32]], i64 [[TMP29]]
; VF4IC4-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], [[TMP8]]
; VF4IC4-NEXT: [[TMP10:%.*]] = add i64 3, [[TMP9]]
; VF4IC4-NEXT: br label [[LOOP_END]]
@@ -320,11 +398,26 @@ define i64 @diff_exit_block_pre_inc_use1() {
; VF4IC4-NEXT: [[OFFSET_IDX:%.*]] = add i64 3, [[INDEX]]
; VF4IC4-NEXT: [[TMP0:%.*]] = getelementptr inbounds i8, ptr [[P1]], i64 [[OFFSET_IDX]]
; VF4IC4-NEXT: [[TMP1:...
[truncated]
|
; CHECK-NEXT: [[TMP11:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD]], [[WIDE_LOAD2]] | ||
; CHECK-NEXT: [[TMP30:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD5]], [[WIDE_LOAD6]] | ||
; CHECK-NEXT: [[TMP31:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD3]], [[WIDE_LOAD7]] | ||
; CHECK-NEXT: [[TMP32:%.*]] = icmp ne <vscale x 16 x i8> [[WIDE_LOAD4]], [[WIDE_LOAD8]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The early exit condition below is also totally broken. We should be performing reductions across all vectors and or'ing them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there's #145340 for that.
ExtractElement with an FirstActiveLane is also broken; all three can be fixed independently, just for FirstActiveLane I still need to share a patch.
; CHECK-NEXT: [[TMP46:%.*]] = call i64 @llvm.experimental.cttz.elts.i64.nxv16i1(<vscale x 16 x i1> [[TMP31]], i1 true) | ||
; CHECK-NEXT: [[TMP58:%.*]] = mul i64 [[TMP42]], 2 | ||
; CHECK-NEXT: [[TMP50:%.*]] = add i64 [[TMP58]], [[TMP46]] | ||
; CHECK-NEXT: [[TMP47:%.*]] = icmp ne i64 [[TMP50]], [[TMP43]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the logic behind this. We're comparing TMP43 (the number of lanes in a vector with VF=vscale x 16) with TMP50 ((2 * number of lanes in a vector) + first active lane of part 2). The result is always going to be false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, the was mixed up when I reordered the code.... Should compare the trailing zeros, updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The detects the case where no lane is set.
Currently FirstActiveLane is not handled correctly during unrolling. This is currently causing mis-compiles when vectorizing early-exit loops with interleaving forced. This patch updates handling of FirstActiveLane to be analogous to computing final reduction results: during unrolling, the created copies for its original operand are added as additional operands, and FirstActiveLane will always produce the index of the first active lane across all unrolled iterations. Note that some of the generated code is still incorrect, as we also need to handle ExtractElement with FirstActiveLane operands. I will share patches for those soon as well.
708edb1
to
6bd5857
Compare
Currently FirstActiveLane is not handled correctly during
unrolling. This is currently causing mis-compiles when
vectorizing early-exit loops with interleaving forced.
This patch updates handling of FirstActiveLane to be analogous to computing final reduction results: during unrolling, the created copies for its original operand are added as additional operands, and FirstActiveLane will always produce the index of the first active lane across all unrolled iterations.
Note that some of the generated code is still incorrect, as we also need to handle ExtractElement with FirstActiveLane operands. I will share patches for those soon as well.