-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[AArch64][CostModel] Lower cost of dupq (SVE2.1) #144918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[AArch64][CostModel] Lower cost of dupq (SVE2.1) #144918
Conversation
@llvm/pr-subscribers-llvm-analysis Author: Graham Hunter (huntergr-arm) ChangesWith codegen in place to match shuffles to dupq, we can now lower the cost to something reasonable. Full diff: https://github.com/llvm/llvm-project/pull/144918.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index ed051f295752e..6ff0efd117dbd 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost(
Kind = TTI::SK_PermuteSingleSrc;
}
+ // Segmented shuffle matching.
+ if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput &&
+ Kind == TTI::SK_PermuteSingleSrc && isa<FixedVectorType>(Tp) &&
+ Tp->getPrimitiveSizeInBits().isKnownMultipleOf(128)) {
+
+ FixedVectorType *VTy = cast<FixedVectorType>(Tp);
+ unsigned Segments = VTy->getPrimitiveSizeInBits() / 128;
+ unsigned SegmentElts = VTy->getNumElements() / Segments;
+
+ // dupq zd.t, zn.t[idx]
+ unsigned Lane = (unsigned)Mask[0];
+ if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) {
+ bool IsDupQ = true;
+ for (unsigned I = 1; I < Mask.size(); ++I)
+ IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts);
+ if (IsDupQ)
+ return LT.first;
+ }
+ }
+
// Check for broadcast loads, which are supported by the LD1R instruction.
// In terms of code-size, the shuffle vector is free when a load + dup get
// folded into a LD1R. That's what we check and return here. For performance
diff --git a/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
new file mode 100644
index 0000000000000..e6a57d1687254
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
@@ -0,0 +1,25 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s | FileCheck %s
+
+;; Broadcast indexed lane within 128b segments (dupq zd.t, zn.t[idx])
+define void @dup_within_each_segment() #0 {
+; CHECK-LABEL: 'dup_within_each_segment'
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_h2 = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_s3 = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_d0 = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+ %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
+ i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+ %dupq_h2 = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2,
+ i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+ %dupq_s3 = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+ i32 7, i32 7, i32 7, i32 7>
+ %dupq_d0 = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+ %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+ ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1" }
|
@llvm/pr-subscribers-backend-aarch64 Author: Graham Hunter (huntergr-arm) ChangesWith codegen in place to match shuffles to dupq, we can now lower the cost to something reasonable. Full diff: https://github.com/llvm/llvm-project/pull/144918.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index ed051f295752e..6ff0efd117dbd 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost(
Kind = TTI::SK_PermuteSingleSrc;
}
+ // Segmented shuffle matching.
+ if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput &&
+ Kind == TTI::SK_PermuteSingleSrc && isa<FixedVectorType>(Tp) &&
+ Tp->getPrimitiveSizeInBits().isKnownMultipleOf(128)) {
+
+ FixedVectorType *VTy = cast<FixedVectorType>(Tp);
+ unsigned Segments = VTy->getPrimitiveSizeInBits() / 128;
+ unsigned SegmentElts = VTy->getNumElements() / Segments;
+
+ // dupq zd.t, zn.t[idx]
+ unsigned Lane = (unsigned)Mask[0];
+ if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) {
+ bool IsDupQ = true;
+ for (unsigned I = 1; I < Mask.size(); ++I)
+ IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts);
+ if (IsDupQ)
+ return LT.first;
+ }
+ }
+
// Check for broadcast loads, which are supported by the LD1R instruction.
// In terms of code-size, the shuffle vector is free when a load + dup get
// folded into a LD1R. That's what we check and return here. For performance
diff --git a/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
new file mode 100644
index 0000000000000..e6a57d1687254
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
@@ -0,0 +1,25 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s | FileCheck %s
+
+;; Broadcast indexed lane within 128b segments (dupq zd.t, zn.t[idx])
+define void @dup_within_each_segment() #0 {
+; CHECK-LABEL: 'dup_within_each_segment'
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_h2 = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_s3 = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %dupq_d0 = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+ %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
+ i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+ %dupq_h2 = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2,
+ i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+ %dupq_s3 = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+ i32 7, i32 7, i32 7, i32 7>
+ %dupq_d0 = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+ %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+ ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1" }
|
// Segmented shuffle matching. | ||
if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput && | ||
Kind == TTI::SK_PermuteSingleSrc && isa<FixedVectorType>(Tp) && | ||
Tp->getPrimitiveSizeInBits().isKnownMultipleOf(128)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please use AArch64::SVEBitsPerBlock
instead of hard-coding 128.
if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) { | ||
bool IsDupQ = true; | ||
for (unsigned I = 1; I < Mask.size(); ++I) | ||
IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts); | ||
if (IsDupQ) | ||
return LT.first; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to benefit from an early exit, you could use something like this:
if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) { | |
bool IsDupQ = true; | |
for (unsigned I = 1; I < Mask.size(); ++I) | |
IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts); | |
if (IsDupQ) | |
return LT.first; | |
} | |
if (all_of(enumerate(Mask), [](unsigned I, unsigned M) { return M == Lane + ((I / SegmentElts) * SegmentElts); }) | |
return .. |
?
@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost( | |||
Kind = TTI::SK_PermuteSingleSrc; | |||
} | |||
|
|||
// Segmented shuffle matching. | |||
if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This instruction is also enabled under FEAT_SME2p1
|
||
// dupq zd.t, zn.t[idx] | ||
unsigned Lane = (unsigned)Mask[0]; | ||
if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this re-use the isDUPQMask method added recently?
@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost( | |||
Kind = TTI::SK_PermuteSingleSrc; | |||
} | |||
|
|||
// Segmented shuffle matching. | |||
if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be handled for any CostKind? Often them all matching is OK (enough).
* Support SME2p1 * Remove hardcoded magic number * Return the same result for other cost kinds
if (Subtarget->hasSVE2p1() || | ||
(Subtarget->hasSME2p1() && Subtarget->isSVEorStreamingSVEAvailable())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have said isStreaming()
, i.e.
if (Subtarget->hasSVE2p1() || | |
(Subtarget->hasSME2p1() && Subtarget->isSVEorStreamingSVEAvailable())) { | |
if (Subtarget->hasSVE2p1() || | |
(Subtarget->hasSME2p1() && Subtarget->isStreaming())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, looking at the instruction description it seems your original code (using isSVEorStreamingSVEAvailable
) was the right code, because with +sme2p1
the instruction is also enabled in non-streaming mode. This can be seen from the pseudo-code for DUPQ, which has IsSVEEnabled()
. This is actually something that @paulwalker-arm is trying to fix in other places (#145322).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After having revisited this once more, the correct interpretation of:
if !Feature_SVE2p1 && !Feature_SME2p1 then Instruction = UNDEFINED
CheckSVEEnabled()
// can be in streaming mode
// can be in non-streaming mode
Is that:
+sve,+sme2p1
enables DUPQ in both streaming and non-streaming mode+sve2p1,+sme
enables DUPQ in both streaming and non-streaming mode
Meaning that for DUPQ this expression should be:
if (Subtarget->hasSVE2p1() || Subtarget->hasSME2p1()) &&
Subtarget->isSVEorStreamingSVEAvailable())
We previously worked under the assumption that +sme* features would add features in streaming mode, and +sve* features would add features in non-streaming mode. But this is not (or no longer) the case. There may be other places where this needs fixing as well.
4690b2b
to
6899d1c
Compare
This reverts commit 57aaebe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scores look OK to me. A couple of other suggestions but looks good otherwise, if @sdesmalen-arm is happy.
@@ -5599,6 +5599,23 @@ AArch64TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, VectorType *DstTy, | |||
SrcTy = DstTy; | |||
} | |||
|
|||
// Segmented shuffle matching. | |||
if ((ST->hasSVE2p1() || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check !Mask.empty() too, in case it is not passed in.
|
||
// Check that all lanes match the first, adjusted for segment. | ||
if (all_of(enumerate(M), [&](auto P) { | ||
return (unsigned)P.value() == Lane + (P.index() / NumElts) * NumElts; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add P.value() == PoisonMaskElem ||
if undef elements are OK to ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to leave that alone for now, but since this only applies to SVE2p1+ we have time to figure out there's a better match for edge cases later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I'll leave the approval to experts 😄
With codegen in place to match shuffles to dupq, we can now lower the cost to something reasonable.