[AArch64][CostModel] Lower cost of dupq (SVE2.1) #144918

huntergr-arm · 2025-06-19T15:48:20Z

With codegen in place to match shuffles to dupq, we can now lower the cost to something reasonable.

llvmbot · 2025-06-19T15:48:53Z

@llvm/pr-subscribers-llvm-analysis

Author: Graham Hunter (huntergr-arm)

Changes

With codegen in place to match shuffles to dupq, we can now lower the cost to something reasonable.

Full diff: https://github.com/llvm/llvm-project/pull/144918.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+20)
(added) llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll (+25)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index ed051f295752e..6ff0efd117dbd 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost(
     Kind = TTI::SK_PermuteSingleSrc;
   }
 
+  // Segmented shuffle matching.
+  if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput &&
+      Kind == TTI::SK_PermuteSingleSrc && isa<FixedVectorType>(Tp) &&
+      Tp->getPrimitiveSizeInBits().isKnownMultipleOf(128)) {
+
+    FixedVectorType *VTy = cast<FixedVectorType>(Tp);
+    unsigned Segments = VTy->getPrimitiveSizeInBits() / 128;
+    unsigned SegmentElts = VTy->getNumElements() / Segments;
+
+    // dupq zd.t, zn.t[idx]
+    unsigned Lane = (unsigned)Mask[0];
+    if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) {
+      bool IsDupQ = true;
+      for (unsigned I = 1; I < Mask.size(); ++I)
+        IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts);
+      if (IsDupQ)
+        return LT.first;
+    }
+  }
+
   // Check for broadcast loads, which are supported by the LD1R instruction.
   // In terms of code-size, the shuffle vector is free when a load + dup get
   // folded into a LD1R. That's what we check and return here. For performance
diff --git a/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
new file mode 100644
index 0000000000000..e6a57d1687254
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
@@ -0,0 +1,25 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s | FileCheck %s
+
+;; Broadcast indexed lane within 128b segments (dupq zd.t, zn.t[idx])
+define void @dup_within_each_segment() #0 {
+; CHECK-LABEL: 'dup_within_each_segment'
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_h2 = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_s3 = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_d0 = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
+                                                                            i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+  %dupq_h2  = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                              i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  %dupq_s3  = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                           i32 7, i32 7, i32 7, i32 7>
+  %dupq_d0  = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+  ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1" }

llvmbot · 2025-06-19T15:48:54Z

@llvm/pr-subscribers-backend-aarch64

Author: Graham Hunter (huntergr-arm)

Changes

With codegen in place to match shuffles to dupq, we can now lower the cost to something reasonable.

Full diff: https://github.com/llvm/llvm-project/pull/144918.diff

2 Files Affected:

(modified) llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp (+20)
(added) llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll (+25)

diff --git a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
index ed051f295752e..6ff0efd117dbd 100644
--- a/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost(
     Kind = TTI::SK_PermuteSingleSrc;
   }
 
+  // Segmented shuffle matching.
+  if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput &&
+      Kind == TTI::SK_PermuteSingleSrc && isa<FixedVectorType>(Tp) &&
+      Tp->getPrimitiveSizeInBits().isKnownMultipleOf(128)) {
+
+    FixedVectorType *VTy = cast<FixedVectorType>(Tp);
+    unsigned Segments = VTy->getPrimitiveSizeInBits() / 128;
+    unsigned SegmentElts = VTy->getNumElements() / Segments;
+
+    // dupq zd.t, zn.t[idx]
+    unsigned Lane = (unsigned)Mask[0];
+    if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) {
+      bool IsDupQ = true;
+      for (unsigned I = 1; I < Mask.size(); ++I)
+        IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts);
+      if (IsDupQ)
+        return LT.first;
+    }
+  }
+
   // Check for broadcast loads, which are supported by the LD1R instruction.
   // In terms of code-size, the shuffle vector is free when a load + dup get
   // folded into a LD1R. That's what we check and return here. For performance
diff --git a/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
new file mode 100644
index 0000000000000..e6a57d1687254
--- /dev/null
+++ b/llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll
@@ -0,0 +1,25 @@
+; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes="print<cost-model>" -cost-kind=throughput 2>&1 -disable-output -mtriple=aarch64--linux-gnu < %s | FileCheck %s
+
+;; Broadcast indexed lane within 128b segments (dupq zd.t, zn.t[idx])
+define void @dup_within_each_segment() #0 {
+; CHECK-LABEL: 'dup_within_each_segment'
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_h2 = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_s3 = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %dupq_d0 = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
+;
+  %dupq_b11 = shufflevector <32 x i8> poison, <32 x i8> poison, <32 x i32> <i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11, i32 11,
+                                                                            i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27, i32 27>
+  %dupq_h2  = shufflevector <16 x i16> poison, <16 x i16> poison, <16 x i32> <i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2, i32  2,
+                                                                              i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
+  %dupq_s3  = shufflevector <8 x i32> poison, <8 x i32> poison, <8 x i32> <i32 3, i32 3, i32 3, i32 3,
+                                                                           i32 7, i32 7, i32 7, i32 7>
+  %dupq_d0  = shufflevector <4 x i64> poison, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 2>
+  %dupq_512b_d1 = shufflevector <8 x i64> poison, <8 x i64> poison, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
+  ret void
+}
+
+attributes #0 = { noinline vscale_range(2,2) "target-features"="+sve2p1" }

sdesmalen-arm · 2025-06-20T13:44:59Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+  // Segmented shuffle matching.
+  if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput &&
+      Kind == TTI::SK_PermuteSingleSrc && isa<FixedVectorType>(Tp) &&
+      Tp->getPrimitiveSizeInBits().isKnownMultipleOf(128)) {


nit: please use AArch64::SVEBitsPerBlock instead of hard-coding 128.

sdesmalen-arm · 2025-06-20T13:54:04Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+    if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) {
+      bool IsDupQ = true;
+      for (unsigned I = 1; I < Mask.size(); ++I)
+        IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts);
+      if (IsDupQ)
+        return LT.first;
+    }


nit: to benefit from an early exit, you could use something like this:

Suggested change

if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) {

bool IsDupQ = true;

for (unsigned I = 1; I < Mask.size(); ++I)

IsDupQ &= (unsigned)Mask[I] == Lane + ((I / SegmentElts) * SegmentElts);

if (IsDupQ)

return LT.first;

}

if (all_of(enumerate(Mask), [](unsigned I, unsigned M) { return M == Lane + ((I / SegmentElts) * SegmentElts); })

return ..

?

sdesmalen-arm · 2025-06-20T13:55:39Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost(
    Kind = TTI::SK_PermuteSingleSrc;
  }

+  // Segmented shuffle matching.
+  if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput &&


This instruction is also enabled under FEAT_SME2p1

davemgreen · 2025-06-20T14:09:40Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

+
+    // dupq zd.t, zn.t[idx]
+    unsigned Lane = (unsigned)Mask[0];
+    if (SegmentElts * Segments == Mask.size() && Lane < SegmentElts) {


Can this re-use the isDUPQMask method added recently?

davemgreen · 2025-06-20T14:09:51Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

@@ -5583,6 +5583,26 @@ InstructionCost AArch64TTIImpl::getShuffleCost(
    Kind = TTI::SK_PermuteSingleSrc;
  }

+  // Segmented shuffle matching.
+  if (ST->hasSVE2p1() && CostKind == TTI::TCK_RecipThroughput &&


Can it be handled for any CostKind? Often them all matching is OK (enough).

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

* Support SME2p1 * Remove hardcoded magic number * Return the same result for other cost kinds

sdesmalen-arm · 2025-06-23T12:21:28Z

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

+    if (Subtarget->hasSVE2p1() ||
+        (Subtarget->hasSME2p1() && Subtarget->isSVEorStreamingSVEAvailable())) {


Sorry, I should have said isStreaming(), i.e.

Suggested change

if (Subtarget->hasSVE2p1() ||

(Subtarget->hasSME2p1() && Subtarget->isSVEorStreamingSVEAvailable())) {

if (Subtarget->hasSVE2p1() ||

(Subtarget->hasSME2p1() && Subtarget->isStreaming())) {

Actually, looking at the instruction description it seems your original code (using isSVEorStreamingSVEAvailable) was the right code, because with +sme2p1 the instruction is also enabled in non-streaming mode. This can be seen from the pseudo-code for DUPQ, which has IsSVEEnabled(). This is actually something that @paulwalker-arm is trying to fix in other places (#145322).

After having revisited this once more, the correct interpretation of:

if !Feature_SVE2p1 && !Feature_SME2p1 then Instruction = UNDEFINED CheckSVEEnabled() // can be in streaming mode // can be in non-streaming mode

Is that:

+sve,+sme2p1 enables DUPQ in both streaming and non-streaming mode

+sve2p1,+sme enables DUPQ in both streaming and non-streaming mode

Meaning that for DUPQ this expression should be:

if (Subtarget->hasSVE2p1() || Subtarget->hasSME2p1()) && Subtarget->isSVEorStreamingSVEAvailable())

We previously worked under the assumption that +sme* features would add features in streaming mode, and +sve* features would add features in non-streaming mode. But this is not (or no longer) the case. There may be other places where this needs fixing as well.

This reverts commit 57aaebe.

davemgreen

The scores look OK to me. A couple of other suggestions but looks good otherwise, if @sdesmalen-arm is happy.

davemgreen · 2025-06-23T15:26:14Z

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

@@ -5599,6 +5599,23 @@ AArch64TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, VectorType *DstTy,
    SrcTy = DstTy;
  }

+  // Segmented shuffle matching.
+  if ((ST->hasSVE2p1() ||


Check !Mask.empty() too, in case it is not passed in.

davemgreen · 2025-06-23T15:27:43Z

llvm/lib/Target/AArch64/AArch64PerfectShuffle.h

+
+  // Check that all lanes match the first, adjusted for segment.
+  if (all_of(enumerate(M), [&](auto P) {
+        return (unsigned)P.value() == Lane + (P.index() / NumElts) * NumElts;


Add P.value() == PoisonMaskElem || if undef elements are OK to ignore

I was going to leave that alone for now, but since this only applies to SVE2p1+ we have time to figure out there's a better match for edge cases later.

gbossu

LGTM, but I'll leave the approval to experts 😄

llvm/lib/Target/AArch64/AArch64PerfectShuffle.h

llvm/test/Analysis/CostModel/AArch64/segmented-shufflevector-patterns.ll

llvm/test/CodeGen/AArch64/sve2p1-vector-shuffles.ll

llvm/lib/Target/AArch64/AArch64PerfectShuffle.h

* Handle poison lanes

huntergr-arm requested review from davemgreen, sdesmalen-arm and gbossu June 19, 2025 15:48

llvmbot added backend:AArch64 llvm:analysis Includes value tracking, cost tables and constant folding labels Jun 19, 2025

sdesmalen-arm reviewed Jun 20, 2025

View reviewed changes

davemgreen reviewed Jun 20, 2025

View reviewed changes

sdesmalen-arm reviewed Jun 20, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Outdated Show resolved Hide resolved

huntergr-arm added 5 commits June 23, 2025 11:26

Test precommit

a42ffba

Return lower cost for dupq

e160c70

* Refactor to share isDUPQMask

542faf0

* Support SME2p1 * Remove hardcoded magic number * Return the same result for other cost kinds

Improve SME check, add runline to test for it

e72d339

Update ISel and codegen test too

a03c040

sdesmalen-arm reviewed Jun 23, 2025

View reviewed changes

Rebase, getShuffleCost params changed

6899d1c

huntergr-arm force-pushed the segmented-lane-splat-costmodel branch from 4690b2b to 6899d1c Compare June 23, 2025 12:41

huntergr-arm added 2 commits June 23, 2025 12:50

Only check for isStreaming()

57aaebe

Revert "Only check for isStreaming()"

3a7c14c

This reverts commit 57aaebe.

davemgreen reviewed Jun 23, 2025

View reviewed changes

gbossu reviewed Jun 24, 2025

View reviewed changes

huntergr-arm added 3 commits June 24, 2025 09:20

* Check for mask being empty

f45e0b2

* Handle poison lanes

Make isDUPQMask clearer, add 512b function to cost test

0c1cdff

Correction to feature checking

80b073c

		if (Subtarget->hasSVE2p1() \|\|
		(Subtarget->hasSME2p1() && Subtarget->isSVEorStreamingSVEAvailable())) {

[AArch64][CostModel] Lower cost of dupq (SVE2.1) #144918

Are you sure you want to change the base?

[AArch64][CostModel] Lower cost of dupq (SVE2.1) #144918

Conversation

huntergr-arm commented Jun 19, 2025

Uh oh!

llvmbot commented Jun 19, 2025

Uh oh!

llvmbot commented Jun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davemgreen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gbossu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!