Skip to content

[AMDGPU] A SCHED_BARRIER in a bundle should not prevent other SCHED_BARRIERs to be considered #152627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yoonseoch
Copy link

@yoonseoch yoonseoch commented Aug 8, 2025

A SCHED_BARRIER within a bundle was prevented another SCHED_BARRIER outside of the bundle to be considered.

This was because when the src or dest SUnit of a target SCHED_BARRIER is collected, if that SUnit is a bundle, all of the constituent MI should be classified against the target SCHED_BARRIER. A SCHED_BARRIER in a bundle is never classified against the target SCHED_BARRIER. As a result, the target SCHED_BARRIER was not honored even though all non-meta instructions in the bundle were classified.

In a new function extra_sched_barrier_in_bundle in llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir, without the changes in this PR, the second bundle of global loads were scheduled before V_MULs.

An alternative fix is not to bundle a SCHED_BARRIER with other memory operations in Post RA Bundler. The alternative fix was not followed as Post RA Bunder already doesn't bundle group sched barriers or igpt_opt.

Copy link

github-actions bot commented Aug 8, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 8, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Yoonseo Choi (yoonseoch)

Changes

Patch is 27.05 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152627.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp (+3-1)
  • (modified) llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir (+286)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
index dbe74b1b08f8c..8c92060128a12 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUIGroupLP.cpp
@@ -2508,7 +2508,9 @@ bool SchedGroup::canAddSU(SUnit &SU) const {
     ++E;
 
   // Return true if all of the bundled MIs can be added to this group.
-  return std::all_of(B, E, [this](MachineInstr &MI) { return canAddMI(MI); });
+  return std::all_of(B, E, [this](MachineInstr &MI) {
+    return (MI.isMetaInstruction()) || canAddMI(MI);
+  });
 }
 
 void SchedGroup::initSchedGroup() {
diff --git a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
index 7bdb8f5b35ec5..a49d2188ce704 100644
--- a/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
+++ b/llvm/test/CodeGen/AMDGPU/sched-barrier-post-RA.mir
@@ -5,6 +5,10 @@
   define amdgpu_kernel void @no_sched_barrier(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
   define amdgpu_kernel void @sched_barrier_mask_0(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
   define amdgpu_kernel void @sched_barrier_mask_1(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @no_sched_barrier_many_loads(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @sched_barrier_mask_1924(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @extra_sched_barrier_in_bundle(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
+  define amdgpu_kernel void @extra_sched_barrier_in_bundle_2(ptr addrspace(1) noalias %out, ptr addrspace(1) noalias %in) { ret void }
 
   !0 = distinct !{!0}
   !1 = !{!1, !0}
@@ -120,3 +124,285 @@ body: |
     }
     S_ENDPGM 0
 ...
+
+
+# No SCHED_BARRIER - First two bundles global_loads get clustered and VALU instrucions are scheduled later.
+
+---
+name: no_sched_barrier_many_loads
+tracksRegLiveness: true
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: no_sched_barrier_many_loads
+    ; CHECK: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit killed $sgpr0_sgpr1, implicit killed $vgpr0, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR killed renamable $sgpr0_sgpr1, killed renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr10 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: S_ENDPGM 0
+    renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+      renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    renamable $vgpr10 = IMPLICIT_DEF
+    renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+      renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    S_ENDPGM 0
+...
+
+# MASK = 0b 0111 1000 0100  SALU, MFMA/WMMA, All DS, All DS Read, All DS Write, All Trans may be
+#                     scheduled across SCHED_BARRIER. VALU and all VMEM, VMEM Read/Write cannot be
+#                     scheduled across SCHED_BARRIER. V_MULs is not shceduled after the second bundle of loads.
+
+---
+name: sched_barrier_mask_1924
+tracksRegLiveness: true
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: sched_barrier_mask_1924
+    ; CHECK: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit killed $sgpr0_sgpr1, implicit killed $vgpr0, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR killed renamable $sgpr0_sgpr1, killed renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr10 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: SCHED_BARRIER 1924
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: S_ENDPGM 0
+    renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+      renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    renamable $vgpr10 = IMPLICIT_DEF
+    renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+
+    SCHED_BARRIER 1924
+
+    BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+      renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    }
+    S_ENDPGM 0
+...
+
+# MASK = 0b 0111 1000 0100  SALU, MFMA/WMMA, All DS, All DS Read, All DS Write, All Trans may be
+#                     scheduled across SCHED_BARRIER. VALU and all VMEM, VMEM Read/Write cannot be
+#                     scheduled across SCHED_BARRIER. V_MULs is not shceduled after the second bundle of loads.
+#
+#                     A SCHED_BARRIER before the seconde bundle is honored.
+
+
+---
+name: extra_sched_barrier_in_bundle
+tracksRegLiveness: true
+body: |
+  bb.0:
+    ; CHECK-LABEL: name: extra_sched_barrier_in_bundle
+    ; CHECK: renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit killed $sgpr0_sgpr1, implicit killed $vgpr0, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR killed renamable $sgpr0_sgpr1, killed renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr10 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr0 = IMPLICIT_DEF
+    ; CHECK-NEXT: renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+    ; CHECK-NEXT: SCHED_BARRIER 1924
+    ; CHECK-NEXT: BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+    ; CHECK-NEXT:   renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT:   SCHED_BARRIER 1924
+    ; CHECK-NEXT:   renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    ; CHECK-NEXT: renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamable $sgpr2_sgpr3, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: BUNDLE implicit killed $vgpr0, implicit killed $vgpr1, implicit killed $sgpr4_sgpr5, implicit $exec, implicit killed $vgpr2 {
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR renamable $vgpr0, killed renamable $vgpr1, renamable $sgpr4_sgpr5, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT:   GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr0, killed renamable $vgpr2, killed renamable $sgpr4_sgpr5, 512, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+    ; CHECK-NEXT: }
+    ; CHECK-NEXT: S_ENDPGM 0
+    renamable $sgpr0_sgpr1 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    BUNDLE implicit-def $vgpr1, implicit-def $vgpr1_lo16, implicit-def $vgpr1_hi16, implicit-def $vgpr2, implicit-def $vgpr2_lo16, implicit-def $vgpr2_hi16, implicit $sgpr0_sgpr1, implicit $vgpr0, implicit $exec {
+      renamable $vgpr1 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      renamable $vgpr2 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr0_sgpr1, renamable $vgpr0, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr2_sgpr3 = IMPLICIT_DEF
+    renamable $vgpr10 = IMPLICIT_DEF
+    renamable $vgpr1 = nsw V_MUL_LO_U32_e64 killed $vgpr1, $vgpr1, implicit $exec
+    renamable $vgpr2 = nsw V_MUL_LO_U32_e64 killed $vgpr2, $vgpr2, implicit $exec
+
+    SCHED_BARRIER 1924
+
+    BUNDLE implicit-def $vgpr11, implicit-def $vgpr11_lo16, implicit-def $vgpr11_hi16, implicit-def $vgpr12, implicit-def $vgpr12_lo16, implicit-def $vgpr12_hi16, implicit $sgpr2_sgpr3, implicit $vgpr10, implicit $exec {
+      renamable $vgpr11 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 0, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+      SCHED_BARRIER 1924
+      renamable $vgpr12 = GLOBAL_LOAD_DWORD_SADDR renamable $sgpr2_sgpr3, renamable $vgpr10, 512, 0, implicit $exec :: (load (s32) from %ir.in, !alias.scope !0, addrspace 1)
+    }
+    renamable $sgpr4_sgpr5 = IMPLICIT_DEF
+    renamable $vgpr0 = IMPLICIT_DEF
+    renamable $vgpr11 = nsw V_MUL_LO_U32_e64 killed $vgpr11, $vgpr11, implicit $exec
+    renamable $vgpr12 = nsw V_MUL_LO_U32_e64 killed $vgpr12, $vgpr12, implicit $exec
+    BUNDLE implicit killed $vgpr10, implicit killed $vgpr11, implicit killed $sgpr2_sgpr3, implicit $exec, implicit killed $vgpr12 {
+      GLOBAL_STORE_DWORD_SADDR renamable $vgpr10, killed renamable $vgpr11, renamable $sgpr2_sgpr3, 0, 0, implicit $exec :: (store (s32) into %ir.out, !noalias !0, addrspace 1)
+      GLOBAL_STORE_DWORD_SADDR killed renamable $vgpr10, killed renamable $vgpr12, killed renamab...
[truncated]

@@ -2508,7 +2508,9 @@ bool SchedGroup::canAddSU(SUnit &SU) const {
++E;

// Return true if all of the bundled MIs can be added to this group.
return std::all_of(B, E, [this](MachineInstr &MI) { return canAddMI(MI); });
return std::all_of(B, E, [this](MachineInstr &MI) {
return (MI.isMetaInstruction()) || canAddMI(MI);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (MI.isMetaInstruction()) || canAddMI(MI);
return MI.isMetaInstruction() || canAddMI(MI);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The basic change LGTM. Though the code structure here is now unnecessarily confusing and could be cleaned up in a follow up. Inside canAddMI already tries to handle isMetaInstruction. The code here is also pre-finding the bounds of the bundle prior to doing the all_of. You can remove the isMetaInstruction inside of canAddMI, and directly check canAddMI in the initial walk through the bundle above

Copy link
Author

@yoonseoch yoonseoch Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, will do so.
Initially I did not remove isMetaInstruction inside of canAddMI as that could lead to edges between two sched_barriers, which I thought was unintentional (lines 2501, 2502).

Copy link
Author

@yoonseoch yoonseoch Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making canAddMI() return true for a metaInstruction in effect came up with more intrusive changes to existing lit tests. I got about 14 lit-test failures from CodeGen/AMDGPU, mostly on tests of iglp_opts and sched-barrier, and sched-group-barriers.

For example, llvm.amdgcn.sched.group.barrier.ll,

// sequence after making canAddMI return true on an metaInstruction

385 define amdgpu_kernel void @test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE(ptr addrspace(1) noalias %in,
386 ; GCN-LABEL: test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE:                                           
387 ; GCN:       ; %bb.0:                                                                                                 
388 ; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24                                                                    
389 ; GCN-NEXT:    v_and_b32_e32 v0, 0x3ff, v0                                                                            
390 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 7, v0                                                                           
    ----------------------------------------------------------------------------------------------------------------------
391 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)                                                                                   
392 ; GCN-NEXT:    global_load_dwordx4 v[12:15], v16, s[0:1] offset:32                                                    
393 ; GCN-NEXT:    ; kill: killed $sgpr0_sgpr1                                                                            
394 ; GCN-NEXT:    global_load_dwordx4 v[0:3], v16, s[0:1]                                                                
395 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)                                               
396 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000002) size(2) SyncID(0)                                               
397 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000040) size(1) SyncID(0)                                               
398 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)                                               
399 ; GCN-NEXT:    s_waitcnt vmcnt(1)                                                                                     
400 ; GCN-NEXT:    v_mul_lo_u32 v13, v13, v13                                                                             
401 ; GCN-NEXT:    v_mul_lo_u32 v12, v12, v12                                                                             
402 ; GCN-NEXT:    v_mul_lo_u32 v15, v15, v15                                                                             
403 ; GCN-NEXT:    v_mul_lo_u32 v14, v14, v14                                                                             
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
    ----------------------------------------------------------------------------------------------------------------------
404 ; GCN-NEXT:    s_waitcnt vmcnt(0)                                                                                     
405 ; GCN-NEXT:    v_mul_lo_u32 v3, v3, v3                                                                                
406 ; GCN-NEXT:    v_mul_lo_u32 v2, v2, v2                                                                                
407 ; GCN-NEXT:    global_store_dwordx4 v16, v[12:15], s[2:3] offset:32                                                   
408 ; GCN-NEXT:    global_load_dwordx4 v[4:7], v16, s[0:1] offset:48                                                      
409 ; GCN-NEXT:    v_mul_lo_u32 v1, v1, v1             

// original sequence in the test

 385 define amdgpu_kernel void @test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE(ptr addrspace(1) noalias %in
 386 ; GCN-LABEL: test_sched_group_barrier_pipeline_alternating_READ_VALU_WRITE:
 387 ; GCN:       ; %bb.0:
 388 ; GCN-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 389 ; GCN-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 390 ; GCN-NEXT:    v_lshlrev_b32_e32 v16, 7, v0
 391 ; GCN-NEXT:    ; kill: killed $sgpr0_sgpr1
 392 ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
 393 ; GCN-NEXT:    global_load_dwordx4 v[12:15], v16, s[0:1] offset:32
     ---------------------------------------------------------------------------------------------------------------------
     ---------------------------------------------------------------------------------------------------------------------
 394 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)
 395 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000002) size(2) SyncID(0)
 396 ; GCN-NEXT:    s_waitcnt vmcnt(0)
     ---------------------------------------------------------------------------------------------------------------------
     ---------------------------------------------------------------------------------------------------------------------
 397 ; GCN-NEXT:    v_mul_lo_u32 v13, v13, v13
 398 ; GCN-NEXT:    v_mul_lo_u32 v12, v12, v12
 399 ; GCN-NEXT:    v_mul_lo_u32 v15, v15, v15
 400 ; GCN-NEXT:    v_mul_lo_u32 v14, v14, v14
 401 ; GCN-NEXT:    global_store_dwordx4 v16, v[12:15], s[2:3] offset:32
 402 ; GCN-NEXT:    global_load_dwordx4 v[0:3], v16, s[0:1]
 403 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000040) size(1) SyncID(0)
 404 ; GCN-NEXT:    ; sched_group_barrier mask(0x00000020) size(1) SyncID(0)
 405 ; GCN-NEXT:    s_waitcnt vmcnt(0)
 406 ; GCN-NEXT:    v_mul_lo_u32 v3, v3, v3
 407 ; GCN-NEXT:    v_mul_lo_u32 v2, v2, v2
     ---------------------------------------------------------------------------------------------------------------------
     ---------------------------------------------------------------------------------------------------------------------
 408 ; GCN-NEXT:    v_mul_lo_u32 v1, v1, v1      

I think using the original logic of canAddMI is safer given that it is used for a non-bundle single MI.
I can still simplify the logic to move the checking into the scanning of the range and avoid doing another scan of all_of.

@yoonseoch
Copy link
Author

@kerbowa, @jrbyrnes, @bcahoon

@bcahoon bcahoon requested a review from kerbowa August 8, 2025 15:36
@yoonseoch yoonseoch changed the title A SCHED_BARRIER in a bundle should not prevent other SCHED_BARRIERs to be considered [AMDGPU] A SCHED_BARRIER in a bundle should not prevent other SCHED_BARRIERs to be considered Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants