Skip to content

[AMDGPU] Add pass to align inner loops #152356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

hjagasiaAMD
Copy link

@hjagasiaAMD hjagasiaAMD commented Aug 6, 2025

Loops less than 32B are aggressively aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.

aligned to 32B boundary. Otherwise for larger loops, check if loop
starts at an offset >16B and if so align to 32B boundary. This is
done to improve performance.
Copy link

github-actions bot commented Aug 6, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot
Copy link
Member

llvmbot commented Aug 6, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: None (hjagasiaAMD)

Changes

aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.


Patch is 174.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152356.diff

85 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+3)
  • (added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp (+153)
  • (added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h (+24)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+6)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/cf-loop-on-constant.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/coalescer_distribute.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline-npm.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+11-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.and.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.or.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.xor.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+19-1)
  • (modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cfg-loop-assert.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/simplifydemandedbits-recursion.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/undefined-subreg-liverange.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-cfg.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/valu-i1.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 007b481f84960..a9270eadf1232 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -231,6 +231,9 @@ extern char &AMDGPUPerfHintAnalysisLegacyID;
 void initializeGCNRegPressurePrinterPass(PassRegistry &);
 extern char &GCNRegPressurePrinterID;
 
+void initializeAMDGPULoopAlignLegacyPass(PassRegistry &);
+extern char &AMDGPULoopAlignLegacyID;
+
 void initializeAMDGPUPreloadKernArgPrologLegacyPass(PassRegistry &);
 extern char &AMDGPUPreloadKernArgPrologLegacyID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
new file mode 100644
index 0000000000000..409b3e47bf2a8
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
@@ -0,0 +1,153 @@
+//===----- AMDGPULoopAlign.cpp - Generate loop alignment directives  -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Inspect a basic block and if certain conditions are met then align to 32
+// bytes.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPULoopAlign.h"
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-loop-align"
+
+static cl::opt<bool>
+    DisableLoopAlign("disable-amdgpu-loop-align", cl::init(false), cl::Hidden,
+                     cl::desc("Disable AMDGPU loop alignment pass"));
+
+namespace {
+
+class AMDGPULoopAlign {
+private:
+  MachineLoopInfo &MLI;
+
+public:
+  AMDGPULoopAlign(MachineLoopInfo &MLI) : MLI(MLI) {}
+
+  struct BasicBlockInfo {
+    // Offset - Distance from the beginning of the function to the beginning
+    // of this basic block.
+    uint64_t Offset = 0;
+    // Size - Size of the basic block in bytes
+    uint64_t Size = 0;
+  };
+
+  void generateBlockInfo(MachineFunction &MF,
+                         SmallVectorImpl<BasicBlockInfo> &BlockInfo) {
+    BlockInfo.clear();
+    BlockInfo.resize(MF.getNumBlockIDs());
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    for (const MachineBasicBlock &MBB : MF) {
+      BlockInfo[MBB.getNumber()].Size = 0;
+      for (const MachineInstr &MI : MBB) {
+        BlockInfo[MBB.getNumber()].Size += TII->getInstSizeInBytes(MI);
+      }
+    }
+    uint64_t PrevNum = (&MF)->begin()->getNumber();
+    for (auto &MBB :
+         make_range(std::next(MachineFunction::iterator((&MF)->begin())),
+                    (&MF)->end())) {
+      uint64_t Num = MBB.getNumber();
+      BlockInfo[Num].Offset =
+          BlockInfo[PrevNum].Offset + BlockInfo[PrevNum].Size;
+      unsigned blockAlignment = MBB.getAlignment().value();
+      unsigned ParentAlign = MBB.getParent()->getAlignment().value();
+      if (blockAlignment <= ParentAlign)
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment);
+      else
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment) +
+                                blockAlignment - ParentAlign;
+      PrevNum = Num;
+    }
+  }
+
+  bool run(MachineFunction &MF) {
+    if (DisableLoopAlign)
+      return false;
+
+    // The starting address of all shader programs must be 256 bytes aligned.
+    // Regular functions just need the basic required instruction alignment.
+    const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
+    MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));
+    if (MF.getAlignment().value() < 32)
+      return false;
+
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    SmallVector<BasicBlockInfo, 16> BlockInfo;
+    generateBlockInfo(MF, BlockInfo);
+
+    bool Changed = false;
+    for (MachineLoop *ML : MLI.getLoopsInPreorder()) {
+      // Check if loop is innermost
+      if (!ML->isInnermost())
+        continue;
+      MachineBasicBlock *Header = ML->getHeader();
+      // Check if loop is already evaluated for prefetch & aligned
+      if (Header->getAlignment().value() == 64 ||
+          ML->getTopBlock()->getAlignment().value() == 64)
+        continue;
+
+      // If loop is < 8-dwords, align aggressively to 0 mod 8 dword boundary.
+      // else align to 0 mod 8 dword boundary only if less than 4 dwords of
+      // instructions are available
+      unsigned loopSizeInBytes = 0;
+      for (MachineBasicBlock *MBB : ML->getBlocks())
+        for (MachineInstr &MI : *MBB)
+          loopSizeInBytes += TII->getInstSizeInBytes(MI);
+
+      if (loopSizeInBytes < 32) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      } else if (BlockInfo[Header->getNumber()].Offset % 32 > 16) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      }
+    }
+    return Changed;
+  }
+};
+
+class AMDGPULoopAlignLegacy : public MachineFunctionPass {
+public:
+  static char ID;
+
+  AMDGPULoopAlignLegacy() : MachineFunctionPass(ID) {}
+
+  bool runOnMachineFunction(MachineFunction &MF) override {
+    return AMDGPULoopAlign(getAnalysis<MachineLoopInfoWrapperPass>().getLI())
+        .run(MF);
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineLoopInfoWrapperPass>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+} // namespace
+
+char AMDGPULoopAlignLegacy::ID = 0;
+
+char &llvm::AMDGPULoopAlignLegacyID = AMDGPULoopAlignLegacy::ID;
+
+INITIALIZE_PASS(AMDGPULoopAlignLegacy, DEBUG_TYPE, "AMDGPU Loop Align", false,
+                false)
+
+PreservedAnalyses
+AMDGPULoopAlignPass::run(MachineFunction &MF,
+                         MachineFunctionAnalysisManager &MFAM) {
+  auto &MLI = MFAM.getResult<MachineLoopAnalysis>(MF);
+  if (AMDGPULoopAlign(MLI).run(MF))
+    return PreservedAnalyses::none();
+  return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
new file mode 100644
index 0000000000000..12b9f13926415
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
@@ -0,0 +1,24 @@
+//===--- AMDGPULoopAlign.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+
+#include "llvm/CodeGen/MachinePassManager.h"
+
+namespace llvm {
+
+class AMDGPULoopAlignPass : public PassInfoMixin<AMDGPULoopAlignPass> {
+public:
+  PreservedAnalyses run(MachineFunction &MF,
+                        MachineFunctionAnalysisManager &MFAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index b6c6d927d0e89..35c41d1d73a59 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -117,6 +117,7 @@ MACHINE_FUNCTION_PASS("amdgpu-preload-kern-arg-prolog", AMDGPUPreloadKernArgProl
 MACHINE_FUNCTION_PASS("amdgpu-prepare-agpr-alloc", AMDGPUPrepareAGPRAllocPass())
 MACHINE_FUNCTION_PASS("amdgpu-nsa-reassign", GCNNSAReassignPass())
 MACHINE_FUNCTION_PASS("amdgpu-wait-sgpr-hazards", AMDGPUWaitSGPRHazardsPass())
+MACHINE_FUNCTION_PASS("amdgpu-loop-align", AMDGPULoopAlignPass())
 MACHINE_FUNCTION_PASS("gcn-create-vopd", GCNCreateVOPDPass())
 MACHINE_FUNCTION_PASS("gcn-dpp-combine", GCNDPPCombinePass())
 MACHINE_FUNCTION_PASS("si-fix-sgpr-copies", SIFixSGPRCopiesPass())
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index c1f17033d04a8..848968a4da88a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -22,6 +22,7 @@
 #include "AMDGPUExportKernelRuntimeHandles.h"
 #include "AMDGPUIGroupLP.h"
 #include "AMDGPUISelDAGToDAG.h"
+#include "AMDGPULoopAlign.h"
 #include "AMDGPUMacroFusion.h"
 #include "AMDGPUPerfHintAnalysis.h"
 #include "AMDGPUPreloadKernArgProlog.h"
@@ -570,6 +571,7 @@ extern "C" LLVM_ABI LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeGCNRegPressurePrinterPass(*PR);
   initializeAMDGPUPreloadKernArgPrologLegacyPass(*PR);
   initializeAMDGPUWaitSGPRHazardsLegacyPass(*PR);
+  initializeAMDGPULoopAlignLegacyPass(*PR);
   initializeAMDGPUPreloadKernelArgumentsLegacyPass(*PR);
 }
 
@@ -1764,6 +1766,8 @@ void GCNPassConfig::addPreEmitPass() {
     addPass(&AMDGPUInsertDelayAluID);
 
   addPass(&BranchRelaxationPassID);
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(&AMDGPULoopAlignLegacyID);
 }
 
 void GCNPassConfig::addPostBBSections() {
@@ -2352,6 +2356,8 @@ void AMDGPUCodeGenPassBuilder::addPreEmitPass(AddMachinePass &addPass) const {
   }
 
   addPass(BranchRelaxationPass());
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(AMDGPULoopAlignPass());
 }
 
 bool AMDGPUCodeGenPassBuilder::isPassEnabled(const cl::opt<bool> &Opt,
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c466f9cf0f359..482b71d910a21 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -70,6 +70,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPULibCalls.cpp
   AMDGPUImageIntrinsicOptimizer.cpp
   AMDGPULibFunc.cpp
+  AMDGPULoopAlign.cpp
   AMDGPULowerBufferFatPointers.cpp
   AMDGPULowerKernelArguments.cpp
   AMDGPULowerKernelAttributes.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..1b81022a273a2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index 2b595b9bbecc0..5d8cd9af8a7b1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
-; RUN: not llc -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
+; RUN: not llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
 
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f32(uint node_ptr, float ray_extent, float3 ray_origin, float3 ray_dir, float3 ray_inv_dir, uint4 texture_descr)
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f16(uint node_ptr, float ray_extent, float3 ray_origin, half3 ray_dir, half3 ray_inv_dir, uint4 texture_descr)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..e3842a124985b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
index e0016b0a5a64d..ba70628ea1e0b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memcpy.p1.p1.i32(ptr addrspace(1), ptr addrspace(1), i32, i1 immarg)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
index 04652af147f9b..cf1da9ae2e6b7 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memset.p1.i32(ptr addrspace(1), i8, i32, i1)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..905b6ec81b98c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
index e03c9ca34b825..6dc12caf234de 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
+; RUN: llc -disable-amdgpu-loop-align=true -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
 
 declare void @llvm.amdgcn.exp.f32(i32 immarg, i32 immarg, float, float, float, float, i1 immarg, i1 immarg)
 declare i32 @llvm.amdgcn.raw.ptr.buffer.atomic.and.i32(i32, ptr addrspace(8), i32, i32, i32 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 4cc39d93854a0..db4ccd2fd44cb 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -1,30 +1,30 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX7LESS,GFX7LESS_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX8,GFX8_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1064,GFX1064_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1032,GFX1032_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-TRUE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-FAKE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-FAKE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1132,GFX1132-TRUE16,GFX1132_ITERATIVE,GFX1132_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize32 -mattr=-fla...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Aug 6, 2025

@llvm/pr-subscribers-llvm-globalisel

Author: None (hjagasiaAMD)

Changes

aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.


Patch is 174.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152356.diff

85 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+3)
  • (added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp (+153)
  • (added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h (+24)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+6)
  • (modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+26-26)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+14-14)
  • (modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/cf-loop-on-constant.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/coalescer_distribute.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline-npm.ll (+5-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+11-5)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.and.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.or.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.xor.ll (+12-12)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+11-11)
  • (modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+19-1)
  • (modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/si-annotate-cfg-loop-assert.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/simplifydemandedbits-recursion.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+4-4)
  • (modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/undefined-subreg-liverange.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-cfg.ll (+2-2)
  • (modified) llvm/test/CodeGen/AMDGPU/valu-i1.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
  • (modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 007b481f84960..a9270eadf1232 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -231,6 +231,9 @@ extern char &AMDGPUPerfHintAnalysisLegacyID;
 void initializeGCNRegPressurePrinterPass(PassRegistry &);
 extern char &GCNRegPressurePrinterID;
 
+void initializeAMDGPULoopAlignLegacyPass(PassRegistry &);
+extern char &AMDGPULoopAlignLegacyID;
+
 void initializeAMDGPUPreloadKernArgPrologLegacyPass(PassRegistry &);
 extern char &AMDGPUPreloadKernArgPrologLegacyID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
new file mode 100644
index 0000000000000..409b3e47bf2a8
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
@@ -0,0 +1,153 @@
+//===----- AMDGPULoopAlign.cpp - Generate loop alignment directives  -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Inspect a basic block and if certain conditions are met then align to 32
+// bytes.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPULoopAlign.h"
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-loop-align"
+
+static cl::opt<bool>
+    DisableLoopAlign("disable-amdgpu-loop-align", cl::init(false), cl::Hidden,
+                     cl::desc("Disable AMDGPU loop alignment pass"));
+
+namespace {
+
+class AMDGPULoopAlign {
+private:
+  MachineLoopInfo &MLI;
+
+public:
+  AMDGPULoopAlign(MachineLoopInfo &MLI) : MLI(MLI) {}
+
+  struct BasicBlockInfo {
+    // Offset - Distance from the beginning of the function to the beginning
+    // of this basic block.
+    uint64_t Offset = 0;
+    // Size - Size of the basic block in bytes
+    uint64_t Size = 0;
+  };
+
+  void generateBlockInfo(MachineFunction &MF,
+                         SmallVectorImpl<BasicBlockInfo> &BlockInfo) {
+    BlockInfo.clear();
+    BlockInfo.resize(MF.getNumBlockIDs());
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    for (const MachineBasicBlock &MBB : MF) {
+      BlockInfo[MBB.getNumber()].Size = 0;
+      for (const MachineInstr &MI : MBB) {
+        BlockInfo[MBB.getNumber()].Size += TII->getInstSizeInBytes(MI);
+      }
+    }
+    uint64_t PrevNum = (&MF)->begin()->getNumber();
+    for (auto &MBB :
+         make_range(std::next(MachineFunction::iterator((&MF)->begin())),
+                    (&MF)->end())) {
+      uint64_t Num = MBB.getNumber();
+      BlockInfo[Num].Offset =
+          BlockInfo[PrevNum].Offset + BlockInfo[PrevNum].Size;
+      unsigned blockAlignment = MBB.getAlignment().value();
+      unsigned ParentAlign = MBB.getParent()->getAlignment().value();
+      if (blockAlignment <= ParentAlign)
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment);
+      else
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment) +
+                                blockAlignment - ParentAlign;
+      PrevNum = Num;
+    }
+  }
+
+  bool run(MachineFunction &MF) {
+    if (DisableLoopAlign)
+      return false;
+
+    // The starting address of all shader programs must be 256 bytes aligned.
+    // Regular functions just need the basic required instruction alignment.
+    const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
+    MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));
+    if (MF.getAlignment().value() < 32)
+      return false;
+
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    SmallVector<BasicBlockInfo, 16> BlockInfo;
+    generateBlockInfo(MF, BlockInfo);
+
+    bool Changed = false;
+    for (MachineLoop *ML : MLI.getLoopsInPreorder()) {
+      // Check if loop is innermost
+      if (!ML->isInnermost())
+        continue;
+      MachineBasicBlock *Header = ML->getHeader();
+      // Check if loop is already evaluated for prefetch & aligned
+      if (Header->getAlignment().value() == 64 ||
+          ML->getTopBlock()->getAlignment().value() == 64)
+        continue;
+
+      // If loop is < 8-dwords, align aggressively to 0 mod 8 dword boundary.
+      // else align to 0 mod 8 dword boundary only if less than 4 dwords of
+      // instructions are available
+      unsigned loopSizeInBytes = 0;
+      for (MachineBasicBlock *MBB : ML->getBlocks())
+        for (MachineInstr &MI : *MBB)
+          loopSizeInBytes += TII->getInstSizeInBytes(MI);
+
+      if (loopSizeInBytes < 32) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      } else if (BlockInfo[Header->getNumber()].Offset % 32 > 16) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      }
+    }
+    return Changed;
+  }
+};
+
+class AMDGPULoopAlignLegacy : public MachineFunctionPass {
+public:
+  static char ID;
+
+  AMDGPULoopAlignLegacy() : MachineFunctionPass(ID) {}
+
+  bool runOnMachineFunction(MachineFunction &MF) override {
+    return AMDGPULoopAlign(getAnalysis<MachineLoopInfoWrapperPass>().getLI())
+        .run(MF);
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineLoopInfoWrapperPass>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+} // namespace
+
+char AMDGPULoopAlignLegacy::ID = 0;
+
+char &llvm::AMDGPULoopAlignLegacyID = AMDGPULoopAlignLegacy::ID;
+
+INITIALIZE_PASS(AMDGPULoopAlignLegacy, DEBUG_TYPE, "AMDGPU Loop Align", false,
+                false)
+
+PreservedAnalyses
+AMDGPULoopAlignPass::run(MachineFunction &MF,
+                         MachineFunctionAnalysisManager &MFAM) {
+  auto &MLI = MFAM.getResult<MachineLoopAnalysis>(MF);
+  if (AMDGPULoopAlign(MLI).run(MF))
+    return PreservedAnalyses::none();
+  return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
new file mode 100644
index 0000000000000..12b9f13926415
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
@@ -0,0 +1,24 @@
+//===--- AMDGPULoopAlign.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+
+#include "llvm/CodeGen/MachinePassManager.h"
+
+namespace llvm {
+
+class AMDGPULoopAlignPass : public PassInfoMixin<AMDGPULoopAlignPass> {
+public:
+  PreservedAnalyses run(MachineFunction &MF,
+                        MachineFunctionAnalysisManager &MFAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index b6c6d927d0e89..35c41d1d73a59 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -117,6 +117,7 @@ MACHINE_FUNCTION_PASS("amdgpu-preload-kern-arg-prolog", AMDGPUPreloadKernArgProl
 MACHINE_FUNCTION_PASS("amdgpu-prepare-agpr-alloc", AMDGPUPrepareAGPRAllocPass())
 MACHINE_FUNCTION_PASS("amdgpu-nsa-reassign", GCNNSAReassignPass())
 MACHINE_FUNCTION_PASS("amdgpu-wait-sgpr-hazards", AMDGPUWaitSGPRHazardsPass())
+MACHINE_FUNCTION_PASS("amdgpu-loop-align", AMDGPULoopAlignPass())
 MACHINE_FUNCTION_PASS("gcn-create-vopd", GCNCreateVOPDPass())
 MACHINE_FUNCTION_PASS("gcn-dpp-combine", GCNDPPCombinePass())
 MACHINE_FUNCTION_PASS("si-fix-sgpr-copies", SIFixSGPRCopiesPass())
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index c1f17033d04a8..848968a4da88a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -22,6 +22,7 @@
 #include "AMDGPUExportKernelRuntimeHandles.h"
 #include "AMDGPUIGroupLP.h"
 #include "AMDGPUISelDAGToDAG.h"
+#include "AMDGPULoopAlign.h"
 #include "AMDGPUMacroFusion.h"
 #include "AMDGPUPerfHintAnalysis.h"
 #include "AMDGPUPreloadKernArgProlog.h"
@@ -570,6 +571,7 @@ extern "C" LLVM_ABI LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeGCNRegPressurePrinterPass(*PR);
   initializeAMDGPUPreloadKernArgPrologLegacyPass(*PR);
   initializeAMDGPUWaitSGPRHazardsLegacyPass(*PR);
+  initializeAMDGPULoopAlignLegacyPass(*PR);
   initializeAMDGPUPreloadKernelArgumentsLegacyPass(*PR);
 }
 
@@ -1764,6 +1766,8 @@ void GCNPassConfig::addPreEmitPass() {
     addPass(&AMDGPUInsertDelayAluID);
 
   addPass(&BranchRelaxationPassID);
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(&AMDGPULoopAlignLegacyID);
 }
 
 void GCNPassConfig::addPostBBSections() {
@@ -2352,6 +2356,8 @@ void AMDGPUCodeGenPassBuilder::addPreEmitPass(AddMachinePass &addPass) const {
   }
 
   addPass(BranchRelaxationPass());
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(AMDGPULoopAlignPass());
 }
 
 bool AMDGPUCodeGenPassBuilder::isPassEnabled(const cl::opt<bool> &Opt,
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c466f9cf0f359..482b71d910a21 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -70,6 +70,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPULibCalls.cpp
   AMDGPUImageIntrinsicOptimizer.cpp
   AMDGPULibFunc.cpp
+  AMDGPULoopAlign.cpp
   AMDGPULowerBufferFatPointers.cpp
   AMDGPULowerKernelArguments.cpp
   AMDGPULowerKernelAttributes.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..1b81022a273a2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index 2b595b9bbecc0..5d8cd9af8a7b1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
-; RUN: not llc -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
+; RUN: not llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
 
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f32(uint node_ptr, float ray_extent, float3 ray_origin, float3 ray_dir, float3 ray_inv_dir, uint4 texture_descr)
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f16(uint node_ptr, float ray_extent, float3 ray_origin, half3 ray_dir, half3 ray_inv_dir, uint4 texture_descr)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..e3842a124985b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
index e0016b0a5a64d..ba70628ea1e0b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memcpy.p1.p1.i32(ptr addrspace(1), ptr addrspace(1), i32, i1 immarg)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
index 04652af147f9b..cf1da9ae2e6b7 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memset.p1.i32(ptr addrspace(1), i8, i32, i1)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..905b6ec81b98c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
index e03c9ca34b825..6dc12caf234de 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
+; RUN: llc -disable-amdgpu-loop-align=true -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
 
 declare void @llvm.amdgcn.exp.f32(i32 immarg, i32 immarg, float, float, float, float, i1 immarg, i1 immarg)
 declare i32 @llvm.amdgcn.raw.ptr.buffer.atomic.and.i32(i32, ptr addrspace(8), i32, i32, i32 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 4cc39d93854a0..db4ccd2fd44cb 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -1,30 +1,30 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX7LESS,GFX7LESS_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX8,GFX8_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1064,GFX1064_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1032,GFX1032_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-TRUE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-FAKE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-FAKE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1132,GFX1132-TRUE16,GFX1132_ITERATIVE,GFX1132_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize32 -mattr=-fla...
[truncated]

@hjagasiaAMD hjagasiaAMD changed the title Add pass to align inner loops. Loops less than 32B are aggresively Add pass to align inner loops for AMDGPU. Aug 6, 2025
@ronlieb ronlieb requested review from arsenm and ronlieb August 6, 2025 18:48
@hjagasiaAMD hjagasiaAMD changed the title Add pass to align inner loops for AMDGPU. [AMDGPU] Add pass to align inner loops. Aug 6, 2025
@hjagasiaAMD hjagasiaAMD changed the title [AMDGPU] Add pass to align inner loops. [AMDGPU] Add pass to align inner loops Aug 6, 2025
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be a custom pass? MachineBlockPlacement can already align loops depending on getPrefLoopAlignment?

Before doing anything with manually specified block alignments, we need to add the missing MIR serialization for the block alignment

Comment on lines +76 to +79
// The starting address of all shader programs must be 256 bytes aligned.
// Regular functions just need the basic required instruction alignment.
const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already handled when emitting code, you don't need to handle it again

uint64_t Size = 0;
};

void generateBlockInfo(MachineFunction &MF,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't want to add yet another instance of a pass doing its own code size computations. For the purpose of this pass you should be able to just set the alignment on the block. getInstSIzeInBytes isn't completely accurate either

@hjagasiaAMD
Copy link
Author

Will address in MachineBlockPlacement. Closing this PR.

With -stop-after=block-placement

bb.1.bb2 (align 64):
successors: %bb.2(0x04000000), %bb.1(0x7c000000)
liveins: $sgpr0

The block alignment is there. Don't think there is a problem for MIR serialization-block alignment

@hjagasiaAMD hjagasiaAMD deleted the amdgpu-loopalign branch August 11, 2025 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants