[AMDGPU] Add pass to align inner loops #152356

hjagasiaAMD · 2025-08-06T18:13:58Z

Loops less than 32B are aggressively aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.

aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.

github-actions · 2025-08-06T18:14:16Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-08-06T18:14:52Z

@llvm/pr-subscribers-backend-amdgpu

Author: None (hjagasiaAMD)

Changes

aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.

Patch is 174.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152356.diff

85 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+3)
(added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp (+153)
(added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h (+24)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+6)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+26-26)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/cf-loop-on-constant.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/coalescer_distribute.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline-npm.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+11-5)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.and.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.or.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.xor.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+19-1)
(modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/si-annotate-cfg-loop-assert.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/simplifydemandedbits-recursion.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/undefined-subreg-liverange.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/uniform-cfg.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/valu-i1.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 007b481f84960..a9270eadf1232 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -231,6 +231,9 @@ extern char &AMDGPUPerfHintAnalysisLegacyID;
 void initializeGCNRegPressurePrinterPass(PassRegistry &);
 extern char &GCNRegPressurePrinterID;
 
+void initializeAMDGPULoopAlignLegacyPass(PassRegistry &);
+extern char &AMDGPULoopAlignLegacyID;
+
 void initializeAMDGPUPreloadKernArgPrologLegacyPass(PassRegistry &);
 extern char &AMDGPUPreloadKernArgPrologLegacyID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
new file mode 100644
index 0000000000000..409b3e47bf2a8
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
@@ -0,0 +1,153 @@
+//===----- AMDGPULoopAlign.cpp - Generate loop alignment directives  -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Inspect a basic block and if certain conditions are met then align to 32
+// bytes.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPULoopAlign.h"
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-loop-align"
+
+static cl::opt<bool>
+    DisableLoopAlign("disable-amdgpu-loop-align", cl::init(false), cl::Hidden,
+                     cl::desc("Disable AMDGPU loop alignment pass"));
+
+namespace {
+
+class AMDGPULoopAlign {
+private:
+  MachineLoopInfo &MLI;
+
+public:
+  AMDGPULoopAlign(MachineLoopInfo &MLI) : MLI(MLI) {}
+
+  struct BasicBlockInfo {
+    // Offset - Distance from the beginning of the function to the beginning
+    // of this basic block.
+    uint64_t Offset = 0;
+    // Size - Size of the basic block in bytes
+    uint64_t Size = 0;
+  };
+
+  void generateBlockInfo(MachineFunction &MF,
+                         SmallVectorImpl<BasicBlockInfo> &BlockInfo) {
+    BlockInfo.clear();
+    BlockInfo.resize(MF.getNumBlockIDs());
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    for (const MachineBasicBlock &MBB : MF) {
+      BlockInfo[MBB.getNumber()].Size = 0;
+      for (const MachineInstr &MI : MBB) {
+        BlockInfo[MBB.getNumber()].Size += TII->getInstSizeInBytes(MI);
+      }
+    }
+    uint64_t PrevNum = (&MF)->begin()->getNumber();
+    for (auto &MBB :
+         make_range(std::next(MachineFunction::iterator((&MF)->begin())),
+                    (&MF)->end())) {
+      uint64_t Num = MBB.getNumber();
+      BlockInfo[Num].Offset =
+          BlockInfo[PrevNum].Offset + BlockInfo[PrevNum].Size;
+      unsigned blockAlignment = MBB.getAlignment().value();
+      unsigned ParentAlign = MBB.getParent()->getAlignment().value();
+      if (blockAlignment <= ParentAlign)
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment);
+      else
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment) +
+                                blockAlignment - ParentAlign;
+      PrevNum = Num;
+    }
+  }
+
+  bool run(MachineFunction &MF) {
+    if (DisableLoopAlign)
+      return false;
+
+    // The starting address of all shader programs must be 256 bytes aligned.
+    // Regular functions just need the basic required instruction alignment.
+    const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
+    MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));
+    if (MF.getAlignment().value() < 32)
+      return false;
+
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    SmallVector<BasicBlockInfo, 16> BlockInfo;
+    generateBlockInfo(MF, BlockInfo);
+
+    bool Changed = false;
+    for (MachineLoop *ML : MLI.getLoopsInPreorder()) {
+      // Check if loop is innermost
+      if (!ML->isInnermost())
+        continue;
+      MachineBasicBlock *Header = ML->getHeader();
+      // Check if loop is already evaluated for prefetch & aligned
+      if (Header->getAlignment().value() == 64 ||
+          ML->getTopBlock()->getAlignment().value() == 64)
+        continue;
+
+      // If loop is < 8-dwords, align aggressively to 0 mod 8 dword boundary.
+      // else align to 0 mod 8 dword boundary only if less than 4 dwords of
+      // instructions are available
+      unsigned loopSizeInBytes = 0;
+      for (MachineBasicBlock *MBB : ML->getBlocks())
+        for (MachineInstr &MI : *MBB)
+          loopSizeInBytes += TII->getInstSizeInBytes(MI);
+
+      if (loopSizeInBytes < 32) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      } else if (BlockInfo[Header->getNumber()].Offset % 32 > 16) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      }
+    }
+    return Changed;
+  }
+};
+
+class AMDGPULoopAlignLegacy : public MachineFunctionPass {
+public:
+  static char ID;
+
+  AMDGPULoopAlignLegacy() : MachineFunctionPass(ID) {}
+
+  bool runOnMachineFunction(MachineFunction &MF) override {
+    return AMDGPULoopAlign(getAnalysis<MachineLoopInfoWrapperPass>().getLI())
+        .run(MF);
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineLoopInfoWrapperPass>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+} // namespace
+
+char AMDGPULoopAlignLegacy::ID = 0;
+
+char &llvm::AMDGPULoopAlignLegacyID = AMDGPULoopAlignLegacy::ID;
+
+INITIALIZE_PASS(AMDGPULoopAlignLegacy, DEBUG_TYPE, "AMDGPU Loop Align", false,
+                false)
+
+PreservedAnalyses
+AMDGPULoopAlignPass::run(MachineFunction &MF,
+                         MachineFunctionAnalysisManager &MFAM) {
+  auto &MLI = MFAM.getResult<MachineLoopAnalysis>(MF);
+  if (AMDGPULoopAlign(MLI).run(MF))
+    return PreservedAnalyses::none();
+  return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
new file mode 100644
index 0000000000000..12b9f13926415
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
@@ -0,0 +1,24 @@
+//===--- AMDGPULoopAlign.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+
+#include "llvm/CodeGen/MachinePassManager.h"
+
+namespace llvm {
+
+class AMDGPULoopAlignPass : public PassInfoMixin<AMDGPULoopAlignPass> {
+public:
+  PreservedAnalyses run(MachineFunction &MF,
+                        MachineFunctionAnalysisManager &MFAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index b6c6d927d0e89..35c41d1d73a59 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -117,6 +117,7 @@ MACHINE_FUNCTION_PASS("amdgpu-preload-kern-arg-prolog", AMDGPUPreloadKernArgProl
 MACHINE_FUNCTION_PASS("amdgpu-prepare-agpr-alloc", AMDGPUPrepareAGPRAllocPass())
 MACHINE_FUNCTION_PASS("amdgpu-nsa-reassign", GCNNSAReassignPass())
 MACHINE_FUNCTION_PASS("amdgpu-wait-sgpr-hazards", AMDGPUWaitSGPRHazardsPass())
+MACHINE_FUNCTION_PASS("amdgpu-loop-align", AMDGPULoopAlignPass())
 MACHINE_FUNCTION_PASS("gcn-create-vopd", GCNCreateVOPDPass())
 MACHINE_FUNCTION_PASS("gcn-dpp-combine", GCNDPPCombinePass())
 MACHINE_FUNCTION_PASS("si-fix-sgpr-copies", SIFixSGPRCopiesPass())
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index c1f17033d04a8..848968a4da88a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -22,6 +22,7 @@
 #include "AMDGPUExportKernelRuntimeHandles.h"
 #include "AMDGPUIGroupLP.h"
 #include "AMDGPUISelDAGToDAG.h"
+#include "AMDGPULoopAlign.h"
 #include "AMDGPUMacroFusion.h"
 #include "AMDGPUPerfHintAnalysis.h"
 #include "AMDGPUPreloadKernArgProlog.h"
@@ -570,6 +571,7 @@ extern "C" LLVM_ABI LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeGCNRegPressurePrinterPass(*PR);
   initializeAMDGPUPreloadKernArgPrologLegacyPass(*PR);
   initializeAMDGPUWaitSGPRHazardsLegacyPass(*PR);
+  initializeAMDGPULoopAlignLegacyPass(*PR);
   initializeAMDGPUPreloadKernelArgumentsLegacyPass(*PR);
 }
 
@@ -1764,6 +1766,8 @@ void GCNPassConfig::addPreEmitPass() {
     addPass(&AMDGPUInsertDelayAluID);
 
   addPass(&BranchRelaxationPassID);
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(&AMDGPULoopAlignLegacyID);
 }
 
 void GCNPassConfig::addPostBBSections() {
@@ -2352,6 +2356,8 @@ void AMDGPUCodeGenPassBuilder::addPreEmitPass(AddMachinePass &addPass) const {
   }
 
   addPass(BranchRelaxationPass());
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(AMDGPULoopAlignPass());
 }
 
 bool AMDGPUCodeGenPassBuilder::isPassEnabled(const cl::opt<bool> &Opt,
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c466f9cf0f359..482b71d910a21 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -70,6 +70,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPULibCalls.cpp
   AMDGPUImageIntrinsicOptimizer.cpp
   AMDGPULibFunc.cpp
+  AMDGPULoopAlign.cpp
   AMDGPULowerBufferFatPointers.cpp
   AMDGPULowerKernelArguments.cpp
   AMDGPULowerKernelAttributes.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..1b81022a273a2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index 2b595b9bbecc0..5d8cd9af8a7b1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
-; RUN: not llc -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
+; RUN: not llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
 
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f32(uint node_ptr, float ray_extent, float3 ray_origin, float3 ray_dir, float3 ray_inv_dir, uint4 texture_descr)
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f16(uint node_ptr, float ray_extent, float3 ray_origin, half3 ray_dir, half3 ray_inv_dir, uint4 texture_descr)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..e3842a124985b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
index e0016b0a5a64d..ba70628ea1e0b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memcpy.p1.p1.i32(ptr addrspace(1), ptr addrspace(1), i32, i1 immarg)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
index 04652af147f9b..cf1da9ae2e6b7 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memset.p1.i32(ptr addrspace(1), i8, i32, i1)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..905b6ec81b98c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
index e03c9ca34b825..6dc12caf234de 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
+; RUN: llc -disable-amdgpu-loop-align=true -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
 
 declare void @llvm.amdgcn.exp.f32(i32 immarg, i32 immarg, float, float, float, float, i1 immarg, i1 immarg)
 declare i32 @llvm.amdgcn.raw.ptr.buffer.atomic.and.i32(i32, ptr addrspace(8), i32, i32, i32 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 4cc39d93854a0..db4ccd2fd44cb 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -1,30 +1,30 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX7LESS,GFX7LESS_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX8,GFX8_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1064,GFX1064_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1032,GFX1032_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-TRUE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-FAKE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-FAKE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1132,GFX1132-TRUE16,GFX1132_ITERATIVE,GFX1132_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize32 -mattr=-fla...
[truncated]

llvmbot · 2025-08-06T18:14:53Z

@llvm/pr-subscribers-llvm-globalisel

Author: None (hjagasiaAMD)

Changes

aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.

Patch is 174.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152356.diff

85 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.h (+3)
(added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp (+153)
(added) llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h (+24)
(modified) llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (+1)
(modified) llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (+6)
(modified) llvm/lib/Target/AMDGPU/CMakeLists.txt (+1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+26-26)
(modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+14-14)
(modified) llvm/test/CodeGen/AMDGPU/branch-relaxation.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-memcpy.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/cf-loop-on-constant.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/cgp-addressing-modes-smem.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/coalescer_distribute.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/copy-to-reg-frameindex.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/dynamic_stackalloc.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/fix-sgpr-copies-phi-regression-issue130646-issue130119.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i32_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/flat_atomics_i64_system_noprivate.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global-load-saddr-to-vaddr.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-atomics-min-max-system.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/global-saddr-load.ll (+8-8)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i32_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_i64_system.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fadd.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmax.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fmin.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/global_atomics_scan_fsub.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/iglp-no-clobber.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/indirect-addressing-si.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/infer-addrspace-flat-atomic.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/infinite-loop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/insert-delay-alu-bug.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline-npm.ll (+5-5)
(modified) llvm/test/CodeGen/AMDGPU/llc-pipeline.ll (+11-5)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.pops.exiting.wave.id.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.add.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.and.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.max.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.min.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.or.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.sub.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umax.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.umin.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.reduce.xor.ll (+12-12)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.struct.ptr.atomic.buffer.load.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.wqm.demote.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/local-atomicrmw-fadd.ll (+11-11)
(modified) llvm/test/CodeGen/AMDGPU/local-stack-alloc-block-sp-reference.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch-data.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/loop-prefetch.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/loop_break.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/machine-sink-temporal-divergence-swdev407790.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/mdt-preserving-crash.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/mfma-loop.ll (+3-3)
(modified) llvm/test/CodeGen/AMDGPU/nested-loop-conditions.ll (+19-1)
(modified) llvm/test/CodeGen/AMDGPU/no-dup-inst-prefetch.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/no-fold-accvgpr-mov.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/noclobber-barrier.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/optimize-negated-cond.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/select-undef.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/should-not-hoist-set-inactive.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/si-annotate-cf.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/si-annotate-cfg-loop-assert.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/simplifydemandedbits-recursion.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/skip-if-dead.ll (+4-4)
(modified) llvm/test/CodeGen/AMDGPU/srem64.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/trap-abis.ll (+6-6)
(modified) llvm/test/CodeGen/AMDGPU/undefined-subreg-liverange.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/uniform-cfg.ll (+2-2)
(modified) llvm/test/CodeGen/AMDGPU/valu-i1.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/vni8-across-blocks.ll (+1-1)
(modified) llvm/test/CodeGen/AMDGPU/wave32.ll (+5-5)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 007b481f84960..a9270eadf1232 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -231,6 +231,9 @@ extern char &AMDGPUPerfHintAnalysisLegacyID;
 void initializeGCNRegPressurePrinterPass(PassRegistry &);
 extern char &GCNRegPressurePrinterID;
 
+void initializeAMDGPULoopAlignLegacyPass(PassRegistry &);
+extern char &AMDGPULoopAlignLegacyID;
+
 void initializeAMDGPUPreloadKernArgPrologLegacyPass(PassRegistry &);
 extern char &AMDGPUPreloadKernArgPrologLegacyID;
 
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
new file mode 100644
index 0000000000000..409b3e47bf2a8
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
@@ -0,0 +1,153 @@
+//===----- AMDGPULoopAlign.cpp - Generate loop alignment directives  -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Inspect a basic block and if certain conditions are met then align to 32
+// bytes.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPULoopAlign.h"
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-loop-align"
+
+static cl::opt<bool>
+    DisableLoopAlign("disable-amdgpu-loop-align", cl::init(false), cl::Hidden,
+                     cl::desc("Disable AMDGPU loop alignment pass"));
+
+namespace {
+
+class AMDGPULoopAlign {
+private:
+  MachineLoopInfo &MLI;
+
+public:
+  AMDGPULoopAlign(MachineLoopInfo &MLI) : MLI(MLI) {}
+
+  struct BasicBlockInfo {
+    // Offset - Distance from the beginning of the function to the beginning
+    // of this basic block.
+    uint64_t Offset = 0;
+    // Size - Size of the basic block in bytes
+    uint64_t Size = 0;
+  };
+
+  void generateBlockInfo(MachineFunction &MF,
+                         SmallVectorImpl<BasicBlockInfo> &BlockInfo) {
+    BlockInfo.clear();
+    BlockInfo.resize(MF.getNumBlockIDs());
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    for (const MachineBasicBlock &MBB : MF) {
+      BlockInfo[MBB.getNumber()].Size = 0;
+      for (const MachineInstr &MI : MBB) {
+        BlockInfo[MBB.getNumber()].Size += TII->getInstSizeInBytes(MI);
+      }
+    }
+    uint64_t PrevNum = (&MF)->begin()->getNumber();
+    for (auto &MBB :
+         make_range(std::next(MachineFunction::iterator((&MF)->begin())),
+                    (&MF)->end())) {
+      uint64_t Num = MBB.getNumber();
+      BlockInfo[Num].Offset =
+          BlockInfo[PrevNum].Offset + BlockInfo[PrevNum].Size;
+      unsigned blockAlignment = MBB.getAlignment().value();
+      unsigned ParentAlign = MBB.getParent()->getAlignment().value();
+      if (blockAlignment <= ParentAlign)
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment);
+      else
+        BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment) +
+                                blockAlignment - ParentAlign;
+      PrevNum = Num;
+    }
+  }
+
+  bool run(MachineFunction &MF) {
+    if (DisableLoopAlign)
+      return false;
+
+    // The starting address of all shader programs must be 256 bytes aligned.
+    // Regular functions just need the basic required instruction alignment.
+    const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
+    MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));
+    if (MF.getAlignment().value() < 32)
+      return false;
+
+    const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+    SmallVector<BasicBlockInfo, 16> BlockInfo;
+    generateBlockInfo(MF, BlockInfo);
+
+    bool Changed = false;
+    for (MachineLoop *ML : MLI.getLoopsInPreorder()) {
+      // Check if loop is innermost
+      if (!ML->isInnermost())
+        continue;
+      MachineBasicBlock *Header = ML->getHeader();
+      // Check if loop is already evaluated for prefetch & aligned
+      if (Header->getAlignment().value() == 64 ||
+          ML->getTopBlock()->getAlignment().value() == 64)
+        continue;
+
+      // If loop is < 8-dwords, align aggressively to 0 mod 8 dword boundary.
+      // else align to 0 mod 8 dword boundary only if less than 4 dwords of
+      // instructions are available
+      unsigned loopSizeInBytes = 0;
+      for (MachineBasicBlock *MBB : ML->getBlocks())
+        for (MachineInstr &MI : *MBB)
+          loopSizeInBytes += TII->getInstSizeInBytes(MI);
+
+      if (loopSizeInBytes < 32) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      } else if (BlockInfo[Header->getNumber()].Offset % 32 > 16) {
+        Header->setAlignment(llvm::Align(32));
+        generateBlockInfo(MF, BlockInfo);
+        Changed = true;
+      }
+    }
+    return Changed;
+  }
+};
+
+class AMDGPULoopAlignLegacy : public MachineFunctionPass {
+public:
+  static char ID;
+
+  AMDGPULoopAlignLegacy() : MachineFunctionPass(ID) {}
+
+  bool runOnMachineFunction(MachineFunction &MF) override {
+    return AMDGPULoopAlign(getAnalysis<MachineLoopInfoWrapperPass>().getLI())
+        .run(MF);
+  }
+
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<MachineLoopInfoWrapperPass>();
+    MachineFunctionPass::getAnalysisUsage(AU);
+  }
+};
+
+} // namespace
+
+char AMDGPULoopAlignLegacy::ID = 0;
+
+char &llvm::AMDGPULoopAlignLegacyID = AMDGPULoopAlignLegacy::ID;
+
+INITIALIZE_PASS(AMDGPULoopAlignLegacy, DEBUG_TYPE, "AMDGPU Loop Align", false,
+                false)
+
+PreservedAnalyses
+AMDGPULoopAlignPass::run(MachineFunction &MF,
+                         MachineFunctionAnalysisManager &MFAM) {
+  auto &MLI = MFAM.getResult<MachineLoopAnalysis>(MF);
+  if (AMDGPULoopAlign(MLI).run(MF))
+    return PreservedAnalyses::none();
+  return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
new file mode 100644
index 0000000000000..12b9f13926415
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
@@ -0,0 +1,24 @@
+//===--- AMDGPULoopAlign.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+
+#include "llvm/CodeGen/MachinePassManager.h"
+
+namespace llvm {
+
+class AMDGPULoopAlignPass : public PassInfoMixin<AMDGPULoopAlignPass> {
+public:
+  PreservedAnalyses run(MachineFunction &MF,
+                        MachineFunctionAnalysisManager &MFAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index b6c6d927d0e89..35c41d1d73a59 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -117,6 +117,7 @@ MACHINE_FUNCTION_PASS("amdgpu-preload-kern-arg-prolog", AMDGPUPreloadKernArgProl
 MACHINE_FUNCTION_PASS("amdgpu-prepare-agpr-alloc", AMDGPUPrepareAGPRAllocPass())
 MACHINE_FUNCTION_PASS("amdgpu-nsa-reassign", GCNNSAReassignPass())
 MACHINE_FUNCTION_PASS("amdgpu-wait-sgpr-hazards", AMDGPUWaitSGPRHazardsPass())
+MACHINE_FUNCTION_PASS("amdgpu-loop-align", AMDGPULoopAlignPass())
 MACHINE_FUNCTION_PASS("gcn-create-vopd", GCNCreateVOPDPass())
 MACHINE_FUNCTION_PASS("gcn-dpp-combine", GCNDPPCombinePass())
 MACHINE_FUNCTION_PASS("si-fix-sgpr-copies", SIFixSGPRCopiesPass())
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index c1f17033d04a8..848968a4da88a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -22,6 +22,7 @@
 #include "AMDGPUExportKernelRuntimeHandles.h"
 #include "AMDGPUIGroupLP.h"
 #include "AMDGPUISelDAGToDAG.h"
+#include "AMDGPULoopAlign.h"
 #include "AMDGPUMacroFusion.h"
 #include "AMDGPUPerfHintAnalysis.h"
 #include "AMDGPUPreloadKernArgProlog.h"
@@ -570,6 +571,7 @@ extern "C" LLVM_ABI LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
   initializeGCNRegPressurePrinterPass(*PR);
   initializeAMDGPUPreloadKernArgPrologLegacyPass(*PR);
   initializeAMDGPUWaitSGPRHazardsLegacyPass(*PR);
+  initializeAMDGPULoopAlignLegacyPass(*PR);
   initializeAMDGPUPreloadKernelArgumentsLegacyPass(*PR);
 }
 
@@ -1764,6 +1766,8 @@ void GCNPassConfig::addPreEmitPass() {
     addPass(&AMDGPUInsertDelayAluID);
 
   addPass(&BranchRelaxationPassID);
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(&AMDGPULoopAlignLegacyID);
 }
 
 void GCNPassConfig::addPostBBSections() {
@@ -2352,6 +2356,8 @@ void AMDGPUCodeGenPassBuilder::addPreEmitPass(AddMachinePass &addPass) const {
   }
 
   addPass(BranchRelaxationPass());
+  if (getOptLevel() > CodeGenOptLevel::Less)
+    addPass(AMDGPULoopAlignPass());
 }
 
 bool AMDGPUCodeGenPassBuilder::isPassEnabled(const cl::opt<bool> &Opt,
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c466f9cf0f359..482b71d910a21 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -70,6 +70,7 @@ add_llvm_target(AMDGPUCodeGen
   AMDGPULibCalls.cpp
   AMDGPUImageIntrinsicOptimizer.cpp
   AMDGPULibFunc.cpp
+  AMDGPULoopAlign.cpp
   AMDGPULowerBufferFatPointers.cpp
   AMDGPULowerKernelArguments.cpp
   AMDGPULowerKernelAttributes.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..1b81022a273a2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
 
 ; Simples case, if - then, that requires lane mask merging,
 ; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index 2b595b9bbecc0..5d8cd9af8a7b1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
-; RUN: not llc -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
+; RUN: not llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
 
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f32(uint node_ptr, float ray_extent, float3 ray_origin, float3 ray_dir, float3 ray_inv_dir, uint4 texture_descr)
 ; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f16(uint node_ptr, float ray_extent, float3 ray_origin, half3 ray_dir, half3 ray_inv_dir, uint4 texture_descr)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..e3842a124985b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
 
 define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
 ; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
index e0016b0a5a64d..ba70628ea1e0b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memcpy.p1.p1.i32(ptr addrspace(1), ptr addrspace(1), i32, i1 immarg)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
index 04652af147f9b..cf1da9ae2e6b7 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
 
 declare void @llvm.memset.p1.i32(ptr addrspace(1), i8, i32, i1)
 
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..905b6ec81b98c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
 
 ; if instruction is uniform and there is available instruction, select SALU instruction
 define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
index e03c9ca34b825..6dc12caf234de 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
@@ -1,5 +1,5 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
+; RUN: llc -disable-amdgpu-loop-align=true -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
 
 declare void @llvm.amdgcn.exp.f32(i32 immarg, i32 immarg, float, float, float, float, i1 immarg, i1 immarg)
 declare i32 @llvm.amdgcn.raw.ptr.buffer.atomic.and.i32(i32, ptr addrspace(8), i32, i32, i32 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 4cc39d93854a0..db4ccd2fd44cb 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -1,30 +1,30 @@
 ; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX7LESS,GFX7LESS_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX8,GFX8_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1064,GFX1064_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1032,GFX1032_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-TRUE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-FAKE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-FAKE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1132,GFX1132-TRUE16,GFX1132_ITERATIVE,GFX1132_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize32 -mattr=-fla...
[truncated]

arsenm

Why does this need to be a custom pass? MachineBlockPlacement can already align loops depending on getPrefLoopAlignment?

Before doing anything with manually specified block alignments, we need to add the missing MIR serialization for the block alignment

arsenm · 2025-08-07T00:20:15Z

llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp

+    // The starting address of all shader programs must be 256 bytes aligned.
+    // Regular functions just need the basic required instruction alignment.
+    const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
+    MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));


This is already handled when emitting code, you don't need to handle it again

arsenm · 2025-08-07T00:26:10Z

llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp

+    uint64_t Size = 0;
+  };
+
+  void generateBlockInfo(MachineFunction &MF,


I really don't want to add yet another instance of a pass doing its own code size computations. For the purpose of this pass you should be able to just set the alignment on the block. getInstSIzeInBytes isn't completely accurate either

hjagasiaAMD · 2025-08-11T16:52:58Z

Will address in MachineBlockPlacement. Closing this PR.

With -stop-after=block-placement

bb.1.bb2 (align 64):
successors: %bb.2(0x04000000), %bb.1(0x7c000000)
liveins: $sgpr0
…
The block alignment is there. Don't think there is a problem for MIR serialization-block alignment

Add pass to align inner loops. Loops less than 32B are aggresively

8f5e8f0

aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.

Merge branch 'main' into amdgpu-loopalign

9a8c18e

llvmbot added backend:AMDGPU llvm:globalisel labels Aug 6, 2025

hjagasiaAMD changed the title ~~Add pass to align inner loops. Loops less than 32B are aggresively~~ Add pass to align inner loops for AMDGPU. Aug 6, 2025

ronlieb requested review from arsenm and ronlieb August 6, 2025 18:48

hjagasiaAMD changed the title ~~Add pass to align inner loops for AMDGPU.~~ [AMDGPU] Add pass to align inner loops. Aug 6, 2025

hjagasiaAMD changed the title ~~[AMDGPU] Add pass to align inner loops.~~ [AMDGPU] Add pass to align inner loops Aug 6, 2025

arsenm requested changes Aug 7, 2025

View reviewed changes

arsenm reviewed Aug 7, 2025

View reviewed changes

hjagasiaAMD closed this Aug 11, 2025

hjagasiaAMD deleted the amdgpu-loopalign branch August 11, 2025 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Add pass to align inner loops #152356

[AMDGPU] Add pass to align inner loops #152356

Uh oh!

hjagasiaAMD commented Aug 6, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

llvmbot commented Aug 6, 2025

Uh oh!

llvmbot commented Aug 6, 2025

Uh oh!

arsenm left a comment

Uh oh!

arsenm Aug 7, 2025

Uh oh!

arsenm Aug 7, 2025

Uh oh!

hjagasiaAMD commented Aug 11, 2025

Uh oh!

Uh oh!

[AMDGPU] Add pass to align inner loops #152356

[AMDGPU] Add pass to align inner loops #152356

Uh oh!

Conversation

hjagasiaAMD commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 6, 2025

Uh oh!

llvmbot commented Aug 6, 2025

Uh oh!

llvmbot commented Aug 6, 2025

Uh oh!

arsenm left a comment

Choose a reason for hiding this comment

Uh oh!

arsenm Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

hjagasiaAMD commented Aug 11, 2025

Uh oh!

Uh oh!

hjagasiaAMD commented Aug 6, 2025 •

edited

Loading