-
Notifications
You must be signed in to change notification settings - Fork 14.7k
[AMDGPU] Add pass to align inner loops #152356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-backend-amdgpu Author: None (hjagasiaAMD) Changesaligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance. Patch is 174.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152356.diff 85 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 007b481f84960..a9270eadf1232 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -231,6 +231,9 @@ extern char &AMDGPUPerfHintAnalysisLegacyID;
void initializeGCNRegPressurePrinterPass(PassRegistry &);
extern char &GCNRegPressurePrinterID;
+void initializeAMDGPULoopAlignLegacyPass(PassRegistry &);
+extern char &AMDGPULoopAlignLegacyID;
+
void initializeAMDGPUPreloadKernArgPrologLegacyPass(PassRegistry &);
extern char &AMDGPUPreloadKernArgPrologLegacyID;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
new file mode 100644
index 0000000000000..409b3e47bf2a8
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
@@ -0,0 +1,153 @@
+//===----- AMDGPULoopAlign.cpp - Generate loop alignment directives -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Inspect a basic block and if certain conditions are met then align to 32
+// bytes.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPULoopAlign.h"
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-loop-align"
+
+static cl::opt<bool>
+ DisableLoopAlign("disable-amdgpu-loop-align", cl::init(false), cl::Hidden,
+ cl::desc("Disable AMDGPU loop alignment pass"));
+
+namespace {
+
+class AMDGPULoopAlign {
+private:
+ MachineLoopInfo &MLI;
+
+public:
+ AMDGPULoopAlign(MachineLoopInfo &MLI) : MLI(MLI) {}
+
+ struct BasicBlockInfo {
+ // Offset - Distance from the beginning of the function to the beginning
+ // of this basic block.
+ uint64_t Offset = 0;
+ // Size - Size of the basic block in bytes
+ uint64_t Size = 0;
+ };
+
+ void generateBlockInfo(MachineFunction &MF,
+ SmallVectorImpl<BasicBlockInfo> &BlockInfo) {
+ BlockInfo.clear();
+ BlockInfo.resize(MF.getNumBlockIDs());
+ const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+ for (const MachineBasicBlock &MBB : MF) {
+ BlockInfo[MBB.getNumber()].Size = 0;
+ for (const MachineInstr &MI : MBB) {
+ BlockInfo[MBB.getNumber()].Size += TII->getInstSizeInBytes(MI);
+ }
+ }
+ uint64_t PrevNum = (&MF)->begin()->getNumber();
+ for (auto &MBB :
+ make_range(std::next(MachineFunction::iterator((&MF)->begin())),
+ (&MF)->end())) {
+ uint64_t Num = MBB.getNumber();
+ BlockInfo[Num].Offset =
+ BlockInfo[PrevNum].Offset + BlockInfo[PrevNum].Size;
+ unsigned blockAlignment = MBB.getAlignment().value();
+ unsigned ParentAlign = MBB.getParent()->getAlignment().value();
+ if (blockAlignment <= ParentAlign)
+ BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment);
+ else
+ BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment) +
+ blockAlignment - ParentAlign;
+ PrevNum = Num;
+ }
+ }
+
+ bool run(MachineFunction &MF) {
+ if (DisableLoopAlign)
+ return false;
+
+ // The starting address of all shader programs must be 256 bytes aligned.
+ // Regular functions just need the basic required instruction alignment.
+ const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
+ MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));
+ if (MF.getAlignment().value() < 32)
+ return false;
+
+ const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+ SmallVector<BasicBlockInfo, 16> BlockInfo;
+ generateBlockInfo(MF, BlockInfo);
+
+ bool Changed = false;
+ for (MachineLoop *ML : MLI.getLoopsInPreorder()) {
+ // Check if loop is innermost
+ if (!ML->isInnermost())
+ continue;
+ MachineBasicBlock *Header = ML->getHeader();
+ // Check if loop is already evaluated for prefetch & aligned
+ if (Header->getAlignment().value() == 64 ||
+ ML->getTopBlock()->getAlignment().value() == 64)
+ continue;
+
+ // If loop is < 8-dwords, align aggressively to 0 mod 8 dword boundary.
+ // else align to 0 mod 8 dword boundary only if less than 4 dwords of
+ // instructions are available
+ unsigned loopSizeInBytes = 0;
+ for (MachineBasicBlock *MBB : ML->getBlocks())
+ for (MachineInstr &MI : *MBB)
+ loopSizeInBytes += TII->getInstSizeInBytes(MI);
+
+ if (loopSizeInBytes < 32) {
+ Header->setAlignment(llvm::Align(32));
+ generateBlockInfo(MF, BlockInfo);
+ Changed = true;
+ } else if (BlockInfo[Header->getNumber()].Offset % 32 > 16) {
+ Header->setAlignment(llvm::Align(32));
+ generateBlockInfo(MF, BlockInfo);
+ Changed = true;
+ }
+ }
+ return Changed;
+ }
+};
+
+class AMDGPULoopAlignLegacy : public MachineFunctionPass {
+public:
+ static char ID;
+
+ AMDGPULoopAlignLegacy() : MachineFunctionPass(ID) {}
+
+ bool runOnMachineFunction(MachineFunction &MF) override {
+ return AMDGPULoopAlign(getAnalysis<MachineLoopInfoWrapperPass>().getLI())
+ .run(MF);
+ }
+
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addRequired<MachineLoopInfoWrapperPass>();
+ MachineFunctionPass::getAnalysisUsage(AU);
+ }
+};
+
+} // namespace
+
+char AMDGPULoopAlignLegacy::ID = 0;
+
+char &llvm::AMDGPULoopAlignLegacyID = AMDGPULoopAlignLegacy::ID;
+
+INITIALIZE_PASS(AMDGPULoopAlignLegacy, DEBUG_TYPE, "AMDGPU Loop Align", false,
+ false)
+
+PreservedAnalyses
+AMDGPULoopAlignPass::run(MachineFunction &MF,
+ MachineFunctionAnalysisManager &MFAM) {
+ auto &MLI = MFAM.getResult<MachineLoopAnalysis>(MF);
+ if (AMDGPULoopAlign(MLI).run(MF))
+ return PreservedAnalyses::none();
+ return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
new file mode 100644
index 0000000000000..12b9f13926415
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
@@ -0,0 +1,24 @@
+//===--- AMDGPULoopAlign.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+
+#include "llvm/CodeGen/MachinePassManager.h"
+
+namespace llvm {
+
+class AMDGPULoopAlignPass : public PassInfoMixin<AMDGPULoopAlignPass> {
+public:
+ PreservedAnalyses run(MachineFunction &MF,
+ MachineFunctionAnalysisManager &MFAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index b6c6d927d0e89..35c41d1d73a59 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -117,6 +117,7 @@ MACHINE_FUNCTION_PASS("amdgpu-preload-kern-arg-prolog", AMDGPUPreloadKernArgProl
MACHINE_FUNCTION_PASS("amdgpu-prepare-agpr-alloc", AMDGPUPrepareAGPRAllocPass())
MACHINE_FUNCTION_PASS("amdgpu-nsa-reassign", GCNNSAReassignPass())
MACHINE_FUNCTION_PASS("amdgpu-wait-sgpr-hazards", AMDGPUWaitSGPRHazardsPass())
+MACHINE_FUNCTION_PASS("amdgpu-loop-align", AMDGPULoopAlignPass())
MACHINE_FUNCTION_PASS("gcn-create-vopd", GCNCreateVOPDPass())
MACHINE_FUNCTION_PASS("gcn-dpp-combine", GCNDPPCombinePass())
MACHINE_FUNCTION_PASS("si-fix-sgpr-copies", SIFixSGPRCopiesPass())
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index c1f17033d04a8..848968a4da88a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -22,6 +22,7 @@
#include "AMDGPUExportKernelRuntimeHandles.h"
#include "AMDGPUIGroupLP.h"
#include "AMDGPUISelDAGToDAG.h"
+#include "AMDGPULoopAlign.h"
#include "AMDGPUMacroFusion.h"
#include "AMDGPUPerfHintAnalysis.h"
#include "AMDGPUPreloadKernArgProlog.h"
@@ -570,6 +571,7 @@ extern "C" LLVM_ABI LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeGCNRegPressurePrinterPass(*PR);
initializeAMDGPUPreloadKernArgPrologLegacyPass(*PR);
initializeAMDGPUWaitSGPRHazardsLegacyPass(*PR);
+ initializeAMDGPULoopAlignLegacyPass(*PR);
initializeAMDGPUPreloadKernelArgumentsLegacyPass(*PR);
}
@@ -1764,6 +1766,8 @@ void GCNPassConfig::addPreEmitPass() {
addPass(&AMDGPUInsertDelayAluID);
addPass(&BranchRelaxationPassID);
+ if (getOptLevel() > CodeGenOptLevel::Less)
+ addPass(&AMDGPULoopAlignLegacyID);
}
void GCNPassConfig::addPostBBSections() {
@@ -2352,6 +2356,8 @@ void AMDGPUCodeGenPassBuilder::addPreEmitPass(AddMachinePass &addPass) const {
}
addPass(BranchRelaxationPass());
+ if (getOptLevel() > CodeGenOptLevel::Less)
+ addPass(AMDGPULoopAlignPass());
}
bool AMDGPUCodeGenPassBuilder::isPassEnabled(const cl::opt<bool> &Opt,
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c466f9cf0f359..482b71d910a21 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -70,6 +70,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPULibCalls.cpp
AMDGPUImageIntrinsicOptimizer.cpp
AMDGPULibFunc.cpp
+ AMDGPULoopAlign.cpp
AMDGPULowerBufferFatPointers.cpp
AMDGPULowerKernelArguments.cpp
AMDGPULowerKernelAttributes.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..1b81022a273a2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
; Simples case, if - then, that requires lane mask merging,
; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index 2b595b9bbecc0..5d8cd9af8a7b1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -1,8 +1,8 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
-; RUN: not llc -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
+; RUN: not llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f32(uint node_ptr, float ray_extent, float3 ray_origin, float3 ray_dir, float3 ray_inv_dir, uint4 texture_descr)
; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f16(uint node_ptr, float ray_extent, float3 ray_origin, half3 ray_dir, half3 ray_inv_dir, uint4 texture_descr)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..e3842a124985b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
index e0016b0a5a64d..ba70628ea1e0b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
declare void @llvm.memcpy.p1.p1.i32(ptr addrspace(1), ptr addrspace(1), i32, i1 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
index 04652af147f9b..cf1da9ae2e6b7 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
declare void @llvm.memset.p1.i32(ptr addrspace(1), i8, i32, i1)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..905b6ec81b98c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
; if instruction is uniform and there is available instruction, select SALU instruction
define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
index e03c9ca34b825..6dc12caf234de 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
@@ -1,5 +1,5 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
+; RUN: llc -disable-amdgpu-loop-align=true -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
declare void @llvm.amdgcn.exp.f32(i32 immarg, i32 immarg, float, float, float, float, i1 immarg, i1 immarg)
declare i32 @llvm.amdgcn.raw.ptr.buffer.atomic.and.i32(i32, ptr addrspace(8), i32, i32, i32 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 4cc39d93854a0..db4ccd2fd44cb 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -1,30 +1,30 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX7LESS,GFX7LESS_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX8,GFX8_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1064,GFX1064_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1032,GFX1032_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-TRUE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-FAKE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-FAKE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1132,GFX1132-TRUE16,GFX1132_ITERATIVE,GFX1132_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize32 -mattr=-fla...
[truncated]
|
@llvm/pr-subscribers-llvm-globalisel Author: None (hjagasiaAMD) Changesaligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance. Patch is 174.55 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/152356.diff 85 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.h b/llvm/lib/Target/AMDGPU/AMDGPU.h
index 007b481f84960..a9270eadf1232 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.h
@@ -231,6 +231,9 @@ extern char &AMDGPUPerfHintAnalysisLegacyID;
void initializeGCNRegPressurePrinterPass(PassRegistry &);
extern char &GCNRegPressurePrinterID;
+void initializeAMDGPULoopAlignLegacyPass(PassRegistry &);
+extern char &AMDGPULoopAlignLegacyID;
+
void initializeAMDGPUPreloadKernArgPrologLegacyPass(PassRegistry &);
extern char &AMDGPUPreloadKernArgPrologLegacyID;
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
new file mode 100644
index 0000000000000..409b3e47bf2a8
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.cpp
@@ -0,0 +1,153 @@
+//===----- AMDGPULoopAlign.cpp - Generate loop alignment directives -----===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+// Inspect a basic block and if certain conditions are met then align to 32
+// bytes.
+//===----------------------------------------------------------------------===//
+
+#include "AMDGPULoopAlign.h"
+#include "AMDGPU.h"
+#include "AMDGPUTargetMachine.h"
+#include "GCNSubtarget.h"
+#include "SIMachineFunctionInfo.h"
+#include "llvm/CodeGen/MachineLoopInfo.h"
+using namespace llvm;
+
+#define DEBUG_TYPE "amdgpu-loop-align"
+
+static cl::opt<bool>
+ DisableLoopAlign("disable-amdgpu-loop-align", cl::init(false), cl::Hidden,
+ cl::desc("Disable AMDGPU loop alignment pass"));
+
+namespace {
+
+class AMDGPULoopAlign {
+private:
+ MachineLoopInfo &MLI;
+
+public:
+ AMDGPULoopAlign(MachineLoopInfo &MLI) : MLI(MLI) {}
+
+ struct BasicBlockInfo {
+ // Offset - Distance from the beginning of the function to the beginning
+ // of this basic block.
+ uint64_t Offset = 0;
+ // Size - Size of the basic block in bytes
+ uint64_t Size = 0;
+ };
+
+ void generateBlockInfo(MachineFunction &MF,
+ SmallVectorImpl<BasicBlockInfo> &BlockInfo) {
+ BlockInfo.clear();
+ BlockInfo.resize(MF.getNumBlockIDs());
+ const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+ for (const MachineBasicBlock &MBB : MF) {
+ BlockInfo[MBB.getNumber()].Size = 0;
+ for (const MachineInstr &MI : MBB) {
+ BlockInfo[MBB.getNumber()].Size += TII->getInstSizeInBytes(MI);
+ }
+ }
+ uint64_t PrevNum = (&MF)->begin()->getNumber();
+ for (auto &MBB :
+ make_range(std::next(MachineFunction::iterator((&MF)->begin())),
+ (&MF)->end())) {
+ uint64_t Num = MBB.getNumber();
+ BlockInfo[Num].Offset =
+ BlockInfo[PrevNum].Offset + BlockInfo[PrevNum].Size;
+ unsigned blockAlignment = MBB.getAlignment().value();
+ unsigned ParentAlign = MBB.getParent()->getAlignment().value();
+ if (blockAlignment <= ParentAlign)
+ BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment);
+ else
+ BlockInfo[Num].Offset = alignTo(BlockInfo[Num].Offset, blockAlignment) +
+ blockAlignment - ParentAlign;
+ PrevNum = Num;
+ }
+ }
+
+ bool run(MachineFunction &MF) {
+ if (DisableLoopAlign)
+ return false;
+
+ // The starting address of all shader programs must be 256 bytes aligned.
+ // Regular functions just need the basic required instruction alignment.
+ const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>();
+ MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4));
+ if (MF.getAlignment().value() < 32)
+ return false;
+
+ const SIInstrInfo *TII = MF.getSubtarget<GCNSubtarget>().getInstrInfo();
+ SmallVector<BasicBlockInfo, 16> BlockInfo;
+ generateBlockInfo(MF, BlockInfo);
+
+ bool Changed = false;
+ for (MachineLoop *ML : MLI.getLoopsInPreorder()) {
+ // Check if loop is innermost
+ if (!ML->isInnermost())
+ continue;
+ MachineBasicBlock *Header = ML->getHeader();
+ // Check if loop is already evaluated for prefetch & aligned
+ if (Header->getAlignment().value() == 64 ||
+ ML->getTopBlock()->getAlignment().value() == 64)
+ continue;
+
+ // If loop is < 8-dwords, align aggressively to 0 mod 8 dword boundary.
+ // else align to 0 mod 8 dword boundary only if less than 4 dwords of
+ // instructions are available
+ unsigned loopSizeInBytes = 0;
+ for (MachineBasicBlock *MBB : ML->getBlocks())
+ for (MachineInstr &MI : *MBB)
+ loopSizeInBytes += TII->getInstSizeInBytes(MI);
+
+ if (loopSizeInBytes < 32) {
+ Header->setAlignment(llvm::Align(32));
+ generateBlockInfo(MF, BlockInfo);
+ Changed = true;
+ } else if (BlockInfo[Header->getNumber()].Offset % 32 > 16) {
+ Header->setAlignment(llvm::Align(32));
+ generateBlockInfo(MF, BlockInfo);
+ Changed = true;
+ }
+ }
+ return Changed;
+ }
+};
+
+class AMDGPULoopAlignLegacy : public MachineFunctionPass {
+public:
+ static char ID;
+
+ AMDGPULoopAlignLegacy() : MachineFunctionPass(ID) {}
+
+ bool runOnMachineFunction(MachineFunction &MF) override {
+ return AMDGPULoopAlign(getAnalysis<MachineLoopInfoWrapperPass>().getLI())
+ .run(MF);
+ }
+
+ void getAnalysisUsage(AnalysisUsage &AU) const override {
+ AU.addRequired<MachineLoopInfoWrapperPass>();
+ MachineFunctionPass::getAnalysisUsage(AU);
+ }
+};
+
+} // namespace
+
+char AMDGPULoopAlignLegacy::ID = 0;
+
+char &llvm::AMDGPULoopAlignLegacyID = AMDGPULoopAlignLegacy::ID;
+
+INITIALIZE_PASS(AMDGPULoopAlignLegacy, DEBUG_TYPE, "AMDGPU Loop Align", false,
+ false)
+
+PreservedAnalyses
+AMDGPULoopAlignPass::run(MachineFunction &MF,
+ MachineFunctionAnalysisManager &MFAM) {
+ auto &MLI = MFAM.getResult<MachineLoopAnalysis>(MF);
+ if (AMDGPULoopAlign(MLI).run(MF))
+ return PreservedAnalyses::none();
+ return PreservedAnalyses::all();
+}
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
new file mode 100644
index 0000000000000..12b9f13926415
--- /dev/null
+++ b/llvm/lib/Target/AMDGPU/AMDGPULoopAlign.h
@@ -0,0 +1,24 @@
+//===--- AMDGPULoopAlign.h --------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+#define LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
+
+#include "llvm/CodeGen/MachinePassManager.h"
+
+namespace llvm {
+
+class AMDGPULoopAlignPass : public PassInfoMixin<AMDGPULoopAlignPass> {
+public:
+ PreservedAnalyses run(MachineFunction &MF,
+ MachineFunctionAnalysisManager &MFAM);
+};
+
+} // namespace llvm
+
+#endif // LLVM_LIB_TARGET_AMDGPU_AMDGPULOOPALIGN_H
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
index b6c6d927d0e89..35c41d1d73a59 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def
@@ -117,6 +117,7 @@ MACHINE_FUNCTION_PASS("amdgpu-preload-kern-arg-prolog", AMDGPUPreloadKernArgProl
MACHINE_FUNCTION_PASS("amdgpu-prepare-agpr-alloc", AMDGPUPrepareAGPRAllocPass())
MACHINE_FUNCTION_PASS("amdgpu-nsa-reassign", GCNNSAReassignPass())
MACHINE_FUNCTION_PASS("amdgpu-wait-sgpr-hazards", AMDGPUWaitSGPRHazardsPass())
+MACHINE_FUNCTION_PASS("amdgpu-loop-align", AMDGPULoopAlignPass())
MACHINE_FUNCTION_PASS("gcn-create-vopd", GCNCreateVOPDPass())
MACHINE_FUNCTION_PASS("gcn-dpp-combine", GCNDPPCombinePass())
MACHINE_FUNCTION_PASS("si-fix-sgpr-copies", SIFixSGPRCopiesPass())
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
index c1f17033d04a8..848968a4da88a 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
@@ -22,6 +22,7 @@
#include "AMDGPUExportKernelRuntimeHandles.h"
#include "AMDGPUIGroupLP.h"
#include "AMDGPUISelDAGToDAG.h"
+#include "AMDGPULoopAlign.h"
#include "AMDGPUMacroFusion.h"
#include "AMDGPUPerfHintAnalysis.h"
#include "AMDGPUPreloadKernArgProlog.h"
@@ -570,6 +571,7 @@ extern "C" LLVM_ABI LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeGCNRegPressurePrinterPass(*PR);
initializeAMDGPUPreloadKernArgPrologLegacyPass(*PR);
initializeAMDGPUWaitSGPRHazardsLegacyPass(*PR);
+ initializeAMDGPULoopAlignLegacyPass(*PR);
initializeAMDGPUPreloadKernelArgumentsLegacyPass(*PR);
}
@@ -1764,6 +1766,8 @@ void GCNPassConfig::addPreEmitPass() {
addPass(&AMDGPUInsertDelayAluID);
addPass(&BranchRelaxationPassID);
+ if (getOptLevel() > CodeGenOptLevel::Less)
+ addPass(&AMDGPULoopAlignLegacyID);
}
void GCNPassConfig::addPostBBSections() {
@@ -2352,6 +2356,8 @@ void AMDGPUCodeGenPassBuilder::addPreEmitPass(AddMachinePass &addPass) const {
}
addPass(BranchRelaxationPass());
+ if (getOptLevel() > CodeGenOptLevel::Less)
+ addPass(AMDGPULoopAlignPass());
}
bool AMDGPUCodeGenPassBuilder::isPassEnabled(const cl::opt<bool> &Opt,
diff --git a/llvm/lib/Target/AMDGPU/CMakeLists.txt b/llvm/lib/Target/AMDGPU/CMakeLists.txt
index c466f9cf0f359..482b71d910a21 100644
--- a/llvm/lib/Target/AMDGPU/CMakeLists.txt
+++ b/llvm/lib/Target/AMDGPU/CMakeLists.txt
@@ -70,6 +70,7 @@ add_llvm_target(AMDGPUCodeGen
AMDGPULibCalls.cpp
AMDGPUImageIntrinsicOptimizer.cpp
AMDGPULibFunc.cpp
+ AMDGPULoopAlign.cpp
AMDGPULowerBufferFatPointers.cpp
AMDGPULowerKernelArguments.cpp
AMDGPULowerKernelAttributes.cpp
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
index fd08ab88990ed..1b81022a273a2 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/divergence-structurizer.ll
@@ -1,5 +1,5 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 3
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=GFX10 %s
; Simples case, if - then, that requires lane mask merging,
; %phi lane mask will hold %val_A at %A. Lanes that are active in %B
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
index 2b595b9bbecc0..5d8cd9af8a7b1 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.intersect_ray.ll
@@ -1,8 +1,8 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
-; RUN: not llc -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1030 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1030 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1013 < %s | FileCheck -check-prefixes=GCN,GFX10,GFX1013 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 < %s | FileCheck -check-prefixes=GCN,GFX11 %s
+; RUN: not llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1012 < %s -o /dev/null 2>&1 | FileCheck -check-prefix=ERR %s
; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f32(uint node_ptr, float ray_extent, float3 ray_origin, float3 ray_dir, float3 ray_inv_dir, uint4 texture_descr)
; uint4 llvm.amdgcn.image.bvh.intersect.ray.i32.v4f16(uint node_ptr, float ray_extent, float3 ray_origin, half3 ray_dir, half3 ray_inv_dir, uint4 texture_descr)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
index 8a53c862371cf..e3842a124985b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.amdgcn.wqm.demote.ll
@@ -1,8 +1,8 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
-; RUN: llc -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=tonga < %s | FileCheck -check-prefix=SI %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx900 < %s | FileCheck -check-prefix=GFX9 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck -check-prefix=GFX10-32 %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 < %s | FileCheck -check-prefix=GFX10-64 %s
define amdgpu_ps void @static_exact(float %arg0, float %arg1) {
; SI-LABEL: static_exact:
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
index e0016b0a5a64d..ba70628ea1e0b 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memcpy.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=35 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -amdgpu-memcpy-loop-unroll=2 -mem-intrinsic-expand-size=37 %s -o - | FileCheck -check-prefix=UNROLL %s
declare void @llvm.memcpy.p1.p1.i32(ptr addrspace(1), ptr addrspace(1), i32, i1 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
index 04652af147f9b..cf1da9ae2e6b7 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/llvm.memset.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
-; RUN: llc -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=3 %s -o - | FileCheck -check-prefix=LOOP %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-- -mem-intrinsic-expand-size=5 %s -o - | FileCheck -check-prefix=UNROLL %s
declare void @llvm.memset.p1.i32(ptr addrspace(1), i8, i32, i1)
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
index 5240bf4f3a1d7..905b6ec81b98c 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/regbankselect-mui.ll
@@ -1,6 +1,6 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
-; RUN: llc -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 < %s | FileCheck -check-prefix=OLD_RBS %s
+; RUN: llc -disable-amdgpu-loop-align=true -global-isel -mtriple=amdgcn-amd-amdpal -mcpu=gfx1010 -new-reg-bank-select < %s | FileCheck -check-prefix=NEW_RBS %s
; if instruction is uniform and there is available instruction, select SALU instruction
define amdgpu_ps void @uniform_in_vgpr(float inreg %a, i32 inreg %b, ptr addrspace(1) %ptr) {
diff --git a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
index e03c9ca34b825..6dc12caf234de 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll
@@ -1,5 +1,5 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
+; RUN: llc -disable-amdgpu-loop-align=true -mtriple=amdgcn -mcpu=gfx1010 < %s | FileCheck %s -check-prefixes=GFX10
declare void @llvm.amdgcn.exp.f32(i32 immarg, i32 immarg, float, float, float, float, i1 immarg, i1 immarg)
declare i32 @llvm.amdgcn.raw.ptr.buffer.atomic.and.i32(i32, ptr addrspace(8), i32, i32, i32 immarg)
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 4cc39d93854a0..db4ccd2fd44cb 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -1,30 +1,30 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
-; RUN: llc -mtriple=amdgcn -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX7LESS,GFX7LESS_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX8,GFX8_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX9,GFX9_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1064,GFX1064_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1010 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1032,GFX1032_ITERATIVE %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-TRUE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize64 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1164,GFX1164-FAKE16,GFX1164_ITERATIVE,GFX1164_ITERATIVE-FAKE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=+real-true16 -mattr=+wavefrontsize32 -mattr=-flat-for-global -amdgpu-atomic-optimizer-strategy=Iterative < %s | FileCheck -enable-var-scope -check-prefixes=GFX1132,GFX1132-TRUE16,GFX1132_ITERATIVE,GFX1132_ITERATIVE-TRUE16 %s
-; RUN: llc -mtriple=amdgcn -mcpu=gfx1100 -mattr=-real-true16 -mattr=+wavefrontsize32 -mattr=-fla...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be a custom pass? MachineBlockPlacement can already align loops depending on getPrefLoopAlignment?
Before doing anything with manually specified block alignments, we need to add the missing MIR serialization for the block alignment
// The starting address of all shader programs must be 256 bytes aligned. | ||
// Regular functions just need the basic required instruction alignment. | ||
const AMDGPUMachineFunction *MFI = MF.getInfo<AMDGPUMachineFunction>(); | ||
MF.setAlignment(MFI->isEntryFunction() ? Align(256) : Align(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already handled when emitting code, you don't need to handle it again
uint64_t Size = 0; | ||
}; | ||
|
||
void generateBlockInfo(MachineFunction &MF, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't want to add yet another instance of a pass doing its own code size computations. For the purpose of this pass you should be able to just set the alignment on the block. getInstSIzeInBytes isn't completely accurate either
Will address in MachineBlockPlacement. Closing this PR. With -stop-after=block-placement bb.1.bb2 (align 64): |
Loops less than 32B are aggressively aligned to 32B boundary. Otherwise for larger loops, check if loop starts at an offset >16B and if so align to 32B boundary. This is done to improve performance.