-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU #127212
base: main
Are you sure you want to change the base?
Conversation
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: None (mihajlovicana) ChangesPatch is 196.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/127212.diff 51 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUInsertDelayAlu.cpp b/llvm/lib/Target/AMDGPU/AMDGPUInsertDelayAlu.cpp
index 3f2bb5df8836b..7eb608fc93e63 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUInsertDelayAlu.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUInsertDelayAlu.cpp
@@ -371,7 +371,10 @@ class AMDGPUInsertDelayAlu : public MachineFunctionPass {
for (MCRegUnit Unit : TRI->regunits(Op.getReg())) {
auto It = State.find(Unit);
if (It != State.end()) {
- Delay.merge(It->second);
+ if (!(SII->isSALU(MI.getOpcode())) ||
+ !AMDGPU::isSGPR(Op.getReg(), TRI) ||
+ It->second.VALUCycles == 0)
+ Delay.merge(It->second);
State.erase(Unit);
}
}
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
index cd405fabf002d..4b68f8a4bd194 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll
@@ -777,7 +777,6 @@ define amdgpu_kernel void @add_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX11W64-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX11W64-NEXT: v_writelane_b32 v0, s2, s3
; GFX11W64-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX11W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11W64-NEXT: s_add_i32 s2, s2, s8
; GFX11W64-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX11W64-NEXT: s_cbranch_scc1 .LBB2_1
@@ -822,7 +821,6 @@ define amdgpu_kernel void @add_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX11W32-NEXT: s_lshl_b32 s6, 1, s2
; GFX11W32-NEXT: v_writelane_b32 v0, s0, s2
; GFX11W32-NEXT: s_and_not1_b32 s1, s1, s6
-; GFX11W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11W32-NEXT: s_add_i32 s0, s0, s3
; GFX11W32-NEXT: s_cmp_lg_u32 s1, 0
; GFX11W32-NEXT: s_cbranch_scc1 .LBB2_1
@@ -864,7 +862,6 @@ define amdgpu_kernel void @add_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX12W64-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX12W64-NEXT: v_writelane_b32 v0, s2, s3
; GFX12W64-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX12W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX12W64-NEXT: s_add_co_i32 s2, s2, s8
; GFX12W64-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX12W64-NEXT: s_cbranch_scc1 .LBB2_1
@@ -910,7 +907,6 @@ define amdgpu_kernel void @add_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX12W32-NEXT: s_lshl_b32 s6, 1, s2
; GFX12W32-NEXT: v_writelane_b32 v0, s0, s2
; GFX12W32-NEXT: s_and_not1_b32 s1, s1, s6
-; GFX12W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX12W32-NEXT: s_add_co_i32 s0, s0, s3
; GFX12W32-NEXT: s_wait_alu 0xfffe
; GFX12W32-NEXT: s_cmp_lg_u32 s1, 0
@@ -1178,7 +1174,6 @@ define amdgpu_kernel void @struct_add_i32_varying_vdata(ptr addrspace(1) %out, p
; GFX11W64-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX11W64-NEXT: v_writelane_b32 v0, s2, s3
; GFX11W64-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX11W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11W64-NEXT: s_add_i32 s2, s2, s8
; GFX11W64-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX11W64-NEXT: s_cbranch_scc1 .LBB3_1
@@ -1226,7 +1221,6 @@ define amdgpu_kernel void @struct_add_i32_varying_vdata(ptr addrspace(1) %out, p
; GFX11W32-NEXT: s_lshl_b32 s6, 1, s2
; GFX11W32-NEXT: v_writelane_b32 v0, s0, s2
; GFX11W32-NEXT: s_and_not1_b32 s1, s1, s6
-; GFX11W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11W32-NEXT: s_add_i32 s0, s0, s3
; GFX11W32-NEXT: s_cmp_lg_u32 s1, 0
; GFX11W32-NEXT: s_cbranch_scc1 .LBB3_1
@@ -1270,7 +1264,6 @@ define amdgpu_kernel void @struct_add_i32_varying_vdata(ptr addrspace(1) %out, p
; GFX12W64-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX12W64-NEXT: v_writelane_b32 v0, s2, s3
; GFX12W64-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX12W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX12W64-NEXT: s_add_co_i32 s2, s2, s8
; GFX12W64-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX12W64-NEXT: s_cbranch_scc1 .LBB3_1
@@ -1319,7 +1312,6 @@ define amdgpu_kernel void @struct_add_i32_varying_vdata(ptr addrspace(1) %out, p
; GFX12W32-NEXT: s_lshl_b32 s6, 1, s2
; GFX12W32-NEXT: v_writelane_b32 v0, s0, s2
; GFX12W32-NEXT: s_and_not1_b32 s1, s1, s6
-; GFX12W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX12W32-NEXT: s_add_co_i32 s0, s0, s3
; GFX12W32-NEXT: s_wait_alu 0xfffe
; GFX12W32-NEXT: s_cmp_lg_u32 s1, 0
@@ -2246,7 +2238,6 @@ define amdgpu_kernel void @sub_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX11W64-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX11W64-NEXT: v_writelane_b32 v0, s2, s3
; GFX11W64-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX11W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11W64-NEXT: s_add_i32 s2, s2, s8
; GFX11W64-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX11W64-NEXT: s_cbranch_scc1 .LBB7_1
@@ -2291,7 +2282,6 @@ define amdgpu_kernel void @sub_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX11W32-NEXT: s_lshl_b32 s6, 1, s2
; GFX11W32-NEXT: v_writelane_b32 v0, s0, s2
; GFX11W32-NEXT: s_and_not1_b32 s1, s1, s6
-; GFX11W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX11W32-NEXT: s_add_i32 s0, s0, s3
; GFX11W32-NEXT: s_cmp_lg_u32 s1, 0
; GFX11W32-NEXT: s_cbranch_scc1 .LBB7_1
@@ -2334,7 +2324,6 @@ define amdgpu_kernel void @sub_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX12W64-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX12W64-NEXT: v_writelane_b32 v0, s2, s3
; GFX12W64-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX12W64-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX12W64-NEXT: s_add_co_i32 s2, s2, s8
; GFX12W64-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX12W64-NEXT: s_cbranch_scc1 .LBB7_1
@@ -2380,7 +2369,6 @@ define amdgpu_kernel void @sub_i32_varying_vdata(ptr addrspace(1) %out, ptr addr
; GFX12W32-NEXT: s_lshl_b32 s6, 1, s2
; GFX12W32-NEXT: v_writelane_b32 v0, s0, s2
; GFX12W32-NEXT: s_and_not1_b32 s1, s1, s6
-; GFX12W32-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX12W32-NEXT: s_add_co_i32 s0, s0, s3
; GFX12W32-NEXT: s_wait_alu 0xfffe
; GFX12W32-NEXT: s_cmp_lg_u32 s1, 0
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
index 8bb8ecb079a34..2bcce6c04c0bb 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll
@@ -899,7 +899,6 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[2:3], 1, s7
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s7
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[2:3]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1164_ITERATIVE-NEXT: s_add_i32 s6, s6, s8
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1164_ITERATIVE-NEXT: s_cbranch_scc1 .LBB2_1
@@ -950,7 +949,6 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s3, 1, s1
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s1
; GFX1132_ITERATIVE-NEXT: s_and_not1_b32 s0, s0, s3
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1132_ITERATIVE-NEXT: s_add_i32 s6, s6, s2
; GFX1132_ITERATIVE-NEXT: s_cmp_lg_u32 s0, 0
; GFX1132_ITERATIVE-NEXT: s_cbranch_scc1 .LBB2_1
@@ -999,7 +997,6 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1264_ITERATIVE-NEXT: s_lshl_b64 s[2:3], 1, s7
; GFX1264_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s7
; GFX1264_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[2:3]
-; GFX1264_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1264_ITERATIVE-NEXT: s_add_co_i32 s6, s6, s8
; GFX1264_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1264_ITERATIVE-NEXT: s_cbranch_scc1 .LBB2_1
@@ -1049,7 +1046,6 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1232_ITERATIVE-NEXT: s_lshl_b32 s3, 1, s1
; GFX1232_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s1
; GFX1232_ITERATIVE-NEXT: s_and_not1_b32 s0, s0, s3
-; GFX1232_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1232_ITERATIVE-NEXT: s_add_co_i32 s6, s6, s2
; GFX1232_ITERATIVE-NEXT: s_wait_alu 0xfffe
; GFX1232_ITERATIVE-NEXT: s_cmp_lg_u32 s0, 0
@@ -2576,17 +2572,16 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1164_ITERATIVE-NEXT: .LBB5_1: ; %ComputeLoop
; GFX1164_ITERATIVE-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX1164_ITERATIVE-NEXT: s_ctz_i32_b64 s2, s[0:1]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s3, v2, s2
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s8, v3, s2
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s2
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v1, s7, s2
; GFX1164_ITERATIVE-NEXT: s_add_u32 s6, s6, s3
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_addc_u32 s7, s7, s8
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[2:3], 1, s2
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[2:3]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1164_ITERATIVE-NEXT: s_cbranch_scc1 .LBB5_1
; GFX1164_ITERATIVE-NEXT: ; %bb.2: ; %ComputeEnd
@@ -2639,7 +2634,6 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1132_ITERATIVE-NEXT: v_readlane_b32 s3, v3, s1
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s1
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v1, s7, s1
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX1132_ITERATIVE-NEXT: s_add_u32 s6, s6, s2
; GFX1132_ITERATIVE-NEXT: s_addc_u32 s7, s7, s3
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s1, 1, s1
@@ -4454,7 +4448,6 @@ define amdgpu_kernel void @sub_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[2:3], 1, s7
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s7
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[2:3]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1164_ITERATIVE-NEXT: s_add_i32 s6, s6, s8
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1164_ITERATIVE-NEXT: s_cbranch_scc1 .LBB8_1
@@ -4505,7 +4498,6 @@ define amdgpu_kernel void @sub_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s3, 1, s1
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s1
; GFX1132_ITERATIVE-NEXT: s_and_not1_b32 s0, s0, s3
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1132_ITERATIVE-NEXT: s_add_i32 s6, s6, s2
; GFX1132_ITERATIVE-NEXT: s_cmp_lg_u32 s0, 0
; GFX1132_ITERATIVE-NEXT: s_cbranch_scc1 .LBB8_1
@@ -4554,7 +4546,6 @@ define amdgpu_kernel void @sub_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1264_ITERATIVE-NEXT: s_lshl_b64 s[2:3], 1, s7
; GFX1264_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s7
; GFX1264_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[2:3]
-; GFX1264_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1264_ITERATIVE-NEXT: s_add_co_i32 s6, s6, s8
; GFX1264_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1264_ITERATIVE-NEXT: s_cbranch_scc1 .LBB8_1
@@ -4604,7 +4595,6 @@ define amdgpu_kernel void @sub_i32_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1232_ITERATIVE-NEXT: s_lshl_b32 s3, 1, s1
; GFX1232_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s1
; GFX1232_ITERATIVE-NEXT: s_and_not1_b32 s0, s0, s3
-; GFX1232_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1232_ITERATIVE-NEXT: s_add_co_i32 s6, s6, s2
; GFX1232_ITERATIVE-NEXT: s_wait_alu 0xfffe
; GFX1232_ITERATIVE-NEXT: s_cmp_lg_u32 s0, 0
@@ -6164,17 +6154,16 @@ define amdgpu_kernel void @sub_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1164_ITERATIVE-NEXT: .LBB11_1: ; %ComputeLoop
; GFX1164_ITERATIVE-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX1164_ITERATIVE-NEXT: s_ctz_i32_b64 s2, s[0:1]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s3, v2, s2
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s8, v3, s2
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s2
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v1, s7, s2
; GFX1164_ITERATIVE-NEXT: s_add_u32 s6, s6, s3
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_addc_u32 s7, s7, s8
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[2:3], 1, s2
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[2:3]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1164_ITERATIVE-NEXT: s_cbranch_scc1 .LBB11_1
; GFX1164_ITERATIVE-NEXT: ; %bb.2: ; %ComputeEnd
@@ -6227,7 +6216,6 @@ define amdgpu_kernel void @sub_i64_varying(ptr addrspace(1) %out, ptr addrspace(
; GFX1132_ITERATIVE-NEXT: v_readlane_b32 s3, v3, s1
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v0, s6, s1
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v1, s7, s1
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX1132_ITERATIVE-NEXT: s_add_u32 s6, s6, s2
; GFX1132_ITERATIVE-NEXT: s_addc_u32 s7, s7, s3
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s1, 1, s1
diff --git a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
index 3c0646c46efd0..eb5353e928682 100644
--- a/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll
@@ -669,7 +669,6 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out) {
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v0, s2, s3
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1164_ITERATIVE-NEXT: s_add_i32 s2, s2, s8
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1164_ITERATIVE-NEXT: s_cbranch_scc1 .LBB2_1
@@ -715,7 +714,6 @@ define amdgpu_kernel void @add_i32_varying(ptr addrspace(1) %out) {
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s6, 1, s2
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v0, s0, s2
; GFX1132_ITERATIVE-NEXT: s_and_not1_b32 s1, s1, s6
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1132_ITERATIVE-NEXT: s_add_i32 s0, s0, s3
; GFX1132_ITERATIVE-NEXT: s_cmp_lg_u32 s1, 0
; GFX1132_ITERATIVE-NEXT: s_cbranch_scc1 .LBB2_1
@@ -1215,7 +1213,7 @@ define amdgpu_kernel void @add_i32_varying_nouse() {
; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s6, v0, s3
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[4:5], 1, s3
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[4:5]
; GFX1164_ITERATIVE-NEXT: s_add_i32 s2, s2, s6
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
@@ -1248,7 +1246,7 @@ define amdgpu_kernel void @add_i32_varying_nouse() {
; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
; GFX1132_ITERATIVE-NEXT: v_readlane_b32 s3, v0, s2
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s2, 1, s2
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(VALU_DEP_1)
+; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1132_ITERATIVE-NEXT: s_and_not1_b32 s1, s1, s2
; GFX1132_ITERATIVE-NEXT: s_add_i32 s0, s0, s3
; GFX1132_ITERATIVE-NEXT: s_cmp_lg_u32 s1, 0
@@ -2217,17 +2215,16 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out) {
; GFX1164_ITERATIVE-NEXT: .LBB6_1: ; %ComputeLoop
; GFX1164_ITERATIVE-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX1164_ITERATIVE-NEXT: s_ctz_i32_b64 s6, s[2:3]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_3) | instid1(VALU_DEP_4)
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s7, v2, s6
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s8, v3, s6
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v0, s0, s6
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v1, s1, s6
; GFX1164_ITERATIVE-NEXT: s_add_u32 s0, s0, s7
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_addc_u32 s1, s1, s8
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[6:7], 1, s6
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(NEXT) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[2:3], s[2:3], s[6:7]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[2:3], 0
; GFX1164_ITERATIVE-NEXT: s_cbranch_scc1 .LBB6_1
; GFX1164_ITERATIVE-NEXT: ; %bb.2: ; %ComputeEnd
@@ -2275,7 +2272,6 @@ define amdgpu_kernel void @add_i64_varying(ptr addrspace(1) %out) {
; GFX1132_ITERATIVE-NEXT: v_readlane_b32 s7, v3, s3
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v0, s0, s3
; GFX1132_ITERATIVE-NEXT: v_writelane_b32 v1, s1, s3
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_4) | instskip(NEXT) | instid1(VALU_DEP_3)
; GFX1132_ITERATIVE-NEXT: s_add_u32 s0, s0, s6
; GFX1132_ITERATIVE-NEXT: s_addc_u32 s1, s1, s7
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s3, 1, s3
@@ -3019,11 +3015,10 @@ define amdgpu_kernel void @add_i64_varying_nouse() {
; GFX1164_ITERATIVE-NEXT: .LBB7_1: ; %ComputeLoop
; GFX1164_ITERATIVE-NEXT: ; =>This Inner Loop Header: Depth=1
; GFX1164_ITERATIVE-NEXT: s_ctz_i32_b64 s4, s[2:3]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_1) | instid1(VALU_DEP_2)
+; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(SALU_CYCLE_1) | instskip(SKIP_4) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s5, v0, s4
; GFX1164_ITERATIVE-NEXT: v_readlane_b32 s6, v1, s4
; GFX1164_ITERATIVE-NEXT: s_add_u32 s0, s0, s5
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(SALU_CYCLE_1)
; GFX1164_ITERATIVE-NEXT: s_addc_u32 s1, s1, s6
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[4:5], 1, s4
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[2:3], s[2:3], s[4:5]
@@ -3059,7 +3054,6 @@ define amdgpu_kernel void @add_i64_varying_nouse() {
; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1)
; GFX1132_ITERATIVE-NEXT: v_readlane_b32 s4, v0, s3
; GFX1132_ITERATIVE-NEXT: v_readlane_b32 s5, v1, s3
-; GFX1132_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX1132_ITERATIVE-NEXT: s_add_u32 s0, s0, s4
; GFX1132_ITERATIVE-NEXT: s_addc_u32 s1, s1, s5
; GFX1132_ITERATIVE-NEXT: s_lshl_b32 s3, 1, s3
@@ -4091,7 +4085,6 @@ define amdgpu_kernel void @sub_i32_varying(ptr addrspace(1) %out) {
; GFX1164_ITERATIVE-NEXT: s_lshl_b64 s[6:7], 1, s3
; GFX1164_ITERATIVE-NEXT: v_writelane_b32 v0, s2, s3
; GFX1164_ITERATIVE-NEXT: s_and_not1_b64 s[0:1], s[0:1], s[6:7]
-; GFX1164_ITERATIVE-NEXT: s_delay_alu instid0(VALU_DEP_2)
; GFX1164_ITERATIVE-NEXT: s_add_i32 s2, s2, s8
; GFX1164_ITERATIVE-NEXT: s_cmp_lg_u64 s[0:1], 0
; GFX1164_ITERATIVE-NEXT: s_cbranch_scc1 .LBB10_1
@@ -4137,7 +4130,6 @@ define amdg...
[truncated]
|
Please provide some justification! |
@@ -428,7 +428,6 @@ define amdgpu_kernel void @divergent_value(ptr addrspace(1) %out, i32 %in) { | |||
; GFX1164DAGISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1) | |||
; GFX1164DAGISEL-NEXT: v_readlane_b32 s6, v0, s5 | |||
; GFX1164DAGISEL-NEXT: s_bitset0_b64 s[2:3], s5 | |||
; GFX1164DAGISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this patch should be applied to v_readlane instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, v_readlane behaves like v_cmp in this regard, so this change is good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a good first cut! However, I suspect that we can do better.
Merely skipping this merge doesn't exploit all the information we have. Do we have a test like this (probably best as a .mir test):
v_mul_f32 v1, v1, v1
v_cmp s0, ...
s_or_b32 s0, s0, s1
v_mul_f32 v1, v1, v1 ; no delay alu needed
In this case, no delay ALU is needed because the automatic wait for the v_cmp implies that the first v_mul is also done.
On the other hand:
v_cmp s0, ...
v_mul_f32 v1, v1, v1
s_or_b32 s0, s0, s1
v_mul_f32 v1, v1, v1 ; delay alu needed here
Here, the automatic wait only waits for the SGPR write and not for the first v_mul, so we still want a delay_alu.
On the third hand:
v_cmp s0, ...
v_mul_f32 v1, v1, v1
v_cmp s2, ...
s_or_b32 s0, s0, s1
v_mul_f32 v1, v1, v1 ; delay alu NOT needed here
In this case, even though the S_OR only depends on the first v_cmp, it waits for all SGPR writes to complete, including the one to s2. So it implicitly waits for completion of the first v_mul, and so we don't need a delay_alu here.
So there are a bunch of additional cases to consider. I suggest you look into writing .mir test cases for them that only run the insert delay ALU pass (and check if perhaps similar tests already exist), and take another look at the DelayState and DelayInfo data structures to see how we can handle these cases best.
@@ -428,7 +428,6 @@ define amdgpu_kernel void @divergent_value(ptr addrspace(1) %out, i32 %in) { | |||
; GFX1164DAGISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) | instid1(SALU_CYCLE_1) | |||
; GFX1164DAGISEL-NEXT: v_readlane_b32 s6, v0, s5 | |||
; GFX1164DAGISEL-NEXT: s_bitset0_b64 s[2:3], s5 | |||
; GFX1164DAGISEL-NEXT: s_delay_alu instid0(VALU_DEP_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, v_readlane behaves like v_cmp in this regard, so this change is good.
Right, good point! That is similar to the way that certain instructions wait for VA_VDST==0, which is handled here:
So I think the SALU wait for VA_SDST==0 should be handled in a similar way. |
New handwritten MIR tests should go in |
1aac6c8
to
bd9635e
Compare
$vgpr0 = V_MUL_F32_e64 0, $vgpr0, 0, $vgpr0, 0, 0, implicit $mode, implicit $exec | ||
... | ||
|
||
# Check if s_delay_alu is added |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
State after $sgpr0 = V_CMP_EQ_U32_e64 3, $sgpr2, implicit $exec
SGPR0_LO16 VALUCycles=4 VALUNum=1
SGPR0_HI16 VALUCycles=4 VALUNum=1
State after $vgpr0 = V_MUL_F32_e64 0, $vgpr0, 0, $vgpr0, 0, 0, implicit $mode, implicit $exec
SGPR0_LO16 VALUCycles=3 VALUNum=2
SGPR0_HI16 VALUCycles=3 VALUNum=2
VGPR0_LO16 VALUCycles=4 VALUNum=1
VGPR0_HI16 VALUCycles=4 VALUNum=1
State after $sgpr0 = S_OR_B32 $sgpr0, $sgpr1, implicit-def $scc
SGPR0_LO16 SALUCycles=1
SGPR0_HI16 SALUCycles=1
State after $vgpr0 = V_MUL_F32_e64 0, $vgpr0, 0, $vgpr0, 0, 0, implicit $mode, implicit $exec
VGPR0_LO16 VALUCycles=4 VALUNum=1
VGPR0_HI16 VALUCycles=4 VALUNum=1
I am confused why do we need delay_alu here. When I look at the the state after each instruction, it says that the second sgpr write waits for 3 cycles, reducing the first vgpr write cycles by 3, leaving it at 1. After the second sgpr write is issued, the cycles for write drop to 0, removing it from map
$vgpr0 = V_MUL_F32_e64 0, $vgpr0, 0, $vgpr0, 0, 0, implicit $mode, implicit $exec | ||
... | ||
|
||
# Check if reduntant delay_alu is removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo "redundant"
$vgpr0 = V_MUL_F32_e64 0, $vgpr0, 0, $vgpr0, 0, 0, implicit $mode, implicit $exec | ||
... | ||
|
||
# Check if reduntant delay_alu is removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo "redundant"
|
||
# Check if reduntant delay_alu is removed | ||
--- | ||
name: perserved_delay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo "preserved"
(State.find(longestWait) == State.end()) | ||
? std::max(deletedCyclesNum, (unsigned)0) | ||
: std::max(State[longestWait].VALUCycles, | ||
State[longestWait].SALUCycles); | ||
lastWrite = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid repeated lookups in State
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please still address the issue of repeated lookups in State.
✅ With the latest revision this PR passed the C/C++ code formatter. |
b5b1c9e
to
fefcb44
Compare
iterator Next; | ||
for (auto I = begin(), E = end(); I != E; I = Next) { | ||
Next = std::next(I); | ||
if (I->second.VALUNum >= VALUNum && I->second.advance(Type, Cycles)) | ||
if (I->second.VALUNum >= SGPRWriteVALUNum && I->second.VALUCycles > 0){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
advance increments VALUNum and we don't want to do that here because it will result in bad VALU_DEP in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I save I->second in a variable before if ?
ping |
@@ -2145,12 +2145,11 @@ define amdgpu_kernel void @add_i64_uniform(ptr addrspace(1) %out, ptr addrspace( | |||
; GFX1164-NEXT: s_waitcnt lgkmcnt(0) | |||
; GFX1164-NEXT: v_readfirstlane_b32 s3, v1 | |||
; GFX1164-NEXT: v_readfirstlane_b32 s2, v0 | |||
; GFX1164-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_2) | instid1(VALU_DEP_2) | |||
; GFX1164-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_3) | instid1(VALU_DEP_1) | |||
; GFX1164-NEXT: v_mad_u64_u32 v[0:1], null, s4, v2, s[2:3] | |||
; GFX1164-NEXT: s_mov_b32 s3, 0x31016000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this s_mov waits for the va_sdst because it has a literal operand
fefcb44
to
7ec7bd2
Compare
SIInstrFlags::MIMG | SIInstrFlags::VIMAGE | | ||
SIInstrFlags::VSAMPLE; | ||
|
||
const uint64_t VA_SDST_0 = SIInstrFlags::SALU | SIInstrFlags::SMRD; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other instructions are already covered by instructionWaitsForVALU function
} | ||
if (AMDGPU::isSGPR(Reg, TRI)) { | ||
lastSGPRfromVALU = *(TRI->regunits(Reg).begin()); | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eliminating the outer loop by only taking the first unit of the SGPR operand
@@ -340,6 +366,7 @@ class AMDGPUInsertDelayAlu : public MachineFunctionPass { | |||
bool Changed = false; | |||
MachineInstr *LastDelayAlu = nullptr; | |||
|
|||
MCRegUnit lastSGPRfromVALU = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize variable names:
MCRegUnit lastSGPRfromVALU = 0; | |
MCRegUnit LastSGPRfromVALU = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The F should be capitalized as well.
if (MI.getDesc().TSFlags & VA_SDST_0) | ||
return true; | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (MI.getDesc().TSFlags & VA_SDST_0) | |
return true; | |
return false; | |
return MI.getDesc().TSFlags & VA_SDST_0; |
for (const auto &Op : MI.defs()) { | ||
Register Reg = Op.getReg(); | ||
if (AMDGPU::isSGPR(Reg, TRI)) { | ||
lastSGPRfromVALU = *(TRI->regunits(Reg).begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastSGPRfromVALU = *(TRI->regunits(Reg).begin()); | |
lastSGPRfromVALU = TRI->regunits(Reg).front(); |
Or use [0]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried doing that but this doesn't compile, TRI->regunits(Reg) is a range iterator and doesn't have front, and I can't seem to use [0] either since regunits is not random access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lastSGPRfromVALU = *(TRI->regunits(Reg).begin()); | |
lastSGPRfromVALU = *MCRegUnitIterator(Reg, TRI);; |
Then you can just construct the iterator which will figure out the first element. That way you avoid make_range + begin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly there are many ways to do this :)
Quite frankly, I think it's fine as is. Though the parenthesis are redundant, and it's more common in LLVM style to avoid redundant parenthesis.
#include "SIDefines.h" | ||
#include "SIInstrInfo.h" | ||
#include "Utils/AMDGPUBaseInfo.h" | ||
#include "llvm/ADT/SetVector.h" | ||
#include "llvm/CodeGen/MachineInstr.h" | ||
#include "llvm/MC/MCRegister.h" | ||
#include "llvm/Support/ErrorHandling.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New #includes are either included through other ones or unnecessary.
void advanceByNum(DelayType Type, unsigned Cycles, | ||
unsigned SGPRWriteVALUNum) { | ||
iterator Next; | ||
for (auto I = begin(), E = end(); I != E; I = Next) { | ||
Next = std::next(I); | ||
if (I->second.VALUNum >= SGPRWriteVALUNum && I->second.VALUCycles > 0) { | ||
erase(I); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type and Cycles aren't actually used, which makes this method misleading.
You could rename it to advanceByVALUNum
and only pass the last argument (renaming it to just VALUNum, since the method itself doesn't care why you're calling it).
@@ -340,6 +366,7 @@ class AMDGPUInsertDelayAlu : public MachineFunctionPass { | |||
bool Changed = false; | |||
MachineInstr *LastDelayAlu = nullptr; | |||
|
|||
MCRegUnit lastSGPRfromVALU = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The F should be capitalized as well.
for (const auto &Op : MI.defs()) { | ||
Register Reg = Op.getReg(); | ||
if (AMDGPU::isSGPR(Reg, TRI)) { | ||
lastSGPRfromVALU = *(TRI->regunits(Reg).begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly there are many ways to do this :)
Quite frankly, I think it's fine as is. Though the parenthesis are redundant, and it's more common in LLVM style to avoid redundant parenthesis.
8e6924b
to
8b31215
Compare
8b31215
to
5441736
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what longestWait does; and lastSgprWrite should really be lastSgprWriteFromVALU, right?
(State.find(longestWait) == State.end()) | ||
? std::max(deletedCyclesNum, (unsigned)0) | ||
: std::max(State[longestWait].VALUCycles, | ||
State[longestWait].SALUCycles); | ||
lastWrite = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please still address the issue of repeated lookups in State.
} | ||
} | ||
} | ||
} | ||
unsigned maxCycles = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an empty line before this to group the code a bit better.
} | ||
} | ||
} | ||
} | ||
unsigned maxCycles = 0; | ||
unsigned lastWrite = 0; | ||
if (Type != OTHER) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the redundant Type != OTHER
check, you can move this code into the if
above.
bool VALUSALUStall = false; | ||
MCRegUnit lastSgprWrite = 0; | ||
MCRegUnit longestWait = 0; | ||
unsigned deletedCyclesNum = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce the scope of variables as much as possible.
For example, VALUSALUStall doesn't need to be defined outside the loop.
MCRegUnit lastSgprWrite = 0; | ||
MCRegUnit longestWait = 0; | ||
unsigned deletedCyclesNum = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In LLVM style variable names start with an upper case letter:
MCRegUnit lastSgprWrite = 0; | |
MCRegUnit longestWait = 0; | |
unsigned deletedCyclesNum = 0; | |
MCRegUnit LastSgprWrite = 0; | |
MCRegUnit LongestWait = 0; | |
unsigned DeletedCyclesNum = 0; |
5441736
to
9a7af22
Compare
9a7af22
to
e73c4d1
Compare
PING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. One nit, rest LGTM.
#include "SIInstrInfo.h" | ||
#include "llvm/ADT/SetVector.h" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep an empty line between #includes and the rest.
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from it). When VALU is issued, it increments internal counter VA_SDST used to track use of this SGPR. SALU will not issue until VA_SDST is zero, that is when VALU is finished writing. Therefore, delays added by s_delay_alu are not needed in this situation.