Skip to content

[LLVM][Intrinsics] Reduce stack size for Intrinsic::getAttributes #152219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 6, 2025

Conversation

jurahul
Copy link
Contributor

@jurahul jurahul commented Aug 5, 2025

This change fixes a stack size regression that got introduced in 0de0354. That change did 2 independent things:

  1. Uniquify argument and function attributes separately so that we generate a smaller number of unique sets as opposed to uniquifying them together. This is beneficial for code size.
  2. Eliminate the fixed size array AS and NumAttrs variable and instead build the returned AttribteList in each case using an initializer list.

The second part seems to have caused a regression in the stack size usage of this function for Windows. This change essentially undoes part 2 and reinstates the use of the fixed size array AS which fixes this stack size regression. The actual measured stack frame size for this function before/after this change is as follows:

  Current trunk data for release build (x86_64 builds for Linux, x86 build for Windows):
  Compiler                              gcc-13.3.0      clang-18.1.3      MSVC 19.43.34810.0
  DLLVM_ENABLE_ASSERTIONS=OFF           0x120           0x110             0x54B0   
  DLLVM_ENABLE_ASSERTIONS=ON            0x2880          0x110             0x5250

  After applying the fix:
  Compiler                              gcc-13.3.0      clang-18.1.3      MSVC 19.43.34810.0
  DLLVM_ENABLE_ASSERTIONS=OFF           0x120           0x118             0x1240h                                               
  DLLVM_ENABLE_ASSERTIONS=ON            0x120           0x118             0x1240h  

Note that for Windows builds with assertions disabled, the stack frame size for this function reduces from 21680 to 4672 which is a 4.6x reduction. Stack frame size for GCC build with assertions also improved and clang builds are unimpacted. The speculation is that clang and gcc is able to reuse the stack space across these switch cases better with existing code, but MSVC is not, and re-introducing the AS variable forces all cases to use the same local variable, addressing the stack space regression.

@jurahul
Copy link
Contributor Author

jurahul commented Aug 5, 2025

This PR addresses a stack size regression that we see in some graphics workloads internally at NVIDIA which seems to be particularly sensitive to stack size. Thanks @Nadharm for triaging this.

@jurahul jurahul requested review from arsenm and nikic August 6, 2025 00:13
@jurahul jurahul marked this pull request as ready for review August 6, 2025 00:13
@llvmbot
Copy link
Member

llvmbot commented Aug 6, 2025

@llvm/pr-subscribers-tablegen

Author: Rahul Joshi (jurahul)

Changes

This change fixes a stack size regression that got introduced in 0de0354. That change did 2 independent things:

  1. Uniquify argument and function attributes separately so that we generate a smaller number of unique sets as opposed to uniquifying them together. This is beneficial for code size.
  2. Eliminate the fixed size array AS and NumAttrs variable and instead build the returned AttribteList in each case using an initializer list.

The second part seems to have caused a regression in the stack size usage of this function for Windows. This change essentially undoes part 2 and reinstates the use of the fixed size array AS which fixes this stack size regression. The actual measured stack frame size for this function before/after this change is as follows:

  Current trunk data for release build (x86_64 builds for Linux, x86 build for Windows):
  Compiler                              gcc-13.3.0      clang-18.1.3      MSVC 19.43.34810.0
  DLLVM_ENABLE_ASSERTIONS=OFF           0x120           0x110             0x54B0   
  DLLVM_ENABLE_ASSERTIONS=ON            0x2880          0x110             0x5250

  After applying the fix:
  Compiler                              gcc-13.3.0      clang-18.1.3      MSVC 19.43.34810.0
  DLLVM_ENABLE_ASSERTIONS=OFF           0x120           0x118             0x1240h                                               
  DLLVM_ENABLE_ASSERTIONS=ON            0x120           0x118             0x1240h  

Note that for Windows builds with assertions disabled, the stack frame size for this function reduces from 21680 to 4672 which is a 4.6x reduction. Stack frame size for GCC build with assertions also improved and clang builds are unimpacted. The speculation is that clang and gcc is able to reuse the stack space across these switch cases better with existing code, but MSVC is not, and re-introducing the AS variable forces all cases to use the same local variable, addressing the stack space regression.


Full diff: https://github.com/llvm/llvm-project/pull/152219.diff

2 Files Affected:

  • (modified) llvm/test/TableGen/intrinsic-attrs.td (+7-8)
  • (modified) llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp (+30-22)
diff --git a/llvm/test/TableGen/intrinsic-attrs.td b/llvm/test/TableGen/intrinsic-attrs.td
index 18309d7419994..92a90dcafd48c 100644
--- a/llvm/test/TableGen/intrinsic-attrs.td
+++ b/llvm/test/TableGen/intrinsic-attrs.td
@@ -30,12 +30,11 @@ def int_deref_ptr_ret : Intrinsic<[llvm_ptr_ty], [], [Dereferenceable<RetIndex,
 
 // CHECK: getAttributes(LLVMContext &C, ID id,
 // CHECK-NEXT: FunctionType *FT) {
-// CHECK: case 1:
-// CHECK-NEXT: return AttributeList::get(C, {
-// CHECK-NEXT:   {AttributeList::FunctionIndex, getIntrinsicFnAttributeSet(C, FnAttrID)}
-// CHECK-NEXT: });
+// CHECK:      case 1:
+// CHECK-NEXT:   HasFnAttr = true;
+// CHECK-NEXT:   break;
 // CHECK-NEXT: case 0:
-// CHECK-NEXT: return AttributeList::get(C, {
-// CHECK-NEXT:   {0, getIntrinsicArgAttributeSet(C, 0, FT->getContainedType(0))},
-// CHECK-NEXT:   {AttributeList::FunctionIndex, getIntrinsicFnAttributeSet(C, FnAttrID)}
-// CHECK-NEXT: });
+// CHECK-NEXT:   AS[0] = {0, getIntrinsicArgAttributeSet(C, 0, FT->getContainedType(0))};
+// CHECK-NEXT:   HasFnAttr = true;
+// CHECK-NEXT:   NumAttrs = 1
+// CHECK-NEXT:   break;
diff --git a/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp b/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp
index ac5c455ed63ce..49f719453e0bb 100644
--- a/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp
+++ b/llvm/utils/TableGen/Basic/IntrinsicEmitter.cpp
@@ -652,6 +652,16 @@ static constexpr uint16_t IntrinsicsToAttributesMap[] = {)";
 
 AttributeList Intrinsic::getAttributes(LLVMContext &C, ID id,
                                        FunctionType *FT) {)";
+  // Find the max number of attributes to create the local array.
+  unsigned MaxNumAttrs = 0;
+  for (const auto [IntPtr, UniqueID] : UniqAttributes) {
+    const CodeGenIntrinsic &Int = *IntPtr;
+    unsigned NumAttrs =
+        llvm::count_if(Int.ArgumentAttributes,
+                       [](const auto &Attrs) { return !Attrs.empty(); });
+    NumAttrs += hasFnAttributes(Int);
+    MaxNumAttrs = std::max(MaxNumAttrs, NumAttrs);
+  }
 
   OS << formatv(R"(
   if (id == 0)
@@ -659,46 +669,44 @@ AttributeList Intrinsic::getAttributes(LLVMContext &C, ID id,
 
   uint16_t PackedID = IntrinsicsToAttributesMap[id - 1];
   uint8_t FnAttrID = PackedID >> 8;
+  std::pair<unsigned, AttributeSet> AS[{}];
+  unsigned NumAttrs = 0;
+  bool HasFnAttr = false;
   switch(PackedID & 0xFF) {{
     default: llvm_unreachable("Invalid attribute number");
-)");
+)",
+                MaxNumAttrs);
 
   for (const auto [IntPtr, UniqueID] : UniqAttributes) {
     OS << formatv("  case {}:\n", UniqueID);
     const CodeGenIntrinsic &Int = *IntPtr;
 
-    // Keep track of the number of attributes we're writing out.
-    unsigned NumAttrs =
-        llvm::count_if(Int.ArgumentAttributes,
-                       [](const auto &Attrs) { return !Attrs.empty(); });
-    NumAttrs += hasFnAttributes(Int);
-    if (NumAttrs == 0) {
-      OS << "    return AttributeList();\n";
-      continue;
-    }
+    unsigned NumAttrs = 0;
 
-    OS << "    return AttributeList::get(C, {\n";
-    ListSeparator LS(",\n");
     for (const auto &[AttrIdx, Attrs] : enumerate(Int.ArgumentAttributes)) {
       if (Attrs.empty())
         continue;
 
       unsigned ArgAttrID = UniqArgAttributes.find(Attrs)->second;
-      OS << LS
-         << formatv("      {{{}, getIntrinsicArgAttributeSet(C, {}, "
-                    "FT->getContainedType({}))}",
-                    AttrIdx, ArgAttrID, AttrIdx);
+      OS << formatv("    AS[{}] = {{{}, getIntrinsicArgAttributeSet(C, {}, "
+                    "FT->getContainedType({}))};\n",
+                    NumAttrs++, AttrIdx, ArgAttrID, AttrIdx);
     }
 
-    if (hasFnAttributes(Int)) {
-      OS << LS
-         << "      {AttributeList::FunctionIndex, "
-            "getIntrinsicFnAttributeSet(C, FnAttrID)}";
-    }
-    OS << "\n    });\n";
+    if (hasFnAttributes(Int))
+      OS << "    HasFnAttr = true;\n";
+
+    if (NumAttrs)
+      OS << formatv("    NumAttrs = {};\n", NumAttrs);
+    OS << "    break;\n";
   }
 
   OS << R"(  }
+  if (HasFnAttr) {
+    AS[NumAttrs++] = {AttributeList::FunctionIndex,
+                      getIntrinsicFnAttributeSet(C, FnAttrID)};
+  }
+  return AttributeList::get(C, ArrayRef(AS, NumAttrs));
 }
 #endif // GET_INTRINSIC_ATTRIBUTES
 

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jurahul jurahul merged commit 22af0cd into llvm:main Aug 6, 2025
12 checks passed
@jurahul jurahul deleted the intrinsic_getAttributes_stack_size branch August 6, 2025 14:09
jurahul added a commit that referenced this pull request Aug 8, 2025
This a follow on to #152219 to
reduce both code and frame size of `Intrinsic::getAttributes` further.

Currently, this function consists of several switch cases (one per
unique argument attributes) that populates the local `AS` array with all
non-empty argument attributes for that intrinsic by calling
`getIntrinsicArgAttributeSet`. This change makes this code table driven
and implements `Intrinsic::getAttributes` without any switch cases,
which reduces the code size of this function on all platforms and in
addition reduces the frame size by a factor of 10 on Windows.

This is achieved by:
1. Emitting table `ArgAttrIdTable` containing a concatenated list of
`<ArgNo, AttrID>` entries across all unique arguments.
2. Emitting table `ArgAttributesInfoTable` (indexed by unique
arguments-ID) to store the starting index and number of non-empty arg
attributes.
3. Reserving unique function-ID 255 to indicate that the intrinsic has
no function attributes (to replace `HasFnAttr` setup in each switch
case).
4. Using a simple table lookup and for loop to build the list of
argument and function attributes for a given intrinsic.

Experimental data shows that with release builds and assertions
disabled, this change reduces the code size for GCC and Clang builds on
Linux by ~9KB for a modest (80/152 byte) increase in frame size. For
Windows, it reduces the code size by 20KB and frame size from 4736 bytes
to 461 bytes which is 10x reduction. Actual data is as follows:

```
 Current trunk:
  Compiler                              gcc-13.3.0      clang-18.1.3      MSVC 19.43.34810.0
  code size                             0x35a9          0x370c            0x5581
  frame size                            0x120           0x118             0x1280

 table driven Intrinsic::getAttributes:
  code size                             0xcfb            0xcd0            0x1cf
  frame size                            0x1b8            0x188            0x1A0
  Total savings (code + data)           9212 bytes       9790 bytes       20119 bytes
```

Total savings above accounts for the additional data size for the 2 new
tables, which in this experiment was: `ArgAttributesInfoTable` = 314
bytes and `ArgAttrIdTable` = 888 bytes. Coupled with the earlier
#152219, this achieves a 46x
reduction in frame size for this function in Windows release builds.
llvm-sync bot pushed a commit to arm/arm-toolchain that referenced this pull request Aug 8, 2025
…ven (#152349)

This a follow on to llvm/llvm-project#152219 to
reduce both code and frame size of `Intrinsic::getAttributes` further.

Currently, this function consists of several switch cases (one per
unique argument attributes) that populates the local `AS` array with all
non-empty argument attributes for that intrinsic by calling
`getIntrinsicArgAttributeSet`. This change makes this code table driven
and implements `Intrinsic::getAttributes` without any switch cases,
which reduces the code size of this function on all platforms and in
addition reduces the frame size by a factor of 10 on Windows.

This is achieved by:
1. Emitting table `ArgAttrIdTable` containing a concatenated list of
`<ArgNo, AttrID>` entries across all unique arguments.
2. Emitting table `ArgAttributesInfoTable` (indexed by unique
arguments-ID) to store the starting index and number of non-empty arg
attributes.
3. Reserving unique function-ID 255 to indicate that the intrinsic has
no function attributes (to replace `HasFnAttr` setup in each switch
case).
4. Using a simple table lookup and for loop to build the list of
argument and function attributes for a given intrinsic.

Experimental data shows that with release builds and assertions
disabled, this change reduces the code size for GCC and Clang builds on
Linux by ~9KB for a modest (80/152 byte) increase in frame size. For
Windows, it reduces the code size by 20KB and frame size from 4736 bytes
to 461 bytes which is 10x reduction. Actual data is as follows:

```
 Current trunk:
  Compiler                              gcc-13.3.0      clang-18.1.3      MSVC 19.43.34810.0
  code size                             0x35a9          0x370c            0x5581
  frame size                            0x120           0x118             0x1280

 table driven Intrinsic::getAttributes:
  code size                             0xcfb            0xcd0            0x1cf
  frame size                            0x1b8            0x188            0x1A0
  Total savings (code + data)           9212 bytes       9790 bytes       20119 bytes
```

Total savings above accounts for the additional data size for the 2 new
tables, which in this experiment was: `ArgAttributesInfoTable` = 314
bytes and `ArgAttrIdTable` = 888 bytes. Coupled with the earlier
llvm/llvm-project#152219, this achieves a 46x
reduction in frame size for this function in Windows release builds.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants