Skip to content

Commit 590066b

Browse files
authored
[NVPTX] Add family-specific architectures support (#141899)
This change adds family-specific architecture variants support added in [PTX ISA 8.8](https://docs.nvidia.com/cuda/parallel-thread-execution/#ptx-isa-version-8-8). These architecture variants have "f" suffix. For example, sm_100f. This change doesn't promote existing features to family-specific architecture.
1 parent 7b989ad commit 590066b

File tree

5 files changed

+169
-29
lines changed

5 files changed

+169
-29
lines changed

llvm/docs/NVPTXUsage.rst

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,57 @@ Example: 32-bit PTX for CUDA Driver API: ``nvptx-nvidia-cuda``
147147

148148
Example: 64-bit PTX for CUDA Driver API: ``nvptx64-nvidia-cuda``
149149

150-
150+
.. _nvptx_arch_hierarchy:
151+
152+
NVPTX Architecture Hierarchy and Ordering
153+
=========================================
154+
155+
GPU architectures: sm_2Y/sm_3Y/sm_5Y/sm_6Y/sm_7Y/sm_8Y/sm_9Y/sm_10Y/sm_12Y
156+
('Y' represents version within the architecture)
157+
The architectures have name of form ``sm_XYz`` where ``X`` represent the generation
158+
number, ``Y`` represents the version within the architecture, and ``z`` represents
159+
the optional feature suffix.
160+
If ``X1Y1 <= X2Y2``, then GPU capabilities of ``sm_X1Y1`` are included in ``sm_X2Y2``.
161+
For example, take ``sm_90`` (9 represents ``X``, 0 represents ``Y``, and no feature
162+
suffix) and ``sm_103`` architectures (10 represents ``X``, 3 represents ``Y``, and no
163+
feature suffix). Since 90 <= 103, ``sm_90`` is compatible with ``sm_103``.
164+
165+
The family-specific variants have ``f`` feature suffix and they follow
166+
following order:
167+
``sm_X{Y2}f > sm_X{Y1}f`` iff ``Y2 > Y1``
168+
``sm_XY{f} > sm_{XY}{}``
169+
170+
For example, take ``sm_100f`` (10 represents ``X``, 0 represents ``Y``, and ``f``
171+
represents ``z``) and ``sm_103f`` (10 represents ``X``, 3 represents ``Y``, and ``f``
172+
represents ``z``) architecture variants. Since ``Y1 < Y2``, ``sm_100f`` is compatible with
173+
``sm_103f``. Similarly based on the second rule, ``sm_90`` is compatible with ``sm_103f``.
174+
175+
Some counter examples, take ``sm_100f`` and ``sm_120f`` (12 represents ``X``, 0
176+
represents ``Y``, and ``f`` represents ``z``) architecture variants. Since both
177+
belongs to different family i.e. ``X1 != X2``, ``sm_100f`` is not compatible with
178+
``sm_120f``.
179+
180+
The architecture-specific variants have ``a`` feature suffix and they follow
181+
following order:
182+
``sm_XY{a} > sm_XY{f} > sm_{XY}{}``
183+
184+
For example, take ``sm_103a`` (10 represents ``X``, 3 represents ``Y``, and ``a``
185+
represents ``z``), ``sm_103f``, and ``sm_103`` architecture variants. The ``sm_103`` is
186+
compatible with ``sm_103a`` and ``sm_103f``, and ``sm_103f`` is compatible with ``sm_103a``.
187+
188+
Encoding := Arch * 10 + 2 (for 'f') + 1 (for 'a')
189+
Arch := X * 10 + Y
190+
191+
For example, ``sm_103f`` is encoded as 1032 (103 * 10 + 2) and ``sm_103a`` is
192+
encoded as 1033 (103 * 10 + 2 + 1).
193+
194+
This encoding allows simple partial ordering of the architectures.
195+
196+
* Compare Family and Arch by dividing FullSMVersion by 100 and 10
197+
respectively before the comparison.
198+
* Compare within the family by comparing FullSMVersion, given both belongs to
199+
the same family.
200+
* Detect ``a`` variants by checking FullSMVersion & 1.
151201

152202
.. _nvptx_intrinsics:
153203

llvm/lib/Target/NVPTX/NVPTX.td

Lines changed: 65 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -33,20 +33,69 @@ class FeaturePTX<int version>:
3333
SubtargetFeature<"ptx"# version, "PTXVersion",
3434
"" # version,
3535
"Use PTX version " # version>;
36-
36+
// NVPTX Architecture Hierarchy and Ordering:
37+
//
38+
// GPU architectures: sm_2Y/sm_3Y/sm_5Y/sm_6Y/sm_7Y/sm_8Y/sm_9Y/sm_10Y/sm_12Y
39+
// ('Y' represents version within the architecture)
40+
// The architectures have name of form sm_XYz where 'X' represent the generation
41+
// number, 'Y' represents the version within the architecture, and 'z' represents
42+
// the optional feature suffix.
43+
// If X1Y1 <= X2Y2, then GPU capabilities of sm_X1Y1 are included in sm_X2Y2.
44+
// For example, take sm_90 (9 represents 'X', 0 represents 'Y', and no feature
45+
// suffix) and sm_103 architectures (10 represents 'X', 3 represents 'Y', and no
46+
// feature suffix). Since 90 <= 103, sm_90 is compatible with sm_103.
47+
//
48+
// The family-specific variants have 'f' feature suffix and they follow
49+
// following order:
50+
// sm_X{Y2}f > sm_X{Y1}f iff Y2 > Y1
51+
// sm_XY{f} > sm_{XY}{}
52+
//
53+
// For example, take sm_100f (10 represents 'X', 0 represents 'Y', and 'f'
54+
// represents 'z') and sm_103f (10 represents 'X', 3 represents 'Y', and 'f'
55+
// represents 'z') architecture variants. Since Y1 < Y2, sm_100f is compatible with
56+
// sm_103f. Similarly based on the second rule, sm_90 is compatible with sm_103f.
57+
//
58+
// Some counter examples, take sm_100f and sm_120f (12 represents 'X', 0
59+
// represents 'Y', and 'f' represents 'z') architecture variants. Since both
60+
// belongs to different family i.e. X1 != X2, sm_100f is not compatible with
61+
// sm_120f.
62+
//
63+
// The architecture-specific variants have 'a' feature suffix and they follow
64+
// following order:
65+
// sm_XY{a} > sm_XY{f} > sm_{XY}{}
66+
//
67+
// For example, take sm_103a (10 represents 'X', 3 represents 'Y', and 'a'
68+
// represents 'z'), sm_103f, and sm_103 architecture variants. The sm_103 is
69+
// compatible with sm_103a and sm_103f, and sm_103f is compatible with sm_103a.
70+
//
71+
// Encoding := Arch * 10 + 2 (for 'f') + 1 (for 'a')
72+
// Arch := X * 10 + Y
73+
//
74+
// For example, sm_103a is encoded as 1033 (103 * 10 + 2 + 1) and sm_103f is
75+
// encoded as 1032 (103 * 10 + 2).
76+
//
77+
// This encoding allows simple partial ordering of the architectures.
78+
// + Compare Family and Arch by dividing FullSMVersion by 100 and 10
79+
// respectively before the comparison.
80+
// + Compare within the family by comparing FullSMVersion, given both belongs to
81+
// the same family.
82+
// + Detect 'a' variants by checking FullSMVersion & 1.
3783
foreach sm = [20, 21, 30, 32, 35, 37, 50, 52, 53,
3884
60, 61, 62, 70, 72, 75, 80, 86, 87,
39-
89, 90, 100, 101, 103, 120, 121] in
40-
def SM#sm: FeatureSM<""#sm, !mul(sm, 10)>;
85+
89, 90, 100, 101, 103, 120, 121] in {
86+
// Base SM version (e.g. FullSMVersion for sm_100 is 1000)
87+
def SM#sm : FeatureSM<""#sm, !mul(sm, 10)>;
88+
89+
// Family-specific targets which are compatible within same family
90+
// (e.g. FullSMVersion for sm_100f is 1002)
91+
if !ge(sm, 100) then
92+
def SM#sm#f : FeatureSM<""#sm#"f", !add(!mul(sm, 10), 2)>;
4193

42-
// Arch-specific targets. PTX for these is not compatible with any other
43-
// architectures.
44-
def SM90a : FeatureSM<"90a", 901>;
45-
def SM100a: FeatureSM<"100a", 1001>;
46-
def SM101a: FeatureSM<"101a", 1011>;
47-
def SM103a: FeatureSM<"103a", 1031>;
48-
def SM120a: FeatureSM<"120a", 1201>;
49-
def SM121a: FeatureSM<"121a", 1211>;
94+
// Architecture-specific targets which are incompatible across architectures
95+
// (e.g. FullSMVersion for sm_100a is 1003)
96+
if !ge(sm, 90) then
97+
def SM#sm#a : FeatureSM<""#sm#"a", !add(!mul(sm, 10), 3)>;
98+
}
5099

51100
foreach version = [32, 40, 41, 42, 43, 50, 60, 61, 62, 63, 64, 65,
52101
70, 71, 72, 73, 74, 75, 76, 77, 78,
@@ -83,14 +132,19 @@ def : Proc<"sm_90", [SM90, PTX78]>;
83132
def : Proc<"sm_90a", [SM90a, PTX80]>;
84133
def : Proc<"sm_100", [SM100, PTX86]>;
85134
def : Proc<"sm_100a", [SM100a, PTX86]>;
135+
def : Proc<"sm_100f", [SM100f, PTX88]>;
86136
def : Proc<"sm_101", [SM101, PTX86]>;
87137
def : Proc<"sm_101a", [SM101a, PTX86]>;
138+
def : Proc<"sm_101f", [SM101f, PTX88]>;
88139
def : Proc<"sm_103", [SM103, PTX88]>;
89140
def : Proc<"sm_103a", [SM103a, PTX88]>;
141+
def : Proc<"sm_103f", [SM103f, PTX88]>;
90142
def : Proc<"sm_120", [SM120, PTX87]>;
91143
def : Proc<"sm_120a", [SM120a, PTX87]>;
144+
def : Proc<"sm_120f", [SM120f, PTX88]>;
92145
def : Proc<"sm_121", [SM121, PTX88]>;
93146
def : Proc<"sm_121a", [SM121a, PTX88]>;
147+
def : Proc<"sm_121f", [SM121f, PTX88]>;
94148

95149
def NVPTXInstrInfo : InstrInfo {
96150
}

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -158,10 +158,10 @@ class hasPTX<int version>: Predicate<"Subtarget->getPTXVersion() >= " # version>
158158
class hasSM<int version>: Predicate<"Subtarget->getSmVersion() >= " # version>;
159159

160160
// Explicit records for arch-accelerated SM versions
161-
def hasSM90a : Predicate<"Subtarget->getFullSmVersion() == 901">;
162-
def hasSM100a : Predicate<"Subtarget->getFullSmVersion() == 1001">;
163-
def hasSM101a : Predicate<"Subtarget->getFullSmVersion() == 1011">;
164-
def hasSM120a : Predicate<"Subtarget->getFullSmVersion() == 1201">;
161+
def hasSM90a : Predicate<"Subtarget->getSmVersion() == 90 && Subtarget->hasArchAccelFeatures()">;
162+
def hasSM100a : Predicate<"Subtarget->getSmVersion() == 100 && Subtarget->hasArchAccelFeatures()">;
163+
def hasSM101a : Predicate<"Subtarget->getSmVersion() == 101 && Subtarget->hasArchAccelFeatures()">;
164+
def hasSM120a : Predicate<"Subtarget->getSmVersion() == 120 && Subtarget->hasArchAccelFeatures()">;
165165

166166
// non-sync shfl instructions are not available on sm_70+ in PTX6.4+
167167
def hasSHFL : Predicate<"!(Subtarget->getSmVersion() >= 70"

llvm/lib/Target/NVPTX/NVPTXSubtarget.h

Lines changed: 29 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,8 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
108108
switch (FullSmVersion) {
109109
default:
110110
break;
111-
case 1001: // sm_100a
112-
case 1011: // sm_101a
111+
case 1003: // sm_100a
112+
case 1013: // sm_101a
113113
HasTcgen05 = true;
114114
break;
115115
}
@@ -120,9 +120,15 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
120120
// TMA G2S copy with cta_group::1/2 support
121121
bool hasCpAsyncBulkTensorCTAGroupSupport() const {
122122
// TODO: Update/tidy-up after the family-conditional support arrives
123-
return ((FullSmVersion == 1001 || FullSmVersion == 1011) &&
124-
PTXVersion >= 86) ||
125-
(FullSmVersion == 1031 && PTXVersion >= 88);
123+
switch (FullSmVersion) {
124+
case 1003:
125+
case 1013:
126+
return PTXVersion >= 86;
127+
case 1033:
128+
return PTXVersion >= 88;
129+
default:
130+
return false;
131+
}
126132
}
127133

128134
// Prior to CUDA 12.3 ptxas did not recognize that the trap instruction
@@ -136,14 +142,24 @@ class NVPTXSubtarget : public NVPTXGenSubtargetInfo {
136142
bool hasCvtaParam() const { return SmVersion >= 70 && PTXVersion >= 77; }
137143
unsigned int getFullSmVersion() const { return FullSmVersion; }
138144
unsigned int getSmVersion() const { return getFullSmVersion() / 10; }
139-
// GPUs with "a" suffix have include architecture-accelerated features that
140-
// are supported on the specified architecture only, hence such targets do not
141-
// follow the onion layer model. hasArchAccelFeatures() allows
142-
// distinguishing such GPU variants from the base GPU architecture.
143-
// - 0 represents base GPU model,
144-
// - non-zero value identifies particular architecture-accelerated variant.
145-
bool hasArchAccelFeatures() const { return getFullSmVersion() % 10; }
146-
145+
// GPUs with "a" suffix have architecture-accelerated features that are
146+
// supported on the specified architecture only, hence such targets do not
147+
// follow the onion layer model. hasArchAccelFeatures() allows distinguishing
148+
// such GPU variants from the base GPU architecture.
149+
// - false represents non-accelerated architecture.
150+
// - true represents architecture-accelerated variant.
151+
bool hasArchAccelFeatures() const {
152+
return (getFullSmVersion() & 1) && PTXVersion >= 80;
153+
}
154+
// GPUs with 'f' suffix have architecture-accelerated features which are
155+
// portable across all future architectures under same SM major. For example,
156+
// sm_100f features will work for sm_10X*f*/sm_10X*a* future architectures.
157+
// - false represents non-family-specific architecture.
158+
// - true represents family-specific variant.
159+
bool hasFamilySpecificFeatures() const {
160+
return getFullSmVersion() % 10 == 2 ? PTXVersion >= 88
161+
: hasArchAccelFeatures();
162+
}
147163
// If the user did not provide a target we default to the `sm_30` target.
148164
std::string getTargetName() const {
149165
return TargetName.empty() ? "sm_30" : TargetName;

llvm/test/CodeGen/NVPTX/sm-version.ll

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,14 +18,19 @@
1818
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_90a | FileCheck %s --check-prefix=SM90a
1919
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_100 | FileCheck %s --check-prefix=SM100
2020
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_100a | FileCheck %s --check-prefix=SM100a
21+
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_100f | FileCheck %s --check-prefix=SM100f
2122
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_101 | FileCheck %s --check-prefix=SM101
2223
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_101a | FileCheck %s --check-prefix=SM101a
24+
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_101f | FileCheck %s --check-prefix=SM101f
2325
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_103 | FileCheck %s --check-prefix=SM103
2426
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_103a | FileCheck %s --check-prefix=SM103a
27+
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_103f | FileCheck %s --check-prefix=SM103f
2528
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_120 | FileCheck %s --check-prefix=SM120
2629
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_120a | FileCheck %s --check-prefix=SM120a
30+
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_120f | FileCheck %s --check-prefix=SM120f
2731
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_121 | FileCheck %s --check-prefix=SM121
2832
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_121a | FileCheck %s --check-prefix=SM121a
33+
; RUN: llc < %s -mtriple=nvptx -mcpu=sm_121f | FileCheck %s --check-prefix=SM121f
2934

3035
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_20 | FileCheck %s --check-prefix=SM20
3136
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_21 | FileCheck %s --check-prefix=SM21
@@ -47,14 +52,19 @@
4752
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_90a | FileCheck %s --check-prefix=SM90a
4853
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100 | FileCheck %s --check-prefix=SM100
4954
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100a | FileCheck %s --check-prefix=SM100a
55+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_100f | FileCheck %s --check-prefix=SM100f
5056
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_101 | FileCheck %s --check-prefix=SM101
5157
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_101a | FileCheck %s --check-prefix=SM101a
58+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_101f | FileCheck %s --check-prefix=SM101f
5259
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_103 | FileCheck %s --check-prefix=SM103
5360
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_103a | FileCheck %s --check-prefix=SM103a
61+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_103f | FileCheck %s --check-prefix=SM103f
5462
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_120 | FileCheck %s --check-prefix=SM120
5563
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_120a | FileCheck %s --check-prefix=SM120a
64+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_120f | FileCheck %s --check-prefix=SM120f
5665
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_121 | FileCheck %s --check-prefix=SM121
5766
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_121a | FileCheck %s --check-prefix=SM121a
67+
; RUN: llc < %s -mtriple=nvptx64 -mcpu=sm_121f | FileCheck %s --check-prefix=SM121f
5868

5969
; SM20: .version 3.2
6070
; SM21: .version 3.2
@@ -76,14 +86,19 @@
7686
; SM90a: .version 8.0
7787
; SM100: .version 8.6
7888
; SM100a: .version 8.6
89+
; SM100f: .version 8.8
7990
; SM101: .version 8.6
8091
; SM101a: .version 8.6
92+
; SM101f: .version 8.8
8193
; SM103: .version 8.8
8294
; SM103a: .version 8.8
95+
; SM103f: .version 8.8
8396
; SM120: .version 8.7
8497
; SM120a: .version 8.7
98+
; SM120f: .version 8.8
8599
; SM121: .version 8.8
86100
; SM121a: .version 8.8
101+
; SM121f: .version 8.8
87102

88103
; SM20: .target sm_20
89104
; SM21: .target sm_21
@@ -105,11 +120,16 @@
105120
; SM90a: .target sm_90a
106121
; SM100: .target sm_100
107122
; SM100a: .target sm_100a
123+
; SM100f: .target sm_100f
108124
; SM101: .target sm_101
109125
; SM101a: .target sm_101a
126+
; SM101f: .target sm_101f
110127
; SM103: .target sm_103
111128
; SM103a: .target sm_103a
129+
; SM103f: .target sm_103f
112130
; SM120: .target sm_120
113131
; SM120a: .target sm_120a
132+
; SM120f: .target sm_120f
114133
; SM121: .target sm_121
115134
; SM121a: .target sm_121a
135+
; SM121f: .target sm_121f

0 commit comments

Comments
 (0)