Skip to content

Commit d447282

Browse files
authored
Merge pull request #11 from bstatcomp/opencl_kernel_generator
Kernel generator design doc
2 parents 8e93766 + 6fce9d2 commit d447282

File tree

1 file changed

+215
-0
lines changed

1 file changed

+215
-0
lines changed
Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
- Feature Name: opencl_kernel_generator
2+
- Start Date: 2019-10-21
3+
- RFC PR: (leave this empty)
4+
- Stan Issue: (leave this empty)
5+
6+
# Summary
7+
[summary]: #summary
8+
9+
Define operators and functions on matrix_cl that will use
10+
[expresson templates](https://en.wikipedia.org/wiki/Expression_templates)
11+
to construct expressions. When assigned to a matrix_cl an expression will generate an
12+
OpenCL kernel in a JIT fashion and execute it to calculate the result of the expression.
13+
14+
# Motivation
15+
[motivation]: #motivation
16+
17+
OpenCL routines which do not require communication between threads and decompose into simple
18+
arithmetic operations follow a simple form. A JIT compiler that can take in arithmetic operations
19+
and create these OpenCL routines on the fly would be much simpler and more efficient than writing
20+
these OpenCL kernels by hand.
21+
22+
Using one kernel for multiple operations is faster than using one kernel per operation. Each
23+
kernel must load its arguments from global global GPU memory and store results back. Memory
24+
transfers are relatively slow compared to calculations. In case of simple kernels majority
25+
of kernel execution time is spent for memory transfers. Kernel generator can generate a single
26+
kernel from a sequence of operations and is therefore more performant than using one predefined
27+
kernel for each operation.
28+
29+
Any operation that requires no communication between threads can be implemented in kernel
30+
generator. While that means not any kernel can be rewritten with kernel generator, many, if
31+
not most of ones needed in Stan Math should qualify.
32+
33+
# Explanation
34+
[guide-level-explanation]: #explanation
35+
36+
A kernel generator expression consists of operations, for example addition, exponentiation,
37+
transposition. While not operations in mathematical sense, accesses to matrices and scalars are also operations
38+
in kernel generator. We can define these operations as classes using the curiously recurring
39+
template pattern (CRTP) to generate a compound class of templated expressions. See chapter 27 of C++
40+
Templates - The Complete Guide for a more detailed explanation of this pattern.
41+
42+
Each operation is implemented as a templated class. When instantiated those templates are given
43+
types of the subexpressions that are arguments to the particular operation. User-facing
44+
functions are used to simplify construction of operations, since class templates can not
45+
be deduced in C++14.
46+
47+
Each operation object is responsible for generating kernel code for the operation it
48+
represents. Each operation selects a unique variable name and generates kernel code that
49+
declares a variable and assigns the result of the operation to it. If the operation has any
50+
arguments that are operations themselves it can access variables they use for their results.
51+
If the operation needs any other data it can specify kernel arguments. For example an operation
52+
that accesses a matrix might need a pointer to matrix data and size of the matrix.
53+
54+
Some operations can also be used in expressions that are assigned to. The most basic case
55+
is access to matrix - i.e. storing the result of a kernel. Another example is matrix sub block
56+
operation Such operations can also generate
57+
kernel code for appearing on the left-hand-side of assignment. While the code is different the
58+
process of generating it is similar to generating code for an operation on the right-hand side
59+
of an assignment.
60+
61+
Kernel generator expressions can consist of operations, matrices and scalars. Operations have
62+
appropriate methods for constructing kernel source. Matrices and scalars on the other hand do
63+
not. So they need to be wrapped in appropriate wrapper that is an operation whenever they are
64+
used in a kernel generator expression.
65+
66+
If an expression is used multiple times the kernel associated with it should be cached. So
67+
regeneration and recompilation of kernel is not necessary at every time an expression is
68+
evaluated. Instead kernel is generated only at first use. Cached kernel can also be reused
69+
between instances of expressions consisting of same operands, but operating on different data,
70+
even if the matrices have different sizes.
71+
72+
Caching is global within a program. In other words cached kernels are only deleted when the
73+
program using them is terminated. Trough templated classes we achieve that every possible expression
74+
has different type. So any kernel generated by kernel generator can be uniquely identified by
75+
a pair that includes type of the expression ant the type of the result. That allows us not to use
76+
a map or a similar object to implement cache. Instead cached kernel can be a static member of
77+
a struct that is templated by both types of the expression and the result.
78+
79+
# Example and reference level explanantion
80+
81+
Binary and Unary operations are defined as templated classes inheriting from a base `operation`
82+
class. The `operation` class is a CRTP interface layer which defines the necessary operations
83+
that `Derived` classes need to implement. The class `binary_operation` is the base that all
84+
over binary operations are derived from. For example, addition is defined as
85+
86+
```cpp
87+
template <typename T_a, typename T_b>
88+
class addition__ : public binary_operation<addition__<T_a, T_b>, T_a, T_b> {
89+
public:
90+
addition__(T_a&& a, T_b&& b)
91+
: binary_operation<addition__<T_a, T_b>, T_a, T_b>(
92+
std::forward<T_a>(a), std::forward<T_b>(b), "+") {}
93+
};
94+
95+
template <typename T_a, typename T_b,
96+
typename = require_all_valid_expressions_t<T_a, T_b>>
97+
inline addition__<as_operation_t<T_a>, as_operation_t<T_b>> operator+(T_a&& a, T_b&& b) {
98+
return {as_operation(std::forward<T_a>(a)),
99+
as_operation(std::forward<T_b>(b))};
100+
}
101+
```
102+
103+
In the above we have the classic CRTP inheritance style for expression templates, a perfect forwarding
104+
constructor, and an `operator+` that the user calls. The constructor in the above calls the
105+
`binary_operator` constructor while passing the simple operator `+`. The `+` is stored as `op_` in the
106+
`binary_operator` instantiation and used during kernel construction in the `generate()` method of
107+
`binary_operator`. The `generate()` method for binary operator is not called until the assignment operator
108+
to a matrix_cl is called. Once the assignment operator is called the expression template class's `eval()`
109+
method is called which in turn calls `operation`'s `evaluate_into()`, which then in turn finally calls
110+
`generate()` to take the expressions matrices and operations and create the appropriate OpenCL kernel.
111+
112+
Those templates are instantinated as shown in an example:
113+
114+
```c++
115+
matrix_cl<double> a, b;
116+
double c;
117+
matrix_cl<double> d = c * (a + b);
118+
```
119+
120+
In this example `a` and `b` are first wrapped in a matrix access operation called `load_`.
121+
It is templated with the type of the type of argument, which can be either value or reference
122+
(the class implements perfect forwarding). In this example both `a` and `b` are lvalues so we
123+
get `load_<matrix_cl<double>&>` in both cases. This happens within
124+
the operator+. This operator returns an addition object that references its parameters -
125+
its type is `addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>`.
126+
127+
operator* wraps scalar in scalar accessing operation that is templated with the type of the scalar
128+
(`scalar_<double>`) and constructs element-wise multiplication object that
129+
references that and the addition as its arguments - resulting in an object of type
130+
`elewise_multiplication_<scalar_<double>, addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>>`.
131+
132+
When assigned to a matrix the expression generates the opencl kernel. Internally in `operator=()`,
133+
`d` is also wrapped into `load_<matrix_cl<double>&>`. Matrix access operations generate code for
134+
loading matrix elements from global memory. Addition generates code for adding them together.
135+
Multiplication multiplies the result with the scalar. `d`'s wrapper generates code for storing the
136+
back to global memory.
137+
138+
When the `evaluate_into()` method is called for this kernel it will begin creating the kernel signature
139+
and body. The `scalar_<double>` will add `double [TYPE] [NAME]` to the kernel signature. Name here is
140+
generated to be unique for each operation in the kernel. Type is the template parameter of `scalar_` -
141+
in this case `double`.
142+
143+
The `load_<matrix_cl<double>>` will add `__global [TYPE]* [NAME], int [NAME]_rows, int [NAME]_view` to
144+
the kernel signature. The kernel's body will define a private variable and then assign the threads array
145+
cell into the private variable for that particular object. That assigment is skipped if the threads array
146+
cell is the part of a triangular matrix that is equal to 0. The pattern will look as follows:
147+
148+
```opencl
149+
double [NAME] = 0;
150+
if (!((!contains_nonzero([NAME]_view, LOWER) && j < i) ||
151+
(!contains_nonzero([NAME]_view, UPPER) && j > i))) {
152+
[NAME] = [NAME]_global[i + [NAME]_rows * j];
153+
}
154+
```
155+
156+
The `generate()` method for binary operations will add their `op_` to the OpenCL kernel working on the
157+
named values of the private variables of their arguments and creating a private variable
158+
for compound expressions. In the example above `elewise_multiplication_` and `addition_` would generate
159+
the following code in kernel body:
160+
161+
```opencl
162+
double var4 = var2 + var3;
163+
double var5 = var1 * var4;
164+
```
165+
166+
Finally, the assignment to a `matrix_cl<T>` creates a `matrix_cl<T>` of the appropriate size which is
167+
added to the end of the signature and assigned into.
168+
169+
```opencl
170+
var5_global[i + var5_rows * j] = var4;
171+
```
172+
173+
Putting this all together we can see the kernel for the
174+
`elewise_multiplication_<scalar_<double>, addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>>`
175+
decomposes into:
176+
177+
```opencl
178+
kernel void calculate(__global double var1,
179+
__global double* var2_global, int var2_rows, int var2_view,
180+
__global double* var3_global, int var3_rows, int var3_view
181+
__global double* var6_global, int var6_rows, int var6_view){
182+
int i = get_global_id(0);
183+
int j = get_global_id(1);
184+
double var1 = 0;
185+
if (!((!contains_nonzero(var1_view, LOWER) && j < i) ||
186+
(!contains_nonzero(var1_view, UPPER) && j > i))) {
187+
var1 = var1_global[i + var1_rows * j];
188+
}
189+
double var2 = 0;
190+
if (!((!contains_nonzero(var2_view, LOWER) && j < i) ||
191+
(!contains_nonzero(var2_view, UPPER) && j > i))) {
192+
var2 = var2_global[i + var2_rows * j];
193+
}
194+
double var4 = var2 + var3;
195+
double var5 = var1 * var4;
196+
var6_global[i + var6_rows * j] = var5;
197+
}
198+
```
199+
200+
After the kernel is compiled it is cached into
201+
`static operation<elewise_multiplication_<scalar_<double>, addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>>>::cache<load<matrix_cl<double>&>::kernel`
202+
203+
# Drawbacks
204+
[drawbacks]: #drawbacks
205+
206+
Even with caching each kernel must be compiled the first time it is used. If many kernels are used and each
207+
only once, long compilations times could make this slower than one kernel per operation. This is not an issue
208+
in Stan, since many leapfrog steps are executed.
209+
210+
# Prior art
211+
[prior-art]: #prior-art
212+
213+
Expression templates are widely used in Eigen.
214+
215+
Tensorflow XLA is experimental kernel fusion feature of Tensorflow.

0 commit comments

Comments
 (0)