|
| 1 | +- Feature Name: opencl_kernel_generator |
| 2 | +- Start Date: 2019-10-21 |
| 3 | +- RFC PR: (leave this empty) |
| 4 | +- Stan Issue: (leave this empty) |
| 5 | + |
| 6 | +# Summary |
| 7 | +[summary]: #summary |
| 8 | + |
| 9 | +Define operators and functions on matrix_cl that will use |
| 10 | +[expresson templates](https://en.wikipedia.org/wiki/Expression_templates) |
| 11 | +to construct expressions. When assigned to a matrix_cl an expression will generate an |
| 12 | +OpenCL kernel in a JIT fashion and execute it to calculate the result of the expression. |
| 13 | + |
| 14 | +# Motivation |
| 15 | +[motivation]: #motivation |
| 16 | + |
| 17 | +OpenCL routines which do not require communication between threads and decompose into simple |
| 18 | +arithmetic operations follow a simple form. A JIT compiler that can take in arithmetic operations |
| 19 | +and create these OpenCL routines on the fly would be much simpler and more efficient than writing |
| 20 | +these OpenCL kernels by hand. |
| 21 | + |
| 22 | +Using one kernel for multiple operations is faster than using one kernel per operation. Each |
| 23 | +kernel must load its arguments from global global GPU memory and store results back. Memory |
| 24 | +transfers are relatively slow compared to calculations. In case of simple kernels majority |
| 25 | +of kernel execution time is spent for memory transfers. Kernel generator can generate a single |
| 26 | +kernel from a sequence of operations and is therefore more performant than using one predefined |
| 27 | +kernel for each operation. |
| 28 | + |
| 29 | +Any operation that requires no communication between threads can be implemented in kernel |
| 30 | +generator. While that means not any kernel can be rewritten with kernel generator, many, if |
| 31 | +not most of ones needed in Stan Math should qualify. |
| 32 | + |
| 33 | +# Explanation |
| 34 | +[guide-level-explanation]: #explanation |
| 35 | + |
| 36 | +A kernel generator expression consists of operations, for example addition, exponentiation, |
| 37 | +transposition. While not operations in mathematical sense, accesses to matrices and scalars are also operations |
| 38 | +in kernel generator. We can define these operations as classes using the curiously recurring |
| 39 | +template pattern (CRTP) to generate a compound class of templated expressions. See chapter 27 of C++ |
| 40 | +Templates - The Complete Guide for a more detailed explanation of this pattern. |
| 41 | + |
| 42 | +Each operation is implemented as a templated class. When instantiated those templates are given |
| 43 | +types of the subexpressions that are arguments to the particular operation. User-facing |
| 44 | +functions are used to simplify construction of operations, since class templates can not |
| 45 | +be deduced in C++14. |
| 46 | + |
| 47 | +Each operation object is responsible for generating kernel code for the operation it |
| 48 | +represents. Each operation selects a unique variable name and generates kernel code that |
| 49 | +declares a variable and assigns the result of the operation to it. If the operation has any |
| 50 | +arguments that are operations themselves it can access variables they use for their results. |
| 51 | +If the operation needs any other data it can specify kernel arguments. For example an operation |
| 52 | +that accesses a matrix might need a pointer to matrix data and size of the matrix. |
| 53 | + |
| 54 | +Some operations can also be used in expressions that are assigned to. The most basic case |
| 55 | +is access to matrix - i.e. storing the result of a kernel. Another example is matrix sub block |
| 56 | +operation Such operations can also generate |
| 57 | +kernel code for appearing on the left-hand-side of assignment. While the code is different the |
| 58 | +process of generating it is similar to generating code for an operation on the right-hand side |
| 59 | +of an assignment. |
| 60 | + |
| 61 | +Kernel generator expressions can consist of operations, matrices and scalars. Operations have |
| 62 | +appropriate methods for constructing kernel source. Matrices and scalars on the other hand do |
| 63 | +not. So they need to be wrapped in appropriate wrapper that is an operation whenever they are |
| 64 | +used in a kernel generator expression. |
| 65 | + |
| 66 | +If an expression is used multiple times the kernel associated with it should be cached. So |
| 67 | +regeneration and recompilation of kernel is not necessary at every time an expression is |
| 68 | +evaluated. Instead kernel is generated only at first use. Cached kernel can also be reused |
| 69 | +between instances of expressions consisting of same operands, but operating on different data, |
| 70 | +even if the matrices have different sizes. |
| 71 | + |
| 72 | +Caching is global within a program. In other words cached kernels are only deleted when the |
| 73 | +program using them is terminated. Trough templated classes we achieve that every possible expression |
| 74 | +has different type. So any kernel generated by kernel generator can be uniquely identified by |
| 75 | +a pair that includes type of the expression ant the type of the result. That allows us not to use |
| 76 | +a map or a similar object to implement cache. Instead cached kernel can be a static member of |
| 77 | +a struct that is templated by both types of the expression and the result. |
| 78 | + |
| 79 | +# Example and reference level explanantion |
| 80 | + |
| 81 | +Binary and Unary operations are defined as templated classes inheriting from a base `operation` |
| 82 | +class. The `operation` class is a CRTP interface layer which defines the necessary operations |
| 83 | +that `Derived` classes need to implement. The class `binary_operation` is the base that all |
| 84 | +over binary operations are derived from. For example, addition is defined as |
| 85 | + |
| 86 | +```cpp |
| 87 | +template <typename T_a, typename T_b> |
| 88 | +class addition__ : public binary_operation<addition__<T_a, T_b>, T_a, T_b> { |
| 89 | + public: |
| 90 | + addition__(T_a&& a, T_b&& b) |
| 91 | + : binary_operation<addition__<T_a, T_b>, T_a, T_b>( |
| 92 | + std::forward<T_a>(a), std::forward<T_b>(b), "+") {} |
| 93 | +}; |
| 94 | + |
| 95 | +template <typename T_a, typename T_b, |
| 96 | + typename = require_all_valid_expressions_t<T_a, T_b>> |
| 97 | +inline addition__<as_operation_t<T_a>, as_operation_t<T_b>> operator+(T_a&& a, T_b&& b) { |
| 98 | + return {as_operation(std::forward<T_a>(a)), |
| 99 | + as_operation(std::forward<T_b>(b))}; |
| 100 | +} |
| 101 | +``` |
| 102 | +
|
| 103 | +In the above we have the classic CRTP inheritance style for expression templates, a perfect forwarding |
| 104 | +constructor, and an `operator+` that the user calls. The constructor in the above calls the |
| 105 | +`binary_operator` constructor while passing the simple operator `+`. The `+` is stored as `op_` in the |
| 106 | +`binary_operator` instantiation and used during kernel construction in the `generate()` method of |
| 107 | +`binary_operator`. The `generate()` method for binary operator is not called until the assignment operator |
| 108 | +to a matrix_cl is called. Once the assignment operator is called the expression template class's `eval()` |
| 109 | +method is called which in turn calls `operation`'s `evaluate_into()`, which then in turn finally calls |
| 110 | +`generate()` to take the expressions matrices and operations and create the appropriate OpenCL kernel. |
| 111 | +
|
| 112 | +Those templates are instantinated as shown in an example: |
| 113 | +
|
| 114 | +```c++ |
| 115 | +matrix_cl<double> a, b; |
| 116 | +double c; |
| 117 | +matrix_cl<double> d = c * (a + b); |
| 118 | +``` |
| 119 | + |
| 120 | +In this example `a` and `b` are first wrapped in a matrix access operation called `load_`. |
| 121 | +It is templated with the type of the type of argument, which can be either value or reference |
| 122 | + (the class implements perfect forwarding). In this example both `a` and `b` are lvalues so we |
| 123 | + get `load_<matrix_cl<double>&>` in both cases. This happens within |
| 124 | +the operator+. This operator returns an addition object that references its parameters - |
| 125 | +its type is `addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>`. |
| 126 | + |
| 127 | +operator* wraps scalar in scalar accessing operation that is templated with the type of the scalar |
| 128 | + (`scalar_<double>`) and constructs element-wise multiplication object that |
| 129 | +references that and the addition as its arguments - resulting in an object of type |
| 130 | +`elewise_multiplication_<scalar_<double>, addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>>`. |
| 131 | + |
| 132 | +When assigned to a matrix the expression generates the opencl kernel. Internally in `operator=()`, |
| 133 | +`d` is also wrapped into `load_<matrix_cl<double>&>`. Matrix access operations generate code for |
| 134 | +loading matrix elements from global memory. Addition generates code for adding them together. |
| 135 | +Multiplication multiplies the result with the scalar. `d`'s wrapper generates code for storing the |
| 136 | +back to global memory. |
| 137 | + |
| 138 | +When the `evaluate_into()` method is called for this kernel it will begin creating the kernel signature |
| 139 | +and body. The `scalar_<double>` will add `double [TYPE] [NAME]` to the kernel signature. Name here is |
| 140 | +generated to be unique for each operation in the kernel. Type is the template parameter of `scalar_` - |
| 141 | +in this case `double`. |
| 142 | + |
| 143 | +The `load_<matrix_cl<double>>` will add `__global [TYPE]* [NAME], int [NAME]_rows, int [NAME]_view` to |
| 144 | +the kernel signature. The kernel's body will define a private variable and then assign the threads array |
| 145 | +cell into the private variable for that particular object. That assigment is skipped if the threads array |
| 146 | +cell is the part of a triangular matrix that is equal to 0. The pattern will look as follows: |
| 147 | + |
| 148 | +```opencl |
| 149 | +double [NAME] = 0; |
| 150 | +if (!((!contains_nonzero([NAME]_view, LOWER) && j < i) || |
| 151 | + (!contains_nonzero([NAME]_view, UPPER) && j > i))) { |
| 152 | + [NAME] = [NAME]_global[i + [NAME]_rows * j]; |
| 153 | +} |
| 154 | +``` |
| 155 | + |
| 156 | +The `generate()` method for binary operations will add their `op_` to the OpenCL kernel working on the |
| 157 | +named values of the private variables of their arguments and creating a private variable |
| 158 | +for compound expressions. In the example above `elewise_multiplication_` and `addition_` would generate |
| 159 | +the following code in kernel body: |
| 160 | + |
| 161 | +```opencl |
| 162 | +double var4 = var2 + var3; |
| 163 | +double var5 = var1 * var4; |
| 164 | +``` |
| 165 | + |
| 166 | +Finally, the assignment to a `matrix_cl<T>` creates a `matrix_cl<T>` of the appropriate size which is |
| 167 | +added to the end of the signature and assigned into. |
| 168 | + |
| 169 | +```opencl |
| 170 | +var5_global[i + var5_rows * j] = var4; |
| 171 | +``` |
| 172 | + |
| 173 | +Putting this all together we can see the kernel for the |
| 174 | +`elewise_multiplication_<scalar_<double>, addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>>` |
| 175 | +decomposes into: |
| 176 | + |
| 177 | +```opencl |
| 178 | +kernel void calculate(__global double var1, |
| 179 | + __global double* var2_global, int var2_rows, int var2_view, |
| 180 | + __global double* var3_global, int var3_rows, int var3_view |
| 181 | + __global double* var6_global, int var6_rows, int var6_view){ |
| 182 | + int i = get_global_id(0); |
| 183 | + int j = get_global_id(1); |
| 184 | + double var1 = 0; |
| 185 | + if (!((!contains_nonzero(var1_view, LOWER) && j < i) || |
| 186 | + (!contains_nonzero(var1_view, UPPER) && j > i))) { |
| 187 | + var1 = var1_global[i + var1_rows * j]; |
| 188 | + } |
| 189 | + double var2 = 0; |
| 190 | + if (!((!contains_nonzero(var2_view, LOWER) && j < i) || |
| 191 | + (!contains_nonzero(var2_view, UPPER) && j > i))) { |
| 192 | + var2 = var2_global[i + var2_rows * j]; |
| 193 | + } |
| 194 | + double var4 = var2 + var3; |
| 195 | + double var5 = var1 * var4; |
| 196 | + var6_global[i + var6_rows * j] = var5; |
| 197 | +} |
| 198 | +``` |
| 199 | +
|
| 200 | +After the kernel is compiled it is cached into |
| 201 | +`static operation<elewise_multiplication_<scalar_<double>, addition_<load_<matrix_cl<double>&>, load_<matrix_cl<double>&>>>>::cache<load<matrix_cl<double>&>::kernel` |
| 202 | +
|
| 203 | +# Drawbacks |
| 204 | +[drawbacks]: #drawbacks |
| 205 | +
|
| 206 | +Even with caching each kernel must be compiled the first time it is used. If many kernels are used and each |
| 207 | +only once, long compilations times could make this slower than one kernel per operation. This is not an issue |
| 208 | +in Stan, since many leapfrog steps are executed. |
| 209 | +
|
| 210 | +# Prior art |
| 211 | +[prior-art]: #prior-art |
| 212 | +
|
| 213 | +Expression templates are widely used in Eigen. |
| 214 | +
|
| 215 | +Tensorflow XLA is experimental kernel fusion feature of Tensorflow. |
0 commit comments