@@ -88,6 +88,22 @@ platforms, including Windows, Linux, macOS, iOS[^1], Android, and the web[^2].
88
88
By using Rust GPU and ` wgpu ` , we have a clean, portable setup with everything written in
89
89
Rust.
90
90
91
+ ## GPU program basics
92
+
93
+ The smallest unit of execution is a thread, which executes the GPU program.
94
+
95
+ Workgroups are groups of threads: they are grouped together and run in parallel (they’re
96
+ called [ thread blocks in
97
+ CUDA] ( < https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming) > ) ). They can access
98
+ the same shared memory.
99
+
100
+ We can dispatch many of these workgroups at once. CUDA calls this a grid (which is made
101
+ of thread blocks).
102
+
103
+ Workgroups and dispatching workgroups are defined in 3D. The size of a workgroup is
104
+ defined by ` compute(threads((x, y, z))) ` where the number of threads per workgroup is
105
+ x \* y \* z.
106
+
91
107
## Writing the kernel
92
108
93
109
### Kernel 1: Naive kernel
@@ -159,6 +175,32 @@ examples.
159
175
160
176
:::
161
177
178
+ Each workgroup, since it’s only one thread, processes one ` result[i, j] ` .
179
+
180
+ To calculate the full matrix, we need to launch as many entries as there are in the
181
+ matrix. Here we specify that (` Uvec3::new(m * n, 1, 1 ` ) on the CPU:
182
+
183
+ import { RustNaiveWorkgroupCount } from './snippets/naive.tsx';
184
+
185
+ <RustNaiveWorkgroupCount />
186
+
187
+ The ` dispatch_count() ` function runs on the CPU and is used by the CPU-to-GPU API (in
188
+ our case ` wgpu ` ) to configure and dispatch work to the GPU:
189
+
190
+ import { RustNaiveDispatch } from './snippets/naive.tsx';
191
+
192
+ <RustNaiveDispatch />
193
+
194
+ ::: warning
195
+
196
+ This code appears more complicated than it needs to be. I abstracted the CPU-side code
197
+ that talks to the GPU using generics and traits so I could easily slot in different
198
+ kernels and their settings while writing the blog post.
199
+
200
+ You could just hardcode the value for simplicity.
201
+
202
+ :::
203
+
162
204
### Kernel 2: Moarrr threads!
163
205
164
206
With the first kernel, we're only able to compute small square matrices due to limits on
@@ -187,26 +229,12 @@ import { RustWorkgroup256WorkgroupCount } from './snippets/workgroup_256.tsx';
187
229
188
230
<RustWorkgroup256WorkgroupCount />
189
231
190
- The ` dispatch_count() ` function runs on the CPU and is used by the CPU-to-GPU API (in
191
- our case ` wgpu ` ) to configure and dispatch to the GPU:
192
-
193
- import { RustWorkgroup256WgpuDispatch } from './snippets/workgroup_256.tsx';
194
-
195
- <RustWorkgroup256WgpuDispatch />
196
-
197
- ::: warning
198
-
199
- This code appears more complicated than it needs to be. I abstracted the CPU-side code
200
- that talks to the GPU using generics and traits so I could easily slot in different
201
- kernels and their settings while writing the blog post.
202
-
203
- You could just hardcode a value for simplicity.
204
-
205
- :::
232
+ With these two small changes we can handle larger matrices without hitting hardware
233
+ workgroup limits.
206
234
207
235
### Kernel 3: Calculating with 2D workgroups
208
236
209
- However doing all the computation in "1 dimension" limits the matrix size we can
237
+ However, doing all the computation in "1 dimension" still limits the matrix size we can
210
238
calculate.
211
239
212
240
Although we don't change much about our code, if we distribute our work in 2 dimensions
@@ -257,24 +285,29 @@ import { RustTiling2dSimd } from './snippets/tiling_2d_simd.tsx';
257
285
Each thread now calculates a 4x4 grid of the output matrix and we see a slight
258
286
improvement over the last kernel.
259
287
288
+ To stay true to the spirit of Zach's original blog post, we'll wrap things up here and
289
+ leave the "fancier" experiments for another time.
290
+
260
291
## Reflections on porting to Rust GPU
261
292
262
293
Porting to Rust GPU went quickly, as the kernels Zach used were fairly simple. Most of
263
294
the time was spent with concerns that were not specifically about writing GPU code. For
264
295
example, deciding how much to abstract vs how much to make the code easy to follow, if
265
296
everything should be available at runtime or if each kernel should be a compilation
266
- target, etc. The code is not _ great_ as it is still blog post code!
297
+ target, etc. [ The
298
+ code] ( https://github.com/Rust-GPU/rust-gpu.github.io/tree/main/blog/2024-11-21-optimizing-matrix-mul/code )
299
+ is not _ great_ as it is still blog post code!
267
300
268
301
My background is not in GPU programming, but I do have Rust experience. I joined the
269
302
Rust GPU project because I tried to use standard GPU languages and knew there must be a
270
303
better way. Writing these GPU kernels felt like writing any other Rust code (other than
271
- debugging, more on that later) which is a huge win to me. Not only the language itself,
304
+ debugging, more on that later) which is a huge win to me. Not just the language itself,
272
305
but the entire development experience.
273
306
274
307
## Rust-specific party tricks
275
308
276
309
Rust lets us write code for both the CPU and GPU in ways that are often impossible—or at
277
- least less elegant—with other languages. I'm going to highlight some benefits of Rust I
310
+ least less elegant—with other languages. I'm going to highlight some benefits I
278
311
experienced while working on this blog post.
279
312
280
313
### Shared code across GPU and CPU
@@ -351,8 +384,9 @@ Testing the kernel in isolation is useful, but it does not reflect how the GPU e
351
384
it with multiple invocations across workgroups and dispatches. To test the kernel
352
385
end-to-end, I needed a test harness that simulated this behavior on the CPU.
353
386
354
- Building the harness was straightforward. By enforcing the same invariants as the GPU I
355
- could validate the kernel under the same conditions the GPU would run it:
387
+ Building the harness was straightforward due to the borrow checker. By enforcing the
388
+ same invariants as the GPU I could validate the kernel under the same conditions the GPU
389
+ would run it:
356
390
357
391
import { RustCpuBackendHarness } from './snippets/party.tsx';
358
392
@@ -484,10 +518,9 @@ future.
484
518
This kernel doesn't use conditional compilation, but it's a key feature of Rust that
485
519
works with Rust GPU. With ` #[cfg(...)] ` , you can adapt kernels to different hardware or
486
520
configurations without duplicating code. GPU languages like WGSL or GLSL offer
487
- preprocessor directives, but these tools lack standardization across ecosystems. Rust
488
- GPU leverages the existing Cargo ecosystem, so conditional compilation follows the same
489
- standards all Rust developers already know. This makes adapting kernels for different
490
- targets easier and more maintainable.
521
+ preprocessor directives, but these tools lack standardization across projects. Rust GPU
522
+ leverages the existing Cargo ecosystem, so conditional compilation follows the same
523
+ standards all Rust developers already know.
491
524
492
525
## Come join us!
493
526
0 commit comments