|
| 1 | +==== |
| 2 | +GPUs |
| 3 | +==== |
| 4 | + |
| 5 | +.. .. contents:: |
| 6 | + |
| 7 | + |
| 8 | +GPUs are supported in Charm4py via the Charm++ HAPI (Hybrid API) interface. |
| 9 | +Presently, this support allows asynchronous completion detection of GPU kernels via Charm4py futures, |
| 10 | +using the function ``charm.hapiAddCudaCallback``. |
| 11 | + |
| 12 | +The HAPI Charm4py API is: |
| 13 | + |
| 14 | +.. code-block:: python |
| 15 | +
|
| 16 | + def hapiAddCudaCallback(stream, future) |
| 17 | +
|
| 18 | +.. note:: |
| 19 | + |
| 20 | + For now, ``charm.hapiAddCudaCallback`` only supports numba and torch streams as input. This function inserts a callback |
| 21 | + into the stream such that when the callback is reached, the corresponding Charm4py future is set. |
| 22 | + |
| 23 | +Enabling HAPI |
| 24 | +-------- |
| 25 | +To build Charm4py with HAPI support, add "cuda" to the Charm build options and follow the steps to build Charm4py from source: |
| 26 | + |
| 27 | +.. code-block:: shell |
| 28 | +
|
| 29 | + export CHARM_EXTRA_BUILD_OPTS="cuda" |
| 30 | + pip install . |
| 31 | +
|
| 32 | +.. warning:: |
| 33 | + |
| 34 | + To ensure that the underlying Charm build has Cuda enabled, remove any pre-existing builds in charm_src/charm before setting the Cuda option and running install. |
| 35 | + |
| 36 | +Examples |
| 37 | +-------- |
| 38 | + |
| 39 | +.. code-block:: python |
| 40 | +
|
| 41 | + from charm4py import charm |
| 42 | + import time |
| 43 | + import numba.cuda as cuda |
| 44 | + import numpy as np |
| 45 | +
|
| 46 | + @cuda.jit |
| 47 | + def elementwise_sum_kernel(x_in, x_out): |
| 48 | + idx = cuda.grid(1) |
| 49 | + if idx < x_in.shape[0]: |
| 50 | + x_out[idx] = x_in[idx] + x_in[idx] |
| 51 | +
|
| 52 | + def main(args): |
| 53 | + N = 1_000_000 |
| 54 | + array_size = (N,) |
| 55 | +
|
| 56 | + s = cuda.stream() |
| 57 | + stream_handle = s.handle.value |
| 58 | +
|
| 59 | + A_host = np.arange(N, dtype=np.float32) |
| 60 | +
|
| 61 | + A_gpu = cuda.device_array(array_size, dtype=np.float32, stream=s) |
| 62 | + B_gpu = cuda.device_array(array_size, dtype=np.float32, stream=s) |
| 63 | + A_gpu.copy_to_device(A_host, stream=s) |
| 64 | +
|
| 65 | + threads_per_block = 128 |
| 66 | + blocks_per_grid = (N + (threads_per_block - 1)) // threads_per_block |
| 67 | +
|
| 68 | + print("Launching kernel and inserting callback...") |
| 69 | + start_time = time.perf_counter() |
| 70 | + elementwise_sum_kernel[blocks_per_grid, threads_per_block, s](A_gpu, B_gpu) |
| 71 | +
|
| 72 | + return_fut = charm.Future() |
| 73 | + charm.hapiAddCudaCallback(stream_handle, return_fut) |
| 74 | + return_fut.get() |
| 75 | + kernel_done_time = time.perf_counter() |
| 76 | + print(f"Callback received, kernel finished in {kernel_done_time - start_time:.6f} seconds.") |
| 77 | +
|
| 78 | + B_host = B_gpu.copy_to_host(stream=s) |
| 79 | +
|
| 80 | + s.synchronize() |
| 81 | +
|
| 82 | + sum_result = np.sum(B_host) |
| 83 | + print(f"Sum of result is {sum_result}") |
| 84 | +
|
| 85 | + charm.exit() |
| 86 | +
|
| 87 | + charm.start(main) |
| 88 | +
|
| 89 | +
|
| 90 | +The above example demonstrates how to use the Charm4py HAPI interface to insert a callback into a CUDA stream and track |
| 91 | +completion of a numba kernel launch. |
0 commit comments