You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alternatively, when available, one could use `Makefile` set to use the [cray compilers](https://docs.lumi-supercomputer.eu/development/compiling/prgenv/#using-hipcc).
17
17
## Running
18
-
In order to execute the HIP application on GPU nodes we submit it to the partition `dev-g` (`-p` flag). We have specify as well the running options like number of gpus needed, mpi tasks and CPU core per MPI taks. Below we have an example of a job with 2 GPUs (`--gpus`, 1 node (`-N`), 2 MPI tasks (`-n`) and 4 cores per MPI task (`-c`):
18
+
In order to execute the HIP application on GPU nodes we submit it to the partition `dev-g` (`-p` flag). We have specify as well the running options like number of GPUs needed, MPI tasks and CPU core per MPI task. Below we have an example of a job with 2 GPUs (`--gpus`, 1 node (`-N`), 2 MPI tasks (`-n`) and 4 cores per MPI task (`-c`):
Modify this according to the neeeds of the job. Note that the modules should be loaded in the terminal which is used for launching the job.
23
+
Modify this according to the needs of the job. Note that the modules should be loaded in the terminal which is used for launching the job.
24
24
25
25
If a reservation is available add `--reservation=<res_name>` to use the specific nodes dedicated to the course.
26
26
27
27
## Exercise instructions
28
-
The best way to learn programming is to get our hands dirty. Use the example codes in this folder to repoduce the problems presented in the [Non-portable kernel-based models](https://enccs.github.io/gpu-programming/9-non-portable-kernel-models/) episode.
28
+
The best way to learn programming is to get our hands dirty. Use the example codes in this folder to reproduce the problems presented in the [Non-portable kernel-based models](https://enccs.github.io/gpu-programming/9-non-portable-kernel-models/) episode.
29
29
30
30
Here are some suggestions for playing around:
31
31
* check the GPU assignment in the "Hello World" example.
32
-
- try 1 MPI taks with multiple GPUs and set the code to use something different from the default `device 0`
32
+
- try 1 MPI task with multiple GPUs and set the code to use something different from the default `device 0`
33
33
- try P nodes with N(<=8) MPI tasks per node with each MPI task being assigned a different GPU.
34
34
* check the vector addition with device memory and with unified memory
35
35
* implement the matrix transpose and compute the effective bandwidths achieved on LUMI GPUs
36
36
* implement a code using 1 GPU and do a reduction on a vector
37
-
* based on the [CUDA blog streams](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) tray to implement a code doing vector additions using streams ovelap data transfers and computations.
37
+
* based on the [CUDA blog streams](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) tray to implement a code doing vector additions using streams overlap data transfers and computations.
38
38
* based on the [CUDA blog reduction](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) try to implement a code doing a sum reduction.
39
39
* check on LUMI the memory bandwidth in the [memory transpose](https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip/hip/04_matrix_transpose) example.
Copy file name to clipboardExpand all lines: content/examples/cuda-hip/hip/04_matrix_transpose/README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Matrix Transpose
2
2
3
-
Assuming a matrix `a` of size `(NxM)` how to improve matrix operations on GPU? In particular the transposing operation `b(i,j)=a(j,i)`. We will compare the execution times and the effective bandwidth between a simple `copy` kernel, a `naive` transpose implementation, and two more optimized versions using `shared memory` (with and without bank conflicts). The time is mesured using the `events`. The effective bandwidth is computed as the ratio between the total memory read and written by the kernel (`2 x Total size of the Matrix in Gbytes`) and the execution time in seconds.
3
+
Assuming a matrix `a` of size `(NxM)` how to improve matrix operations on GPU? In particular the transposing operation `b(i,j)=a(j,i)`. We will compare the execution times and the effective bandwidth between a simple `copy` kernel, a `naive` transpose implementation, and two more optimized versions using `shared memory` (with and without bank conflicts). The time is measured using the `events`. The effective bandwidth is computed as the ratio between the total memory read and written by the kernel (`2 x Total size of the Matrix in Gbytes`) and the execution time in seconds.
4
4
5
5
## Copy kernel
6
6
The base line for our experiment is the simple copy kernel.
@@ -14,7 +14,7 @@ __global__ void copy_kernel(float *in, float *out, int width, int height) {
14
14
out[index] = in[index];
15
15
}
16
16
```
17
-
This kernel is only reading the data from the input matrix to the output matrix. No optimizations are needed except for minor tuning in the number of threads per block. All reads from and writes to the GPU memory are coalesced and it is maximum bandwidth that one could achive on a given machine in a kernel.
17
+
This kernel is only reading the data from the input matrix to the output matrix. No optimizations are needed except for minor tuning in the number of threads per block. All reads from and writes to the GPU memory are coalesced and it is maximum bandwidth that one could achieve on a given machine in a kernel.
18
18
19
19
## Naive transpose
20
20
This is the first transpose version where each the reads are done in a coalesced way, but not the writing.
The shared memory is local to each CU with about 100 time smaller latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
58
+
The shared memory is local to each CU with about 100 time smaller latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the second one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
59
59
60
60
## Transpose with shared memory and no bank conflicts
61
61
The bank conflicts in this case can be solved in a very simple way. We pad the shared matrix. Instead of `__shared__ float tile[tile_dim][tile_dim];` we use `__shared__ float tile[tile_dim][tile_dim+1];`. Effectively this shifts the data in the banks. Hopefully this does not create other banks conflicts!!!!
For the optimizations exercise get aquinted with the code, compile them and execute them. For each case try to tune the threads per block (by changing `tile_dim`) and find the configuration which improve the performance the most and also the ones which do not. As a reference the `V100` has 84 Streaming Multiprocessors (nvidia equivalent of CU) and a peak bandwidth of `900 GB/s`.
83
+
For the optimizations exercise get acquainted with the code, compile them and execute them. For each case try to tune the threads per block (by changing `tile_dim`) and find the configuration which improve the performance the most and also the ones which do not. As a reference the `V100` has 84 Streaming Multiprocessors (NVIDIA equivalent of CU) and a peak bandwidth of `900 GB/s`.
84
84
85
85
86
-
In this exercise it is pretty intuitive what is needed to be done to improve the performance. Measuring the time by events is suficient, but in general in order to obtain more information about how various parts of the application behave a **profiler** is recommended. `HIP` does not provide us with profilers, they are provided by the back end on top of which they are running. On Nvidia platforms we can use the tools [Nsight Systems](https://docs.csc.fi/computing/nsys/) and [Nsight Compute](https://docs.csc.fi/computing/ncu/). On AMD platforms one can try [rocprof](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) or [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest/).
86
+
In this exercise it is pretty intuitive what is needed to be done to improve the performance. Measuring the time by events is sufficient, but in general in order to obtain more information about how various parts of the application behave a **profiler** is recommended. `HIP` does not provide us with profilers, they are provided by the back end on top of which they are running. On Nvidia platforms we can use the tools [Nsight Systems](https://docs.csc.fi/computing/nsys/) and [Nsight Compute](https://docs.csc.fi/computing/ncu/). On AMD platforms one can try [rocprof](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) or [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest/).
0 commit comments