Skip to content

Commit f897b46

Browse files
committed
Fix typos in README.md's
1 parent 7a8dcfd commit f897b46

File tree

2 files changed

+10
-10
lines changed

2 files changed

+10
-10
lines changed

content/examples/cuda-hip/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -15,25 +15,25 @@ hipcc -O2 --offload-arch=gfx90a <code>.cpp
1515
```
1616
Alternatively, when available, one could use `Makefile` set to use the [cray compilers](https://docs.lumi-supercomputer.eu/development/compiling/prgenv/#using-hipcc).
1717
## Running
18-
In order to execute the HIP application on GPU nodes we submit it to the partition `dev-g` (`-p` flag). We have specify as well the running options like number of gpus needed, mpi tasks and CPU core per MPI taks. Below we have an example of a job with 2 GPUs (`--gpus`, 1 node (`-N`), 2 MPI tasks (`-n`) and 4 cores per MPI task (`-c`):
18+
In order to execute the HIP application on GPU nodes we submit it to the partition `dev-g` (`-p` flag). We have specify as well the running options like number of GPUs needed, MPI tasks and CPU core per MPI task. Below we have an example of a job with 2 GPUs (`--gpus`, 1 node (`-N`), 2 MPI tasks (`-n`) and 4 cores per MPI task (`-c`):
1919

2020
```
2121
srun -p dev-g --gpus 2 -N 1 -n 2 -c 4 --time=00:10:00 --account=project_465001310 ./a.out
2222
```
23-
Modify this according to the neeeds of the job. Note that the modules should be loaded in the terminal which is used for launching the job.
23+
Modify this according to the needs of the job. Note that the modules should be loaded in the terminal which is used for launching the job.
2424

2525
If a reservation is available add `--reservation=<res_name>` to use the specific nodes dedicated to the course.
2626

2727
## Exercise instructions
28-
The best way to learn programming is to get our hands dirty. Use the example codes in this folder to repoduce the problems presented in the [Non-portable kernel-based models](https://enccs.github.io/gpu-programming/9-non-portable-kernel-models/) episode.
28+
The best way to learn programming is to get our hands dirty. Use the example codes in this folder to reproduce the problems presented in the [Non-portable kernel-based models](https://enccs.github.io/gpu-programming/9-non-portable-kernel-models/) episode.
2929

3030
Here are some suggestions for playing around:
3131
* check the GPU assignment in the "Hello World" example.
32-
- try 1 MPI taks with multiple GPUs and set the code to use something different from the default `device 0`
32+
- try 1 MPI task with multiple GPUs and set the code to use something different from the default `device 0`
3333
- try P nodes with N(<=8) MPI tasks per node with each MPI task being assigned a different GPU.
3434
* check the vector addition with device memory and with unified memory
3535
* implement the matrix transpose and compute the effective bandwidths achieved on LUMI GPUs
3636
* implement a code using 1 GPU and do a reduction on a vector
37-
* based on the [CUDA blog streams](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) tray to implement a code doing vector additions using streams ovelap data transfers and computations.
37+
* based on the [CUDA blog streams](https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/) tray to implement a code doing vector additions using streams overlap data transfers and computations.
3838
* based on the [CUDA blog reduction](https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf) try to implement a code doing a sum reduction.
3939
* check on LUMI the memory bandwidth in the [memory transpose](https://github.com/ENCCS/gpu-programming/tree/main/content/examples/cuda-hip/hip/04_matrix_transpose) example.

content/examples/cuda-hip/hip/04_matrix_transpose/README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Matrix Transpose
22

3-
Assuming a matrix `a` of size `(NxM)` how to improve matrix operations on GPU? In particular the transposing operation `b(i,j)=a(j,i)`. We will compare the execution times and the effective bandwidth between a simple `copy` kernel, a `naive` transpose implementation, and two more optimized versions using `shared memory` (with and without bank conflicts). The time is mesured using the `events`. The effective bandwidth is computed as the ratio between the total memory read and written by the kernel (`2 x Total size of the Matrix in Gbytes`) and the execution time in seconds.
3+
Assuming a matrix `a` of size `(NxM)` how to improve matrix operations on GPU? In particular the transposing operation `b(i,j)=a(j,i)`. We will compare the execution times and the effective bandwidth between a simple `copy` kernel, a `naive` transpose implementation, and two more optimized versions using `shared memory` (with and without bank conflicts). The time is measured using the `events`. The effective bandwidth is computed as the ratio between the total memory read and written by the kernel (`2 x Total size of the Matrix in Gbytes`) and the execution time in seconds.
44

55
## Copy kernel
66
The base line for our experiment is the simple copy kernel.
@@ -14,7 +14,7 @@ __global__ void copy_kernel(float *in, float *out, int width, int height) {
1414
out[index] = in[index];
1515
}
1616
```
17-
This kernel is only reading the data from the input matrix to the output matrix. No optimizations are needed except for minor tuning in the number of threads per block. All reads from and writes to the GPU memory are coalesced and it is maximum bandwidth that one could achive on a given machine in a kernel.
17+
This kernel is only reading the data from the input matrix to the output matrix. No optimizations are needed except for minor tuning in the number of threads per block. All reads from and writes to the GPU memory are coalesced and it is maximum bandwidth that one could achieve on a given machine in a kernel.
1818

1919
## Naive transpose
2020
This is the first transpose version where each the reads are done in a coalesced way, but not the writing.
@@ -55,7 +55,7 @@ __global__ void transpose_SM_kernel(float *in, float *out, int width,
5555
out[out_index] = tile[threadIdx.x][threadIdx.y];
5656
}
5757
```
58-
The shared memory is local to each CU with about 100 time smaller latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
58+
The shared memory is local to each CU with about 100 time smaller latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the second one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
5959

6060
## Transpose with shared memory and no bank conflicts
6161
The bank conflicts in this case can be solved in a very simple way. We pad the shared matrix. Instead of `__shared__ float tile[tile_dim][tile_dim];` we use `__shared__ float tile[tile_dim][tile_dim+1];`. Effectively this shifts the data in the banks. Hopefully this does not create other banks conflicts!!!!
@@ -80,8 +80,8 @@ __global__ void transpose_SM_nobc_kernel(float *in, float *out, int width,
8080
}
8181
```
8282

83-
For the optimizations exercise get aquinted with the code, compile them and execute them. For each case try to tune the threads per block (by changing `tile_dim`) and find the configuration which improve the performance the most and also the ones which do not. As a reference the `V100` has 84 Streaming Multiprocessors (nvidia equivalent of CU) and a peak bandwidth of `900 GB/s`.
83+
For the optimizations exercise get acquainted with the code, compile them and execute them. For each case try to tune the threads per block (by changing `tile_dim`) and find the configuration which improve the performance the most and also the ones which do not. As a reference the `V100` has 84 Streaming Multiprocessors (NVIDIA equivalent of CU) and a peak bandwidth of `900 GB/s`.
8484

8585

86-
In this exercise it is pretty intuitive what is needed to be done to improve the performance. Measuring the time by events is suficient, but in general in order to obtain more information about how various parts of the application behave a **profiler** is recommended. `HIP` does not provide us with profilers, they are provided by the back end on top of which they are running. On Nvidia platforms we can use the tools [Nsight Systems](https://docs.csc.fi/computing/nsys/) and [Nsight Compute](https://docs.csc.fi/computing/ncu/). On AMD platforms one can try [rocprof](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) or [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest/).
86+
In this exercise it is pretty intuitive what is needed to be done to improve the performance. Measuring the time by events is sufficient, but in general in order to obtain more information about how various parts of the application behave a **profiler** is recommended. `HIP` does not provide us with profilers, they are provided by the back end on top of which they are running. On Nvidia platforms we can use the tools [Nsight Systems](https://docs.csc.fi/computing/nsys/) and [Nsight Compute](https://docs.csc.fi/computing/ncu/). On AMD platforms one can try [rocprof](https://rocm.docs.amd.com/projects/rocprofiler/en/latest/) or [Omniperf](https://rocm.docs.amd.com/projects/omniperf/en/latest/).
8787

0 commit comments

Comments
 (0)