Skip to content

Commit c7aff42

Browse files
authored
Update README.md
1 parent 1594a91 commit c7aff42

File tree

1 file changed

+1
-1
lines changed
  • content/examples/cuda-hip/hip/04_matrix_transpose

1 file changed

+1
-1
lines changed

content/examples/cuda-hip/hip/04_matrix_transpose/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ __global__ void transpose_SM_kernel(float *in, float *out, int width,
5555
out[out_index] = tile[threadIdx.x][threadIdx.y];
5656
}
5757
```
58-
The shared memory is local to each CU with about 100 time slower latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
58+
The shared memory is local to each CU with about 100 time smaller latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
5959

6060
## Transpose with shared memory and no bank conflicts
6161
The bank conflicts in this case can be solved in a very simple way. We pad the shared matrix. Instead of `__shared__ float tile[tile_dim][tile_dim];` we use `__shared__ float tile[tile_dim][tile_dim+1];`. Effectively this shifts the data in the banks. Hopefully this does not create other banks conflicts!!!!

0 commit comments

Comments
 (0)