Update README.md

csccva · web-flow · commit c7aff428d2a5 · 2024-11-11T11:46:53.000+02:00
diff --git a/content/examples/cuda-hip/hip/04_matrix_transpose/README.md b/content/examples/cuda-hip/hip/04_matrix_transpose/README.md
@@ -55,7 +55,7 @@ __global__ void transpose_SM_kernel(float *in, float *out, int width,
   out[out_index] = tile[threadIdx.x][threadIdx.y];
 }
 ``` 
-The shared memory is local to each CU with about 100 time slower latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.  
+The shared memory is local to each CU with about 100 time smaller latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.  
 
 ## Transpose with shared memory and no bank conflicts
 The bank conflicts in this case can be solved in a very simple way. We pad the shared matrix. Instead of `__shared__ float tile[tile_dim][tile_dim];` we use `__shared__ float tile[tile_dim][tile_dim+1];`. Effectively this shifts the data in the banks. Hopefully this does not create other banks conflicts!!!!

Original file line number	Diff line number	Diff line change
`@@ -55,7 +55,7 @@ __global__ void transpose_SM_kernel(float in, float out, int width,`
`55`	`55`	`out[out_index] = tile[threadIdx.x][threadIdx.y];`
`56`	`56`	`}`
`57`	`57`	```
`58`		-The shared memory is local to each CU with about 100 time slower latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
	`58`	+The shared memory is local to each CU with about 100 time smaller latency than the global memory. While there is an extra synchronization needed to ensure that the data has been saved locally, the gain in switching from uncoalesced to coalesced accesses outweights the loss. The reading and writing of SM can be done in any order as long as there are no bank conflicts. While the first SM access `tile[threadIdx.y][threadIdx.x] = in[in_index];` is free on bank conflicts the secone one `out[out_index] = tile[threadIdx.x][threadIdx.y];`. When bank conflicts occur the access to the data is serialized. Even so the gain of using SM is quite big.
`59`	`59`
`60`	`60`	`## Transpose with shared memory and no bank conflicts`
`61`	`61`	The bank conflicts in this case can be solved in a very simple way. We pad the shared matrix. Instead of `__shared__ float tile[tile_dim][tile_dim];` we use `__shared__ float tile[tile_dim][tile_dim+1];`. Effectively this shifts the data in the banks. Hopefully this does not create other banks conflicts!!!!