This program uses CUDA to modify an image using a specified kernel.
The kernel convolution implemented is similar to the one in the process explained in this video: How blurs and filters work
This one uses a simple weighted distribution Kernel to make a box blur over an image. It doesn't prioritize any pixel over the other so the weights are the same amongst the box.
- Constant memory variables are similar to global variables in that, they are visible to all threads.
- They are stored in L2 Cache. The access speed to this location is only second to the L1 cache (very fast access!) as seen in the simplified diagram below:
Programming Massively Parallel Processors: A Hands-on Approach by Wen-mei W. Hwu, David B. Kirk, Izzat El Hajj. Excerpt available here
First, initialize the kernel that will be iterating over the image. Since the kernel will be used repeatedly by the GPU to process each section of the original image, we might as well store it in the GPU cache. By telling it to store it in constant, we make CUDA aggressively cache the constant variable in memory.
#define FILTER_SIZE 3
#define FILTER_OFFSET (FILTER_SIZE/2);
__constant__ float FILTER[3 * 3];
We indicate it won't change during runtime, so we need to copy as a "symbol" (a device variable).
cudaMemcpyToSymbol(FILTER, h_filter, filter_byte_size);
Images, specially camera photographs, tend to be very large in size. This implementation uses a very small file size, but for much larger, very high resolution images, the CPU will reach some issues.
A GPU has much higher bandwidth for transferring data with its multiple control units
Information locality & Actual physical distance
Though the capacity in cache in CPU's is much larger and would allow all of our instructions to be accessed really fast, if we needed to share information to other cores, we would have to go L1 Cache -> L2 Cache -> L3 Cache. Ech of these jumps imply more time transferring data than actually using it.
A GPU, on the other hand, can simply send it to L2 Cache from any of its cores, which makes it shared to every other core that might need that information.
Nvidia Docs "CUDA C programming guide". Availabe here
A CPU will interpret these instructions and run them sequentially on its own process. The issue here is we would need to parallelize the tasks in CPU, but these tasks are really quite simple. Even splitting this tasks might generate much more overhead than just running them sequentially.
Moreover, the small amount of cores in comparison to the GPU mean that all these tasks can only be subdivided in very large sets.We would fill our CPU with very long simple instructions that could have been subdivided into thousands of GPU cores.
PGM format allows me to dump the raw data of each location very simply. I draw a white canvas, paint a black line on it and CUDA processes it with a kernel that box-blurs the input.
The implementation of the arrays forces any data to be stored as a 1D array. We want to keep the continuity of the memory locations so it wont jump from pointer to pointer.
The task is divided in grids of 64x64 and blocks of 16x16. I can change this parameter later because it is a task that can be much further subdivided, which would give us much better performance.
See the following example by Patrice's pseudocode of one of their lectures at Auckland University
Original presenttion available here: Patrices Lecture PDF presentation
This pseudocode has
If the kernel size is similar to the input size... well that is basically
Now, in this cuda implementation, the complexity of the task is reduced to
The test designed is visual, with a few assertions to make sure we input and output the image files.
The resulting files test.pgm and blur.pgm can be seen in an online viewer or opened with an editor.
test.pgm has the middle section crossed by a black diagonal and can be seen, even when edited in a text file that clearly there's a line in the middle.
blur.pgm on the other hand, is the same image but the diagonal is now blurred. The blurring might be difficult to detect in the image BUT when in a text editor, the "numerical blurring" is very clear.
All parameters are hard coded (for now). Simply compile the kernel.cu
file and run it, nothing else is needed.
The images can be seen using an online viewer such as this one or download IrfanView. IrfanView is a pretty light-weight image viewer, no bloat, no ads.
Here's how the images should look like: