The time profiling setup, as mentioned in the main Readme, runs a forward pass of the VGG and AlexNet architecture over multiple batch-sizes for all the convolution algorithms multiple times each (= 3 as default), measures the different times as convolution time, algorithmic overhead time and total time and compares them.
In order to capture run time of the different convolution methods, we use CUDA Events API and add timing logic within each kernel. At each convolution layer, we break the time taken by the API function into 3 time metrics:
-
Convolution Time - This includes the time taken for all the crucial operations/kerels that are involved in the main convolution logic [The part that multiplies the matrices and computes the answert]
-
Pre/post process Overhead Time - This includes the algorithmic overheads incurred due to functions/kernels for preprocessing like padding, filter size manipulation, etc. or for postprocessing like cropping, striding the output out of a bigger array or longer C++ copy operations. This DOES NOT include times for CUDA memory operations like cudaMemcpy, cudaMalloc, etc.
Note - Overhead Time for Im2Col Kernel was assumed to be 0.01 for the sake of graph plotting. In real, since there are no explicit pre/post-processing steps, the overhead by our definition is 0
-
Total Time - This is the total time taken for the function call which includes Convolution Time, Pre/post process Overhead Time AS WELL AS the time taken for memory operations like CudaMalloc, etc. Total Time = Convolution Time + Pre/postprocess Overhead Time + Any other overhead due to memory operations
Note - Convolution and Overhead Time for CUDNN API Call were assumed to be 0 since we don't know how the split is internally. Only Total Time is considered
-
Logger [
timeProfiler.cc
] This C++ script takes the following inputs:- Mode - "VGG"/"ALEX"
- Path to Pretrained Model
- Number of runs for each kernel and batchsize setting
- Algorithm - "DIRECT"/"FFT"/"IM2COL"/"WINOGRAD"/"CUDNN[default]"
With this input, it runs a forward pass of the inferencing engine for the given architecture for batchsizes = 1, 2, 4 and 8. For each setting, the forward pass is run < Number of Runs > times.
-
Analyzer [
analyzer.py
] Takes the logfiles generated by the logger and generates plots
-
Compilation:
$ make
[Can be skipped if compiled through the global makefile] -
Running the C++ based logger:
$ make run_vgg # For VGG $ make run_alex # For AlexNet
By default, this code runs 3 iterations of each batchsize and kernel setting for more statistically sound results. This can be changed, however we faced some issues in Colab when increasing it beyon 3 so that should be kept in mind. To change this default:
$ make run_<vgg/alex> NUM_RUNS=<NUMBER>
This will generate the log files -
logVGG.txt
andlogALEX.txt
-
Running the Analyzer: The analyzer can be run be as follows:
$ python analyzer.py
This will take
logVGG.txt
and/orlogALEX.txt
present in the directory and generate plots in theprofiling/TimeProfiling/Plots/
folder -
Clean the test directory:
$ make clean
[Removes only the C++ based logger binary]
Note: The plots follow a log-scale on the Y-axis
Through these comparisons, we can draw the following important conclusions:
-
Our Im2Col/GEMM kernel implementation works best by a margin for both batch-size = 1 and 8
-
For Batch-size = 1:
- FFT and Direct Convolution perform equally on par with the Direct Convolution beating FFT by a very small margin. The reasons for this are:
- Reason : The FFT theory for convolution tells us that FFT is only profitable for filters of size > 15x15 and hence we expect it to be slow due to the 3x3 filter size.
- Reason : CUFFT algorithm works best when the input size can be rerpesented as 2n x 3m x 5p x 7q, however due to padding the input dimensions at each layer cannot be prime factorized this way leading to CUFFT using a slower algorithm
- For the first layer, the FFT kernel time fluctuates and is much greater than Direct Convolution.
-
Reason : The reason again has to do with CUFFT efficiency and prime factorization of the input size. CUFFT works best when the prime factors, even among {2,3,5,7}, are as small as possible (so, only 2s is the best). However, in the first layer we input channels = 3 while in the subsequent layers input channels =
$2^n$ , hence accounting for the higher time.
-
Reason : The reason again has to do with CUFFT efficiency and prime factorization of the input size. CUFFT works best when the prime factors, even among {2,3,5,7}, are as small as possible (so, only 2s is the best). However, in the first layer we input channels = 3 while in the subsequent layers input channels =
- Our Winograd Implementation is the slowest because:
- Reason : Due to processing each tile in parallel, the advantage of reducing these many operations isn’t reflected
- Reason : We are using 4 times extra memory and a number of extra operations
- FFT and Direct Convolution perform equally on par with the Direct Convolution beating FFT by a very small margin. The reasons for this are:
-
For Batch-size = 8:
- Surprisingly, FFT outperforms Direct Convolution for almost all the layers. Keeping in mind that our FFT implementation actually loops over the batchsize, it's easy to see that Direct Convolution scales much worse with batchsize
- Winograd is still the slowest because it loops over batchsize as well
-
The trend of convolution time across the layers remains the same for each algorithm
[Post/Pre Processing] Overhead Time :
Note:
- The plots follow a log-scale on the Y-axis
- Overhead Time for Im2Col Kernel was assumed to be 0.01 for the sake of graph plotting. In real, since there are no explicit pre/post-processing steps, the overhead by our definition is 0
From the above plots, we can draw the following important conclusions:
- The overhead for FFT is greater than the other algorithms by a margin:
- Reason : FFT requires the filters to be the same size as the input and so they have to padded to the same (bigger) size and also flipped and alligned properly which contribute to a larger pre-processing time. The output has to be cropped adding to the overhead
- The somewhat larger overhead for Winograd (remember, this is log scale so FFT is 8-10 times more) is because we have to crop the output as well
Note: The plots follow a log-scale on the Y-axis
From the above graphs, the following conclusions can be drawn:
- Im2Col is our fastest implementation which is not very far from CUDNN convolution (where CUDNN is set to choose the best algorithm given the parameters of the layer)
- Inspite of having a lower convolution time, the post/pre processing overhead as well as the memory overhead for FFT Convolution is quite large due to which the total time is so bad
- Winograd, due to its larger convolution time, is the slowest algorithm for our implementations
Convolution Time | [Pre/post processing] Overhead Time | Total Time |
---|---|---|
![]() |
![]() |
![]() |
Note: The plots follow a log-scale on the Y-axis
From the above plots, we can draw the conclusions that:
- Im2Col is undoubtedly the fastest of all our implementations across any batchsize
- Even though FFT loops on the images in a batch, the convolution time scales better for larger batch-sizes than Direct Convolution even when the input sizes cannot be optimally prime factorized
- The overhead time of FFT convolution is almost the same for all batchsizes. The rate of increase of overhead time for other kernels is faster.
- Winograd is the slowest kernel in terms of overall time for higher batch sizes
Note: Winograd Kernel does not work on AlexNet as the algorithm is only described in the literature for 3x3 kernels and scaling up is extremely tough. AlexNet has 11x11 and 5x5 filters, therefore, we don't use Winograd for it
Convolution Time :
Batchsize = 1 | Batchsize = 8 |
---|---|
![]() |
![]() |
Note: The plots follow a log-scale on the Y-axis
From the above plots, we can draw the following conclusions:
- Im2Col/GEMM Convolution is the fastest as in VGG
- Interestingly, for the first layer FFT Convolution performs worse than Direct Convolution even though the kernel size is 11x11 because of the same reason as stated in the VGG section regarding first layer performance:
- Reason: CUFFT algorithm works best when the input size can be rerpesented as 2n x 3m x 5p x 7q, however due to padding the input dimensions at each layer cannot be prime factorized this way leading to CUFFT using a slower algorithm
- For the second layer (for both batchsizes), when the kernel is 5x5, FFT performs better than Direct Convolution and then goes back to performing badly for 3x3 filters
- Direct Convolution does not scale as badly as it did in VGG, maybe because the input size and the number of filters in the further layers of AlexNet is way smaller
[Post/Pre Processing] Overhead Time :
Batchsize = 1 | Batchsize = 8 |
---|---|
![]() |
![]() |
Note:
- The plots follow a log-scale on the Y-axis
- Overhead Time for Im2Col Kernel was assumed to be 0.01 for the sake of graph plotting. In real, since there are no explicit pre/post-processing steps, the overhead by our definition is 0
From the above plots, we can draw the following conclusions:
- FFT overhead is still the largest as was the case in VGG but does not increase up much with batchsize
Total Time :
Batchsize = 1 | Batchsize = 8 |
---|---|
![]() |
![]() |
Note: The plots follow a log-scale on the Y-axis
From the above graphs, the following conclusions can be drawn:
- Im2Col is our fastest implementation which is not very far from CUDNN convolution (where CUDNN is set to choose the best algorithm given the parameters of the layer)
- Inspite of having a lower convolution time for some layers, the post/pre processing overhead as well as the memory overhead is quite large for FFT due to which the total time is so bad
Convolution Time | [Pre/post processing] Overhead Time | Total Time |
---|---|---|
![]() |
![]() |
![]() |
Note: The plots follow a log-scale on the Y-axis
From the above plots, we can draw the conclusions that:
- Im2Col is undoubtedly the fastest of all our implementations across any batchsize
- Unlike in VGG, the convolution time for Direct Convolution does not increase as badly and therefore it is the second best performing kernel throughout
- The increase in FFT pre/post processing overhead with batchsize is not sharp even though the overhead itself is quite large. We can also infer that the other overheads increase with batchsize.