Time Profiling - Introduction

The time profiling setup, as mentioned in the main Readme, runs a forward pass of the VGG and AlexNet architecture over multiple batch-sizes for all the convolution algorithms multiple times each (= 3 as default), measures the different times as convolution time, algorithmic overhead time and total time and compares them.

Time Calculation

In order to capture run time of the different convolution methods, we use CUDA Events API and add timing logic within each kernel. At each convolution layer, we break the time taken by the API function into 3 time metrics:

Convolution Time - This includes the time taken for all the crucial operations/kerels that are involved in the main convolution logic [The part that multiplies the matrices and computes the answert]
Pre/post process Overhead Time - This includes the algorithmic overheads incurred due to functions/kernels for preprocessing like padding, filter size manipulation, etc. or for postprocessing like cropping, striding the output out of a bigger array or longer C++ copy operations. This DOES NOT include times for CUDA memory operations like cudaMemcpy, cudaMalloc, etc.

Note - Overhead Time for Im2Col Kernel was assumed to be 0.01 for the sake of graph plotting. In real, since there are no explicit pre/post-processing steps, the overhead by our definition is 0
Total Time - This is the total time taken for the function call which includes Convolution Time, Pre/post process Overhead Time AS WELL AS the time taken for memory operations like CudaMalloc, etc. Total Time = Convolution Time + Pre/postprocess Overhead Time + Any other overhead due to memory operations

Experimental Setup

Logger [timeProfiler.cc] This C++ script takes the following inputs:
- Mode - "VGG"/"ALEX"
- Path to Pretrained Model
- Number of runs for each kernel and batchsize setting
- Algorithm - "DIRECT"/"FFT"/"IM2COL"/"WINOGRAD"/"CUDNN[default]"
With this input, it runs a forward pass of the inferencing engine for the given architecture for batchsizes = 1, 2, 4 and 8. For each setting, the forward pass is run < Number of Runs > times.
Analyzer [analyzer.py] Takes the logfiles generated by the logger and generates plots

How to Run the Code

Compilation: $ make [Can be skipped if compiled through the global makefile]
Running the C++ based logger:
```
 $ make run_vgg		# For VGG
 $ make run_alex		# For AlexNet
```
By default, this code runs 3 iterations of each batchsize and kernel setting for more statistically sound results. This can be changed, however we faced some issues in Colab when increasing it beyon 3 so that should be kept in mind. To change this default:
```
 $ make run_<vgg/alex> NUM_RUNS=<NUMBER> 
```
This will generate the log files - logVGG.txt and logALEX.txt
Running the Analyzer: The analyzer can be run be as follows:
```
 $ python analyzer.py
```
This will take logVGG.txt and/or logALEX.txt present in the directory and generate plots in the profiling/TimeProfiling/Plots/ folder
Clean the test directory: $ make clean [Removes only the C++ based logger binary]

Experimental Results

VGG-19

Layer Wise Comparison

Convolution Time :

Through these comparisons, we can draw the following important conclusions:

Our Im2Col/GEMM kernel implementation works best by a margin for both batch-size = 1 and 8
For Batch-size = 1:
- FFT and Direct Convolution perform equally on par with the Direct Convolution beating FFT by a very small margin. The reasons for this are:
  - Reason : The FFT theory for convolution tells us that FFT is only profitable for filters of size > 15x15 and hence we expect it to be slow due to the 3x3 filter size.
  - Reason : CUFFT algorithm works best when the input size can be rerpesented as 2n x 3m x 5p x 7q, however due to padding the input dimensions at each layer cannot be prime factorized this way leading to CUFFT using a slower algorithm
- For the first layer, the FFT kernel time fluctuates and is much greater than Direct Convolution.
  - Reason : The reason again has to do with CUFFT efficiency and prime factorization of the input size. CUFFT works best when the prime factors, even among {2,3,5,7}, are as small as possible (so, only 2s is the best). However, in the first layer we input channels = 3 while in the subsequent layers input channels = $2^n$, hence accounting for the higher time.
- Our Winograd Implementation is the slowest because:
  - Reason : Due to processing each tile in parallel, the advantage of reducing these many operations isn’t reflected
  - Reason : We are using 4 times extra memory and a number of extra operations
For Batch-size = 8:
- Surprisingly, FFT outperforms Direct Convolution for almost all the layers. Keeping in mind that our FFT implementation actually loops over the batchsize, it's easy to see that Direct Convolution scales much worse with batchsize
- Winograd is still the slowest because it loops over batchsize as well
The trend of convolution time across the layers remains the same for each algorithm

[Post/Pre Processing] Overhead Time :

From the above plots, we can draw the following important conclusions:

The overhead for FFT is greater than the other algorithms by a margin:
- Reason : FFT requires the filters to be the same size as the input and so they have to padded to the same (bigger) size and also flipped and alligned properly which contribute to a larger pre-processing time. The output has to be cropped adding to the overhead
The somewhat larger overhead for Winograd (remember, this is log scale so FFT is 8-10 times more) is because we have to crop the output as well

Total Time :

From the above graphs, the following conclusions can be drawn:

Im2Col is our fastest implementation which is not very far from CUDNN convolution (where CUDNN is set to choose the best algorithm given the parameters of the layer)
Inspite of having a lower convolution time, the post/pre processing overhead as well as the memory overhead for FFT Convolution is quite large due to which the total time is so bad
Winograd, due to its larger convolution time, is the slowest algorithm for our implementations

BatchWise Comparison

Convolution Time	[Pre/post processing] Overhead Time	Total Time

From the above plots, we can draw the conclusions that:

Im2Col is undoubtedly the fastest of all our implementations across any batchsize
Even though FFT loops on the images in a batch, the convolution time scales better for larger batch-sizes than Direct Convolution even when the input sizes cannot be optimally prime factorized
The overhead time of FFT convolution is almost the same for all batchsizes. The rate of increase of overhead time for other kernels is faster.
Winograd is the slowest kernel in terms of overall time for higher batch sizes

AlexNet

Layer Wise Comparison

Convolution Time :

Batchsize = 1	Batchsize = 8

From the above plots, we can draw the following conclusions:

Im2Col/GEMM Convolution is the fastest as in VGG
Interestingly, for the first layer FFT Convolution performs worse than Direct Convolution even though the kernel size is 11x11 because of the same reason as stated in the VGG section regarding first layer performance:
- Reason: CUFFT algorithm works best when the input size can be rerpesented as 2n x 3m x 5p x 7q, however due to padding the input dimensions at each layer cannot be prime factorized this way leading to CUFFT using a slower algorithm
For the second layer (for both batchsizes), when the kernel is 5x5, FFT performs better than Direct Convolution and then goes back to performing badly for 3x3 filters
Direct Convolution does not scale as badly as it did in VGG, maybe because the input size and the number of filters in the further layers of AlexNet is way smaller

[Post/Pre Processing] Overhead Time :

Batchsize = 1	Batchsize = 8

From the above plots, we can draw the following conclusions:

FFT overhead is still the largest as was the case in VGG but does not increase up much with batchsize

Total Time :

Batchsize = 1	Batchsize = 8

From the above graphs, the following conclusions can be drawn:

Im2Col is our fastest implementation which is not very far from CUDNN convolution (where CUDNN is set to choose the best algorithm given the parameters of the layer)
Inspite of having a lower convolution time for some layers, the post/pre processing overhead as well as the memory overhead is quite large for FFT due to which the total time is so bad

BatchWise Comparison

Convolution Time	[Pre/post processing] Overhead Time	Total Time

From the above plots, we can draw the conclusions that:

Im2Col is undoubtedly the fastest of all our implementations across any batchsize
Unlike in VGG, the convolution time for Direct Convolution does not increase as badly and therefore it is the second best performing kernel throughout
The increase in FFT pre/post processing overhead with batchsize is not sharp even though the overhead itself is quite large. We can also infer that the other overheads increase with batchsize.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Time Profiling - Introduction

Time Calculation

Experimental Setup

How to Run the Code

Experimental Results

VGG-19

Layer Wise Comparison

BatchWise Comparison

AlexNet

Layer Wise Comparison

BatchWise Comparison

Files

README.md

Latest commit

History

README.md

File metadata and controls

Time Profiling - Introduction

Time Calculation

Experimental Setup

How to Run the Code

Experimental Results

VGG-19

Layer Wise Comparison

BatchWise Comparison

AlexNet

Layer Wise Comparison

BatchWise Comparison