Open
Description
🚀 The feature
Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization
Here's what can be done to get performance boost:
- Added
#pragma omp parallel for
to the kernel (line 27) - Added -fopenmp as CFLAG to the compilation
- Set torch.set_num_threads() to desired num of OMP threads (on test/WL side).
Motivation, pitch
I did some experimentation locally in which:
- I've added this optimization
- Built a small test case that calls roi_align
- Profiled
torchvision.ops.roi_align()
and measured time using current implementation vs. 18 threads on simple CLX machine.
On my humble experiments it shows 10X performance boost!
Alternatives
There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike.
Nevertheless, the current implementation is a really naive and can easily be much performant.
Additional context
No response