Skip to content

torchvision.roi_align performance optimization with openMP  #4935

Open
@gal-star

Description

@gal-star

🚀 The feature

Looking at the implementation of roi_align_kernel, it seems as if this can be further optimized using openmp parallelization

// #pragma omp parallel for num_threads(32)

Here's what can be done to get performance boost:

  1. Added #pragma omp parallel for to the kernel (line 27)
  2. Added -fopenmp as CFLAG to the compilation
  3. Set torch.set_num_threads() to desired num of OMP threads (on test/WL side).

Motivation, pitch

I did some experimentation locally in which:

  • I've added this optimization
  • Built a small test case that calls roi_align
  • Profiled torchvision.ops.roi_align() and measured time using current implementation vs. 18 threads on simple CLX machine.

On my humble experiments it shows 10X performance boost!

Alternatives

There can be other libraries/tooling that can do optimization to this CPU kernel. One can think of oneTBB or something alike.
Nevertheless, the current implementation is a really naive and can easily be much performant.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions