threaded version of `sideof(::Point, ::Ring)` is significantly slower than serial for small rings #988

disberd · 2024-08-08T11:50:27Z

The default multithreading for sideof(::Point, ::Ring) seems to pay off for rings with more than 1000 points (number depends on Threads.nthreads() and probably system/julia version)

Given that number of threads used for this can not be controlled without forcing the same number of threads for the whole julia session, I think it would be good to default to serial version for rings with 1000 points or less. This can significantly speed up operations like point in GeometrySet where the geometry set is composed of many small PolyAreas (like in the case of NaturalEarth borders with 110m resolution).

I'd be willing to help if you are interested, alternative implementations are:

refactor the sideof code into a core parts that does computation on a single segment and have the outer sideof method call the core either in threaded or serial depending on the number of vertices.
- This has in my opinion the potential best performance, and does not requre additional deps but requires more code refactoring (I still don't think that much)
use something like ChunkSplitter for making sure the threaded chunks have at least i.e. 1000 elements. This still has some threading overhead but does not require as much refactoring
use @batch from Polyester which has lower threading overhead and allows to specify the minimum batch size
use OhMyThreads.jl to do something similar to points 2/3 above, just as an alternative package for the implementation

Benchmarks

Here are some benchmark belows between latest [email protected] and [email protected]

using Meshes, BenchmarkTools

sampledcircle(n) = sample(Sphere((0,0),1), RegularSampling(n)) |> splat(Ring)
function bench(ns, p)
       for n in ns
           s = sampledcircle(n)
           b = @benchmark sideof($p, $s)
		   println("#======  Benchmark for n = $n ======#")
           display(b)
		   println()
       end
end

bench([100, 1000, 10000], Point(2,2))

Threaded (v0.48.1) - Windows - Julia 1.10 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.186 μs …  12.127 ms  ┊ GC (min … max):  0.00% … 99.74%
 Time  (median):     6.471 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   7.912 μs ± 121.278 μs  ┊ GC (mean ± σ):  15.77% ±  1.38%

      ▂▄▇▇█▅▄▁  ▂▄▄▄▆▅▆▄▅▃▃
  ▁▂▆███████████████████████▇▆▅▄▄▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▁ ▄
  4.19 μs         Histogram: frequency by time        12.2 μs <

 Memory estimate: 5.27 KiB, allocs estimate: 43.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):   7.567 μs …  18.097 ms  ┊ GC (min … max):  0.00% … 99.72%
 Time  (median):     11.850 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   16.869 μs ± 181.019 μs  ┊ GC (mean ± σ):  10.70% ±  1.00%

      ▇█▅▃▁
  ▁▂▃██████▆▄▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  7.57 μs         Histogram: frequency by time         41.4 μs <

 Memory estimate: 5.27 KiB, allocs estimate: 43.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.500 μs … 520.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.234 μs ±  14.817 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▇██▄▁ ▁
  ▁▂▂▂▂▃▄▅█████████▅▅▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  20.5 μs         Histogram: frequency by time         89.7 μs <

 Memory estimate: 5.27 KiB, allocs estimate: 43.

Threaded (v0.48.1) - Windows - Julia 1.11 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.557 μs …  2.430 ms  ┊ GC (min … max): 0.00% … 99.03%
 Time  (median):     6.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.559 μs ± 37.753 μs  ┊ GC (mean ± σ):  8.52% ±  1.71%

              ▂▆▅█▇▇▃
  ▁▂▃▄▅▅▆▆▇▇██████████▇▆▄▄▂▂▂▂▂▂▂▁▁▁▁▂▂▁▁▁▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁ ▃
  4.56 μs        Histogram: frequency by time        12.7 μs <

 Memory estimate: 5.28 KiB, allocs estimate: 44.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):   7.583 μs …  4.616 ms  ┊ GC (min … max): 0.00% … 99.22%
 Time  (median):     11.917 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.675 μs ± 56.185 μs  ┊ GC (mean ± σ):  5.25% ±  1.40%

        ▃██▇▂
  ▁▂▄▅▆███████▅▄▃▃▂▂▂▂▂▂▃▄▄▄▃▃▂▂▂▂▁▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  7.58 μs         Histogram: frequency by time        34.8 μs <

 Memory estimate: 5.28 KiB, allocs estimate: 44.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  19.300 μs … 287.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.957 μs ±  14.075 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▄█▆▃
  ▁▁▂▃▄▃▃▄▇████▆▄▃▃▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  19.3 μs         Histogram: frequency by time          100 μs <

 Memory estimate: 5.28 KiB, allocs estimate: 44.

Threaded/Serial (v0.48.1) - Windows - Julia 1.11 - 1 thread

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.260 μs …  16.080 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.390 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.529 μs ± 405.729 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▅▆▅▅▅▄▄▃▃▂▁▂▁ ▁▁▁▂▁▂▁▂▂▁▁ ▁                              ▂
  ██████████████████████████████▇▇██▇▆██▇▇▆▇▇▆▅▆▅▆▅▅▅▅▅▅▄▅▄▂▄ █
  1.26 μs      Histogram: log(frequency) by time      3.11 μs <

 Memory estimate: 752 bytes, allocs estimate: 9.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.814 μs … 29.029 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.957 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.257 μs ±  1.156 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▆▄▃▂▂ ▁▁▁                                                ▁
  ██████████████▇▇████▇██▇██▇▅▇▆▅▆▅▆▅▆▄▅▆▆▆▄▅▅▅▃▄▅▄▄▄▅▃▃▃▃▄▄ █
  4.81 μs      Histogram: log(frequency) by time     10.2 μs <

 Memory estimate: 752 bytes, allocs estimate: 9.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.100 μs … 140.900 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     41.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.554 μs ±   7.712 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▇▄▂▁▂▁▁▁    ▁                                              ▁
  ███████████████▇▇▇▇▇▇▇▆█▇▇▇▇▇█▇▇▆▇▆▇▇▆▅▅▆▆▅▆▆▅▆▇▆▆▆▅▆▆▅▅▅▅▅▆ █
  40.1 μs       Histogram: log(frequency) by time      77.9 μs <

 Memory estimate: 752 bytes, allocs estimate: 9.

Serial (v0.46.1) - Windows - Julia 1.11 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 201 evaluations.
 Range (min … max):  396.517 ns …  1.720 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     399.502 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   434.504 ns ± 77.214 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▆▁▄▁▂ ▁▁      ▁  ▁         ▁   ▁                           ▁
  ██████████▇█████████▇███▇▇█▇▇█▇▇▇█▇█▇█▇▇▇▇▇▇▆▆▅▅▆▅▅▅▅▅▅▄▄▄▅▆ █
  397 ns        Histogram: log(frequency) by time       716 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.850 μs …  20.913 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.862 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.043 μs ± 639.322 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▅▁▂▁ ▁                                                    ▁
  ██████▆█▇▇▇▇█▇▇▆▆▇▇▇▇▆▆▇▆▇▆▆▆▆▅▅▅▅▄▅▄▅▅▄▅▄▅▆▆▄▅▅▅▅▅▅▅▅▅▄▅▇▇ █
  3.85 μs      Histogram: log(frequency) by time      6.46 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  38.200 μs … 254.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.533 μs ±   6.719 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▅▁▃▁ ▂                                                     ▁
  ██████▇█▇██▇█▇▇▇▇▇▇▆▇█▇▇█▇▆▆▆▆▆▅▄▅▃▅▅▆▅▅▆▅▅▅▆▅▅▅▅▆▅▄▅▃▅▄▄▅▃▅ █
  38.2 μs       Histogram: log(frequency) by time      67.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The text was updated successfully, but these errors were encountered:

juliohm · 2024-08-08T13:43:32Z

Really nice idea. Should we experiment with option (4) first? It is less disruptive and easy to try. I understand that we can simply do something like @tasks for i in 1:n batchsize=1000 to control the chunks.

disberd · 2024-08-08T14:12:13Z

@juliohm I did try a bit with OhMyThreads but it does not seem to help much here simply with @tasks (also you can't break early easily within @tasks at the moment, see JuliaFolds2/OhMyThreads.jl#96).

Polyester on the other hands seems to overall reduce allocations and bring minimal overhead for small number of points and speedup even compared to Threads.@threads with high number of points.

Are you willing to have the PR use Polyester or for some reason you don't want to depend on it (or use @batch)?

juliohm · 2024-08-08T14:18:25Z

@disberd in that case I think it is safer to proceed with option (1). We never know for how long these optimized threading packages will be maintained (e.g., LoopVectorization.jl).

disberd · 2024-08-08T14:20:56Z

OK, I'll do the PR with (1) but just post here for reference the timing I get with @batch:

Threaded (with `@batch`) - Windows - Julia 1.11 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
 Range (min … max):  413.065 ns …  21.077 μs  ┊ GC (min … max): 0.00% … 97.52%
 Time  (median):     429.648 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   440.808 ns ± 210.052 ns  ┊ GC (mean ± σ):  0.47% ±  0.98%

    ███▇▄▃▃▁▁▁                                                  ▂
  ▃▂██████████▇█▇▆▇▆▅▆▆▆▅▅▅▅▅▅▅▅▄▄▆▆▆▆▅▆▆▆▆▆▅▆▅▄▅▆▅▆▄▆▆▆▆▆▅▅▆▇▆ █
  413 ns        Histogram: log(frequency) by time        648 ns <

 Memory estimate: 32 bytes, allocs estimate: 2.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.913 μs …  17.913 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.950 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.002 μs ± 334.257 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▆▃▂ ▂▅▅▃▁   ▁▃▃                                          ▂
  ██████▇█████▆▇▅███▇█▇▇▅▇▆▆▆▅▆▅▃▆▇█▆▆▆▅▁▃▃▅▃▄▁▄▃▅▄▅▅▃▄▅▄▄▄▅▅ █
  3.91 μs      Histogram: log(frequency) by time      4.78 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   9.067 μs … 66.767 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.300 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.774 μs ±  2.917 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▄█▆▅▄▁                                                   ▁
  █▅▅█████████▇▇▇▅▆▆▆▆▅▅▆▄▅▅▄▅▄▃▄▄▄▄▅▄▄▄▅▃▄▄▄▄▃▅▃▁▄▁▄▁▄▃▁▃▄▄▅ █
  9.07 μs      Histogram: log(frequency) by time        25 μs <

 Memory estimate: 80 bytes, allocs estimate: 3.

disberd mentioned this issue Aug 8, 2024

Algorithms from geodesic geometry #989

Open

juliohm added enhancement New feature or request performance labels Aug 8, 2024

juliohm assigned disberd Aug 8, 2024

disberd mentioned this issue Aug 8, 2024

Make sideof(::Point, ::Ring) serial when convenient #993

Merged

juliohm closed this as completed in #993 Aug 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

threaded version of `sideof(::Point, ::Ring)` is significantly slower than serial for small rings #988

threaded version of `sideof(::Point, ::Ring)` is significantly slower than serial for small rings #988

disberd commented Aug 8, 2024 •

edited

Loading

juliohm commented Aug 8, 2024

disberd commented Aug 8, 2024 •

edited

Loading

juliohm commented Aug 8, 2024

disberd commented Aug 8, 2024

threaded version of sideof(::Point, ::Ring) is significantly slower than serial for small rings #988

threaded version of sideof(::Point, ::Ring) is significantly slower than serial for small rings #988

Comments

disberd commented Aug 8, 2024 • edited Loading

Benchmarks

Threaded (v0.48.1) - Windows - Julia 1.10 - 8 threads

Threaded (v0.48.1) - Windows - Julia 1.11 - 8 threads

Threaded/Serial (v0.48.1) - Windows - Julia 1.11 - 1 thread

Serial (v0.46.1) - Windows - Julia 1.11 - 8 threads

juliohm commented Aug 8, 2024

disberd commented Aug 8, 2024 • edited Loading

juliohm commented Aug 8, 2024

disberd commented Aug 8, 2024

Threaded (with @batch) - Windows - Julia 1.11 - 8 threads

threaded version of `sideof(::Point, ::Ring)` is significantly slower than serial for small rings #988

threaded version of `sideof(::Point, ::Ring)` is significantly slower than serial for small rings #988

disberd commented Aug 8, 2024 •

edited

Loading

disberd commented Aug 8, 2024 •

edited

Loading

Threaded (with `@batch`) - Windows - Julia 1.11 - 8 threads