Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threaded version of sideof(::Point, ::Ring) is significantly slower than serial for small rings #988

Closed
disberd opened this issue Aug 8, 2024 · 4 comments · Fixed by #993
Assignees
Labels
enhancement New feature or request performance

Comments

@disberd
Copy link
Member

disberd commented Aug 8, 2024

The default multithreading for sideof(::Point, ::Ring) seems to pay off for rings with more than 1000 points (number depends on Threads.nthreads() and probably system/julia version)

Given that number of threads used for this can not be controlled without forcing the same number of threads for the whole julia session, I think it would be good to default to serial version for rings with 1000 points or less. This can significantly speed up operations like point in GeometrySet where the geometry set is composed of many small PolyAreas (like in the case of NaturalEarth borders with 110m resolution).

I'd be willing to help if you are interested, alternative implementations are:

  • refactor the sideof code into a core parts that does computation on a single segment and have the outer sideof method call the core either in threaded or serial depending on the number of vertices.
    • This has in my opinion the potential best performance, and does not requre additional deps but requires more code refactoring (I still don't think that much)
  • use something like ChunkSplitter for making sure the threaded chunks have at least i.e. 1000 elements. This still has some threading overhead but does not require as much refactoring
  • use @batch from Polyester which has lower threading overhead and allows to specify the minimum batch size
  • use OhMyThreads.jl to do something similar to points 2/3 above, just as an alternative package for the implementation

Benchmarks

Here are some benchmark belows between latest [email protected] and [email protected]

using Meshes, BenchmarkTools

sampledcircle(n) = sample(Sphere((0,0),1), RegularSampling(n)) |> splat(Ring)
function bench(ns, p)
       for n in ns
           s = sampledcircle(n)
           b = @benchmark sideof($p, $s)
		   println("#======  Benchmark for n = $n ======#")
           display(b)
		   println()
       end
end

bench([100, 1000, 10000], Point(2,2))

Threaded (v0.48.1) - Windows - Julia 1.10 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.186 μs …  12.127 ms  ┊ GC (min … max):  0.00% … 99.74%
 Time  (median):     6.471 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   7.912 μs ± 121.278 μs  ┊ GC (mean ± σ):  15.77% ±  1.38%

      ▂▄▇▇█▅▄▁  ▂▄▄▄▆▅▆▄▅▃▃
  ▁▂▆███████████████████████▇▆▅▄▄▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▁ ▄
  4.19 μs         Histogram: frequency by time        12.2 μs <

 Memory estimate: 5.27 KiB, allocs estimate: 43.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):   7.567 μs …  18.097 ms  ┊ GC (min … max):  0.00% … 99.72%
 Time  (median):     11.850 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   16.869 μs ± 181.019 μs  ┊ GC (mean ± σ):  10.70% ±  1.00%

      ▇█▅▃▁
  ▁▂▃██████▆▄▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  7.57 μs         Histogram: frequency by time         41.4 μs <

 Memory estimate: 5.27 KiB, allocs estimate: 43.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.500 μs … 520.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   36.234 μs ±  14.817 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▇██▄▁ ▁
  ▁▂▂▂▂▃▄▅█████████▅▅▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  20.5 μs         Histogram: frequency by time         89.7 μs <

 Memory estimate: 5.27 KiB, allocs estimate: 43.

Threaded (v0.48.1) - Windows - Julia 1.11 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.557 μs …  2.430 ms  ┊ GC (min … max): 0.00% … 99.03%
 Time  (median):     6.700 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.559 μs ± 37.753 μs  ┊ GC (mean ± σ):  8.52% ±  1.71%

              ▂▆▅█▇▇▃
  ▁▂▃▄▅▅▆▆▇▇██████████▇▆▄▄▂▂▂▂▂▂▂▁▁▁▁▂▂▁▁▁▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁ ▃
  4.56 μs        Histogram: frequency by time        12.7 μs <

 Memory estimate: 5.28 KiB, allocs estimate: 44.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):   7.583 μs …  4.616 ms  ┊ GC (min … max): 0.00% … 99.22%
 Time  (median):     11.917 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.675 μs ± 56.185 μs  ┊ GC (mean ± σ):  5.25% ±  1.40%

        ▃██▇▂
  ▁▂▄▅▆███████▅▄▃▃▂▂▂▂▂▂▃▄▄▄▃▃▂▂▂▂▁▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  7.58 μs         Histogram: frequency by time        34.8 μs <

 Memory estimate: 5.28 KiB, allocs estimate: 44.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  19.300 μs … 287.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.957 μs ±  14.075 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▄█▆▃
  ▁▁▂▃▄▃▃▄▇████▆▄▃▃▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  19.3 μs         Histogram: frequency by time          100 μs <

 Memory estimate: 5.28 KiB, allocs estimate: 44.

Threaded/Serial (v0.48.1) - Windows - Julia 1.11 - 1 thread

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.260 μs …  16.080 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.390 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.529 μs ± 405.729 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▅▆▅▅▅▄▄▃▃▂▁▂▁ ▁▁▁▂▁▂▁▂▂▁▁ ▁                              ▂
  ██████████████████████████████▇▇██▇▆██▇▇▆▇▇▆▅▆▅▆▅▅▅▅▅▅▄▅▄▂▄ █
  1.26 μs      Histogram: log(frequency) by time      3.11 μs <

 Memory estimate: 752 bytes, allocs estimate: 9.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (min … max):  4.814 μs … 29.029 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.957 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.257 μs ±  1.156 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆▆▄▃▂▂ ▁▁▁                                                ▁
  ██████████████▇▇████▇██▇██▇▅▇▆▅▆▅▆▅▆▄▅▆▆▆▄▅▅▅▃▄▅▄▄▄▅▃▃▃▃▄▄ █
  4.81 μs      Histogram: log(frequency) by time     10.2 μs <

 Memory estimate: 752 bytes, allocs estimate: 9.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.100 μs … 140.900 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     41.300 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   43.554 μs ±   7.712 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▄▇▄▂▁▂▁▁▁    ▁                                              ▁
  ███████████████▇▇▇▇▇▇▇▆█▇▇▇▇▇█▇▇▆▇▆▇▇▆▅▅▆▆▅▆▆▅▆▇▆▆▆▅▆▆▅▅▅▅▅▆ █
  40.1 μs       Histogram: log(frequency) by time      77.9 μs <

 Memory estimate: 752 bytes, allocs estimate: 9.

Serial (v0.46.1) - Windows - Julia 1.11 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 201 evaluations.
 Range (min … max):  396.517 ns …  1.720 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     399.502 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   434.504 ns ± 77.214 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂▆▁▄▁▂ ▁▁      ▁  ▁         ▁   ▁                           ▁
  ██████████▇█████████▇███▇▇█▇▇█▇▇▇█▇█▇█▇▇▇▇▇▇▆▆▅▅▆▅▅▅▅▅▅▄▄▄▅▆ █
  397 ns        Histogram: log(frequency) by time       716 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.850 μs …  20.913 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.862 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.043 μs ± 639.322 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █ ▅▁▂▁ ▁                                                    ▁
  ██████▆█▇▇▇▇█▇▇▆▆▇▇▇▇▆▆▇▆▇▆▆▆▆▅▅▅▅▄▅▄▅▅▄▅▄▅▆▆▄▅▅▅▅▅▅▅▅▅▄▅▇▇ █
  3.85 μs      Histogram: log(frequency) by time      6.46 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  38.200 μs … 254.800 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   40.533 μs ±   6.719 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▁▅▁▃▁ ▂                                                     ▁
  ██████▇█▇██▇█▇▇▇▇▇▇▆▇█▇▇█▇▆▆▆▆▆▅▄▅▃▅▅▆▅▅▆▅▅▅▆▅▅▅▅▆▅▄▅▃▅▄▄▅▃▅ █
  38.2 μs       Histogram: log(frequency) by time      67.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
@juliohm
Copy link
Member

juliohm commented Aug 8, 2024

Really nice idea. Should we experiment with option (4) first? It is less disruptive and easy to try. I understand that we can simply do something like @tasks for i in 1:n batchsize=1000 to control the chunks.

@juliohm juliohm added enhancement New feature or request performance labels Aug 8, 2024
@disberd
Copy link
Member Author

disberd commented Aug 8, 2024

@juliohm I did try a bit with OhMyThreads but it does not seem to help much here simply with @tasks (also you can't break early easily within @tasks at the moment, see JuliaFolds2/OhMyThreads.jl#96).

Polyester on the other hands seems to overall reduce allocations and bring minimal overhead for small number of points and speedup even compared to Threads.@threads with high number of points.

Are you willing to have the PR use Polyester or for some reason you don't want to depend on it (or use @batch)?

@juliohm
Copy link
Member

juliohm commented Aug 8, 2024

@disberd in that case I think it is safer to proceed with option (1). We never know for how long these optimized threading packages will be maintained (e.g., LoopVectorization.jl).

@disberd
Copy link
Member Author

disberd commented Aug 8, 2024

OK, I'll do the PR with (1) but just post here for reference the timing I get with @batch:

Threaded (with @batch) - Windows - Julia 1.11 - 8 threads

julia> bench([100, 1000, 10000], Point(2,2))
#======  Benchmark for n = 100 ======#
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
 Range (min … max):  413.065 ns …  21.077 μs  ┊ GC (min … max): 0.00% … 97.52%
 Time  (median):     429.648 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   440.808 ns ± 210.052 ns  ┊ GC (mean ± σ):  0.47% ±  0.98%

    ███▇▄▃▃▁▁▁                                                  ▂
  ▃▂██████████▇█▇▆▇▆▅▆▆▆▅▅▅▅▅▅▅▅▄▄▆▆▆▆▅▆▆▆▆▆▅▆▅▄▅▆▅▆▄▆▆▆▆▆▅▅▆▇▆ █
  413 ns        Histogram: log(frequency) by time        648 ns <

 Memory estimate: 32 bytes, allocs estimate: 2.

#======  Benchmark for n = 1000 ======#
BenchmarkTools.Trial: 10000 samples with 8 evaluations.
 Range (min … max):  3.913 μs …  17.913 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.950 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.002 μs ± 334.257 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▆▃▂ ▂▅▅▃▁   ▁▃▃                                          ▂
  ██████▇█████▆▇▅███▇█▇▇▅▇▆▆▆▅▆▅▃▆▇█▆▆▆▅▁▃▃▅▃▄▁▄▃▅▄▅▅▃▄▅▄▄▄▅▅ █
  3.91 μs      Histogram: log(frequency) by time      4.78 μs <

 Memory estimate: 32 bytes, allocs estimate: 2.

#======  Benchmark for n = 10000 ======#
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):   9.067 μs … 66.767 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     10.300 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.774 μs ±  2.917 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▄█▆▅▄▁                                                   ▁
  █▅▅█████████▇▇▇▅▆▆▆▆▅▅▆▄▅▅▄▅▄▃▄▄▄▄▅▄▄▄▅▃▄▄▄▄▃▅▃▁▄▁▄▁▄▃▁▃▄▄▅ █
  9.07 μs      Histogram: log(frequency) by time        25 μs <

 Memory estimate: 80 bytes, allocs estimate: 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants