Skip to content

Commit 7e9ab31

Browse files
authored
Full scheduler::Symbol support + Keyword argument forwarding (#81)
* tmp * tmp generic * tests passing * tests * macro API update * fix init bug * fixes + docstrings * changelog * dont allow ntasks and nchunks at the same time * Nothing -> NotGiven in scheduler.jl * try to fix CI (despite bug) * use init to fix CI * readme/index.md example * default dynamic scheduler to ntasks=nthreads * doc/examples update * update tls docs * collapse docs
1 parent 2bd4772 commit 7e9ab31

19 files changed

+511
-330
lines changed

Diff for: CHANGELOG.md

+3
Original file line numberDiff line numberDiff line change
@@ -4,16 +4,19 @@ OhMyThreads.jl Changelog
44
Version 0.5.0
55
-------------
66

7+
- ![Feature][badge-feature] The parallel functions (e.g. tmapreduce etc.) now support `scheduler::Symbol` besides `scheduler::Scheduler`. To configure the selected scheduler (e.g. set `nchunks` etc.) one may now pass keyword arguments directly into the parallel functions (they will get passed on to the scheduler constructor). Example: `tmapreduce(sin, +, 1:10; chunksize=2, scheduler=:static)`. Analogous support has been added to the macro API: (Most) settings (`@set name = value`) will now be passed on to the parallel functions as keyword arguments (which then forward them to the scheduler constructor). Note that, to avoid ambiguity, we don't support this feature for `scheduler::Scheduler` but only for `scheduler::Symbol`.
78
- ![Feature][badge-feature] Added a `SerialScheduler` that can be used to turn off any multithreading.
89
- ![Feature][badge-feature] Added `OhMyThreads.WithTaskLocals` that represents a closure over `TaskLocalValues`, but can have those values materialized as an optimization (using `OhMyThreads.promise_task_local`)
910
- ![Feature][badge-feature] In the case `nchunks > nthreads()`, the `StaticScheduler` now distributes chunks in a round-robin fashion (instead of either implicitly decreasing `nchunks` to `nthreads()` or throwing an error).
1011
- ![Feature][badge-feature] `@set init = ...` may now be used to specify an initial value for a reduction (only has an effect in conjuction with `@set reducer=...` and triggers a warning otherwise).
12+
- ![Enhancement][badge-enhancement] `SerialScheduler` and `DynamicScheduler` now support the keyword argument `ntasks` as an alias for `nchunks`.
1113
- ![Enhancement][badge-enhancement] Made `@tasks` use `OhMyThreads.WithTaskLocals` automatically as an optimization.
1214
- ![Enhancement][badge-enhancement] Uses of `@local` within `@tasks` no-longer require users to declare the type of the task local value, it can be inferred automatically if a type is not provided.
1315
- ![BREAKING][badge-breaking] The `DynamicScheduler` (default) and the `StaticScheduler` now support a `chunksize` argument to specify the desired size of chunks instead of the number of chunks (`nchunks`). Note that `chunksize` and `nchunks` are mutually exclusive. (This is unlikely to break existing code but technically could because the type parameter has changed from `Bool` to `ChunkingMode`.)
1416
- ![Breaking][badge-breaking] `DynamicScheduler` and `StaticScheduler` don't support `nchunks=0` or `chunksize=0` any longer. Instead, chunking can now be turned off via an explicit new keyword argument `chunking=false`.
1517
- ![BREAKING][badge-breaking] Within a `@tasks` block, task-local values must from now on be defined via `@local` instead of `@init` (renamed).
1618
- ![BREAKING][badge-breaking] The (already deprecated) `SpawnAllScheduler` has been dropped.
19+
- ![BREAKING][badge-breaking] The default value for `ntasks`/`nchunks` for `DynamicScheduler` has been changed from `2*nthreads()` to `nthreads()`. With the new value we now align with `@threads :dynamic`. The old value wasn't giving good load balancing anyways and choosing a higher value penalizes uniform use cases even more. To get the old behavior, set `nchunks=2*nthreads()`.
1720

1821
Version 0.4.6
1922
-------------

Diff for: README.md

+17-14
Original file line numberDiff line numberDiff line change
@@ -38,21 +38,26 @@ focus on [data parallelism](https://en.wikipedia.org/wiki/Data_parallelism), tha
3838
## Example
3939

4040
```julia
41-
using OhMyThreads
41+
using OhMyThreads: tmapreduce, @tasks
42+
using BenchmarkTools: @btime
43+
using Base.Threads: nthreads
4244

4345
# Variant 1: function API
44-
function mc_parallel(N; kw...)
45-
M = tmapreduce(+, 1:N; kw...) do i
46+
function mc_parallel(N; ntasks=nthreads())
47+
M = tmapreduce(+, 1:N; ntasks) do i
4648
rand()^2 + rand()^2 < 1.0
4749
end
4850
pi = 4 * M / N
4951
return pi
5052
end
5153

5254
# Variant 2: macro API
53-
function mc_parallel_macro(N)
55+
function mc_parallel_macro(N; ntasks=nthreads())
5456
M = @tasks for i in 1:N
55-
@set reducer=+
57+
@set begin
58+
reducer=+
59+
ntasks=ntasks
60+
end
5661
rand()^2 + rand()^2 < 1.0
5762
end
5863
pi = 4 * M / N
@@ -62,19 +67,17 @@ end
6267
N = 100_000_000
6368
mc_parallel(N) # gives, e.g., 3.14159924
6469

65-
using BenchmarkTools
66-
67-
@show Threads.nthreads() # 5 in this example
68-
69-
@btime mc_parallel($N; scheduler=DynamicScheduler(; nchunks=1)) # effectively using 1 thread
70-
@btime mc_parallel($N) # using all 5 threads
70+
@btime mc_parallel($N; ntasks=1) # use a single task (and hence a single thread)
71+
@btime mc_parallel($N) # using all threads
72+
@btime mc_parallel_macro($N) # using all threads
7173
```
7274

73-
Timings might be something like this:
75+
With 5 threads, timings might be something like this:
7476

7577
```
76-
447.093 ms (7 allocations: 624 bytes)
77-
89.401 ms (66 allocations: 5.72 KiB)
78+
417.282 ms (14 allocations: 912 bytes)
79+
83.578 ms (38 allocations: 3.08 KiB)
80+
83.573 ms (38 allocations: 3.08 KiB)
7881
```
7982

8083
(Check out the full [Parallel Monte Carlo](https://juliafolds2.github.io/OhMyThreads.jl/stable/literate/mc/mc/) example if you like.)

Diff for: docs/make.jl

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ makedocs(;
2828
]
2929
],
3030
repo = "https://github.com/JuliaFolds2/OhMyThreads.jl/blob/{commit}{path}#{line}",
31-
format = Documenter.HTML(repolink = "https://github.com/JuliaFolds2/OhMyThreads.jl"))
31+
format = Documenter.HTML(repolink = "https://github.com/JuliaFolds2/OhMyThreads.jl"; collapselevel = 1))
3232

3333
if ci
3434
@info "Deploying documentation to GitHub"

Diff for: docs/src/index.md

+17-14
Original file line numberDiff line numberDiff line change
@@ -14,21 +14,26 @@ to add the package to your Julia environment.
1414
### Basic example
1515

1616
```julia
17-
using OhMyThreads
17+
using OhMyThreads: tmapreduce, @tasks
18+
using BenchmarkTools: @btime
19+
using Base.Threads: nthreads
1820

1921
# Variant 1: function API
20-
function mc_parallel(N; kw...)
21-
M = tmapreduce(+, 1:N; kw...) do i
22+
function mc_parallel(N; ntasks=nthreads())
23+
M = tmapreduce(+, 1:N; ntasks) do i
2224
rand()^2 + rand()^2 < 1.0
2325
end
2426
pi = 4 * M / N
2527
return pi
2628
end
2729

2830
# Variant 2: macro API
29-
function mc_parallel_macro(N)
31+
function mc_parallel_macro(N; ntasks=nthreads())
3032
M = @tasks for i in 1:N
31-
@set reducer=+
33+
@set begin
34+
reducer=+
35+
ntasks=ntasks
36+
end
3237
rand()^2 + rand()^2 < 1.0
3338
end
3439
pi = 4 * M / N
@@ -38,19 +43,17 @@ end
3843
N = 100_000_000
3944
mc_parallel(N) # gives, e.g., 3.14159924
4045

41-
using BenchmarkTools
42-
43-
@show Threads.nthreads() # 5 in this example
44-
45-
@btime mc_parallel($N; scheduler=DynamicScheduler(; nchunks=1)) # effectively using 1 thread
46-
@btime mc_parallel($N) # using all 5 threads
46+
@btime mc_parallel($N; ntasks=1) # use a single task (and hence a single thread)
47+
@btime mc_parallel($N) # using all threads
48+
@btime mc_parallel_macro($N) # using all threads
4749
```
4850

49-
Timings might be something like this:
51+
With 5 threads, timings might be something like this:
5052

5153
```
52-
447.093 ms (7 allocations: 624 bytes)
53-
89.401 ms (66 allocations: 5.72 KiB)
54+
417.282 ms (14 allocations: 912 bytes)
55+
83.578 ms (38 allocations: 3.08 KiB)
56+
83.573 ms (38 allocations: 3.08 KiB)
5457
```
5558

5659
(Check out the full [Parallel Monte Carlo](@ref) example if you like.)

Diff for: docs/src/literate/integration/integration.jl

+2-1
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ end
2929
# interval, as a multiple of the number of available Julia threads.
3030

3131
using Base.Threads: nthreads
32-
@show nthreads()
3332

3433
N = nthreads() * 1_000_000
3534

@@ -82,3 +81,5 @@ using BenchmarkTools
8281

8382
# Because the problem is trivially parallel - all threads to the same thing and don't need
8483
# to communicate - we expect an ideal speedup of (close to) the number of available threads.
84+
85+
nthreads()

Diff for: docs/src/literate/integration/integration.md

+15-4
Original file line numberDiff line numberDiff line change
@@ -46,13 +46,12 @@ interval, as a multiple of the number of available Julia threads.
4646

4747
````julia
4848
using Base.Threads: nthreads
49-
@show nthreads()
5049

5150
N = nthreads() * 1_000_000
5251
````
5352

5453
````
55-
5000000
54+
10000000
5655
````
5756

5857
Calling `trapezoidal` we do indeed find the (approximate) value of $\pi$.
@@ -101,6 +100,10 @@ end
101100
# end
102101
````
103102

103+
````
104+
trapezoidal_parallel (generic function with 1 method)
105+
````
106+
104107
First, we check the correctness of our parallel implementation.
105108

106109
````julia
@@ -120,14 +123,22 @@ using BenchmarkTools
120123
````
121124

122125
````
123-
12.782 ms (0 allocations: 0 bytes)
124-
2.563 ms (37 allocations: 3.16 KiB)
126+
24.348 ms (0 allocations: 0 bytes)
127+
2.457 ms (69 allocations: 6.05 KiB)
125128
126129
````
127130

128131
Because the problem is trivially parallel - all threads to the same thing and don't need
129132
to communicate - we expect an ideal speedup of (close to) the number of available threads.
130133

134+
````julia
135+
nthreads()
136+
````
137+
138+
````
139+
10
140+
````
141+
131142
---
132143

133144
*This page was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*

Diff for: docs/src/literate/juliaset/juliaset.jl

+4-4
Original file line numberDiff line numberDiff line change
@@ -111,12 +111,12 @@ img = zeros(Int, N, N)
111111
# the load balancing of the default dynamic scheduler. The latter divides the overall
112112
# workload into tasks that can then be dynamically distributed among threads to adjust the
113113
# per-thread load. We can try to fine tune and improve the load balancing further by
114-
# increasing the `nchunks` parameter of the scheduler, that is, creating more and smaller
115-
# tasks.
114+
# increasing the `ntasks` parameter of the scheduler, that is, creating more tasks with
115+
# smaller per-task workload.
116116

117117
using OhMyThreads: DynamicScheduler
118118

119-
@btime compute_juliaset_parallel!($img; scheduler=DynamicScheduler(; nchunks=N)) samples=10 evals=3;
119+
@btime compute_juliaset_parallel!($img; ntasks=N, scheduler=:dynamic) samples=10 evals=3;
120120

121121
# Note that while this turns out to be a bit faster, it comes at the expense of much more
122122
# allocations.
@@ -126,4 +126,4 @@ using OhMyThreads: DynamicScheduler
126126

127127
using OhMyThreads: StaticScheduler
128128

129-
@btime compute_juliaset_parallel!($img; scheduler=StaticScheduler()) samples=10 evals=3;
129+
@btime compute_juliaset_parallel!($img; scheduler=:static) samples=10 evals=3;

Diff for: docs/src/literate/juliaset/juliaset.md

+9-9
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,9 @@ img = zeros(Int, N, N)
121121
````
122122

123123
````
124-
nthreads() = 5
125-
138.157 ms (0 allocations: 0 bytes)
126-
40.373 ms (67 allocations: 6.20 KiB)
124+
nthreads() = 10
125+
131.295 ms (0 allocations: 0 bytes)
126+
31.422 ms (68 allocations: 6.09 KiB)
127127
128128
````
129129

@@ -135,17 +135,17 @@ As stated above, the per-pixel computation is non-uniform. Hence, we do benefit
135135
the load balancing of the default dynamic scheduler. The latter divides the overall
136136
workload into tasks that can then be dynamically distributed among threads to adjust the
137137
per-thread load. We can try to fine tune and improve the load balancing further by
138-
increasing the `nchunks` parameter of the scheduler, that is, creating more and smaller
139-
tasks.
138+
increasing the `ntasks` parameter of the scheduler, that is, creating more tasks with
139+
smaller per-task workload.
140140

141141
````julia
142142
using OhMyThreads: DynamicScheduler
143143

144-
@btime compute_juliaset_parallel!($img; scheduler=DynamicScheduler(; nchunks=N)) samples=10 evals=3;
144+
@btime compute_juliaset_parallel!($img; ntasks=N, scheduler=:dynamic) samples=10 evals=3;
145145
````
146146

147147
````
148-
31.751 ms (12011 allocations: 1.14 MiB)
148+
17.438 ms (12018 allocations: 1.11 MiB)
149149
150150
````
151151

@@ -158,11 +158,11 @@ To quantify the impact of load balancing we can opt out of dynamic scheduling an
158158
````julia
159159
using OhMyThreads: StaticScheduler
160160

161-
@btime compute_juliaset_parallel!($img; scheduler=StaticScheduler()) samples=10 evals=3;
161+
@btime compute_juliaset_parallel!($img; scheduler=:static) samples=10 evals=3;
162162
````
163163

164164
````
165-
63.147 ms (37 allocations: 3.26 KiB)
165+
30.097 ms (73 allocations: 6.23 KiB)
166166
167167
````
168168

Diff for: docs/src/literate/mc/mc.jl

+2-2
Original file line numberDiff line numberDiff line change
@@ -74,8 +74,8 @@ using Base.Threads: nthreads
7474

7575
using OhMyThreads: StaticScheduler
7676

77-
@btime mc_parallel($N) samples=10 evals=3;
78-
@btime mc_parallel($N; scheduler = StaticScheduler()) samples=10 evals=3;
77+
@btime mc_parallel($N; scheduler=:dynamic) samples=10 evals=3; # default
78+
@btime mc_parallel($N; scheduler=:static) samples=10 evals=3;
7979

8080
# ## Manual parallelization
8181
#

Diff for: docs/src/literate/mc/mc.md

+14-14
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ mc(N)
3434
````
3535

3636
````
37-
3.14145748
37+
3.14171236
3838
````
3939

4040
## Parallelization with `tmapreduce`
@@ -69,7 +69,7 @@ mc_parallel(N)
6969
````
7070

7171
````
72-
3.14134792
72+
3.14156496
7373
````
7474

7575
Let's run a quick benchmark.
@@ -86,9 +86,9 @@ using Base.Threads: nthreads
8686
````
8787

8888
````
89-
nthreads() = 5
90-
317.745 ms (0 allocations: 0 bytes)
91-
88.384 ms (66 allocations: 5.72 KiB)
89+
nthreads() = 10
90+
301.636 ms (0 allocations: 0 bytes)
91+
41.864 ms (68 allocations: 5.81 KiB)
9292
9393
````
9494

@@ -100,13 +100,13 @@ and compare the performance of static and dynamic scheduling (with default param
100100
````julia
101101
using OhMyThreads: StaticScheduler
102102

103-
@btime mc_parallel($N) samples=10 evals=3;
104-
@btime mc_parallel($N; scheduler=StaticScheduler()) samples=10 evals=3;
103+
@btime mc_parallel($N; scheduler=:dynamic) samples=10 evals=3; # default
104+
@btime mc_parallel($N; scheduler=:static) samples=10 evals=3;
105105
````
106106

107107
````
108-
88.222 ms (66 allocations: 5.72 KiB)
109-
88.203 ms (36 allocations: 2.98 KiB)
108+
41.839 ms (68 allocations: 5.81 KiB)
109+
41.838 ms (68 allocations: 5.81 KiB)
110110
111111
````
112112

@@ -121,7 +121,7 @@ simulation. Finally, we fetch the results and compute the average estimate for $
121121
using OhMyThreads: @spawn, chunks
122122

123123
function mc_parallel_manual(N; nchunks = nthreads())
124-
tasks = map(chunks(1:N; n = nchunks)) do idcs # TODO: replace by `tmap` once ready
124+
tasks = map(chunks(1:N; n = nchunks)) do idcs
125125
@spawn mc(length(idcs))
126126
end
127127
pi = sum(fetch, tasks) / nchunks
@@ -132,7 +132,7 @@ mc_parallel_manual(N)
132132
````
133133

134134
````
135-
3.1414609999999996
135+
3.14180504
136136
````
137137

138138
And this is the performance:
@@ -142,7 +142,7 @@ And this is the performance:
142142
````
143143

144144
````
145-
64.042 ms (31 allocations: 2.80 KiB)
145+
30.224 ms (65 allocations: 5.70 KiB)
146146
147147
````
148148

@@ -161,8 +161,8 @@ end samples=10 evals=3;
161161
````
162162

163163
````
164-
88.041 ms (0 allocations: 0 bytes)
165-
63.427 ms (0 allocations: 0 bytes)
164+
41.750 ms (0 allocations: 0 bytes)
165+
30.148 ms (0 allocations: 0 bytes)
166166
167167
````
168168

0 commit comments

Comments
 (0)