-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressed LBFGS (forward) operator #258
base: main
Are you sure you want to change the base?
Compressed LBFGS (forward) operator #258
Conversation
1e51035
to
6c08717
Compare
@dpo, @tmigot, (@amontoison, @geoffroyleconte) I think this PR is pretty mature. n = 100000
m = 15
mem = m
lbfgs = CompressedLBFGSOperator(n; m)
classic_lbfgs = LBFGSOperator(n; mem)
for i in 1:m #m
s = rand(n)
y = rand(n)
push!(lbfgs, s, y)
push!(classic_lbfgs, s, y)
end
s = rand(n)
y = rand(n)
@benchmark push!(lbfgs, s, y)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
# Range (min … max): 145.300 μs … 6.865 ms ┊ GC (min … max): 0.00% … 96.10%
# Time (median): 159.900 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 186.701 μs ± 206.696 μs ┊ GC (mean ± σ): 3.75% ± 3.42%
# █
# ▇█▇▃▂▂▂▂▆▆▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
# 145 μs Histogram: frequency by time 440 μs <
# Memory estimate: 86.28 KiB, allocs estimate: 16.
# BenchmarkTools.Trial: 6184 samples with 1 evaluation. # n=10000, m=15
# Range (min … max): 560.800 μs … 9.411 ms ┊ GC (min … max): 0.00% … 87.77%
# Time (median): 733.700 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 796.455 μs ± 299.321 μs ┊ GC (mean ± σ): 0.95% ± 3.07%
# ▅▆█▇▇▇▄▃▁
# ▂▂▄▆██████████▇▇▅▅▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
# 561 μs Histogram: frequency by time 1.55 ms <
# Memory estimate: 112.86 KiB, allocs estimate: 16.
# BenchmarkTools.Trial: 741 samples with 1 evaluation. # n=100000, m=15
# Range (min … max): 5.402 ms … 18.267 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 6.557 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 6.731 ms ± 1.057 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▂▆█▂▃▄▂ ▁▂▁ ▁▃
# ▃▂▃▄▅▃████████████████▇▇▆▆▆▅▄▆▅▅▄▃▄▃▃▃▃▃▂▂▂▂▃▂▁▁▃▁▁▁▂▁▁▁▁▂ ▄
# 5.4 ms Histogram: frequency by time 9.66 ms <
# Memory estimate: 112.86 KiB, allocs estimate: 16.
@benchmark push!(classic_lbfgs, s, y)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
# Range (min … max): 131.300 μs … 783.500 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 135.500 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 153.789 μs ± 51.113 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# █▆▄▄▂▁▁▁▁▁▂▂▁ ▁▁▁ ▁
# ███████████████▇▇▇██████▇▇▆▇▇▆▇▆▆▆▆▆▆▆▇▇▆▆▆▆▅▅▅▅▅▅▅▅▆▄▅▅▅▃▅▅▄ █
# 131 μs Histogram: log(frequency) by time 380 μs <
# Memory estimate: 0 bytes, allocs estimate: 0.
# BenchmarkTools.Trial: 5073 samples with 1 evaluation. # n=10000, m = 15
# Range (min … max): 858.900 μs … 2.479 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 895.200 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 981.256 μs ± 176.964 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▆█▄▂▁▁▃▄▂▁▁▁▁▂▁▁ ▁▁▁ ▁
# ███████████████████████████▆▇▇▇▇▇▇▇▆▇▆▆▆▇▆▇▆▆▆▅▆▆▆▆▆▅▄▃▅▆▅▅▅▅ █
# 859 μs Histogram: log(frequency) by time 1.66 ms <
# Memory estimate: 0 bytes, allocs estimate: 0.
# BenchmarkTools.Trial: 246 samples with 1 evaluation. # n=100000, m=15
# Range (min … max): 19.177 ms … 27.130 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 19.926 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 20.311 ms ± 1.151 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▁ ▁▁█▃▂▄
# ▅█▆████████▆▆▆▄▄▄▄▃▃▄▂▂▃▁▄▁▃▁▄▁▃▂▂▁▃▁▁▂▁▃▂▄▂▁▁▁▂▂▁▂▃▁▁▁▁▁▁▂ ▃
# 19.2 ms Histogram: frequency by time 24.6 ms <
# Memory estimate: 0 bytes, allocs estimate: 0.
Bv = similar(y)
v = ones(n)
@benchmark mul!(Bv, lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m=5
# Range (min … max): 63.000 μs … 1.067 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 87.100 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 96.820 μs ± 41.151 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▁▄▇██▇▆▅▅▄▃▃▂▁▁▁ ▂
# ▂▅█████████████████████▇▇▇▇▆▇▇▅▅▆▅▆▄▆▅▆▆▅▆▅▆▆▅▆▆▅▆▅▆▅▆▅▅▆▅▆ █
# 63 μs Histogram: log(frequency) by time 273 μs <
# Memory estimate: 0 bytes, allocs estimate: 0.
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m=15
# Range (min … max): 116.300 μs … 801.700 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 128.700 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 134.797 μs ± 29.547 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▄▆██▆▄▃▃▂▂▂▁▁ ▂
# ████████████████▇▇▇▆▇▆▆▆▄▄▆▅▄▆▅▃▅▄▅▅▅▂▄▂▅▄▄▅▅▅▄▄▂▄▄▄▅▃▄▄▄▄▆▃▅ █
# 116 μs Histogram: log(frequency) by time 295 μs <
# Memory estimate: 0 bytes, allocs estimate: 0.
# BenchmarkTools.Trial: 3477 samples with 1 evaluation. # n=100000, m=15
# Range (min … max): 1.101 ms … 3.437 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 1.235 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 1.412 ms ± 353.924 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▃▇█▂
# ▄████▆▄▃▃▃▃▂▃▃▂▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
# 1.1 ms Histogram: frequency by time 2.6 ms <
# Memory estimate: 0 bytes, allocs estimate: 0.
@benchmark mul!(Bv, classic_lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
# Range (min … max): 109.700 μs … 742.600 μs ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 111.600 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 131.146 μs ± 49.643 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# █▄▄▃▂ ▁ ▁▁▂▁▁ ▁ ▁
# ████████████████▇▇▇████▇█▇▆▇▆▇▇▇▆▇▇▅▆▆▆▆▆▆▆▆▇▆▅▅▆▅▆▅▅▅▅▅▄▅▅▄▄ █
# 110 μs Histogram: log(frequency) by time 339 μs <
# Memory estimate: 0 bytes, allocs estimate: 0.
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 15
# Range (min … max): 311.800 μs … 25.272 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 327.500 μs ┊ GC (median): 0.00%
# Time (mean ± σ): 394.421 μs ± 279.993 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# █▄▃▃▃▃▂▃▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁ ▁
# ████████████████████████████▇██▇██▇▇▆▆▇▆▇▆▇▆▆▆▆▆▆▆▆▆▅▆▅▅▆▄▄▄▅ █
# 312 μs Histogram: log(frequency) by time 905 μs <
# Memory estimate: 0 bytes, allocs estimate: 0.
# BenchmarkTools.Trial: 951 samples with 1 evaluation. # n=100000, m=15
# Range (min … max): 4.579 ms … 9.967 ms ┊ GC (min … max): 0.00% … 0.00%
# Time (median): 5.061 ms ┊ GC (median): 0.00%
# Time (mean ± σ): 5.241 ms ± 560.678 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
# ▃█▅▄▂▃▁
# ▂▃▇███████▇▆▆▄▆▅▆▆▅▄▄▅▅▅▄▄▄▅▄▄▄▄▃▂▃▃▂▃▂▂▃▂▂▁▁▂▂▂▁▂▁▂▁▁▂▁▁▂▁ ▃
# 4.58 ms Histogram: frequency by time 6.88 ms <
# Memory estimate: 0 bytes, allocs estimate: 0. Some allocations remain in Next, I plan to include some tests for the CUDA architecture. |
Hi, thanks for this PR. LinearOperators.jl/src/lbfgs.jl Line 201 in a204372
Is it possible to make a similar change (your And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example: LinearOperators.jl/src/constructors.jl Line 28 in a204372
|
It's more generic if you add an argument with the storage type |
src/compressed_lbfgs.jl
Outdated
|
||
# step 7 | ||
mul!(Bv, view(op.Yₖ, :, 1:op.k), view(op.sol, 1:op.k)) | ||
mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, (T)(-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, (T)(-1)) | |
mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, -one(T)) |
Adding
The user can specify directly the type he needs, for both the |
You suppose that GPU == Nvidia but If tomorrow we want to use the same operator on Intel GPUs, will you add oneAPI.jl as a dependency? |
For now, yes.
I guess it will need oneAPI as a dependency.
At the end, It demultipliates GPU packages as dependencies, but the exact same code will work on every architecture. Personnally, I think it is better if the user doen't need to complete the type of the data-structures, but the drawback is having GPU packages as dependencies :/ |
I agree with @amontoison here. Maybe you could use Requires.jl (see If it is not possible, I suggest you make this PR work on the CPU only, and once it is merged we can discuss the GPU compatibility in another PR. |
It is sure, but I think it is the user's loss.
I don't mind, even if I don't like it. |
Then I would advice not adding CUDA in the dependencies. We used the following multiple times in this package:
Using what you propose would mean changing all these functions too if we want to have a coherent API for the whole package. |
It would be better, the user would not have to change manually My motivation about implementating |
I don't see a big difference bewteen |
My code needs neither of
If this corner case happens, the user will give its desired
I am not sure why I will have errors or slow conversions, which would not be the case if I implement it the other way. Willing to have structures not related to the architecture, is for me the small minority of usages. |
5f48fc8
to
4a293cc
Compare
@geoffroyleconte I added Requires and removed CUDA from the dependencies, and the tests pass (for the CPU. @amontoison unfortunately, I got (at least) one issue with the GPU :/ |
Paul, could you test with the branch master of CUDA.jl? |
It seems to work on GPU now, but it struggles to invert 3 intemediary matrices :/ |
Let's not add submodules, please. |
I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your |
It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). |
Co-authored-by: Alexis Montoison <[email protected]>
Co-authored-by: Alexis Montoison <[email protected]>
Co-authored-by: Alexis Montoison <[email protected]>
Codecov ReportBase: 97.32% // Head: 96.18% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #258 +/- ##
==========================================
- Coverage 97.32% 96.18% -1.14%
==========================================
Files 14 15 +1
Lines 1009 1102 +93
==========================================
+ Hits 982 1060 +78
- Misses 27 42 +15
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Paul, the code will also work for any architecture is you use what we suggest with Geoffroy. |
I know. CompressedLinearOperator(n) runnable on any architecture and CompressedLinearOperator(n; T=..., V=CUDA.CuVector{T, CUDA.Mem.DeviceBuffer}, M=CUDA.CuMatrix{T, CUDA.Mem.DeviceBuffer}) that should be changed for each architecture, the first option is clearly better (for me). |
The issue is that I didn't understand what you want to achieve. Your example has a big drawback, you will have a different behavior depending if you have a GPU or not and the user doesn't have control on it. We practice we don't want to always use an operator or routine specialized for GPUs. About your previous message, the type of the GPU buffer is not required in the constructor. So, the question is do we prefer:
|
I tried without it, but it failed to type correctly
and adding
There is no drawback because the user has still control CompressedLBFGSOperator(n::Int; mem::Int=5, T=Float64, M=default_matrix_type(; T), V=default_vector_type(; T)) the trick is that
I followed the If there is several GPU backend modules loaded, fair enough, the user will have to type |
The LBFGS Operator in this package uses scalar indexing so it is not made for GPUs anyways. LinearOperators.jl/src/constructors.jl Line 39 in 9591943
(you can use S = CuArray{Float64, 1, CUDA.Mem.DeviceBuffer} for example).Having multiple ways to do the same thing for different operators is confusing. I am not against changing the API for the whole package but I think that this should not be done in this PR. |
view(data.inverse_intermediate_1, 1:2*data.k, 1:2*data.k) .= inv(data.intermediate_1[1:2*data.k, 1:2*data.k]) | ||
view(data.inverse_intermediate_2, 1:2*data.k, 1:2*data.k) .= inv(data.intermediate_2[1:2*data.k, 1:2*data.k]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inv
has been used to wait until a better solution is found.
The function is performed only when an update occurs.
The dimension of the matrix inverted is related to m
and not to n
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use LAPACK.getrf!
and LAPACK.getri!
on data.intermediate_1[1:2*data.k, 1:2*data.k]
and data.intermediate_1[1:2*data.k, 1:2*data.k]
.
getrf!
computes a dense LU decomposition in-place.
getri!
uses the factors of the LU decomposition to compute the inverse.
I added a dispatch in CUDA.jl and AMDGPU.jl for these LAPACK calls.
https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusolver/dense.jl#L895-L926
@dpo, this is the first version of an implementation of a compressed LBFGS (forward) operator.
I made a first
structure
, as well as aMatrix
interface, and amul!
method.There is a lot of room for improvement, but right now it is functionnal.
I didn't add tests for now.