Compressed LBFGS (forward) operator #258

paraynaud · 2023-01-13T15:59:53Z

@dpo, this is the first version of an implementation of a compressed LBFGS (forward) operator.
I made a first structure, as well as a Matrix interface, and a mul! method.
There is a lot of room for improvement, but right now it is functionnal.
I didn't add tests for now.

github-actions · 2023-01-13T16:24:43Z

Package name	latest	stable
CaNNOLeS.jl
DCISolver.jl
FletcherPenaltySolver.jl
JSOSolvers.jl
Krylov.jl
NLPModels.jl
NLPModelsModifiers.jl
PROPACK.jl
Percival.jl
QuadraticModels.jl
SolverTools.jl

paraynaud · 2023-01-18T15:12:38Z

@dpo, @tmigot, (@amontoison, @geoffroyleconte) I think this PR is pretty mature.
Right now, a CompressedLBFGSOperator is faster to update and to perform mul! than the usual LBFGSOperator.
Here are some @benchmark I ran on my personal laptop.
It means to compare LBFGSOperator and CompressedLBFGSOperator:

n = 100000
m = 15
mem = m 
lbfgs = CompressedLBFGSOperator(n; m)
classic_lbfgs = LBFGSOperator(n; mem)

for i in 1:m #m
  s = rand(n)
  y = rand(n)
  push!(lbfgs, s, y)
  push!(classic_lbfgs, s, y)
end 

s = rand(n)
y = rand(n)
  
@benchmark push!(lbfgs, s, y) 
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
#  Range (min … max):  145.300 μs …   6.865 ms  ┊ GC (min … max): 0.00% … 96.10%
#  Time  (median):     159.900 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   186.701 μs ± 206.696 μs  ┊ GC (mean ± σ):  3.75% ±  3.42%
#    █        
#   ▇█▇▃▂▂▂▂▆▆▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
#   145 μs           Histogram: frequency by time          440 μs <
#  Memory estimate: 86.28 KiB, allocs estimate: 16.

# BenchmarkTools.Trial: 6184 samples with 1 evaluation. # n=10000,  m=15
#  Range (min … max):  560.800 μs …   9.411 ms  ┊ GC (min … max): 0.00% … 87.77%
#  Time  (median):     733.700 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   796.455 μs ± 299.321 μs  ┊ GC (mean ± σ):  0.95% ±  3.07%
#        ▅▆█▇▇▇▄▃▁ 
#   ▂▂▄▆██████████▇▇▅▅▄▄▃▃▃▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
#   561 μs           Histogram: frequency by time         1.55 ms <
#  Memory estimate: 112.86 KiB, allocs estimate: 16.

# BenchmarkTools.Trial: 741 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  5.402 ms … 18.267 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     6.557 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   6.731 ms ±  1.057 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
#          ▂▆█▂▃▄▂ ▁▂▁ ▁▃
#   ▃▂▃▄▅▃████████████████▇▇▆▆▆▅▄▆▅▅▄▃▄▃▃▃▃▃▂▂▂▂▃▂▁▁▃▁▁▁▂▁▁▁▁▂ ▄
#   5.4 ms         Histogram: frequency by time        9.66 ms <
#  Memory estimate: 112.86 KiB, allocs estimate: 16.


@benchmark push!(classic_lbfgs, s, y)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m = 5
#  Range (min … max):  131.300 μs … 783.500 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     135.500 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   153.789 μs ±  51.113 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▆▄▄▂▁▁▁▁▁▂▂▁     ▁▁▁                                         ▁
#   ███████████████▇▇▇██████▇▇▆▇▇▆▇▆▆▆▆▆▆▆▇▇▆▆▆▆▅▅▅▅▅▅▅▅▆▄▅▅▅▃▅▅▄ █
#   131 μs        Histogram: log(frequency) by time        380 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 5073 samples with 1 evaluation. # n=10000, m = 15
#  Range (min … max):  858.900 μs …   2.479 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     895.200 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   981.256 μs ± 176.964 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▆█▄▂▁▁▃▄▂▁▁▁▁▂▁▁  ▁▁▁                                        ▁
#   ███████████████████████████▆▇▇▇▇▇▇▇▆▇▆▆▆▇▆▇▆▆▆▅▆▆▆▆▆▅▄▃▅▆▅▅▅▅ █
#   859 μs        Histogram: log(frequency) by time       1.66 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 246 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  19.177 ms … 27.130 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     19.926 ms              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   20.311 ms ±  1.151 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▁ ▁▁█▃▂▄    
#   ▅█▆████████▆▆▆▄▄▄▄▃▃▄▂▂▃▁▄▁▃▁▄▁▃▂▂▁▃▁▁▂▁▃▂▄▂▁▁▁▂▂▁▂▃▁▁▁▁▁▁▂ ▃
#   19.2 ms         Histogram: frequency by time        24.6 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

Bv = similar(y)
v = ones(n)

@benchmark mul!(Bv, lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m=5
#  Range (min … max):  63.000 μs …  1.067 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     87.100 μs              ┊ GC (median):    0.00%
#  Time  (mean ± σ):   96.820 μs ± 41.151 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#     ▁▄▇██▇▆▅▅▄▃▃▂▁▁▁                                          ▂
#   ▂▅█████████████████████▇▇▇▇▆▇▇▅▅▆▅▆▄▆▅▆▆▅▆▅▆▆▅▆▆▅▆▅▆▅▆▅▅▆▅▆ █
#   63 μs        Histogram: log(frequency) by time       273 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000, m=15
#  Range (min … max):  116.300 μs … 801.700 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     128.700 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   134.797 μs ±  29.547 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▄▆██▆▄▃▃▂▂▂▁▁                                                ▂
#   ████████████████▇▇▇▆▇▆▆▆▄▄▆▅▄▆▅▃▅▄▅▅▅▂▄▂▅▄▄▅▅▅▄▄▂▄▄▄▅▃▄▄▄▄▆▃▅ █
#   116 μs        Histogram: log(frequency) by time        295 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.


#  BenchmarkTools.Trial: 3477 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  1.101 ms …   3.437 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     1.235 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   1.412 ms ± 353.924 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#    ▃▇█▂        
#   ▄████▆▄▃▃▃▃▂▃▃▂▃▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
#   1.1 ms          Histogram: frequency by time         2.6 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

@benchmark mul!(Bv, classic_lbfgs, v)
# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m = 5
#  Range (min … max):  109.700 μs … 742.600 μs  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     111.600 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   131.146 μs ±  49.643 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▄▄▃▂ ▁  ▁▁▂▁▁       ▁                                        ▁
#   ████████████████▇▇▇████▇█▇▆▇▆▇▇▇▆▇▇▅▆▆▆▆▆▆▆▆▇▆▅▅▆▅▆▅▅▅▅▅▄▅▅▄▄ █
#   110 μs        Histogram: log(frequency) by time        339 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 10000 samples with 1 evaluation. # n=10000,  m = 15
#  Range (min … max):  311.800 μs …  25.272 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     327.500 μs               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   394.421 μs ± 279.993 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#   █▄▃▃▃▃▂▃▄▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁                                      ▁
#   ████████████████████████████▇██▇██▇▇▆▆▇▆▇▆▇▆▆▆▆▆▆▆▆▆▅▆▅▅▆▄▄▄▅ █
#   312 μs        Histogram: log(frequency) by time        905 μs <
#  Memory estimate: 0 bytes, allocs estimate: 0.

# BenchmarkTools.Trial: 951 samples with 1 evaluation. # n=100000, m=15
#  Range (min … max):  4.579 ms …   9.967 ms  ┊ GC (min … max): 0.00% … 0.00%
#  Time  (median):     5.061 ms               ┊ GC (median):    0.00%
#  Time  (mean ± σ):   5.241 ms ± 560.678 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
#      ▃█▅▄▂▃▁        
#   ▂▃▇███████▇▆▆▄▆▅▆▆▅▄▄▅▅▅▄▄▄▅▄▄▄▄▃▂▃▃▂▃▂▂▃▂▂▁▁▂▂▂▁▂▁▂▁▁▂▁▁▂▁ ▃
#   4.58 ms         Histogram: frequency by time        6.88 ms <
#  Memory estimate: 0 bytes, allocs estimate: 0.

Some allocations remain in push!(lbfgs,s,y), due to several mandatory matrix inversion (considering n == mem).

Next, I plan to include some tests for the CUDA architecture.
I tried to design my code to work for both CPU and GPU.

geoffroyleconte · 2023-01-18T16:47:58Z

Hi, thanks for this PR.
Usually in this package the functions to perform matrix-vector products are implemented in prod!, for example:

LinearOperators.jl/src/lbfgs.jl

Line 201 in a204372

prod! = @closure (res, x, α, β) -> lbfgs_multiply(res, lbfgs_data, x, α, β)

Is it possible to make a similar change (your mul! function should go to prod!)?

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:

LinearOperators.jl/src/constructors.jl

Line 28 in a204372

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

amontoison · 2023-01-19T07:58:54Z

It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

amontoison · 2023-01-19T08:52:51Z

src/compressed_lbfgs.jl

+
+  # step 7 
+  mul!(Bv, view(op.Yₖ, :, 1:op.k),  view(op.sol, 1:op.k))
+  mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, (T)(-1))


Suggested change

mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, (T)(-1))

mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, -one(T))

paraynaud · 2023-01-19T09:20:39Z

Hi, thanks for this PR. Usually in this package the functions to perform matrix-vector products are implemented in prod!, for example:

LinearOperators.jl/src/lbfgs.jl

Line 201 in a204372

prod! = @closure (res, x, α, β) -> lbfgs_multiply(res, lbfgs_data, x, α, β)

Is it possible to make a similar change (your mul! function should go to prod!)?

Adding prod! is not an issue.
Since prod is a 5 arguments method, I will add it similarly to a LBFGSOperator as a new field of the structure.

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:

It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture.
Removing CUDA will make the operator more complicate to any new user.
Another advantage, is that you don't need different test files for buildkite.

amontoison · 2023-01-19T09:50:25Z

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:
It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture. Removing CUDA will make the operator more complicate to any new user. Another advantage, is that you don't need different test files for buildkite.

You suppose that GPU == Nvidia but If tomorrow we want to use the same operator on Intel GPUs, will you add oneAPI.jl as a dependency?
I don't understand your argument about test files for buildkite. If you want you can use buildkite to run all tests of LinearOperators if you can.

paraynaud · 2023-01-19T11:42:52Z

And is it possible to remove CUDA from the deps? I can only see 3 lines where you use CUDA functions, maybe you could add something for the user to specify directly the type he needs if he wants to use CUDA? See for example:
It's more generic if you add an argument with the storage type S.
It will also work with other GPU backends whereas your current implementation is restricted to Nvidia GPUs.

The user can specify directly the type he needs, for both the <:AbstractVector and the <:AbstractMatrix types, but I ease its job by determining which one use depending the architecture. Removing CUDA will make the operator more complicate to any new user. Another advantage, is that you don't need different test files for buildkite.

You suppose that GPU == Nvidia

For now, yes.
But later, if necessary, I would improve default_gpu method as well as the structures defined accordingly.

but If tomorrow we want to use the same operator on Intel GPUs, will you add oneAPI.jl as a dependency?

I guess it will need oneAPI as a dependency.

I don't understand your argument about test files for buildkite. If you want you can use buildkite to run all tests of LinearOperators you can.

At the end, It demultipliates GPU packages as dependencies, but the exact same code will work on every architecture.

Personnally, I think it is better if the user doen't need to complete the type of the data-structures, but the drawback is having GPU packages as dependencies :/

geoffroyleconte · 2023-01-19T14:27:16Z

I agree with @amontoison here. Maybe you could use Requires.jl (see
https://github.com/JuliaSmoothOptimizers/RipQP.jl/blob/d69c91618b95731f32cd2701f945bad1c254e3f6/src/RipQP.jl#L19) if there is no other choice, but I feel like it could be avoided if you make some minor changes and document well your operator.

If it is not possible, I suggest you make this PR work on the CPU only, and once it is merged we can discuss the GPU compatibility in another PR.

paraynaud · 2023-01-19T15:10:58Z

if you make some minor changes and document well your operator

It is sure, but I think it is the user's loss.
I am thinking about Requires, but not sure how to make it work.

If it is not possible, I suggest you make this PR work on the CPU only, and once it is merged we can discuss the GPU compatibility in another PR.

I don't mind, even if I don't like it.
But I prefer a clear answer on this one, rather than implementing this scheme to my other modules for less than nothing ;)

geoffroyleconte · 2023-01-20T01:26:33Z

Then I would advice not adding CUDA in the dependencies.

We used the following multiple times in this package:

LinearOperators.jl/src/constructors.jl

Line 28 in a204372

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

Using what you propose would mean changing all these functions too if we want to have a coherent API for the whole package.

paraynaud · 2023-01-20T09:58:15Z

Then I would advice not adding CUDA in the dependencies.

We used the following multiple times in this package:

LinearOperators.jl/src/constructors.jl

Line 28 in a204372

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

It would be better, the user would not have to change manually S in case where the architecture lauching the code is a GPU.

My motivation about implementating CompressedLBFGSOperator is the a current implementation of LBFGSOperator (forward) does not translate to GPU. So not everything is the module would need change.

amontoison · 2023-01-20T10:12:45Z

Then I would advice not adding CUDA in the dependencies.
We used the following multiple times in this package:

LinearOperators.jl/src/constructors.jl

Line 28 in a204372

LinearOperator(M::Symmetric{T}, S = Vector{T}) where {T <: Real} =

It would be better, the user would not have to change manually S in case where the architecture lauching the code is a GPU.

My motivation about implementating CompressedLBFGSOperator is the a current implementation of LBFGSOperator (forward) does not translate to GPU. So not everything is the module would need change.

I don't see a big difference bewteen T=Float64, gpu=true and S=CuVector{Float64} sincerely.
I'm not sure that default_gpu() should return true if you have a gpu available. If you use it in optimization methods that only working on cpu, you will some errors or slow conversions.

paraynaud · 2023-01-20T10:43:03Z

I don't see a big différence bewteen T=Float64, gpu=true and S=CuVector{Float64} sincerely.

My code needs neither of gpu=true nor S=CuVector{Float64}; T=Float64 being by default.
If you want to switch to T=Float32, M and S will change accordingly to T.

I'm not sure that default_gpu() should return true if you have a gpu available.

If this corner case happens, the user will give its desired M and S types.

If you use it in optimization methods that only working on cpu, you will some errors or slow conversions.

I am not sure why I will have errors or slow conversions, which would not be the case if I implement it the other way.

Willing to have structures not related to the architecture, is for me the small minority of usages.

doc

paraynaud · 2023-01-31T10:28:13Z

@geoffroyleconte I added Requires and removed CUDA from the dependencies, and the tests pass (for the CPU.
To not mess up the __init__ of LinearOperators, I made a submodule ModCompressedLBFGSOperator and I use Require in it.

@amontoison unfortunately, I got (at least) one issue with the GPU :/

amontoison · 2023-01-31T10:58:34Z

Paul, could you test with the branch master of CUDA.jl?
You just need to do a modification similar to
https://github.com/JuliaSmoothOptimizers/Krylov.jl/blob/main/.buildkite/pipeline.yml#L33

paraynaud · 2023-01-31T14:30:53Z

It seems to work on GPU now, but it struggles to invert 3 intemediary matrices :/
These invertions should cap the performances of the method.

dpo · 2023-01-31T14:43:22Z

Let's not add submodules, please.

geoffroyleconte · 2023-01-31T15:47:43Z

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

src/compressed_lbfgs.jl

paraynaud · 2023-01-31T22:24:55Z

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend).
In the worst case, the user defines the structures he wants with the optionnal parameters.
Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

Co-authored-by: Alexis Montoison <[email protected]>

codecov · 2023-01-31T22:35:33Z

Codecov Report

Base: 97.32% // Head: 96.18% // Decreases project coverage by -1.14% ⚠️

Coverage data is based on head (11e3b34) compared to base (9591943).
Patch coverage: 82.79% of modified lines in pull request are covered.

❗ Current head 11e3b34 differs from pull request most recent head 341332f. Consider uploading reports for the commit 341332f to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #258      +/-   ##
==========================================
- Coverage   97.32%   96.18%   -1.14%     
==========================================
  Files          14       15       +1     
  Lines        1009     1102      +93     
==========================================
+ Hits          982     1060      +78     
- Misses         27       42      +15

Impacted Files	Coverage Δ
src/compressed_lbfgs.jl	`82.79% <82.79%> (ø)`
src/utilities.jl	`96.47% <0.00%> (+1.17%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

amontoison · 2023-01-31T22:43:05Z

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). In the worst case, the user defines the structures he wants with the optionnal parameters. Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

Paul, the code will also work for any architecture is you use what we suggest with Geoffroy.

paraynaud · 2023-01-31T22:52:08Z

I still think that removing CUDA entirely is a better solution. As mentionned by @amontoison you would also be able to use other GPU backends. You could add a constructor for your CompressedBFGSOperator in your other modules if they use CUDA.

It may use other GPU backends, and it can be done automatically by adding 2 lines in the Requires part (for each GPU bakend). In the worst case, the user defines the structures he wants with the optionnal parameters. Having the same code working for any architecture is an non-negligeable asset for me, especially for tutorials and demonstrations!

Paul, the code will also work for any architecture is you use what we suggest with Geoffroy.

I know.
I don't understand the issue, is it because I added Requires as a dependency?
If the purpose of the tools we provide is to be democratized, what I propose is really usefull with no drawback now that CUDA is not a dependency anymore.
Between:

CompressedLinearOperator(n)

runnable on any architecture and

CompressedLinearOperator(n; T=..., V=CUDA.CuVector{T, CUDA.Mem.DeviceBuffer}, M=CUDA.CuMatrix{T, CUDA.Mem.DeviceBuffer})

that should be changed for each architecture, the first option is clearly better (for me).

amontoison · 2023-02-01T07:31:41Z

The issue is that I didn't understand what you want to achieve.
Your previous message helped me to understand.

Your example has a big drawback, you will have a different behavior depending if you have a GPU or not and the user doesn't have control on it. We practice we don't want to always use an operator or routine specialized for GPUs.
Julia packages with GPU support are not democratized and it will never be the case (from my point of view) with the constraints related to GPU programming.

About your previous message, the type of the GPU buffer is not required in the constructor.
T could have a default value related to S or M argument because you can recover it with eltype(S) or eltype(M).
You can also only provide S or M with this function: https://github.com/JuliaSmoothOptimizers/Krylov.jl/blob/main/src/krylov_utils.jl#L241

So, the question is do we prefer:

CompressedLinearOperator(n)
# Which GPU backend will be used in the future if you have an Intel CPU and a Nvidia GPU because OneAPI.jl and CUDA.jl are both installed?
# Do we need to provide all options if we have a Nvidia GPU but we don't want to use it?

CompressedLinearOperator(n, S=CuVector{T})

paraynaud · 2023-02-01T10:01:04Z

About your previous message, the type of the GPU buffer is not required in the constructor.

I tried without it, but it failed to type correctly

Dₖ = Diagonal(V(undef, mem))

and adding CUDA.Mem.DeviceBuffer made it work.

Your example has a big drawback, you will have a different behaviour depending if you have a GPU or not and the user doesn't have control on it. We practice we don't want to always use an operator or routine specialized for GPUs

There is no drawback because the user has still control

CompressedLBFGSOperator(n::Int; mem::Int=5, T=Float64, M=default_matrix_type(; T), V=default_vector_type(; T))

the trick is that default_matrix_type and default_vector_type return Matrix and Vector by default, but the Requires part overwrite those two methods depending on the modules loaded.
If you don't want to rely on the default M and V, you just have to parametrize M and V, replacing default_matrix_type and default_vector_type, as Geoffroy suggested, but I believe that most of the time you should not have to use it.

T could have a default value related to S or M argument because you can recover it with eltype(S) or eltype(M).

I followed the LBFGSOperator API, which has also T as a parameter.

If there is several GPU backend modules loaded, fair enough, the user will have to type M and V manually.

geoffroyleconte · 2023-02-01T18:01:06Z

The LBFGS Operator in this package uses scalar indexing so it is not made for GPUs anyways.
As I mentionned in a previous message, the API in this package currently uses this:

LinearOperators.jl/src/constructors.jl

Line 39 in 9591943

function LinearOperator(M::SymTridiagonal{T}, S = Vector{T}) where {T}

(you can use S = CuArray{Float64, 1, CUDA.Mem.DeviceBuffer} for example).
Having multiple ways to do the same thing for different operators is confusing. I am not against changing the API for the whole package but I think that this should not be done in this PR.

paraynaud · 2023-12-22T16:20:55Z

src/compressed_lbfgs.jl

+  view(data.inverse_intermediate_1, 1:2*data.k, 1:2*data.k) .= inv(data.intermediate_1[1:2*data.k, 1:2*data.k])
+  view(data.inverse_intermediate_2, 1:2*data.k, 1:2*data.k) .= inv(data.intermediate_2[1:2*data.k, 1:2*data.k])


inv has been used to wait until a better solution is found.
The function is performed only when an update occurs.
The dimension of the matrix inverted is related to m and not to n.

You could use LAPACK.getrf! and LAPACK.getri! on data.intermediate_1[1:2*data.k, 1:2*data.k] and data.intermediate_1[1:2*data.k, 1:2*data.k].
getrf! computes a dense LU decomposition in-place.
getri! uses the factors of the LU decomposition to compute the inverse.

I added a dispatch in CUDA.jl and AMDGPU.jl for these LAPACK calls.
https://github.com/JuliaGPU/CUDA.jl/blob/master/lib/cusolver/dense.jl#L895-L926

paraynaud force-pushed the pr-compressed-LBFGS branch 2 times, most recently from 1e51035 to 6c08717 Compare January 18, 2023 14:38

amontoison reviewed Jan 19, 2023

View reviewed changes

RAYNAUD Paul (raynaudp) added 9 commits January 31, 2023 10:16

CompressedLBFGS structure + Matrix interface + mul! (first version)

1b4244d

allocation free for the core of the mul! method

647b19e

circular push!

5ca9000

add documentation

3ff759b

add tests + change to satisfy the tests

7d0ef3d

Reduce allocations in push! + modify tests

28fdfc0

improve circular shift in push! (faster update than LBFGSOperator)

050d219

m -> mem, CompressedLBFGS -> CompressedLBFGSOperator + add vectorshift!

f0884c8

new module ModCompressedLBFGSOperator + remove CUDA + add Requires + fix

4a293cc

doc

paraynaud force-pushed the pr-compressed-LBFGS branch 2 times, most recently from 5f48fc8 to 4a293cc Compare January 31, 2023 10:17

test using CUDA#master

f447c62

fix CUDA types + adapt tests

cf114e0

amontoison reviewed Jan 31, 2023

View reviewed changes

src/compressed_lbfgs.jl Outdated Show resolved Hide resolved

amontoison reviewed Jan 31, 2023

View reviewed changes

src/compressed_lbfgs.jl Outdated Show resolved Hide resolved

src/compressed_lbfgs.jl Outdated Show resolved Hide resolved

paraynaud and others added 4 commits February 1, 2023 00:25

Update src/compressed_lbfgs.jl

d80a7d8

Co-authored-by: Alexis Montoison <[email protected]>

Update src/compressed_lbfgs.jl

11e3b34

Co-authored-by: Alexis Montoison <[email protected]>

Update src/compressed_lbfgs.jl

341332f

Co-authored-by: Alexis Montoison <[email protected]>

apply correction from the review

b82c99e

restore buildkite + add CompressedLBFGData + prod!

b211db5

dpo mentioned this pull request Dec 20, 2023

InverseLBFGSOperator for GPU #306

Open

paraynaud commented Dec 22, 2023

View reviewed changes

	mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, (T)(-1))
	mul!(Bv, view(op.Sₖ, :, 1:op.k), view(op.sol, op.k+1:2*op.k), - op.α, -one(T))

		view(data.inverse_intermediate_1, 1:2data.k, 1:2data.k) .= inv(data.intermediate_1[1:2data.k, 1:2data.k])
		view(data.inverse_intermediate_2, 1:2data.k, 1:2data.k) .= inv(data.intermediate_2[1:2data.k, 1:2data.k])

Compressed LBFGS (forward) operator #258

Are you sure you want to change the base?

Compressed LBFGS (forward) operator #258

Uh oh!

Conversation

paraynaud commented Jan 13, 2023

Uh oh!

github-actions bot commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paraynaud commented Jan 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffroyleconte commented Jan 18, 2023

Uh oh!

amontoison commented Jan 19, 2023

Uh oh!

amontoison Jan 19, 2023

Choose a reason for hiding this comment

Uh oh!

paraynaud commented Jan 19, 2023

Uh oh!

amontoison commented Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paraynaud commented Jan 19, 2023

Uh oh!

geoffroyleconte commented Jan 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paraynaud commented Jan 19, 2023

Uh oh!

geoffroyleconte commented Jan 20, 2023

Uh oh!

paraynaud commented Jan 20, 2023

Uh oh!

amontoison commented Jan 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paraynaud commented Jan 20, 2023

Uh oh!

paraynaud commented Jan 31, 2023

Uh oh!

amontoison commented Jan 31, 2023

Uh oh!

paraynaud commented Jan 31, 2023

Uh oh!

dpo commented Jan 31, 2023

Uh oh!

geoffroyleconte commented Jan 31, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paraynaud commented Jan 31, 2023

Uh oh!

codecov bot commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amontoison commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paraynaud commented Jan 31, 2023

Uh oh!

amontoison commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paraynaud commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geoffroyleconte commented Feb 1, 2023

Uh oh!

paraynaud Dec 22, 2023

Choose a reason for hiding this comment

Uh oh!

amontoison Dec 22, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Jan 13, 2023 •

edited

Loading

paraynaud commented Jan 18, 2023 •

edited

Loading

amontoison commented Jan 19, 2023 •

edited

Loading

geoffroyleconte commented Jan 19, 2023 •

edited

Loading

amontoison commented Jan 20, 2023 •

edited

Loading

codecov bot commented Jan 31, 2023 •

edited

Loading

amontoison commented Jan 31, 2023 •

edited

Loading

amontoison commented Feb 1, 2023 •

edited

Loading

paraynaud commented Feb 1, 2023 •

edited

Loading