improve ODE performance #128

Roger-luo · 2022-02-17T06:10:10Z

after some attempts, I decide to have a quick patch first instead of a complete rewrite of the hamiltonian expr, this basically doesn’t change any APIs for the performance issue with some ugly workaround, there are still a few things left to do:

we are not using the high performance intrinsic implemented in Yao for an individual pulse but using naive matrix-vector multiplication, and on Yao side, we are not making use of fma intrinsic for ODE solvers in those intrinsic because those intrinsic were designed for gates previously (so, I’d expect ~20% speedup in total with this), see obey mul! convention QuantumBFS/BQCESubroutine.jl#37
the current implementation of automatically deciding which term is constant is quite dumb (but works), ideally, if we have more generic terms like Sum(i->Omega * i, X, 1:N) + Sum(i->2i, N, 1:N) we can just implement this transform as merging similar terms. But this would require more work and doesn't fit well with YaoBlocks at the moment (YaoBlocks can't do general pattern match & rewrite)
I dropped real layout support in this PR since I'd like to have that automatically supported in a separate PR switching to StructArrays
I'll fix the CUDA part in a separate PR later

@jon-wurtz 's QuSpin benchmark for this PR as a reference, tested on AWS EC2 c5a.xlarge (AMD CPU) QuSpin:

/home/ubuntu/EaRyd/quspin-benchmark.py:77: UserWarning: Test for symmetries not implemented for <class 'quspin.basis.basis_1d.spin.spin_basis_1d'>, to turn off this warning set check_symm=False in hamiltonian
  ham = quspin.operators.hamiltonian(static,dynamic,N=N)
Time to compute evolved state: 24.525sec

EaRyd (include compilation, first time execution)

julia> include("test.jl")
[ Info: Precompiling EaRyd [bd27d05e-4ce1-5e79-84dd-c5d7d508bbe1]
 33.241566 seconds (17.73 M allocations: 1.218 GiB, 1.84% gc time, 63.07% compilation time)

exclude compilation time

julia> @time emulate!(odesolve);
 11.981559 seconds (2.89 k allocations: 320.239 MiB, 1.91% gc time)

we can include some precompile statements to get rid of that compile time for the default solver but I think that's gonna be in another PR.

Why QuSpin is slower?

It's actually not clear to me why QuSpin is slower, I think it's probably due to different memory layout, in QuSpin the memory layout is using array of struct layout which I'd expect to be faster.

After some comparison, our equation evaluation is actually slightly slower than QuSpin since the sparse multiplication is slower (by ~5ms). so the only reason then is the ODE solver is faster and use much less number of steps to achieve similar precision.

some notes

why Yao.cache doesn't work here:

it doesn't work well with subspace, the CacheSever is not space type aware, but only element-type aware

the key is not space type aware, which is hard to change
overloading new mat for subspace cache is problematic, since the cache sever does not
know about the space

still has small allocations (which is fine, but not nice for profile and benchmark)

for individual pulse XTerm needs to be either split into sum of put(i=>X) or individual matrices, using apply! directly on each put(i=>X) is faster than sum the expression since we can manually only allocate 3 arrays to do the reduction, instead of https://github.com/QuantumBFS/Yao.jl/blob/master/lib/YaoBlocks/src/composite/reduce.jl#L32
and it will be faster without cache

function apply_hs(hs, dst, st)
    h1 = hs[1]
    src = copy(st)
    dst.state .= st.state
    apply!(dst, h1)
    for idx in 2:length(hs)
        apply!(st, hs[idx])
        dst.state .+= st.state
        st.state .= src.state
    end
    return st
end

hs = [put(10, i => X) for i in 1:10]
st = rand_state(10)
dst = copy(st)

@benchmark apply_hs($hs, $dst, $st)

julia> @benchmark apply_hs($hs, $dst, $st)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.590 μs … 54.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     18.401 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.952 μs ±  1.525 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

h = sum(hs)
@benchmark apply!($st, $h)

julia> @benchmark apply!($st, $h)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  26.561 μs …   5.458 ms  ┊ GC (min … max): 0.00% … 99.05%
 Time  (median):     29.360 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.105 μs ± 107.646 μs  ┊ GC (mean ± σ):  6.68% ±  1.98%

GiggleLiu · 2022-02-17T07:21:22Z

What is space type aware?

Roger-luo · 2022-02-21T04:17:48Z

A side note for backlog: OK since now we split constant terms and use instruct! for the multiplication when applicable (local/individual pulse). There are some part of this optimization left to be done is on the instruct! side - since the instruct! interface is not designed for this type of application (more standard matrix multiplication), its performance is not at the ideal performance, we can squeeze more performance by solving QuantumBFS/BQCESubroutine.jl#37 which removes the original switch+broadcast intrinsics to faster fma intrinsics when multiplying the hamiltonian

Roger-luo · 2022-02-21T10:32:35Z

This PR requires #136 don't merge before that one is merged

GiggleLiu · 2022-02-22T03:03:55Z

I have merged PR #136 , this PR does not seem to be compatible with the block system. Do you want keep working in this PR or open a new PR for that?

Roger-luo · 2022-02-22T04:58:32Z

There's a separate PR for that #137 @GiggleLiu

GiggleLiu

It is an OK pr, well tested and should be good to merge. But I think you need to refactor the design a bit later.

make terms subtype of block

b411b82

Roger-luo added 2 commits February 18, 2022 07:24

implement split_term

ca6d9f2

revert operations

4210ccd

Roger-luo added 5 commits February 21, 2022 07:31

roughly works need to update register to latest interface

7b74f1b

adapt to new Yao api

af09097

Merge branch 'roger/register' into roger/hamiltonian-expr

4449240

temp

3abf57b

fix ODE

f285c7a

Roger-luo marked this pull request as ready for review February 21, 2022 10:32

Roger-luo requested a review from GiggleLiu February 21, 2022 10:32

Roger-luo added 2 commits February 21, 2022 10:33

clean up commit files

4ec8086

move -im to alpha

42ae786

Roger-luo mentioned this pull request Feb 21, 2022

make terms primtive block #137

Closed

always use CSR format

4e633b8

Roger-luo enabled auto-merge (squash) February 22, 2022 11:48

GiggleLiu approved these changes Feb 22, 2022

View reviewed changes

Roger-luo merged commit 8f29a98 into master Feb 22, 2022

Roger-luo deleted the roger/hamiltonian-expr branch February 22, 2022 19:42

GiggleLiu mentioned this pull request Feb 22, 2022

Roger/smooth piecewise constant #139

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve ODE performance #128

improve ODE performance #128

Roger-luo commented Feb 17, 2022 •

edited

Loading

GiggleLiu commented Feb 17, 2022

Roger-luo commented Feb 21, 2022 •

edited

Loading

Roger-luo commented Feb 21, 2022

GiggleLiu commented Feb 22, 2022 •

edited

Loading

Roger-luo commented Feb 22, 2022

GiggleLiu left a comment

improve ODE performance #128

improve ODE performance #128

Conversation

Roger-luo commented Feb 17, 2022 • edited Loading

Why QuSpin is slower?

GiggleLiu commented Feb 17, 2022

Roger-luo commented Feb 21, 2022 • edited Loading

Roger-luo commented Feb 21, 2022

GiggleLiu commented Feb 22, 2022 • edited Loading

Roger-luo commented Feb 22, 2022

GiggleLiu left a comment

Choose a reason for hiding this comment

Roger-luo commented Feb 17, 2022 •

edited

Loading

Roger-luo commented Feb 21, 2022 •

edited

Loading

GiggleLiu commented Feb 22, 2022 •

edited

Loading