Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve ODE performance #128

Merged
merged 11 commits into from
Feb 22, 2022
Merged

improve ODE performance #128

merged 11 commits into from
Feb 22, 2022

Conversation

Roger-luo
Copy link
Member

@Roger-luo Roger-luo commented Feb 17, 2022

after some attempts, I decide to have a quick patch first instead of a complete rewrite of the hamiltonian expr, this basically doesn’t change any APIs for the performance issue with some ugly workaround, there are still a few things left to do:

  • we are not using the high performance intrinsic implemented in Yao for an individual pulse but using naive matrix-vector multiplication, and on Yao side, we are not making use of fma intrinsic for ODE solvers in those intrinsic because those intrinsic were designed for gates previously (so, I’d expect ~20% speedup in total with this), see obey mul! convention QuantumBFS/BQCESubroutine.jl#37
  • the current implementation of automatically deciding which term is constant is quite dumb (but works), ideally, if we have more generic terms like Sum(i->Omega * i, X, 1:N) + Sum(i->2i, N, 1:N) we can just implement this transform as merging similar terms. But this would require more work and doesn't fit well with YaoBlocks at the moment (YaoBlocks can't do general pattern match & rewrite)
  • I dropped real layout support in this PR since I'd like to have that automatically supported in a separate PR switching to StructArrays
  • I'll fix the CUDA part in a separate PR later

@jon-wurtz 's QuSpin benchmark for this PR as a reference, tested on AWS EC2 c5a.xlarge (AMD CPU) QuSpin:

/home/ubuntu/EaRyd/quspin-benchmark.py:77: UserWarning: Test for symmetries not implemented for <class 'quspin.basis.basis_1d.spin.spin_basis_1d'>, to turn off this warning set check_symm=False in hamiltonian
  ham = quspin.operators.hamiltonian(static,dynamic,N=N)
Time to compute evolved state: 24.525sec

EaRyd (include compilation, first time execution)

julia> include("test.jl")
[ Info: Precompiling EaRyd [bd27d05e-4ce1-5e79-84dd-c5d7d508bbe1]
 33.241566 seconds (17.73 M allocations: 1.218 GiB, 1.84% gc time, 63.07% compilation time)

exclude compilation time

julia> @time emulate!(odesolve);
 11.981559 seconds (2.89 k allocations: 320.239 MiB, 1.91% gc time)

we can include some precompile statements to get rid of that compile time for the default solver but I think that's gonna be in another PR.

Why QuSpin is slower?

It's actually not clear to me why QuSpin is slower, I think it's probably due to different memory layout, in QuSpin the memory layout is using array of struct layout which I'd expect to be faster.

After some comparison, our equation evaluation is actually slightly slower than QuSpin since the sparse multiplication is slower (by ~5ms). so the only reason then is the ODE solver is faster and use much less number of steps to achieve similar precision.


some notes

why Yao.cache doesn't work here:

  1. it doesn't work well with subspace, the CacheSever is not space type aware, but only element-type aware
  • the key is not space type aware, which is hard to change
  • overloading new mat for subspace cache is problematic, since the cache sever does not
    know about the space
  1. still has small allocations (which is fine, but not nice for profile and benchmark)

for individual pulse XTerm needs to be either split into sum of put(i=>X) or individual matrices, using apply! directly on each put(i=>X) is faster than sum the expression since we can manually only allocate 3 arrays to do the reduction, instead of https://github.com/QuantumBFS/Yao.jl/blob/master/lib/YaoBlocks/src/composite/reduce.jl#L32
and it will be faster without cache

function apply_hs(hs, dst, st)
    h1 = hs[1]
    src = copy(st)
    dst.state .= st.state
    apply!(dst, h1)
    for idx in 2:length(hs)
        apply!(st, hs[idx])
        dst.state .+= st.state
        st.state .= src.state
    end
    return st
end

hs = [put(10, i => X) for i in 1:10]
st = rand_state(10)
dst = copy(st)

@benchmark apply_hs($hs, $dst, $st)

julia> @benchmark apply_hs($hs, $dst, $st)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  17.590 μs  54.292 μs  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     18.401 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   18.952 μs ±  1.525 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

h = sum(hs)
@benchmark apply!($st, $h)

julia> @benchmark apply!($st, $h)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min  max):  26.561 μs    5.458 ms  ┊ GC (min  max): 0.00%  99.05%
 Time  (median):     29.360 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.105 μs ± 107.646 μs  ┊ GC (mean ± σ):  6.68% ±  1.98%

@GiggleLiu
Copy link
Collaborator

What is space type aware?

@Roger-luo
Copy link
Member Author

Roger-luo commented Feb 21, 2022

A side note for backlog: OK since now we split constant terms and use instruct! for the multiplication when applicable (local/individual pulse). There are some part of this optimization left to be done is on the instruct! side - since the instruct! interface is not designed for this type of application (more standard matrix multiplication), its performance is not at the ideal performance, we can squeeze more performance by solving QuantumBFS/BQCESubroutine.jl#37 which removes the original switch+broadcast intrinsics to faster fma intrinsics when multiplying the hamiltonian

@Roger-luo Roger-luo marked this pull request as ready for review February 21, 2022 10:32
@Roger-luo Roger-luo requested a review from GiggleLiu February 21, 2022 10:32
@Roger-luo
Copy link
Member Author

This PR requires #136 don't merge before that one is merged

@GiggleLiu
Copy link
Collaborator

GiggleLiu commented Feb 22, 2022

I have merged PR #136 , this PR does not seem to be compatible with the block system. Do you want keep working in this PR or open a new PR for that?

@Roger-luo
Copy link
Member Author

There's a separate PR for that #137 @GiggleLiu

@Roger-luo Roger-luo enabled auto-merge (squash) February 22, 2022 11:48
Copy link
Collaborator

@GiggleLiu GiggleLiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is an OK pr, well tested and should be good to merge. But I think you need to refactor the design a bit later.

@Roger-luo Roger-luo merged commit 8f29a98 into master Feb 22, 2022
@Roger-luo Roger-luo deleted the roger/hamiltonian-expr branch February 22, 2022 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants