Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization in [email protected], compared with [email protected] #236

Open
ZongYongyue opened this issue Jan 24, 2025 · 10 comments
Open

Comments

@ZongYongyue
Copy link

Hi Lukas,

I am exploring the new version of MPSKit. Compared with [email protected], the [email protected] seems to drop some support for parallel computation, especially for finite size system and algorithms.

For an example, set MPSKit.Defaults.set_parallelization("derivatives" => true) is useful when I perform DMRG2 for a finite size lattice in the past version:

#julia -t 8 hubbard_m.jl
using LinearAlgebra
BLAS.set_num_threads(1)
using TensorKit
using MPSKit
using MPSKitModels: FiniteCylinder,FiniteStrip
using DynamicalCorrelators
using JLD2: save,load

MPSKit.Defaults.set_parallelization("derivatives" => true)
filling = (1,1)
lattice = FiniteStrip(4, 12)
H = hubbard(Float64, U1Irrep, U1Irrep, lattice; filling=filling, t=1, U=8, μ=0)
N=length(lattice)
st = randFiniteMPS(Float64, U1Irrep, U1Irrep, N; filling=filling)
err = 1e-6
@time gs, envs, delta = find_groundstate(st, H, DMRG2(trscheme= truncerr(err)));
E0 = expectation_value(gs, H)

[ Info: DMRG2   1:      obj = -4.911146147560e+00       err = 9.9914573155e-01  time = 51.10 sec
[ Info: DMRG2   2:      obj = -4.913259207333e+00       err = 1.6855045657e-04  time = 1.29 min
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 2.7500646205e-10  time = 24.01 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.3298697960e-14  time = 3.27 min
209.001721 seconds (1.13 G allocations: 624.127 GiB, 21.34% gc time, 288 lock conflicts, 38.48% compilation time)
-4.913259209043462
# For a single thread, it costs
[ Info: DMRG2   1:      obj = -4.912078856370e+00       err = 9.9976380773e-01  time = 1.77 min
[ Info: DMRG2   2:      obj = -4.913259207169e+00       err = 1.0282147643e-04  time = 1.61 min
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 3.0078417534e-10  time = 57.36 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.4075854078e-14  time = 5.79 min
357.544613 seconds (914.24 M allocations: 572.318 GiB, 16.88% gc time, 4.63% compilation time)
-4.9132592090434555

But in the new version, a lot of things changed, and I noticed that the two-site derivative function ∂AC2 does not use multithreads anymore, which seems to drop the last support for DMRG2 in parallelization and that indeed seems to be the case:

# 8 threads in [email protected] with MPSKit.Defaults.set_scheduler!(:dynamic)
[ Info: DMRG2   1:      obj = -4.911500799784e+00       err = 9.4429522527e-01  time = 1.26 min
[ Info: DMRG2   2:      obj = -4.913259207112e+00       err = 2.2780024577e-04  time = 48.19 sec
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 2.9917901490e-10  time = 35.59 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.3964831775e-14  time = 4.68 min
291.332853 seconds (888.91 M allocations: 450.021 GiB, 11.08% gc time, 15.20% compilation time: <1% of which was recompilation)
-4.913259209043462
# single thread in [email protected]
[ Info: DMRG2   1:      obj = -4.911706979069e+00       err = 9.9925886189e-01  time = 1.06 min
[ Info: DMRG2   2:      obj = -4.913259207177e+00       err = 1.6968202792e-04  time = 52.09 sec
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 3.0106028781e-10  time = 34.27 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.4408920985e-14  time = 4.58 min
284.791112 seconds (762.70 M allocations: 432.346 GiB, 16.69% gc time, 8.58% compilation time: <1% of which was recompilation)
-4.913259209043463

For single-threaded computations, the new version has significant advantages. However, since the new version does not support multi-threading for finite size DMRG, it is at a disadvantage compared to the previous version.

@lkdvos
Copy link
Member

lkdvos commented Jan 24, 2025

Hi Yue, thanks for bringing this up, this is really helpful!

The goal of the rewrite wasn't necessarily to remove the multithreading, it's more that I wanted to delegate some of the responsibility of who should implement that out of MPSKit, in the sense that this is just a block-sparse contraction, which should be implemented by BlockTensorKit.jl now. I'm definitely willing to spend the time to add it back in, and this should not be too much work, but before I do this, would you be willing to also try the single threaded MPSKit v0.12 but with number of BLAS threads equal to the available threads? I feel like that is a bit more fair of a comparison.
One of the reasons I didn't prioritize working on multithreading over the blocks is that in many cases, I found that compared to letting BLAS do the multithreading, trying to manually improve over that actually hindered performance instead of helping. Again, I do believe that there are cases where multithreading over the blocks will outperform this, but it's not entirely clear when this happens. If nothing else, having more data about this is helpful.

@ZongYongyue
Copy link
Author

Yes, when I tried the single thread, the number of BLAS threads is equal to the available threads. But I do agree with you that the efficiency problem is case dependent. Would you mind teach me how to add it back in the v0.12, I am willing to try more cases for comparison and I will share my data of different cases.

@lkdvos
Copy link
Member

lkdvos commented Jan 25, 2025

There is a tiny bit of infrastructure missing to make this fully customizable right now, but the general way to add it back consists of two steps:

  1. Adding support for backend = ... keyword argument to all @plansor, @tensor and @planar calls to have the ability to dynamically switch this. Alternatively, find a way to specify a backend via ScopedValues.jl or a similar construction
  2. Adding a parallellized (blocksparse) contraction backend implementation in BlockTensorKit.jl

The first step is something we're actively trying to figure out as well for TensorKit.jl, see for example this draft PR which would already add multithreading on the symmetry-blocks level.

The second is a matter of defining a custom backend <: TensorOperations.AbstractBackend struct, and using this to specialize an implementation of blocktensormaps linear algebra. Once the TensorKit.jl implementation correctly passes the backend through, this would mean specific implementations of mul!, and add_transform! are presumably the main focus points for improving runtimes.

Additionally, any kind of benchmarks and profiler setups are immensely helpful to actually gauge how well the implementations do compared to the base case which simply uses BLAS multithreading. In particular, I have no real idea if we should focus on having multithreading at the symmetry level, or at the BLAS level, or at the blocks in the hamiltonian level, or a combination of all of them.
Do let me know if there are any more specific things you would like to know!

@ZongYongyue
Copy link
Author

ZongYongyue commented Feb 18, 2025

Since multithreading at the symmetry-blocks level has already been implemented in TensorKit in ld-multithreading2 branch, I would like to know how I can enable these features when using MPSKit algorithms . Does this involve your second point, where I need to define a custom backend and then set it in MPSKit using something like set_backend!, similar to set_scheduler!? I also noticed TensorKitBackend in backends.jl, but I’m not sure how to configure it and make it work.

@lkdvos
Copy link
Member

lkdvos commented Feb 18, 2025

Yes, this is very much related. I wouldn't say that it has already fully been implemented, but that is definitely an initial push towards making that work. I don't want to yet start recommending to use these things, because the actual interface is still subject to change, but it does outline some of the ideas we are working with.
(linking this PR for future reference: Jutho/TensorKit.jl#203)

@ZongYongyue
Copy link
Author

ZongYongyue commented Feb 19, 2025

I see... so for now, if I want to use multithreading to deal with some work, is it best to go back to [email protected] or use [email protected] + TensorKit-ld-multithreading?

@lkdvos
Copy link
Member

lkdvos commented Feb 19, 2025

I would advise against [email protected] mostly because the data you would compute is not forwards compatible: the structure of the tensors changed between these versions, so any data you load from disk is not trivially loadable in the future.

I'll try and spend some time this week to make the multithreading branch at least usable, if you are willing to accept that there might be some bugs that we'll have to fix as we go along?

@lkdvos
Copy link
Member

lkdvos commented Feb 19, 2025

I think that branch should now have rudimentary support for selecting some multithreading over the different symmetry blocks. In particular, see the file backends.jl, where I added some functionality to easily switch out the default schedulers used.

Let me know if anything is not clear, or not behaving as expected?

As a small side note, make sure you update to the latest version of BlockTensorKit.jl as well, we recently found a rather significant performance bug there, so I'm expecting that if you run the new version the timings (even without multithreading) should have improved.

@ZongYongyue
Copy link
Author

I would be very happy and truly grateful if you are willing to do so. Multithreading acceleration would be very helpful for my current work, so if you make the multithreading branch available, I can provide timely feedback on any issues I might encounter.

@ZongYongyue
Copy link
Author

I think that branch should now have rudimentary support for selecting some multithreading over the different symmetry blocks. In particular, see the file backends.jl, where I added some functionality to easily switch out the default schedulers used.

Let me know if anything is not clear, or not behaving as expected?

As a small side note, make sure you update to the latest version of BlockTensorKit.jl as well, we recently found a rather significant performance bug there, so I'm expecting that if you run the new version the timings (even without multithreading) should have improved.

Thank you very much, I will try this now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants