Parallelization in [email protected], compared with [email protected] #236

ZongYongyue · 2025-01-24T07:58:26Z

Hi Lukas,

I am exploring the new version of MPSKit. Compared with [email protected], the [email protected] seems to drop some support for parallel computation, especially for finite size system and algorithms.

For an example, set MPSKit.Defaults.set_parallelization("derivatives" => true) is useful when I perform DMRG2 for a finite size lattice in the past version:

#julia -t 8 hubbard_m.jl
using LinearAlgebra
BLAS.set_num_threads(1)
using TensorKit
using MPSKit
using MPSKitModels: FiniteCylinder,FiniteStrip
using DynamicalCorrelators
using JLD2: save,load

MPSKit.Defaults.set_parallelization("derivatives" => true)
filling = (1,1)
lattice = FiniteStrip(4, 12)
H = hubbard(Float64, U1Irrep, U1Irrep, lattice; filling=filling, t=1, U=8, μ=0)
N=length(lattice)
st = randFiniteMPS(Float64, U1Irrep, U1Irrep, N; filling=filling)
err = 1e-6
@time gs, envs, delta = find_groundstate(st, H, DMRG2(trscheme= truncerr(err)));
E0 = expectation_value(gs, H)

[ Info: DMRG2   1:      obj = -4.911146147560e+00       err = 9.9914573155e-01  time = 51.10 sec
[ Info: DMRG2   2:      obj = -4.913259207333e+00       err = 1.6855045657e-04  time = 1.29 min
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 2.7500646205e-10  time = 24.01 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.3298697960e-14  time = 3.27 min
209.001721 seconds (1.13 G allocations: 624.127 GiB, 21.34% gc time, 288 lock conflicts, 38.48% compilation time)
-4.913259209043462

# For a single thread, it costs
[ Info: DMRG2   1:      obj = -4.912078856370e+00       err = 9.9976380773e-01  time = 1.77 min
[ Info: DMRG2   2:      obj = -4.913259207169e+00       err = 1.0282147643e-04  time = 1.61 min
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 3.0078417534e-10  time = 57.36 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.4075854078e-14  time = 5.79 min
357.544613 seconds (914.24 M allocations: 572.318 GiB, 16.88% gc time, 4.63% compilation time)
-4.9132592090434555

But in the new version, a lot of things changed, and I noticed that the two-site derivative function ∂AC2 does not use multithreads anymore, which seems to drop the last support for DMRG2 in parallelization and that indeed seems to be the case:

# 8 threads in [email protected] with MPSKit.Defaults.set_scheduler!(:dynamic)
[ Info: DMRG2   1:      obj = -4.911500799784e+00       err = 9.4429522527e-01  time = 1.26 min
[ Info: DMRG2   2:      obj = -4.913259207112e+00       err = 2.2780024577e-04  time = 48.19 sec
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 2.9917901490e-10  time = 35.59 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.3964831775e-14  time = 4.68 min
291.332853 seconds (888.91 M allocations: 450.021 GiB, 11.08% gc time, 15.20% compilation time: <1% of which was recompilation)
-4.913259209043462

# single thread in [email protected]
[ Info: DMRG2   1:      obj = -4.911706979069e+00       err = 9.9925886189e-01  time = 1.06 min
[ Info: DMRG2   2:      obj = -4.913259207177e+00       err = 1.6968202792e-04  time = 52.09 sec
[ Info: DMRG2   3:      obj = -4.913259209043e+00       err = 3.0106028781e-10  time = 34.27 sec
[ Info: DMRG2 conv 4:   obj = -4.913259209043e+00       err = 4.4408920985e-14  time = 4.58 min
284.791112 seconds (762.70 M allocations: 432.346 GiB, 16.69% gc time, 8.58% compilation time: <1% of which was recompilation)
-4.913259209043463

For single-threaded computations, the new version has significant advantages. However, since the new version does not support multi-threading for finite size DMRG, it is at a disadvantage compared to the previous version.

The text was updated successfully, but these errors were encountered:

lkdvos · 2025-01-24T13:09:50Z

Hi Yue, thanks for bringing this up, this is really helpful!

The goal of the rewrite wasn't necessarily to remove the multithreading, it's more that I wanted to delegate some of the responsibility of who should implement that out of MPSKit, in the sense that this is just a block-sparse contraction, which should be implemented by BlockTensorKit.jl now. I'm definitely willing to spend the time to add it back in, and this should not be too much work, but before I do this, would you be willing to also try the single threaded MPSKit v0.12 but with number of BLAS threads equal to the available threads? I feel like that is a bit more fair of a comparison.
One of the reasons I didn't prioritize working on multithreading over the blocks is that in many cases, I found that compared to letting BLAS do the multithreading, trying to manually improve over that actually hindered performance instead of helping. Again, I do believe that there are cases where multithreading over the blocks will outperform this, but it's not entirely clear when this happens. If nothing else, having more data about this is helpful.

ZongYongyue · 2025-01-25T06:35:33Z

Yes, when I tried the single thread, the number of BLAS threads is equal to the available threads. But I do agree with you that the efficiency problem is case dependent. Would you mind teach me how to add it back in the v0.12, I am willing to try more cases for comparison and I will share my data of different cases.

lkdvos · 2025-01-25T19:28:26Z

There is a tiny bit of infrastructure missing to make this fully customizable right now, but the general way to add it back consists of two steps:

Adding support for backend = ... keyword argument to all @plansor, @tensor and @planar calls to have the ability to dynamically switch this. Alternatively, find a way to specify a backend via ScopedValues.jl or a similar construction
Adding a parallellized (blocksparse) contraction backend implementation in BlockTensorKit.jl

The first step is something we're actively trying to figure out as well for TensorKit.jl, see for example this draft PR which would already add multithreading on the symmetry-blocks level.

The second is a matter of defining a custom backend <: TensorOperations.AbstractBackend struct, and using this to specialize an implementation of blocktensormaps linear algebra. Once the TensorKit.jl implementation correctly passes the backend through, this would mean specific implementations of mul!, and add_transform! are presumably the main focus points for improving runtimes.

Additionally, any kind of benchmarks and profiler setups are immensely helpful to actually gauge how well the implementations do compared to the base case which simply uses BLAS multithreading. In particular, I have no real idea if we should focus on having multithreading at the symmetry level, or at the BLAS level, or at the blocks in the hamiltonian level, or a combination of all of them.
Do let me know if there are any more specific things you would like to know!

ZongYongyue · 2025-02-18T16:55:15Z

Since multithreading at the symmetry-blocks level has already been implemented in TensorKit in ld-multithreading2 branch, I would like to know how I can enable these features when using MPSKit algorithms . Does this involve your second point, where I need to define a custom backend and then set it in MPSKit using something like set_backend!, similar to set_scheduler!? I also noticed TensorKitBackend in backends.jl, but I’m not sure how to configure it and make it work.

lkdvos · 2025-02-18T17:12:18Z

Yes, this is very much related. I wouldn't say that it has already fully been implemented, but that is definitely an initial push towards making that work. I don't want to yet start recommending to use these things, because the actual interface is still subject to change, but it does outline some of the ideas we are working with.
(linking this PR for future reference: Jutho/TensorKit.jl#203)

ZongYongyue · 2025-02-19T01:39:09Z

I see... so for now, if I want to use multithreading to deal with some work, is it best to go back to [email protected] or use [email protected] + TensorKit-ld-multithreading?

lkdvos · 2025-02-19T12:30:32Z

I would advise against [email protected] mostly because the data you would compute is not forwards compatible: the structure of the tensors changed between these versions, so any data you load from disk is not trivially loadable in the future.

I'll try and spend some time this week to make the multithreading branch at least usable, if you are willing to accept that there might be some bugs that we'll have to fix as we go along?

lkdvos · 2025-02-19T13:08:30Z

I think that branch should now have rudimentary support for selecting some multithreading over the different symmetry blocks. In particular, see the file backends.jl, where I added some functionality to easily switch out the default schedulers used.

Let me know if anything is not clear, or not behaving as expected?

As a small side note, make sure you update to the latest version of BlockTensorKit.jl as well, we recently found a rather significant performance bug there, so I'm expecting that if you run the new version the timings (even without multithreading) should have improved.

ZongYongyue · 2025-02-19T13:19:45Z

I would be very happy and truly grateful if you are willing to do so. Multithreading acceleration would be very helpful for my current work, so if you make the multithreading branch available, I can provide timely feedback on any issues I might encounter.

ZongYongyue · 2025-02-19T13:25:33Z

I think that branch should now have rudimentary support for selecting some multithreading over the different symmetry blocks. In particular, see the file backends.jl, where I added some functionality to easily switch out the default schedulers used.

Let me know if anything is not clear, or not behaving as expected?

As a small side note, make sure you update to the latest version of BlockTensorKit.jl as well, we recently found a rather significant performance bug there, so I'm expecting that if you run the new version the timings (even without multithreading) should have improved.

Thank you very much, I will try this now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization in [email protected], compared with [email protected] #236

Parallelization in [email protected], compared with [email protected] #236

ZongYongyue commented Jan 24, 2025

lkdvos commented Jan 24, 2025

ZongYongyue commented Jan 25, 2025

lkdvos commented Jan 25, 2025

ZongYongyue commented Feb 18, 2025 •

edited

Loading

lkdvos commented Feb 18, 2025

ZongYongyue commented Feb 19, 2025 •

edited

Loading

lkdvos commented Feb 19, 2025

lkdvos commented Feb 19, 2025

ZongYongyue commented Feb 19, 2025

ZongYongyue commented Feb 19, 2025

Parallelization in [email protected], compared with [email protected] #236

Parallelization in [email protected], compared with [email protected] #236

Comments

ZongYongyue commented Jan 24, 2025

lkdvos commented Jan 24, 2025

ZongYongyue commented Jan 25, 2025

lkdvos commented Jan 25, 2025

ZongYongyue commented Feb 18, 2025 • edited Loading

lkdvos commented Feb 18, 2025

ZongYongyue commented Feb 19, 2025 • edited Loading

lkdvos commented Feb 19, 2025

lkdvos commented Feb 19, 2025

ZongYongyue commented Feb 19, 2025

ZongYongyue commented Feb 19, 2025

ZongYongyue commented Feb 18, 2025 •

edited

Loading

ZongYongyue commented Feb 19, 2025 •

edited

Loading