forked from NVlabs/stylegan2-ada-pytorch
-
-
Notifications
You must be signed in to change notification settings - Fork 128
Open
Description
Describe the bug
Running training with Top-k feature on multi GPU fails with torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'epochs'
It does with with one GPU, or with multi GPU and without topk.
To Reproduce
Steps to reproduce the behavior:
- with ./docker_run.sh
- run
python train.py --outdir=/results --data=/images/ --resume=ffhq256 --gpus=2 --metrics=none --snap=1 --topk=0.9726 - result:
...
Setting up augmentation...
Distributing across 2 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Training for 25000 kimg...
Traceback (most recent call last):
File "train.py", line 608, in <module>
main() # pylint: disable=no-value-for-parameter
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
return f(get_current_context(), *args, **kwargs)
File "train.py", line 603, in main
torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 247, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 205, in start_processes
while not context.join():
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 166, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/scratch/train.py", line 445, in subprocess_fn
training_loop.training_loop(rank=rank, **args)
File "/scratch/training/training_loop.py", line 305, in training_loop
loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain)
File "/scratch/training/loss.py", line 81, in accumulate_gradients
k_frac = np.maximum(self.G_top_k_gamma ** self.G.epochs, self.G_top_k_frac)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 795, in __getattr__
raise ModuleAttributeError("'{}' object has no attribute '{}'".format(
torch.nn.modules.module.ModuleAttributeError: 'DistributedDataParallel' object has no attribute 'epochs'
Expected behavior
Trainings with Top-k should work on multi GPU as it works on mono GPU.
Or, refuse to start in such conditions (with an error) if it's not supported.
Desktop (please complete the following information):
- OS: Linux Ubuntu 20.04
- NVIDIA driver version 460
- Docker:
nvcr.io/nvidia/pytorch:20.12-py3
Additional context
I use this repo head (464100c for reference) + merge of NVlabs#3; there was minimal conflicts, and I checked the code touched by #16 was not changed by that merge.
cpietsch
Metadata
Metadata
Assignees
Labels
No labels