You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Aug 29, 2023. It is now read-only.
Thanks for your great work! I try to train the model myself recently, but I found that it takes so long to transfer the model from cpu to gpu (about an hour) and then it failed. Could you pls give me any suggestions? Did I do something wrong?
res4.9.conv3.norm.num_batches_tracked
res5.0.conv1.norm.num_batches_tracked
res5.0.conv2.norm.num_batches_tracked
res5.0.conv3.norm.num_batches_tracked
res5.0.shortcut.norm.num_batches_tracked
res5.1.conv1.norm.num_batches_tracked
res5.1.conv2.norm.num_batches_tracked
res5.1.conv3.norm.num_batches_tracked
res5.2.conv1.norm.num_batches_tracked
res5.2.conv2.norm.num_batches_tracked
res5.2.conv3.norm.num_batches_tracked
stem.conv1.norm.num_batches_tracked
stem.conv2.norm.num_batches_tracked
stem.conv3.norm.num_batches_tracked
stem.fc.{bias, weight}
[08/21 20:18:39 d2.engine.train_loop]: Starting training from iteration 0
ERROR [08/21 20:20:24 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
[08/21 20:20:24 d2.engine.hooks]: Total training time: 0:01:45 (0:00:00 on hooks)
[08/21 20:20:24 d2.utils.events]: iter: 0 lr: N/A max_mem: 5604M
Traceback (most recent call last):
File "train_net.py", line 270, in
args=(args,),
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 258, in main
return trainer.train()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hi,
Thanks for your great work! I try to train the model myself recently, but I found that it takes so long to transfer the model from cpu to gpu (about an hour) and then it failed. Could you pls give me any suggestions? Did I do something wrong?
Thanks in advance!
My environment is below:
sys.platform linux
Python 3.7.0 (default, Oct 9 2018, 10:31:47) [GCC 7.3.0]
numpy 1.21.5
detectron2 0.6 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/detectron2
Compiler GCC 7.3
CUDA compiler CUDA 10.2
detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5
DETECTRON2_ENV_MODULE
PyTorch 1.8.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0 NVIDIA GeForce RTX 3080 Laptop GPU (arch=8.6)
Driver version 510.60.02
CUDA_HOME /usr/local/cuda
Pillow 9.2.0
torchvision 0.9.2 @/home/mu/anaconda3/envs/maskformer/lib/python3.7/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5
fvcore 0.1.5.post20220512
iopath 0.1.9
cv2 4.6.0
The error is below:
res4.9.conv3.norm.num_batches_tracked
res5.0.conv1.norm.num_batches_tracked
res5.0.conv2.norm.num_batches_tracked
res5.0.conv3.norm.num_batches_tracked
res5.0.shortcut.norm.num_batches_tracked
res5.1.conv1.norm.num_batches_tracked
res5.1.conv2.norm.num_batches_tracked
res5.1.conv3.norm.num_batches_tracked
res5.2.conv1.norm.num_batches_tracked
res5.2.conv2.norm.num_batches_tracked
res5.2.conv3.norm.num_batches_tracked
stem.conv1.norm.num_batches_tracked
stem.conv2.norm.num_batches_tracked
stem.conv3.norm.num_batches_tracked
stem.fc.{bias, weight}
[08/21 20:18:39 d2.engine.train_loop]: Starting training from iteration 0
ERROR [08/21 20:20:24 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
[08/21 20:20:24 d2.engine.hooks]: Total training time: 0:01:45 (0:00:00 on hooks)
[08/21 20:20:24 d2.utils.events]: iter: 0 lr: N/A max_mem: 5604M
Traceback (most recent call last):
File "train_net.py", line 270, in
args=(args,),
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/launch.py", line 82, in launch
main_func(*args)
File "train_net.py", line 258, in main
return trainer.train()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/cloud/maskformer/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 285, in run_step
losses.backward()
File "/cloud/maskformer/lib/python3.7/site-packages/torch/tensor.py", line 245, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/cloud/maskformer/lib/python3.7/site-packages/torch/autograd/init.py", line 147, in backward
allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution
The text was updated successfully, but these errors were encountered: