Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: an illegal memory access was encountered #231

Open
eddiewrc opened this issue May 23, 2022 · 2 comments
Open

RuntimeError: CUDA error: an illegal memory access was encountered #231

eddiewrc opened this issue May 23, 2022 · 2 comments

Comments

@eddiewrc
Copy link

Hi, first of all thanks for sharing this library with all of us!
Unfortunately I am encountering few problems while trying to run it. In particular, I tried to build the following network, which is supposed to take as input a sparse tensor of shape (8192, 16384). Part of it is now commented because I tried to locate the origin of the problem, and apparently it happens just with just the first Convolution module (so I commented the rest for now)

The error that I get is pasted below. The GPU is a quadro gv100, system cuda version 11.4, pytorch 1.11.0 py3.9_cuda11.3_cudnn8.2.0_0

class HCSparseConvNet1(t.nn.Module):
        def __init__(self, featSize, numOut, size, name = "NN"):
                super(HCSparseConvNet1, self).__init__()
                print(size)
                self.inputLayer = scn.InputLayer(2, size, 2)
                
                self.sparseModel = scn.Sequential(scn.Convolution(2,1,4,8,8, True))#, scn.Convolution(2,4,8,8,4, True), scn.LeakyReLU(), scn.Convolution(2,8,16,3,2,True), scn.LeakyReLU(), scn.Convolution(2,16,16, 3,2, True), scn.SparseToDense(2, 16))#, scn.MaxPooling(2,16,8), scn.Convolution(2, 10,10,64,32, False))
                self.out1 = t.nn.Sequential(t.nn.GroupNorm(1,16), t.nn.Tanh(), t.nn.Conv2d(16,8,3,2), t.nn.GroupNorm(1,8), t.nn.Tanh(), t.nn.Conv2d(8,4,3,1, padding=1), t.nn.GroupNorm(1,4), t.nn.Tanh())
#self.spatial_size= self.sparseModel.input_spatial_size(size)
                self.final = t.nn.Sequential(t.nn.Linear(7812, 100), t.nn.LayerNorm(100), t.nn.Tanh(), t.nn.Linear(100, numOut))

        def forward(self, x, batchSize):
                #print(x[0].size(), x[1].size())
                x = self.inputLayer(x)
                x = self.sparseModel(x)
                print(x)
                #x = self.out1(x)
                #print(x.size())
                #x = self.final(x.view(batchSize, -1))
                return x

The error:

Traceback (most recent call last):
  File "/home/eddiewrc/galiana2/galianaHCsparseConvNet.py", line 144, in <module>
    sys.exit(main(sys.argv))
  File "/home/eddiewrc/galiana2/galianaHCsparseConvNet.py", line 94, in main
    wrapper.fit(X, Y, device, epochs=50, batch_size = 11, LOG=False)
  File "/home/eddiewrc/galiana2/sources/HCModels.py", line 200, in fit
    yp = self.model.forward([coord, features], batchSize)
  File "/home/eddiewrc/galiana2/sources/HCModels.py", line 58, in forward
    print(x)
  File "/home/eddiewrc/SparseConvNet/sparseconvnet/sparseConvNetTensor.py", line 58, in __repr__
    'features=' + repr(self.features) + \
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor.py", line 305, in __repr__
    return torch._tensor_str._str(self)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 434, in _str
    return _str_intern(self)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 409, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 264, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/home/eddiewrc/miniconda3/lib/python3.9/site-packages/torch/_tensor_str.py", line 296, in get_summarized_data
    return torch.stack([get_summarized_data(x) for x in (start + end)])
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
```.
@eddiewrc
Copy link
Author

eddiewrc commented May 23, 2022

I have an addition to make:
this is the GPU settings on my machine (3 gpus).
Apparently the error happens just when I try to use GPUs 1 and 2, and the library works ok on what pytorch recognzes as cuda:0 . (which happens to be Quadro #1 )

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN Xp     Off  | 00000000:09:00.0 Off |                  N/A |
| 30%   52C    P2    65W / 250W |   1521MiB / 12196MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro GV100        Off  | 00000000:83:00.0 Off |                  Off |
| 38%   52C    P2    40W / 250W |   3379MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Quadro GV100        Off  | 00000000:84:00.0 Off |                  Off |
| 36%   49C    P2    40W / 250W |      8MiB / 32508MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@AndreGraca98
Copy link

Hello, i also had this issue but I found a workaround. If you do torch.cuda.set_device(1) before sending the model to the device with model.to('cuda:1') it works fine :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants