You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
I'm facing issues with running tinker-hp GPU with DeepHP enabled on HPC cluster. I prepare system according to instructions - I load Nvidia HPC package through its modulefile and GNU compilers from modulefile provided by HPC administrators. I load conda environment from tinkerml.torch.yaml file modified for chosen CUDA version. Build through installation script is successful with settings as follows:
target_arch='gpu' , c_c=80
cuda_ver=11.0
FPA=1 [left as default]
build_plumed=0 [left as default]
build_colvars=0 [left as default]
NN=1
Build completes without error. ‘Normal’ tasks, such as dynamic or analyze, run without a problem. When I try to run ML potential tasks from the ‘examples’ directory though, run fails at library load with error:
Exception: Fail to load modules with exception: /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /users/kdm/adrmalin/.conda/envs/DeepHP-torch/lib/python3.10/site-packages/torch/../../../libtorch_python.so)
It looks like binaries link to libstdc++.so library in default /usr/lib64/ location, which is outdated and thus causing error, instead of library provided by HPC compiler module. GNUROOT variable is correctly identified by install.sh script though and as far as I was able to check it – all variables inside makefile are being defined correctly by install script.
I have tried building with HPC-SDK 22.7 / CUDA 11.7 / GNU 12.2.0 in a first attempt, as this combination of SDK and CUDA version is mentioned as working correctly, then I tried to downgrade to HPC-SDK 22.2 / CUDA 11.0 / GNU 9.3 with the same results. I modified conda .yaml file accordingly with change in CUDA version.
HPC cluster runs on Rocky Linux 8.10 and GPU nodes contain Nvidia A100 GPUs.
Also I’ve encountered a minor issue with examples provided in github package – Deep-HP_example1 file failed to run with error:
Error in dispersion neigbor list: max cutoff + buffer should be less than half one edge of the box
dispersion cutoff = 9.000
After adding keyword disp-cutoff 7 run fails with abovementioned GLIBCXX error.
Is there any way to make this work?
The text was updated successfully, but these errors were encountered:
Hi,
I'm facing issues with running tinker-hp GPU with DeepHP enabled on HPC cluster. I prepare system according to instructions - I load Nvidia HPC package through its modulefile and GNU compilers from modulefile provided by HPC administrators. I load conda environment from tinkerml.torch.yaml file modified for chosen CUDA version. Build through installation script is successful with settings as follows:
Build completes without error. ‘Normal’ tasks, such as dynamic or analyze, run without a problem. When I try to run ML potential tasks from the ‘examples’ directory though, run fails at library load with error:
It looks like binaries link to libstdc++.so library in default /usr/lib64/ location, which is outdated and thus causing error, instead of library provided by HPC compiler module. GNUROOT variable is correctly identified by install.sh script though and as far as I was able to check it – all variables inside makefile are being defined correctly by install script.
I have tried building with HPC-SDK 22.7 / CUDA 11.7 / GNU 12.2.0 in a first attempt, as this combination of SDK and CUDA version is mentioned as working correctly, then I tried to downgrade to HPC-SDK 22.2 / CUDA 11.0 / GNU 9.3 with the same results. I modified conda .yaml file accordingly with change in CUDA version.
HPC cluster runs on Rocky Linux 8.10 and GPU nodes contain Nvidia A100 GPUs.
Also I’ve encountered a minor issue with examples provided in github package – Deep-HP_example1 file failed to run with error:
After adding keyword disp-cutoff 7 run fails with abovementioned GLIBCXX error.
Is there any way to make this work?
The text was updated successfully, but these errors were encountered: