Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to run ML-enabled example runs #26

Open
adrmalin opened this issue Jan 27, 2025 · 0 comments
Open

Failure to run ML-enabled example runs #26

adrmalin opened this issue Jan 27, 2025 · 0 comments

Comments

@adrmalin
Copy link

Hi,
I'm facing issues with running tinker-hp GPU with DeepHP enabled on HPC cluster. I prepare system according to instructions - I load Nvidia HPC package through its modulefile and GNU compilers from modulefile provided by HPC administrators. I load conda environment from tinkerml.torch.yaml file modified for chosen CUDA version. Build through installation script is successful with settings as follows:

target_arch='gpu' , c_c=80
cuda_ver=11.0
FPA=1 [left as default]
build_plumed=0 [left as default]
build_colvars=0 [left as default]
NN=1

Build completes without error. ‘Normal’ tasks, such as dynamic or analyze, run without a problem. When I try to run ML potential tasks from the ‘examples’ directory though, run fails at library load with error:

Exception: Fail to load modules with exception: /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /users/kdm/adrmalin/.conda/envs/DeepHP-torch/lib/python3.10/site-packages/torch/../../../libtorch_python.so)

It looks like binaries link to libstdc++.so library in default /usr/lib64/ location, which is outdated and thus causing error, instead of library provided by HPC compiler module. GNUROOT variable is correctly identified by install.sh script though and as far as I was able to check it – all variables inside makefile are being defined correctly by install script.
I have tried building with HPC-SDK 22.7 / CUDA 11.7 / GNU 12.2.0 in a first attempt, as this combination of SDK and CUDA version is mentioned as working correctly, then I tried to downgrade to HPC-SDK 22.2 / CUDA 11.0 / GNU 9.3 with the same results. I modified conda .yaml file accordingly with change in CUDA version.
HPC cluster runs on Rocky Linux 8.10 and GPU nodes contain Nvidia A100 GPUs.
Also I’ve encountered a minor issue with examples provided in github package – Deep-HP_example1 file failed to run with error:

Error in dispersion neigbor list: max cutoff + buffer should be less than half one edge of the box
  dispersion cutoff =          9.000

After adding keyword disp-cutoff 7 run fails with abovementioned GLIBCXX error.
Is there any way to make this work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant