Failure to run ML-enabled example runs #26

adrmalin · 2025-01-27T12:58:18Z

Hi,
I'm facing issues with running tinker-hp GPU with DeepHP enabled on HPC cluster. I prepare system according to instructions - I load Nvidia HPC package through its modulefile and GNU compilers from modulefile provided by HPC administrators. I load conda environment from tinkerml.torch.yaml file modified for chosen CUDA version. Build through installation script is successful with settings as follows:

target_arch='gpu' , c_c=80
cuda_ver=11.0
FPA=1 [left as default]
build_plumed=0 [left as default]
build_colvars=0 [left as default]
NN=1

Build completes without error. ‘Normal’ tasks, such as dynamic or analyze, run without a problem. When I try to run ML potential tasks from the ‘examples’ directory though, run fails at library load with error:

Exception: Fail to load modules with exception: /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /users/kdm/adrmalin/.conda/envs/DeepHP-torch/lib/python3.10/site-packages/torch/../../../libtorch_python.so)

It looks like binaries link to libstdc++.so library in default /usr/lib64/ location, which is outdated and thus causing error, instead of library provided by HPC compiler module. GNUROOT variable is correctly identified by install.sh script though and as far as I was able to check it – all variables inside makefile are being defined correctly by install script.
I have tried building with HPC-SDK 22.7 / CUDA 11.7 / GNU 12.2.0 in a first attempt, as this combination of SDK and CUDA version is mentioned as working correctly, then I tried to downgrade to HPC-SDK 22.2 / CUDA 11.0 / GNU 9.3 with the same results. I modified conda .yaml file accordingly with change in CUDA version.
HPC cluster runs on Rocky Linux 8.10 and GPU nodes contain Nvidia A100 GPUs.
Also I’ve encountered a minor issue with examples provided in github package – Deep-HP_example1 file failed to run with error:

Error in dispersion neigbor list: max cutoff + buffer should be less than half one edge of the box
  dispersion cutoff =          9.000

After adding keyword disp-cutoff 7 run fails with abovementioned GLIBCXX error.
Is there any way to make this work?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to run ML-enabled example runs #26

Failure to run ML-enabled example runs #26

adrmalin commented Jan 27, 2025

Failure to run ML-enabled example runs #26

Failure to run ML-enabled example runs #26

Comments

adrmalin commented Jan 27, 2025