-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugs and issues running oq on a cluster #10377
Comments
Well, there is a reason why the engine support for SLURM clusters is still flagged experimental ;-)
The one in your $HOME will have the precedence: this is the desired behavior since forever, independently from being on a cluster. You can run openquake calculations with About the
This is strange and indeed looks like a bug. |
Thank you very much for the response - of course I understand that it is a work in progress :) it is already really cool to be able to run it distributed on multiple nodes at all :) I'm mostly giving feedback for when you work on the feature again. I went back and tried to reproduce the bug where some settings were taken from the oq config, and the other from the home directory. I can not reproduce it. But I'm sure it didn't work before... I'll investigate when/if it appears again. The submit_cmd parsing really difficult to debug (I would have never figured it out it if I didn't go look at the code). So I thought I'd report it. running it using a processpool obviously is not what I'd want on a cluster. If you work on this feature again, it would be really appreciated to allow running it in a normal sbatch way. If that is not possible, a way of passing options through to slurm would be appreciated. |
Feel feel to give suggestions. For our use case we wanted to figure out some good SLURM options and nail them down on openquake.cfg or submit_cmd ( |
I honestly would prefer just being able to write my own sbatch file normally. This would imo streamline the execution since "OQ things" are handled the oq way (ini file, cli options), and all "slurm configurations" would be handled the slurm way (sbatch file, cli options, srun cli). From a user perspective I don't see the advantage putting an "oq wrapper" around slurm. |
It was implemented as it is for users knowing nothing about SLURM and wanting to have the same experience as on their laptops. Another option (which we considered in the past) would be having the engine generate a batch file that then the user can customize and run with sbatch. Fortunately now it is not a good moment for us to go back to SLURM, so it will have to wait. |
I understand that. However from what I saw up to now working with HPC clusters, is that each one has again its own peculiarities and the user has to have a basic understanding anyways in the end. Sure, feel free to let us know when you go back to SLURM, we are happy to share experiences. It is really working nicely for us right now and we are exploring the new possibilities it is giving us. Thank you very much! |
one more question: Does a disaggregation calculation scale with more nodes, and if yes, how? |
Disaggregation calculations normally use only 1 site and are much faster than classical calculations, so there is little interest in running them on a cluster; certainly they have not been optimized for use via SLURM. My guess is that they will work but not produce enough tasks to take advantage of the available cores. Disclaimer: I never tried to run a disaggregation via SLURM. Most likely running disaggregations on hundreds/thousands of sites with any efficiency will require changes to the engine and it is a nontrivial project. |
Dear OQ team. I am running tests with OpenQuake on a cluster (CINECA currently). I have some feedback, and ran into some issues. First, for reproducibility my setup:
They preferred not installing OQ as a module. I am using a "spack" environment, in which I added and installed python and pip. Using the pip from this environment, I then installed OpenQuake by hand (install the correct requirements file and then the engine as an editable install). Installing OQ in this way works very well for me and allows me the most flexibility.
Bug No 1:
Possibly not even cluster related. I had a
openquake.cfg
in my users home directory. I however expected theoq-engine/openquake/engine/openquake.cfg
would be used. In this specific case, most of the configurations were in fact taken from theoq-engine/...
configuration file (likeslurm_time
), however thesubmit_cmd
was taken from theopenquake.cfg
in my user home!Running the openquake command using
-c CONFIG_FILE_PATH
did not have any effect.Bug No 2:
openquake/baselib/slurm.py
, line 7, splits thesubmit_cmd
using spaces, which results in flags like this--account=my_account
to be ignored.Issue:
Passing slurm options to the openquake calculations is currently very cumbersome. A few have their own keys in the
cfg
file, like time and cpus-per-task, the amount of nodes is passed directly to oq engine --run, and all other flags must be added to thesubmit_cmd
in thecfg
file. This is very inflexible!! Especially because some flags need to be adapted from run to run, and then I have to go every time to the openquake.cfg file for that!Is there a way I could simply run openquake using
srun
or at least create my own sbatch file and run openquake this way? If not I would very much like such an option!The text was updated successfully, but these errors were encountered: