Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugs and issues running oq on a cluster #10377

Open
schmidni opened this issue Feb 25, 2025 · 8 comments
Open

Bugs and issues running oq on a cluster #10377

schmidni opened this issue Feb 25, 2025 · 8 comments
Assignees
Milestone

Comments

@schmidni
Copy link
Contributor

Dear OQ team. I am running tests with OpenQuake on a cluster (CINECA currently). I have some feedback, and ran into some issues. First, for reproducibility my setup:

They preferred not installing OQ as a module. I am using a "spack" environment, in which I added and installed python and pip. Using the pip from this environment, I then installed OpenQuake by hand (install the correct requirements file and then the engine as an editable install). Installing OQ in this way works very well for me and allows me the most flexibility.

Bug No 1:
Possibly not even cluster related. I had a openquake.cfg in my users home directory. I however expected the oq-engine/openquake/engine/openquake.cfg would be used. In this specific case, most of the configurations were in fact taken from the oq-engine/... configuration file (like slurm_time), however the submit_cmd was taken from the openquake.cfg in my user home!
Running the openquake command using -c CONFIG_FILE_PATH did not have any effect.

Bug No 2:
openquake/baselib/slurm.py, line 7, splits the submit_cmd using spaces, which results in flags like this --account=my_account to be ignored.

Issue:
Passing slurm options to the openquake calculations is currently very cumbersome. A few have their own keys in the cfg file, like time and cpus-per-task, the amount of nodes is passed directly to oq engine --run, and all other flags must be added to the submit_cmd in the cfg file. This is very inflexible!! Especially because some flags need to be adapted from run to run, and then I have to go every time to the openquake.cfg file for that!

Is there a way I could simply run openquake using srun or at least create my own sbatch file and run openquake this way? If not I would very much like such an option!

@micheles
Copy link
Contributor

micheles commented Feb 25, 2025

Well, there is a reason why the engine support for SLURM clusters is still flagged experimental ;-)
About the configuration file the command oq info cfg tells you which files are considered and in which order:

 $ oq info cfg  # on my machine
Looking at the following paths (the last wins)
/home/michele/oq-engine/openquake/engine/openquake.cfg
/home/michele/openquake/openquake.cfg
/home/michele/openquake.cfg

The one in your $HOME will have the precedence: this is the desired behavior since forever, independently from being on a cluster.

You can run openquake calculations with srun/sbatch (just use oq_distribute=processpool and not slurm) but then you will lose the ability to split a single calculation across multiple nodes. This makes sense if you have multiple medium-sized calculations.

About the submit_cmd: we added the minimum amount of functionality that was required by us, but we are willing to extend it and we accept contributions!

Running the openquake command using -c CONFIG_FILE_PATH did not have any effect.

This is strange and indeed looks like a bug.

@micheles micheles added this to the Engine 3.24.0 milestone Feb 25, 2025
@micheles micheles self-assigned this Feb 25, 2025
@schmidni
Copy link
Contributor Author

schmidni commented Feb 25, 2025

Thank you very much for the response - of course I understand that it is a work in progress :) it is already really cool to be able to run it distributed on multiple nodes at all :) I'm mostly giving feedback for when you work on the feature again.

I went back and tried to reproduce the bug where some settings were taken from the oq config, and the other from the home directory. I can not reproduce it. But I'm sure it didn't work before... I'll investigate when/if it appears again.

The submit_cmd parsing really difficult to debug (I would have never figured it out it if I didn't go look at the code). So I thought I'd report it.

running it using a processpool obviously is not what I'd want on a cluster. If you work on this feature again, it would be really appreciated to allow running it in a normal sbatch way. If that is not possible, a way of passing options through to slurm would be appreciated.

@micheles
Copy link
Contributor

micheles commented Feb 25, 2025

Feel feel to give suggestions. For our use case we wanted to figure out some good SLURM options and nail them down on openquake.cfg or submit_cmd (submit_cmd predates SLURM, it was already there so we re-used it). If you need to change the SLURM parameters often, maybe we should add an option oq-engine --run job.ini --slurm-file=slurm.conf or something like that? In this way you could have your own collections of SLURM configuration files depending on the situation.

@schmidni
Copy link
Contributor Author

I honestly would prefer just being able to write my own sbatch file normally. This would imo streamline the execution since "OQ things" are handled the oq way (ini file, cli options), and all "slurm configurations" would be handled the slurm way (sbatch file, cli options, srun cli). From a user perspective I don't see the advantage putting an "oq wrapper" around slurm.
But that is just my personal opinion :)

@micheles
Copy link
Contributor

It was implemented as it is for users knowing nothing about SLURM and wanting to have the same experience as on their laptops. Another option (which we considered in the past) would be having the engine generate a batch file that then the user can customize and run with sbatch. Fortunately now it is not a good moment for us to go back to SLURM, so it will have to wait.

@schmidni
Copy link
Contributor Author

I understand that. However from what I saw up to now working with HPC clusters, is that each one has again its own peculiarities and the user has to have a basic understanding anyways in the end.
Also, any potential tools between or around "slurm and the execution of a program" will not work anymore, if oq takes care of its own execution (like profiling, containerization, 3rd party frameworks, ...). The engine generating a sbatch file could work, optimally this would be an optional feature.

Sure, feel free to let us know when you go back to SLURM, we are happy to share experiences. It is really working nicely for us right now and we are exploring the new possibilities it is giving us. Thank you very much!

@schmidni
Copy link
Contributor Author

one more question: Does a disaggregation calculation scale with more nodes, and if yes, how?

@micheles
Copy link
Contributor

micheles commented Feb 28, 2025

Disaggregation calculations normally use only 1 site and are much faster than classical calculations, so there is little interest in running them on a cluster; certainly they have not been optimized for use via SLURM. My guess is that they will work but not produce enough tasks to take advantage of the available cores. Disclaimer: I never tried to run a disaggregation via SLURM. Most likely running disaggregations on hundreds/thousands of sites with any efficiency will require changes to the engine and it is a nontrivial project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants