Description
Hi, we are currently using parallel cluster as a SLURM cluster with a capacity reservation of 3 p4de.24xlarge instances. We've been running into certain issues, but we couldn't find clear feedback on how to address them, so we wanted to check here. I have included our cluster-config-yaml file, and would appreciate feedback.
Problem 1: Custom Changes to config files
We would like to make changes to the slurm config files to enable certain behavior. Specificially, we would like to enable the following:
# temp environment changes
PrologFlags = Alloc,Contain,X11
JobContainerType = job_container/tmpfs
# this might help make it so that nvidia-smi is isolated
ConstrainDevices = yes
ConstrainRAMSpace = yes
# For OOM containment
JobAcctGatherType = jobacct_gather/cgroup
JobAcctGatherParams = NoOverMemoryKill
# make salloc call srun for interactive jobs
LaunchParameters = use_interactive_step [or] use_interactive_step,enable_nss_slurm
However, we've found that we can't set these parameters through the CustomSlurmSettings
option. For temp environments, it seems like we might need to create a custom job_container.conf file. However, I currently see no way to do this via the config file.
Question: can we manually enable all of these options ourselves without repercussions? What would you suggest?
Problem 2: Separate partition for root (/) and how to enable usrquota.
We would like to mount root (/) to a separate file system through Lustre or something else. However, it currently says that only a single lustre file system can be used as a part of a given installation. Secondly, we would like to constrain the size of each user's home directory to be a particular size. Can you share how we can enable this programmatically? We could do this manually as described here: http://www.yolinux.com/TUTORIALS/LinuxTutorialQuotas.html, but we are wondering if there are any other alternatives?
Thanks for the help.