-
Notifications
You must be signed in to change notification settings - Fork 39
Add total memory to job info 878 #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is so strange. Locally I have to add And yet, in the CI I get different failures. So, we have divergence in the tests between local and remote, which is horrible from a code velocity perspective. Here's the other issue, Anyway, when I get the tests to somehow work, I'll finish this extremely simple ticket which has now taken weeks because of testing. testing. |
return nil if mem_str.nil? || mem_str.strip.empty? || !mem_str.match(/[KMGTP]/) | ||
|
||
unit = mem_str.match(/[KMGTP]/).to_s | ||
value = mem_str.match(/\d+/).to_s | ||
|
||
return nil unless unit && value | ||
|
||
factor = { | ||
"K" => 1024, | ||
"M" => 1024**2, | ||
"G" => 1024**3, | ||
"T" => 1024**4, | ||
"P" => 1024**5 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit concerned about having to do this translation. I don't think anything other than M
is ever given. Have you seen other values out of squeue
? docs indicate it's M
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can only say I've seen the 'M' and 'G' in outputs, and I had a hard time figuring out what SLURM does here with this tbh. I haven't seen any 'K', and I don't know if we have any jobs large enough to see 'T' or 'P'. But, I have seen both 'M' and 'G' in responses from squeue
live on the OSC systems with squeue --Format=MinMemory
though it's usually only a few users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂️ We use the --noconvert
flag here so it's always M
, that's why I can't see any G out of say active jobs page.
--noconvert
Don't convert units from their original type (e.g. 2048M won't be converted to 2G).
ood_core/lib/ood_core/job/adapters/slurm.rb
Line 221 in be9ede9
args = ["--all", "--states=all", "--noconvert"] |
def compute_total_memory(v, allocated_nodes) | ||
return nil unless v[:min_memory].to_s.match?(/\d+/) | ||
|
||
min_memory = parse_memory(v[:min_memory]) | ||
|
||
return nil if min_memory.nil? | ||
|
||
min_memory * allocated_nodes.count | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right. Here's a related ticket on Slurm memory and what min_memory is actually reporting. Seems like we need to muliply this by how many cores have been allocated. And if the whole node is allocated, and this value is 0, then I guess 0 is the best value we can get at this time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think min_memory * cores
is the right calculation. I just submitted exclusive jobs on a cluster and it still reported min memory per core. By nodes this is only 8G when it's in fact ~480G across 2 machines.
[johrstrom dashboard(upgrade-hterm)] 🐹 squeue -j 939211 -o %N,%m,%C
NODELIST,MIN_MEMORY,CPUS
a[0273-0274],4027M,240
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm now seeing that even assuming it's per core is wrong.
I found this job on ascend, with 100G memory and 8 cores. But the actual memory usage isn't 800G, it's just the 100G.
[johrstrom dashboard(lmod-ui-3402)] 🐺 squeue --all --noconvert -j 939727 -o %m,%C
MIN_MEMORY,CPUS
102400M,8
It's not clear to me how to find out if this is value per core or if it's total.
Addresses #878.
OodCore::Job::Info
to ensure we pass back memory in a consistent manner usingbytes
nil
--mem
or--mem-per-cpu
slurm_spec.rb
as well to ensure the adapter are correct.info_spec.rb
to ensure correctness.