Skip to content

Add total memory to job info 878 #879

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Oglopf
Copy link
Contributor

@Oglopf Oglopf commented Apr 18, 2025

Addresses #878.

  • Adds a top layer method to OodCore::Job::Info to ensure we pass back memory in a consistent manner using bytes
  • Attempts to prevent issues with other schedulers beside Slurm if any fields are missing or nil
  • Handles cases where the job has been submitted for either --mem or --mem-per-cpu
  • Computes the per-node or per-cpu total memory in use
  • Still need to write a tests for slurm_spec.rb as well to ensure the adapter are correct.
  • I had trouble getting these tests to run today, need to work more with what i have in info_spec.rb to ensure correctness.

@Oglopf
Copy link
Contributor Author

Oglopf commented May 5, 2025

This is so strange. Locally I have to add OodCore::Job::Adapters::Slurm::Batch::Error to clear the last failure. Strange it fails locally yet runs in the CI, strange is not the right work though, annoying is better.

And yet, in the CI I get different failures. So, we have divergence in the tests between local and remote, which is horrible from a code velocity perspective.

Here's the other issue, rspec is radically over-engineered. The mocking we do here is out-dated with various ways of mocking states and all these intricate objects we have to mock and call into, which also isn't even consistent across tests. It's bad from a developer ergonomic perspective, really bad. Why do we the devs allow this to continue? I see a ticket #844 about the migration where there's a lot of bad info being suggested as to why we do this. I for one would love to see the tests ported. We destroy this project by burning devs out having these kinds of bizarre, outdated techniques and silly arguments that are littered with false premises except maybe one valid point about the the e2e tests.

Anyway, when I get the tests to somehow work, I'll finish this extremely simple ticket which has now taken weeks because of testing. testing.

@Oglopf Oglopf marked this pull request as ready for review May 6, 2025 17:34
Comment on lines +876 to +889
return nil if mem_str.nil? || mem_str.strip.empty? || !mem_str.match(/[KMGTP]/)

unit = mem_str.match(/[KMGTP]/).to_s
value = mem_str.match(/\d+/).to_s

return nil unless unit && value

factor = {
"K" => 1024,
"M" => 1024**2,
"G" => 1024**3,
"T" => 1024**4,
"P" => 1024**5
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still a bit concerned about having to do this translation. I don't think anything other than M is ever given. Have you seen other values out of squeue? docs indicate it's M.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can only say I've seen the 'M' and 'G' in outputs, and I had a hard time figuring out what SLURM does here with this tbh. I haven't seen any 'K', and I don't know if we have any jobs large enough to see 'T' or 'P'. But, I have seen both 'M' and 'G' in responses from squeue live on the OSC systems with squeue --Format=MinMemory though it's usually only a few users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂️ We use the --noconvert flag here so it's always M, that's why I can't see any G out of say active jobs page.

--noconvert
    Don't convert units from their original type (e.g. 2048M won't be converted to 2G). 

args = ["--all", "--states=all", "--noconvert"]

Comment on lines +931 to +939
def compute_total_memory(v, allocated_nodes)
return nil unless v[:min_memory].to_s.match?(/\d+/)

min_memory = parse_memory(v[:min_memory])

return nil if min_memory.nil?

min_memory * allocated_nodes.count
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. Here's a related ticket on Slurm memory and what min_memory is actually reporting. Seems like we need to muliply this by how many cores have been allocated. And if the whole node is allocated, and this value is 0, then I guess 0 is the best value we can get at this time?

#256

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think min_memory * cores is the right calculation. I just submitted exclusive jobs on a cluster and it still reported min memory per core. By nodes this is only 8G when it's in fact ~480G across 2 machines.

[johrstrom dashboard(upgrade-hterm)] 🐹  squeue -j 939211 -o %N,%m,%C
NODELIST,MIN_MEMORY,CPUS
a[0273-0274],4027M,240

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm now seeing that even assuming it's per core is wrong.

I found this job on ascend, with 100G memory and 8 cores. But the actual memory usage isn't 800G, it's just the 100G.

[johrstrom dashboard(lmod-ui-3402)] 🐺  squeue --all --noconvert -j 939727 -o %m,%C
MIN_MEMORY,CPUS
102400M,8

It's not clear to me how to find out if this is value per core or if it's total.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants