Add total memory to job info 878 #879

Oglopf · 2025-04-18T23:06:19Z

Addresses #878.

Adds a top layer method to OodCore::Job::Info to ensure we pass back memory in a consistent manner using bytes
Attempts to prevent issues with other schedulers beside Slurm if any fields are missing or nil
Handles cases where the job has been submitted for either --mem or --mem-per-cpu
Computes the per-node or per-cpu total memory in use
Still need to write a tests for slurm_spec.rb as well to ensure the adapter are correct.
I had trouble getting these tests to run today, need to work more with what i have in info_spec.rb to ensure correctness.

lib/ood_core/job/info.rb

Oglopf · 2025-05-05T22:39:48Z

This is so strange. Locally I have to add OodCore::Job::Adapters::Slurm::Batch::Error to clear the last failure. Strange it fails locally yet runs in the CI, strange is not the right work though, annoying is better.

And yet, in the CI I get different failures. So, we have divergence in the tests between local and remote, which is horrible from a code velocity perspective.

Here's the other issue, rspec is radically over-engineered. The mocking we do here is out-dated with various ways of mocking states and all these intricate objects we have to mock and call into, which also isn't even consistent across tests. It's bad from a developer ergonomic perspective, really bad. Why do we the devs allow this to continue? I see a ticket #844 about the migration where there's a lot of bad info being suggested as to why we do this. I for one would love to see the tests ported. We destroy this project by burning devs out having these kinds of bizarre, outdated techniques and silly arguments that are littered with false premises except maybe one valid point about the the e2e tests.

Anyway, when I get the tests to somehow work, I'll finish this extremely simple ticket which has now taken weeks because of testing. testing.

lib/ood_core/job/adapters/slurm.rb

lib/ood_core/job/info.rb

johrstrom · 2025-05-28T13:42:48Z

lib/ood_core/job/adapters/slurm.rb

+            return nil if mem_str.nil? || mem_str.strip.empty? || !mem_str.match(/[KMGTP]/)
+
+            unit = mem_str.match(/[KMGTP]/).to_s
+            value = mem_str.match(/\d+/).to_s
+
+            return nil unless unit && value
+
+            factor = {
+              "K" => 1024,
+              "M" => 1024**2,
+              "G" => 1024**3,
+              "T" => 1024**4,
+              "P" => 1024**5
+            }


I'm still a bit concerned about having to do this translation. I don't think anything other than M is ever given. Have you seen other values out of squeue? docs indicate it's M.

I can only say I've seen the 'M' and 'G' in outputs, and I had a hard time figuring out what SLURM does here with this tbh. I haven't seen any 'K', and I don't know if we have any jobs large enough to see 'T' or 'P'. But, I have seen both 'M' and 'G' in responses from squeue live on the OSC systems with squeue --Format=MinMemory though it's usually only a few users.

🤦‍♂️ We use the --noconvert flag here so it's always M, that's why I can't see any G out of say active jobs page.

--noconvert Don't convert units from their original type (e.g. 2048M won't be converted to 2G).

ood_core/lib/ood_core/job/adapters/slurm.rb

Line 221 in be9ede9

args = ["--all", "--states=all", "--noconvert"]

johrstrom · 2025-05-28T13:51:07Z

lib/ood_core/job/adapters/slurm.rb

+          def compute_total_memory(v, allocated_nodes)
+            return nil unless v[:min_memory].to_s.match?(/\d+/)
+
+            min_memory = parse_memory(v[:min_memory])
+
+            return nil if min_memory.nil?
+
+            min_memory * allocated_nodes.count
+          end


I don't think this is right. Here's a related ticket on Slurm memory and what min_memory is actually reporting. Seems like we need to muliply this by how many cores have been allocated. And if the whole node is allocated, and this value is 0, then I guess 0 is the best value we can get at this time?

#256

Yea I think min_memory * cores is the right calculation. I just submitted exclusive jobs on a cluster and it still reported min memory per core. By nodes this is only 8G when it's in fact ~480G across 2 machines.

[johrstrom dashboard(upgrade-hterm)] 🐹 squeue -j 939211 -o %N,%m,%C NODELIST,MIN_MEMORY,CPUS a[0273-0274],4027M,240

OK, I'm now seeing that even assuming it's per core is wrong.

I found this job on ascend, with 100G memory and 8 cores. But the actual memory usage isn't 800G, it's just the 100G.

[johrstrom dashboard(lmod-ui-3402)] 🐺 squeue --all --noconvert -j 939727 -o %m,%C MIN_MEMORY,CPUS 102400M,8

It's not clear to me how to find out if this is value per core or if it's total.

Oglopf added 3 commits April 18, 2025 18:51

update slurm adapter to provide total memory information

b62d135

add total memory method at top layer of info object

1331c37

add tests for info object using total_memory

bb9447b

johrstrom reviewed Apr 21, 2025

View reviewed changes

lib/ood_core/job/info.rb Outdated Show resolved Hide resolved

Oglopf added 11 commits April 21, 2025 16:27

fix broken test with missing data for build.

726d989

remove total_memory, add total_memory attribute

0c3fd4c

remove some cruft, add compute_total_memory method

36ba2e1

add helper to parse memory string and return the bytes used by job

f3b5b02

add tests for slurm total_memory using node or cpu context

d18c69e

remove total_memory test cases

2404760

fix typo in method

5c8185e

remove unneded hash argument

d2df78f

restore deleted field in info

d7912ca

add guard against nil min_memory

69fcb3e

try to validate the mem_str and unit for tests

3c905fb

Oglopf added 2 commits May 6, 2025 13:32

add nil test case for min_memory

30c9f73

fix compute and parse logic

304449b

Oglopf marked this pull request as ready for review May 6, 2025 17:34

johrstrom reviewed May 12, 2025

View reviewed changes

lib/ood_core/job/adapters/slurm.rb Outdated Show resolved Hide resolved

johrstrom reviewed May 12, 2025

View reviewed changes

lib/ood_core/job/info.rb Outdated Show resolved Hide resolved

Oglopf added 3 commits May 22, 2025 15:18

ensure total memory is an int

cfc744e

remove incorrect cpu processing logic

b2f5874

updated tests for cpu change

892d542

johrstrom reviewed May 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add total memory to job info 878 #879

Add total memory to job info 878 #879

Uh oh!

Oglopf commented Apr 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Oglopf commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

johrstrom May 28, 2025

Uh oh!

Oglopf May 28, 2025

Uh oh!

johrstrom May 28, 2025

Uh oh!

johrstrom May 28, 2025

Uh oh!

johrstrom May 28, 2025

Uh oh!

johrstrom May 28, 2025

Uh oh!

Uh oh!

Add total memory to job info 878 #879

Are you sure you want to change the base?

Add total memory to job info 878 #879

Uh oh!

Conversation

Oglopf commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Oglopf commented May 5, 2025

Uh oh!

Uh oh!

Uh oh!

johrstrom May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Oglopf May 28, 2025

Choose a reason for hiding this comment

Uh oh!

johrstrom May 28, 2025

Choose a reason for hiding this comment

Uh oh!

johrstrom May 28, 2025

Choose a reason for hiding this comment

Uh oh!

johrstrom May 28, 2025

Choose a reason for hiding this comment

Uh oh!

johrstrom May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Oglopf commented Apr 18, 2025 •

edited

Loading