Skip to content

Node-labeller amd.com/gpu.vram undercounts addressable memory on dpx_nps1 (divides by compute-partition count, ignores memory-partition mode) #555

Description

@jitesh-gupta

Summary

The AMD GPU Operator's node-labeller emits amd.com/gpu.vram=144G on an MI355X node configured for DPX compute partitioning + NPS1 memory partitioning. The correct value is 288G — in NPS1 memory is a single NUMA domain shared across all compute partitions, so each DPX partition is addressable against the full per-physical-GPU VRAM.

The label appears to be computed as physical_vram / compute_partition_count without consulting memory_partition_mode, which is incorrect for any NPS1 configuration.

Environment

GPU AMD Instinct MI355X OAM (PCI device-id 75a3)
Host OS / driver Linux, AMDGPU driver 6.14.14
Kubernetes v1.29
AMD GPU Operator manually installed (~90 days old)
node-labeller image bundled with the above operator release

Steps to reproduce

  1. Two MI355X nodes on the same cluster, identical hardware (same device-id, same chassis SKU), differing only in partition mode:
    • Node A: dpx_nps1 (DPX compute, NPS1 memory)
    • Node B: spx_nps1 (SPX compute, NPS1 memory)
  2. Dump the amd.com/gpu.* labels on each:
    kubectl get node <node> -o jsonpath='{.metadata.labels}' \
      | jq '. | to_entries | map(select(.key | startswith("amd.com/gpu"))) | from_entries'

Expected behavior

The amd.com/gpu.vram label should reflect per-partition addressable VRAM, which depends on the memory partition mode:

memory partition per-partition addressable VRAM
nps1 full physical VRAM (memory is not partitioned)
nps2 physical VRAM ÷ 2
nps4 physical VRAM ÷ 4

For dpx_nps1 on MI355X: vram=288G (same as SPX, since NPS1 memory is shared across DPX compute partitions).
For dpx_nps2: vram=144G.
For dpx_nps4: vram=72G.

Actual behavior

On the DPX+NPS1 node, vram=144G — i.e., physical VRAM divided by the compute partition count, ignoring the memory partition mode entirely.

Side-by-side diff (same hardware, different partition mode):

Label SPX/NPS1 node DPX/NPS1 node Halving correct?
amd.com/gpu.compute-memory-partition spx_nps1 dpx_nps1
amd.com/gpu.cu-count 256 128 Yes — compute IS partitioned by DPX
amd.com/gpu.simd-count 1024 512 Yes — same reason
amd.com/gpu.vram 288G 144G No — NPS1 means memory is NOT partitioned
amd.com/gpu.device-id 75a3 75a3 Same hardware
amd.com/gpu.driver-version 6.14.14 6.14.14 Same driver

Raw outputs:

SPX/NPS1 node:

{
  "amd.com/gpu.compute-memory-partition": "spx_nps1",
  "amd.com/gpu.compute-partitioning-supported": "true",
  "amd.com/gpu.cu-count": "256",
  "amd.com/gpu.device-id": "75a3",
  "amd.com/gpu.driver-version": "6.14.14",
  "amd.com/gpu.family": "AI",
  "amd.com/gpu.memory-partitioning-supported": "true",
  "amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM",
  "amd.com/gpu.simd-count": "1024",
  "amd.com/gpu.vram": "288G"
}

DPX/NPS1 node:

{
  "amd.com/gpu.compute-memory-partition": "dpx_nps1",
  "amd.com/gpu.compute-partitioning-supported": "true",
  "amd.com/gpu.cu-count": "128",
  "amd.com/gpu.device-id": "75a3",
  "amd.com/gpu.driver-version": "6.14.14",
  "amd.com/gpu.family": "AI",
  "amd.com/gpu.memory-partitioning-supported": "true",
  "amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM",
  "amd.com/gpu.simd-count": "512",
  "amd.com/gpu.vram": "144G"
}

Impact

Anything that consumes amd.com/gpu.vram for capacity planning, scheduler hinting, or workload-fit decisions on DPX+NPS1 nodes will see half of actual addressable memory. CU/SIMD labels are correct, so compute-fit decisions are unaffected.

Suggested fix

amd.com/gpu.vram should be derived from the memory partition mode, not the compute partition mode. Pseudocode:

physical_vram_per_gpu  = read from amdgpu sysfs
memory_partitions      = {nps1: 1, nps2: 2, nps4: 4}[memory_partition_mode]
vram_label             = physical_vram_per_gpu / memory_partitions

Compute-partition–related labels (cu-count, simd-count) remain divided by the compute partition count, as they are today.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions