Device memory usage #1036

dacorvo · 2025-11-28T15:14:01Z

What does this PR do?

This pull-request adds utilities to read and log neuron device memory usage.
This pull request also refactors and simplifies the sharding logic for attention heads and key/value heads across the codebase, improves logging for cache allocation and head adjustment.
Finally it cleans up unused or redundant code in the Makefile, adding an install target and using uv whenever possible.

If uv is installed, it will be used.

HuggingFaceDocBuilderDev · 2025-11-28T15:19:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copilot

Pull request overview

This PR adds utilities for reading and logging Neuron device memory usage, refactors the attention head sharding logic to be more straightforward, and cleans up the Makefile to better utilize uv for package management.

Key changes:

Introduces neuron_device_memory.py module to read device memory statistics from Neuron sysfs
Simplifies GQA sharding strategy determination by eliminating the desired_sharding_strategy parameter and consolidating logic into a single get_shardable_head_counts() function
Removes unused parameters (scatter_index, kvcache_buffer, inputs_embeds) from the decoder and KV cache manager interfaces
Adds logging for KV cache allocation and head count adjustments
Updates Makefile with new install target and better uv integration

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
optimum/neuron/version.py	Bumps development version from 0.4.3.dev0 to 0.4.3.dev1
optimum/neuron/utils/neuron_device_memory.py	Adds new utility module to read and format Neuron device memory usage from sysfs
optimum/neuron/models/inference/backend/pretrained_model.py	Adds device memory logging after weight initialization
optimum/neuron/models/inference/backend/modules/kvcache/kv_cache_manager.py	Refactors to accept actual KV head count directly, removes unused parameters, adds cache size logging
optimum/neuron/models/inference/backend/modules/decoder/modeling_decoder.py	Integrates refactored head count logic, removes unused parameters and methods
optimum/neuron/models/inference/backend/modules/attention/gqa.py	Consolidates sharding strategy logic into `get_shardable_head_counts()`, removes `desired_sharding_strategy` parameter, adds logging for head adjustments
Makefile	Removes unused targets, adds `install` target with `uv` support, simplifies build command

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-28T15:20:32Z

Makefile

+# Installation
+install: $(PACKAGE_DIST)
+	# Force CPU version of torch to speed up installation time
+	$(UV) pip install --upgrade $(PACKAGE_DIST)[neuronx] \


The install target unconditionally uses $(UV) without checking if it exists. If uv is not installed, this will fail. Consider adding a check similar to line 41, or use $(if $(UV), $(UV) pip, python -m pip) to fall back to standard pip when uv is not available.

Suggested change

$(UV) pip install --upgrade $(PACKAGE_DIST)[neuronx] \

$(if $(UV),$(UV) pip,python -m pip) install --upgrade $(PACKAGE_DIST)[neuronx] \

optimum/neuron/utils/neuron_device_memory.py

Copilot · 2025-11-28T15:20:33Z

optimum/neuron/utils/neuron_device_memory.py

+    try:
+        with open(file_path, "r") as f:
+            return f.read().strip()
+    except (FileNotFoundError, PermissionError) as e:
+        raise e


The exception handling in lines 56-57 catches exceptions only to re-raise them immediately. This serves no purpose and should be simplified to just let the exceptions propagate naturally by removing the try-except block entirely.

Suggested change

try:

with open(file_path, "r") as f:

return f.read().strip()

except (FileNotFoundError, PermissionError) as e:

raise e

with open(file_path, "r") as f:

return f.read().strip()

Copilot · 2025-11-28T15:20:33Z

optimum/neuron/utils/neuron_device_memory.py

+    for category in DEVICE_MEM_CATEGORIES:
+        try:
+            categories[category] = read_memory_counter(device_mem_path, category)
+        except Exception:


Using a bare except Exception: is too broad and will silently skip categories that fail for any reason, making debugging difficult. Consider either logging the exception or being more specific about which exceptions to catch (e.g., FileNotFoundError, PermissionError, ValueError).

Suggested change

except Exception:

except (FileNotFoundError, PermissionError, ValueError):

I agree with the comments, plus I would add a constant EXPECTED_EXCEPTIONS = (FileNotFoundError, PermissionError, ValueError) to avoid repeating over and over

Copilot · 2025-11-28T15:20:34Z

optimum/neuron/models/inference/backend/pretrained_model.py

            weights = sharder.shard_checkpoint()
        start_rank_tensor = torch.tensor([start_rank_id], dtype=torch.int32, device="cpu")
        self._traced_model.nxd_model.initialize(weights, start_rank_tensor)
+        logger.info(str(get_neuron_device_memory()))


The call to get_neuron_device_memory() will raise a RuntimeError if the Neuron sysfs is not found. This could cause weight loading to fail even when weights are loaded successfully. Consider wrapping this in a try-except block to log the memory usage only when available, or handle the case when Neuron driver is not present gracefully.

Suggested change

logger.info(str(get_neuron_device_memory()))

try:

logger.info(str(get_neuron_device_memory()))

except RuntimeError as e:

logger.info(f"Neuron device memory information unavailable: {e}")

Copilot · 2025-11-28T15:20:34Z

optimum/neuron/models/inference/backend/modules/kvcache/kv_cache_manager.py

        :param position_ids: tensor of size (batch_sz, seq_len)
        :param new_key_values: list of tuple, the latest kv obtained at the end of the network from forward pass
        :param seq_len: sequence length
-        :param scatter_index: tensor representing index to update
        :param active_mask: tensor representing index to update
-        :param kvcache_buffer: if passed key states are updates to this buffer.
-               kvcache_buffer is 2D list where, 1st dim for layer and the second denotes K and V.
-               For example,
-                    kvcache_buffer[1][0] is the K cache of the 1st layer
-                    kvcache_buffer[4][1] is the V cache of the 4th layer
        :return: list of tuple of (K, V)


The docstring references a removed parameter scatter_index at line 117. This outdated documentation should be removed as scatter_index is no longer a parameter of the update_cache method.

tengomucho

Thanks a lot! Few comments, otherwise seems good!

tengomucho · 2025-11-28T16:29:48Z

Makefile

 SHELL := /bin/bash
 CURRENT_DIR = $(shell pwd)
+UV = $(shell command -v uv)
+UV_ARGS = $(if $(UV),--index-strategy unsafe-best-match)


We already assume a given system is available (the AMI for example). If that is the case, I think we could install uv there and just use it, without adding additional checks

uv is not installed by default on the AWS DLAMI, which I personally often use when bumping versions.

optimum/neuron/version.py

tengomucho · 2025-11-28T16:39:16Z

optimum/neuron/utils/neuron_device_memory.py

+    for category in DEVICE_MEM_CATEGORIES:
+        try:
+            categories[category] = read_memory_counter(device_mem_path, category)
+        except Exception:


I agree with the comments, plus I would add a constant EXPECTED_EXCEPTIONS = (FileNotFoundError, PermissionError, ValueError) to avoid repeating over and over

Upcoming changes related to the sharding strategy should not modify the graphs, but bumping the version makes it easier to track regressions.

Also add a log about the replication of heads when sharding.

Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

optimum/neuron/utils/neuron_device_memory.py

Copilot · 2025-12-01T10:44:30Z

optimum/neuron/utils/neuron_device_memory.py

+        except Exception:
+            # Skip categories that can't be read


The bare except Exception: on line 133 silently catches all exceptions. Consider logging a warning when a category cannot be read, or at least being more specific about which exceptions to catch (e.g., OSError, ValueError) to avoid masking unexpected errors.

Suggested change

except Exception:

# Skip categories that can't be read

except (OSError, ValueError) as e:

logger.warning(f"Could not read memory category '{category}' in {device_mem_path}: {e}")

tengomucho

some nits, otherwise LGTM!

tengomucho · 2025-12-01T10:45:49Z

tests/decoder/test_device_memory.py

@@ -0,0 +1,12 @@
+import pytest
+
+from optimum.neuron.utils.neuron_device_memory import get_neuron_device_memory


not really a decoder test, but I understand it had to be placed somewhere

tengomucho · 2025-12-01T10:46:31Z

tests/decoder/test_device_memory.py

    assert neuron_device_memory is not None
    assert len(neuron_device_memory.devices) == get_neuron_devices_count()
-    assert neuron_device_memory.get_total_memory() > 0
+    assert neuron_device_memory.get_total_memory() >= 0


when is it going to be 0?

On the CI, it looks like it is zero when no other tests have been run

Co-authored-by: Copilot <[email protected]>

dacorvo added 3 commits November 27, 2025 11:11

refactor(Makefile): remove obsolete variables

a05e78b

fix(Makefile): remove obsolete targets

2c26087

feat(Makefile): add install target

7f90003

If uv is installed, it will be used.

dacorvo requested review from JingyaHuang, Copilot, michaelbenayoun and tengomucho November 28, 2025 15:16

Copilot started reviewing on behalf of dacorvo November 28, 2025 15:16 View session

Copilot finished reviewing on behalf of dacorvo November 28, 2025 15:19

Copilot AI reviewed Nov 28, 2025

View reviewed changes

tengomucho reviewed Nov 28, 2025

View reviewed changes

dacorvo and others added 14 commits December 1, 2025 07:40

chore: bump dev version

b6ef0de

Upcoming changes related to the sharding strategy should not modify the graphs, but bumping the version makes it easier to track regressions.

refactor(llm): remove unused parameters and methods

b7cc86b

feat(decoder): add log about the cache size

463270b

refactor(decoder): removed unused desired sharding strategy

23e064b

refactor(decoder): merge GQA sharding helpers

a3a2010

refactor(decoder): cleanup KV Cache initialization

7d58cb3

Also add a log about the replication of heads when sharding.

feat(Makefile): use uv for build if possible

8f5bec5

feat: add utility to read and aggregate neuron device memory

70d3e43

feat(llm): log device memory after loading weights

59a9333

review: more intuitive boundary behaviour with memory suffixes

8fed8b4

Co-authored-by: Copilot <[email protected]>

review: use a list to disambiguate suffixes processing order

9dd8531

Co-authored-by: Copilot <[email protected]>

review: remove scatter_index docstring

6b3b3f3

review: remove useless try/except

efec3fa

review: simplify code and add license header

ebc138a

dacorvo force-pushed the device_memory_logs branch 2 times, most recently from a90fe5d to 203b5b4 Compare December 1, 2025 09:25

dacorvo added 2 commits December 1, 2025 10:01

refactor(system): add get_neuron_device_count helper

7a4a98d

test: add device memory test

9d598f0

ci: add device memory test

0b97e83

dacorvo force-pushed the device_memory_logs branch from 203b5b4 to 0b97e83 Compare December 1, 2025 10:01

dacorvo requested review from Copilot and tengomucho December 1, 2025 10:40

Copilot started reviewing on behalf of dacorvo December 1, 2025 10:40 View session

Copilot finished reviewing on behalf of dacorvo December 1, 2025 10:43

Copilot AI reviewed Dec 1, 2025

View reviewed changes

tengomucho approved these changes Dec 1, 2025

View reviewed changes

review: update copyright

b020708

Co-authored-by: Copilot <[email protected]>

dacorvo merged commit 433407f into main Dec 1, 2025
10 of 11 checks passed

dacorvo deleted the device_memory_logs branch December 1, 2025 12:35

	$(UV) pip install --upgrade $(PACKAGE_DIST)[neuronx] \
	$(if $(UV),$(UV) pip,python -m pip) install --upgrade $(PACKAGE_DIST)[neuronx] \

	except Exception:
	except (FileNotFoundError, PermissionError, ValueError):

		@@ -0,0 +1,12 @@
		import pytest

		from optimum.neuron.utils.neuron_device_memory import get_neuron_device_memory

Device memory usage #1036

Device memory usage #1036

Uh oh!

Conversation

dacorvo commented Nov 28, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Nov 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

tengomucho Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

tengomucho left a comment

Choose a reason for hiding this comment

Uh oh!

tengomucho Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

dacorvo Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tengomucho Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

tengomucho left a comment

Choose a reason for hiding this comment

Uh oh!

tengomucho Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

tengomucho Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

dacorvo Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants