-
Notifications
You must be signed in to change notification settings - Fork 88
Device memory usage #1036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Device memory usage #1036
Conversation
If uv is installed, it will be used.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds utilities for reading and logging Neuron device memory usage, refactors the attention head sharding logic to be more straightforward, and cleans up the Makefile to better utilize uv for package management.
Key changes:
- Introduces
neuron_device_memory.pymodule to read device memory statistics from Neuron sysfs - Simplifies GQA sharding strategy determination by eliminating the
desired_sharding_strategyparameter and consolidating logic into a singleget_shardable_head_counts()function - Removes unused parameters (
scatter_index,kvcache_buffer,inputs_embeds) from the decoder and KV cache manager interfaces - Adds logging for KV cache allocation and head count adjustments
- Updates Makefile with new
installtarget and betteruvintegration
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| optimum/neuron/version.py | Bumps development version from 0.4.3.dev0 to 0.4.3.dev1 |
| optimum/neuron/utils/neuron_device_memory.py | Adds new utility module to read and format Neuron device memory usage from sysfs |
| optimum/neuron/models/inference/backend/pretrained_model.py | Adds device memory logging after weight initialization |
| optimum/neuron/models/inference/backend/modules/kvcache/kv_cache_manager.py | Refactors to accept actual KV head count directly, removes unused parameters, adds cache size logging |
| optimum/neuron/models/inference/backend/modules/decoder/modeling_decoder.py | Integrates refactored head count logic, removes unused parameters and methods |
| optimum/neuron/models/inference/backend/modules/attention/gqa.py | Consolidates sharding strategy logic into get_shardable_head_counts(), removes desired_sharding_strategy parameter, adds logging for head adjustments |
| Makefile | Removes unused targets, adds install target with uv support, simplifies build command |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Installation | ||
| install: $(PACKAGE_DIST) | ||
| # Force CPU version of torch to speed up installation time | ||
| $(UV) pip install --upgrade $(PACKAGE_DIST)[neuronx] \ |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The install target unconditionally uses $(UV) without checking if it exists. If uv is not installed, this will fail. Consider adding a check similar to line 41, or use $(if $(UV), $(UV) pip, python -m pip) to fall back to standard pip when uv is not available.
| $(UV) pip install --upgrade $(PACKAGE_DIST)[neuronx] \ | |
| $(if $(UV),$(UV) pip,python -m pip) install --upgrade $(PACKAGE_DIST)[neuronx] \ |
| try: | ||
| with open(file_path, "r") as f: | ||
| return f.read().strip() | ||
| except (FileNotFoundError, PermissionError) as e: | ||
| raise e |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exception handling in lines 56-57 catches exceptions only to re-raise them immediately. This serves no purpose and should be simplified to just let the exceptions propagate naturally by removing the try-except block entirely.
| try: | |
| with open(file_path, "r") as f: | |
| return f.read().strip() | |
| except (FileNotFoundError, PermissionError) as e: | |
| raise e | |
| with open(file_path, "r") as f: | |
| return f.read().strip() |
| for category in DEVICE_MEM_CATEGORIES: | ||
| try: | ||
| categories[category] = read_memory_counter(device_mem_path, category) | ||
| except Exception: |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except Exception: is too broad and will silently skip categories that fail for any reason, making debugging difficult. Consider either logging the exception or being more specific about which exceptions to catch (e.g., FileNotFoundError, PermissionError, ValueError).
| except Exception: | |
| except (FileNotFoundError, PermissionError, ValueError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the comments, plus I would add a constant EXPECTED_EXCEPTIONS = (FileNotFoundError, PermissionError, ValueError) to avoid repeating over and over
| weights = sharder.shard_checkpoint() | ||
| start_rank_tensor = torch.tensor([start_rank_id], dtype=torch.int32, device="cpu") | ||
| self._traced_model.nxd_model.initialize(weights, start_rank_tensor) | ||
| logger.info(str(get_neuron_device_memory())) |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call to get_neuron_device_memory() will raise a RuntimeError if the Neuron sysfs is not found. This could cause weight loading to fail even when weights are loaded successfully. Consider wrapping this in a try-except block to log the memory usage only when available, or handle the case when Neuron driver is not present gracefully.
| logger.info(str(get_neuron_device_memory())) | |
| try: | |
| logger.info(str(get_neuron_device_memory())) | |
| except RuntimeError as e: | |
| logger.info(f"Neuron device memory information unavailable: {e}") |
| :param position_ids: tensor of size (batch_sz, seq_len) | ||
| :param new_key_values: list of tuple, the latest kv obtained at the end of the network from forward pass | ||
| :param seq_len: sequence length | ||
| :param scatter_index: tensor representing index to update | ||
| :param active_mask: tensor representing index to update | ||
| :param kvcache_buffer: if passed key states are updates to this buffer. | ||
| kvcache_buffer is 2D list where, 1st dim for layer and the second denotes K and V. | ||
| For example, | ||
| kvcache_buffer[1][0] is the K cache of the 1st layer | ||
| kvcache_buffer[4][1] is the V cache of the 4th layer | ||
| :return: list of tuple of (K, V) |
Copilot
AI
Nov 28, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring references a removed parameter scatter_index at line 117. This outdated documentation should be removed as scatter_index is no longer a parameter of the update_cache method.
tengomucho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! Few comments, otherwise seems good!
| SHELL := /bin/bash | ||
| CURRENT_DIR = $(shell pwd) | ||
| UV = $(shell command -v uv) | ||
| UV_ARGS = $(if $(UV),--index-strategy unsafe-best-match) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already assume a given system is available (the AMI for example). If that is the case, I think we could install uv there and just use it, without adding additional checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uv is not installed by default on the AWS DLAMI, which I personally often use when bumping versions.
| for category in DEVICE_MEM_CATEGORIES: | ||
| try: | ||
| categories[category] = read_memory_counter(device_mem_path, category) | ||
| except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the comments, plus I would add a constant EXPECTED_EXCEPTIONS = (FileNotFoundError, PermissionError, ValueError) to avoid repeating over and over
Upcoming changes related to the sharding strategy should not modify the graphs, but bumping the version makes it easier to track regressions.
Also add a log about the replication of heads when sharding.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
a90fe5d to
203b5b4
Compare
203b5b4 to
0b97e83
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| except Exception: | ||
| # Skip categories that can't be read |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bare except Exception: on line 133 silently catches all exceptions. Consider logging a warning when a category cannot be read, or at least being more specific about which exceptions to catch (e.g., OSError, ValueError) to avoid masking unexpected errors.
| except Exception: | |
| # Skip categories that can't be read | |
| except (OSError, ValueError) as e: | |
| logger.warning(f"Could not read memory category '{category}' in {device_mem_path}: {e}") |
tengomucho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits, otherwise LGTM!
| @@ -0,0 +1,12 @@ | |||
| import pytest | |||
|
|
|||
| from optimum.neuron.utils.neuron_device_memory import get_neuron_device_memory | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really a decoder test, but I understand it had to be placed somewhere
| assert neuron_device_memory is not None | ||
| assert len(neuron_device_memory.devices) == get_neuron_devices_count() | ||
| assert neuron_device_memory.get_total_memory() > 0 | ||
| assert neuron_device_memory.get_total_memory() >= 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is it going to be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the CI, it looks like it is zero when no other tests have been run
Co-authored-by: Copilot <[email protected]>
What does this PR do?
This pull-request adds utilities to read and log neuron device memory usage.
This pull request also refactors and simplifies the sharding logic for attention heads and key/value heads across the codebase, improves logging for cache allocation and head adjustment.
Finally it cleans up unused or redundant code in the Makefile, adding an install target and using
uvwhenever possible.