generated from amazon-archives/__template_Custom
-
Notifications
You must be signed in to change notification settings - Fork 235
Enable opt-in for high frequency GPU metrics #1893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
sky333999
merged 1 commit into
aws:main
from
yanhaoluo666:feature/high-frequency-gpu-metrics
Nov 4, 2025
Merged
Enable opt-in for high frequency GPU metrics #1893
sky333999
merged 1 commit into
aws:main
from
yanhaoluo666:feature/high-frequency-gpu-metrics
Nov 4, 2025
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
69a8d55 to
cbb5dc4
Compare
yanhaoluo666
commented
Oct 14, 2025
spanaik
reviewed
Oct 16, 2025
a40f2b7 to
69ba416
Compare
spanaik
reviewed
Oct 16, 2025
spanaik
reviewed
Oct 16, 2025
translator/translate/otel/receiver/awscontainerinsight/utils.go
Outdated
Show resolved
Hide resolved
spanaik
approved these changes
Oct 16, 2025
6c0f9d7 to
acbbe17
Compare
movence
reviewed
Oct 22, 2025
translator/translate/otel/pipeline/containerinsights/translator.go
Outdated
Show resolved
Hide resolved
31fa2a1 to
b9ed82e
Compare
sky333999
reviewed
Oct 30, 2025
translator/translate/otel/pipeline/containerinsights/translator.go
Outdated
Show resolved
Hide resolved
translator/translate/otel/processor/batchprocessor/translator.go
Outdated
Show resolved
Hide resolved
translator/translate/otel/pipeline/containerinsights/translator.go
Outdated
Show resolved
Hide resolved
translator/tocwconfig/sampleConfig/emf_and_kubernetes_with_gpu_config.yaml
Show resolved
Hide resolved
b9ed82e to
7b33d8f
Compare
8cb0d6e to
d06a96b
Compare
sky333999
previously approved these changes
Nov 3, 2025
8424297 to
05c81f4
Compare
sky333999
previously approved these changes
Nov 3, 2025
movence
previously approved these changes
Nov 3, 2025
21cf594 to
8c2ccf6
Compare
8c2ccf6 to
aeee745
Compare
sky333999
approved these changes
Nov 4, 2025
movence
approved these changes
Nov 4, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Note
This PR is dependent on PR370 and PR371, will update
go.modfile and rebase once they are merged.Update: go.mod file updated.
Description of the issue
Currently, GPU metrics are collected at a one-minute interval, which works well for most machine learning (ML) training jobs. However, for ML inference, where execution times can be as short as 2-3 seconds, this interval is insufficient.
Description of changes
This PR provides customer gpu metrics collection interval customization by introducing a new configuration field. Changes are listed below:
accelerated_compute_gpu_metrics_collection_intervalto let customer denote metrics collection interval, default value is 60.2.1 the batch period will turn from 5s to 60s for batch processor;
2.2 groupbyattrs processor will be added to awscontainerinsights pipeline to compact metrics from the same resource;
2.3 gpu sampling frequency would use configured value in awscontainerinsights receiver(PR 370);
2.4 all gpu metrics will be compressed and converted to cloudwatch histogram type in emf exporter(PR 371);
We have also tried out to provide keys for groupbyattrs processor to only compact gpu metrics, but there is hardly improvement for cpu and memory.
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
2.1 gpu metrics were sampled every second, i.e. there were 60 datapoints in each PutLogEvent call;
2.2 gpu metrics were in cloudwatch histogram format.
Requirements
Before commiting your code, please do the following steps.
make fmtandmake fmt-sh. - donemake lint. - doneIntegration Tests
To run integration tests against this PR, add the
ready for testinglabel.