Enable opt-in for high frequency GPU metrics #1893

yanhaoluo666 · 2025-10-13T19:50:13Z

Note

This PR is dependent on PR370 and PR371, will update go.mod file and rebase once they are merged.

Update: go.mod file updated.

Description of the issue

Currently, GPU metrics are collected at a one-minute interval, which works well for most machine learning (ML) training jobs. However, for ML inference, where execution times can be as short as 2-3 seconds, this interval is insufficient.

Description of changes

This PR provides customer gpu metrics collection interval customization by introducing a new configuration field. Changes are listed below:

Introduce a new field - accelerated_compute_gpu_metrics_collection_interval to let customer denote metrics collection interval, default value is 60.
If customer sets it to a value less than 60, belows changes will take effect:
2.1 the batch period will turn from 5s to 60s for batch processor;
2.2 groupbyattrs processor will be added to awscontainerinsights pipeline to compact metrics from the same resource;
2.3 gpu sampling frequency would use configured value in awscontainerinsights receiver(PR 370);
2.4 all gpu metrics will be compressed and converted to cloudwatch histogram type in emf exporter(PR 371);

We have also tried out to provide keys for groupbyattrs processor to only compact gpu metrics, but there is hardly improvement for cpu and memory.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

deploy this PR along with PR 370 and PR 371 to personal eks cluster
Spinned up a ML job then checked cloudwatch logs and metrics, confirmed
2.1 gpu metrics were sampled every second, i.e. there were 60 datapoints in each PutLogEvent call;
2.2 gpu metrics were in cloudwatch histogram format.

logs sample

{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}

metrics graph

Requirements

Before commiting your code, please do the following steps.

Run make fmt and make fmt-sh. - done
Run make lint. - done

Integration Tests

To run integration tests against this PR, add the ready for testing label.

translator/translate/otel/common/common.go

translator/translate/otel/exporter/awsemf/kubernetes.go

translator/translate/otel/receiver/awscontainerinsight/utils.go

translator/translate/otel/pipeline/containerinsights/translator.go

translator/translate/otel/common/common_test.go

translator/translate/otel/exporter/awsemf/kubernetes.go

translator/translate/otel/common/common.go

translator/translate/otel/pipeline/containerinsights/translator.go

translator/translate/otel/processor/batchprocessor/translator.go

translator/translate/otel/pipeline/containerinsights/translator.go

translator/tocwconfig/sampleConfig/emf_and_kubernetes_with_gpu_config.yaml

translator/translate/otel/processor/groupbyattrsprocessor/translator.go

translator/translate/otel/receiver/awscontainerinsight/translator.go

yanhaoluo666 requested a review from a team as a code owner October 13, 2025 19:50

yanhaoluo666 mentioned this pull request Oct 14, 2025

[awsemfexporter] Support gauge to cloudwatch histogram convertion in EMF exporter amazon-contributing/opentelemetry-collector-contrib#371

Merged

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from 69a8d55 to cbb5dc4 Compare October 14, 2025 15:58

yanhaoluo666 commented Oct 14, 2025

View reviewed changes

translator/translate/otel/common/common.go Show resolved Hide resolved

yanhaoluo666 requested a review from movence October 14, 2025 17:25

yanhaoluo666 mentioned this pull request Oct 15, 2025

[awscontainerinsightreceiver] Add accelerated_compute_gpu_metrics_collection_interval config to support gpu metrics collection interval customization amazon-contributing/opentelemetry-collector-contrib#370

Merged

spanaik reviewed Oct 16, 2025

View reviewed changes

translator/translate/otel/exporter/awsemf/kubernetes.go Outdated Show resolved Hide resolved

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from a40f2b7 to 69ba416 Compare October 16, 2025 13:51

spanaik reviewed Oct 16, 2025

View reviewed changes

translator/translate/otel/exporter/awsemf/kubernetes.go Outdated Show resolved Hide resolved

spanaik reviewed Oct 16, 2025

View reviewed changes

translator/translate/otel/receiver/awscontainerinsight/utils.go Outdated Show resolved Hide resolved

spanaik approved these changes Oct 16, 2025

View reviewed changes

yanhaoluo666 added the ready for testing Indicates this PR is ready for integration tests to run label Oct 17, 2025

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 6c0f9d7 to acbbe17 Compare October 20, 2025 11:39

movence reviewed Oct 22, 2025

View reviewed changes

translator/translate/otel/pipeline/containerinsights/translator.go Outdated Show resolved Hide resolved

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 7 times, most recently from 31fa2a1 to b9ed82e Compare October 23, 2025 16:47

yanhaoluo666 mentioned this pull request Oct 29, 2025

Add integ test on high frequency gpu metrics aws/amazon-cloudwatch-agent-test#618

Open

sky333999 reviewed Oct 30, 2025

View reviewed changes

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from b9ed82e to 7b33d8f Compare October 31, 2025 15:42

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from 8cb0d6e to d06a96b Compare November 3, 2025 12:25

sky333999 previously approved these changes Nov 3, 2025

View reviewed changes

yanhaoluo666 dismissed sky333999’s stale review via 8424297 November 3, 2025 17:18

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 2 times, most recently from 8424297 to 05c81f4 Compare November 3, 2025 17:23

sky333999 previously approved these changes Nov 3, 2025

View reviewed changes

movence previously approved these changes Nov 3, 2025

View reviewed changes

yanhaoluo666 dismissed stale reviews from movence and sky333999 via 8c2ccf6 November 4, 2025 11:28

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 21cf594 to 8c2ccf6 Compare November 4, 2025 11:28

yanhaoluo666 added ready for testing Indicates this PR is ready for integration tests to run and removed ready for testing Indicates this PR is ready for integration tests to run labels Nov 4, 2025

Enable opt-in for high frequency GPU metrics

aeee745

yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 8c2ccf6 to aeee745 Compare November 4, 2025 11:55

sky333999 approved these changes Nov 4, 2025

View reviewed changes

movence approved these changes Nov 4, 2025

View reviewed changes

sky333999 merged commit b95d897 into aws:main Nov 4, 2025
15 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable opt-in for high frequency GPU metrics #1893

Enable opt-in for high frequency GPU metrics #1893

Uh oh!

yanhaoluo666 commented Oct 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Enable opt-in for high frequency GPU metrics #1893

Enable opt-in for high frequency GPU metrics #1893

Uh oh!

Conversation

yanhaoluo666 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Description of the issue

Description of changes

License

Tests

Requirements

Integration Tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yanhaoluo666 commented Oct 13, 2025 •

edited

Loading