Skip to content

Conversation

@yanhaoluo666
Copy link
Contributor

@yanhaoluo666 yanhaoluo666 commented Oct 13, 2025

Note

This PR is dependent on PR370 and PR371, will update go.mod file and rebase once they are merged.

Update: go.mod file updated.

Description of the issue

Currently, GPU metrics are collected at a one-minute interval, which works well for most machine learning (ML) training jobs. However, for ML inference, where execution times can be as short as 2-3 seconds, this interval is insufficient.

Description of changes

This PR provides customer gpu metrics collection interval customization by introducing a new configuration field. Changes are listed below:

  1. Introduce a new field - accelerated_compute_gpu_metrics_collection_interval to let customer denote metrics collection interval, default value is 60.
  2. If customer sets it to a value less than 60, belows changes will take effect:
    2.1 the batch period will turn from 5s to 60s for batch processor;
    2.2 groupbyattrs processor will be added to awscontainerinsights pipeline to compact metrics from the same resource;
    2.3 gpu sampling frequency would use configured value in awscontainerinsights receiver(PR 370);
    2.4 all gpu metrics will be compressed and converted to cloudwatch histogram type in emf exporter(PR 371);

We have also tried out to provide keys for groupbyattrs processor to only compact gpu metrics, but there is hardly improvement for cpu and memory.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. deploy this PR along with PR 370 and PR 371 to personal eks cluster
  2. Spinned up a ML job then checked cloudwatch logs and metrics, confirmed
    2.1 gpu metrics were sampled every second, i.e. there were 60 datapoints in each PutLogEvent call;
    2.2 gpu metrics were in cloudwatch histogram format.
  • logs sample
{
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "Namespace",
                    "PodName"
                ],
                [
                    "ClusterName",
                    "ContainerName",
                    "FullPodName",
                    "GpuDevice",
                    "Namespace",
                    "PodName"
                ]
            ],
            "Metrics": [
                {
                    "Name": "container_gpu_temperature",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_power_draw",
                    "Unit": "None",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_utilization",
                    "Unit": "Percent",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_used",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "container_gpu_memory_total",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "cpipeline",
    "ContainerName": "main",
    "FullPodName": "gpu-burn-577f5d7468-4j54s",
    "GpuDevice": "nvidia0",
    "InstanceId": "i-0f01fff8faa360227",
    "InstanceType": "g4dn.xlarge",
    "Namespace": "kube-system",
    "NodeName": "ip-192-168-6-219.ec2.internal",
    "PodName": "gpu-burn",
    "Sources": [
        "dcgm",
        "pod",
        "calculated"
    ],
    "Timestamp": "1760375344178",
    "Type": "ContainerGPU",
    "UUID": "GPU-60efa417-4d26-c4ba-9e62-66249559952d",
    "Version": "0",
    "kubernetes": {
        "container_name": "main",
        "containerd": {
            "container_id": "5bfc51b6805d8bdc96e34f262394ae2702cc5d55ad186c660acbef414aa86223"
        },
        "host": "ip-192-168-6-219.ec2.internal",
        "labels": {
            "app": "gpu-burn",
            "pod-template-hash": "577f5d7468"
        },
        "pod_name": "gpu-burn-577f5d7468-4j54s",
        "pod_owners": [
            {
                "owner_kind": "Deployment",
                "owner_name": "gpu-burn"
            }
        ]
    },
    "container_gpu_memory_total": {
        "Values": [
            16006027360
        ],
        "Counts": [
            60
        ],
        "Max": 16006027360,
        "Min": 16006027360,
        "Count": 60,
        "Sum": 982473768960
    },
    "container_gpu_memory_used": {
        "Values": [
            0,
            176060768,
            245366784,
            14254342144,
            253755392,
            111149056,
            207608048,
            251658240
        ],
        "Counts": [
            8,
            1,
            1,
            46,
            1,
            1,
            1,
            1
        ],
        "Max": 14254342144,
        "Min": 0,
        "Count": 60,
        "Sum": 656945446912
    },
    "container_gpu_memory_utilization": {
        "Values": [
            1.185,
            0.9862,
            90.0607,
            1.609,
            0.6948,
            1.3572000000000002,
            1.5559999999999998,
            0
        ],
        "Counts": [
            1,
            1,
            46,
            1,
            1,
            1,
            1,
            8
        ],
        "Max": 90.0607,
        "Min": 0,
        "Count": 60,
        "Sum": 4150.226400000004
    },
    "container_gpu_power_draw": {
        "Values": [
            32.662,
            70.563,
            69.099,
            32.760,
            69.49,
            33.549,
            69.978,
            69.197,
            33.844,
            63.907,
            65.919,
            70.368,
            70.27,
            38.921,
            69.435,
            68.360,
            69.88,
            70.173,
            68.318,
            70.119,
            67.872,
            70.466,
            65.626,
            67.97,
            69.826,
            32.859,
            33.352,
            70.660,
            70.075,
            33.253,
            69.294,
            69.587,
            68.904,
            38.429,
            82.459,
            69.685,
            69.392,
            68.849,
            69.782,
            68.458
        ],
        "Counts": [
            2,
            2,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            3,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            3,
            2,
            2,
            1,
            1,
            1,
            1,
            1,
            4,
            1,
            1,
            2,
            1
        ],
        "Max": 82.459,
        "Min": 32.662,
        "Count": 60,
        "Sum": 3748.8209999999995
    },
    "container_gpu_temperature": {
        "Values": [
            42,
            43,
            44
        ],
        "Counts": [
            12,
            32,
            16
        ],
        "Max": 44,
        "Min": 42,
        "Count": 60,
        "Sum": 2628
    },
    "container_gpu_utilization": {
        "Values": [
            96,
            6,
            8,
            14,
            58,
            0,
            64,
            9,
            89,
            7,
            100
        ],
        "Counts": [
            1,
            1,
            1,
            1,
            1,
            6,
            1,
            1,
            1,
            2,
            44
        ],
        "Max": 100,
        "Min": 0,
        "Count": 60,
        "Sum": 4858
    }
}
  • metrics graph
image

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh. - done
  2. Run make lint. - done

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from a40f2b7 to 69ba416 Compare October 16, 2025 13:51
@yanhaoluo666 yanhaoluo666 added the ready for testing Indicates this PR is ready for integration tests to run label Oct 17, 2025
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 6c0f9d7 to acbbe17 Compare October 20, 2025 11:39
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from b9ed82e to 7b33d8f Compare October 31, 2025 15:42
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 4 times, most recently from 8cb0d6e to d06a96b Compare November 3, 2025 12:25
sky333999
sky333999 previously approved these changes Nov 3, 2025
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch 2 times, most recently from 8424297 to 05c81f4 Compare November 3, 2025 17:23
sky333999
sky333999 previously approved these changes Nov 3, 2025
movence
movence previously approved these changes Nov 3, 2025
@yanhaoluo666 yanhaoluo666 dismissed stale reviews from movence and sky333999 via 8c2ccf6 November 4, 2025 11:28
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 21cf594 to 8c2ccf6 Compare November 4, 2025 11:28
@yanhaoluo666 yanhaoluo666 added ready for testing Indicates this PR is ready for integration tests to run and removed ready for testing Indicates this PR is ready for integration tests to run labels Nov 4, 2025
@yanhaoluo666 yanhaoluo666 force-pushed the feature/high-frequency-gpu-metrics branch from 8c2ccf6 to aeee745 Compare November 4, 2025 11:55
@sky333999 sky333999 merged commit b95d897 into aws:main Nov 4, 2025
15 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants