Skip to content

Colliding values for container_fs_ container I/O metric series if device major:minor cannot be resolved to a device path #3693

@ringerc

Description

@ringerc

Bug

cadvisor may emit semi-randomly selected metric values for container_fs_ metrics like container_fs_reads_bytes_total, container_fs_writes_bytes_total, container_fs_reads_total, container_fs_writes_total etc for newly hot-added devices, or when there are issues with device node to name mapping.

If there is more than one device for a given container, this will appear to the user as extremely rapidly increasing values for the rate(...)s over these counters because decreases in the values will be treated as counter wrap-arounds by Prometheus. Values for container I/O metrics will be wildly inaccurate and extremely high.

Cause

This can occur whenever there's a device node that cadvisor can't resolve to a name, because:

  • cadvisor maps device major:minor to device names via a cache
  • If the cache refresh fails, is newly hot-added and not in the cache yet, or reading the device state for some subset of devices, the default device name "" is used (since Device is not set to anything else) and assignDeviceNamesToPerDiskStats does no eager lookup on mapping misses (it also ignores false returns from func DeviceName that should indicate a missed lookup and caches the empty-string result anyway)
  • major:minor mapping when emitting metrics does not treat the empty device-name as an error, and so it may emit multiple series with device="" in a single scrape
  • the ioValues helper's PerDiskStats input has no guaranteed sort order, so different scrapes might emit the values in a different order
  • promhttp will use the most recent value when there are conflicting values
  • whenever the chosen device's current stats value happens to be lower than the last-chosen one, the counter will be treated as having wrapped around and reset.

If there's a cache miss for a newly hot-added device, cadvisor won't force a cache refresh, so I suspect this bug can affect newly hot-added devices like new iSCSI targets or kubernetes PV block devices until the next cache refresh.

Possible fixes

To prevent this from occurring, cadvisor should not add cache entries for missed device name lookups, and should:

  1. discard metrics where the device could not be mapped to a name;
  2. emit the major:minor device identity as the value of the device label if it cannot be resolved to a string device node name; or
  3. add another devicenode label with the major:minor on all metrics, so that if device="" the series are still disambiguated by devicenode="major:minor".

I think (2) would be the best compromise for backwards-compat (no new labels) and usability, except that hot-added devices are likely to appear as (e.g.) device="252:0" initially then device="dm-0" after the next MachineInfo block device mapping cache refresh. This would break the series and cause unnecessary label churn that adds Prometheus load and memory use.

To fix that, it'd be desirable to trigger an eager lookup of device node->name when the MachineInfo cache misses. But at an initial glance it looks that'd require new methods on the interface, a negative cache to avoid hammering refreshes when there are devices that consistently cannot be mapped, etc. The MachineInfo could look up the major:minor in /sys/dev/block/[major:minor] if there's a miss for example, avoiding the need to rebuild the whole cache.

Option (3) wouldn't be any better than option (2) though; initially we'd emit device="" devicenode="252:0" and once the cache refreshed, we'd emit device="dm-0" devicenode="252:0" for example. This would still break the series, add churn, and have an initially-invalid device name.

So it seems like there's a need to an eager lookup of device names in case of cache misses, and in the meantime maybe option (1) of just discarding such series might be best.

The immediate cause of the cache failing to populate properly in my case looks like it was fixed by #3522 but the underlying weakness in the metrics label scheme remains.

Related

Diagnostic details

This issue was observed by running this command to grab both the raw /sys/fs/cgroup/io.stat and the currently-reported value of container_fs_reads_bytes_total for a specific pod.

for i in $(seq 1 10) ; do kubectl exec -it -n p-nklcmnw2oj pod/p-nklcmnw2oj-1 -- cat /sys/fs/cgroup/io.stat ; echo; promtool query instant http://localhost:9999 'container_fs_reads_bytes_total{pod=~"p-nklcmnw2oj-1"}'; echo; sleep 1m; done > p-nklcmnw2oj-1-io.stat-1m.txt

p-nklcmnw2oj-1-io.stat-1m.txt

The resulting output showed a surprising number of values like container_fs_reads_bytes_total{container="postgres",device=""} 450560 for the target pod+container. This didn't make sense for an aggregate counter for all devices, and it was interspersed with values that would've been consistent for other counters seen in io.stat.

$ awk -F '=>' '/^container_fs_reads_bytes_total/ { print $2 }' p-nklcmnw2oj-1-io.stat-1m.txt 
 450560 @[1748570525.479]
 210518016 @[1748570525.479]
 135649726464 @[1748570589.822]
 593107652608 @[1748570589.822]
 640986619904 @[1748570655.531]
 137259347968 @[1748570655.531]
 450560 @[1748570719.619]
 450560 @[1748570719.619]
 593107824640 @[1748570784.021]
 450560 @[1748570784.021]
 0 @[1748570847.817]
 450560 @[1748570847.817]
 0 @[1748570911.649]
 450560 @[1748570911.649]
 450560 @[1748570975.866]
 450560 @[1748570975.866]
 450560 @[1748571039.912]
 196608 @[1748571039.912]
 640986619904 @[1748571103.627]
 196608 @[1748571103.627]

This number happens to be the rbytes value in io.stat for a number of idle device nodes whose counters aren't increasing:

9:125 rbytes=450560 wbytes=4119887937536 rios=100 wios=49739282 dbytes=0 dios=0
253:1 rbytes=450560 wbytes=4133210316800 rios=100 wios=49925374 dbytes=0 dios=0
253:2 rbytes=450560 wbytes=4133210316800 rios=100 wios=49925374 dbytes=0 dios=0
253:19 rbytes=450560 wbytes=5006616088576 rios=100 wios=76139758 dbytes=0 dios=0

which is why it seems to repeat so often.

What seems to be happening is that cadvisor is (inadvertently) picking a random device node and emitting its rbytes value for each scrape of container_fs_reads_bytes_total because it cannot map the major:minor to a device name (probably due to #3522).

This is nonsensical behaviour. It shouldn't just emit device="" in these cases. It should instead visibly fail or add some disambiguating label.

Version info

This specific incident was encountered in the RHOS (openshift) cadvisor embedded in their kubectl rpm version:

Name        : openshift-kubelet
Version     : 4.18.0
Release     : 202505060638.p0.gaf98ede.assembly.stream.el9

... and is caused by their kubelet version omitting #3522.

But it looks like the underlying problem can occur in current master as of cadvisor v0.52.1 and HEAD commit f6e31a3 if anything else causes cadvisor to be unable to map the device major:minor to a name, or there's a race with newly hot-added devices.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions