-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
Bug
cadvisor
may emit semi-randomly selected metric values for container_fs_
metrics like container_fs_reads_bytes_total
, container_fs_writes_bytes_total
, container_fs_reads_total
, container_fs_writes_total
etc for newly hot-added devices, or when there are issues with device node to name mapping.
If there is more than one device for a given container, this will appear to the user as extremely rapidly increasing values for the rate(...)
s over these counters because decreases in the values will be treated as counter wrap-arounds by Prometheus. Values for container I/O metrics will be wildly inaccurate and extremely high.
Cause
This can occur whenever there's a device node that cadvisor can't resolve to a name, because:
cadvisor
maps devicemajor:minor
to device names via a cache- If the cache refresh fails, is newly hot-added and not in the cache yet, or reading the device state for some subset of devices, the default device name
""
is used (sinceDevice
is not set to anything else) andassignDeviceNamesToPerDiskStats
does no eager lookup on mapping misses (it also ignores false returns from func DeviceName that should indicate a missed lookup and caches the empty-string result anyway) major:minor
mapping when emitting metrics does not treat the empty device-name as an error, and so it may emit multiple series withdevice=""
in a single scrape- the
ioValues
helper'sPerDiskStats
input has no guaranteed sort order, so different scrapes might emit the values in a different order - promhttp will use the most recent value when there are conflicting values
- whenever the chosen device's current stats value happens to be lower than the last-chosen one, the counter will be treated as having wrapped around and reset.
If there's a cache miss for a newly hot-added device, cadvisor
won't force a cache refresh, so I suspect this bug can affect newly hot-added devices like new iSCSI targets or kubernetes PV block devices until the next cache refresh.
Possible fixes
To prevent this from occurring, cadvisor
should not add cache entries for missed device name lookups, and should:
- discard metrics where the device could not be mapped to a name;
- emit the
major:minor
device identity as the value of thedevice
label if it cannot be resolved to a string device node name; or - add another
devicenode
label with themajor:minor
on all metrics, so that ifdevice=""
the series are still disambiguated bydevicenode="major:minor"
.
I think (2) would be the best compromise for backwards-compat (no new labels) and usability, except that hot-added devices are likely to appear as (e.g.) device="252:0"
initially then device="dm-0"
after the next MachineInfo block device mapping cache refresh. This would break the series and cause unnecessary label churn that adds Prometheus load and memory use.
To fix that, it'd be desirable to trigger an eager lookup of device node->name when the MachineInfo cache misses. But at an initial glance it looks that'd require new methods on the interface, a negative cache to avoid hammering refreshes when there are devices that consistently cannot be mapped, etc. The MachineInfo could look up the major:minor in /sys/dev/block/[major:minor]
if there's a miss for example, avoiding the need to rebuild the whole cache.
Option (3) wouldn't be any better than option (2) though; initially we'd emit device="" devicenode="252:0"
and once the cache refreshed, we'd emit device="dm-0" devicenode="252:0"
for example. This would still break the series, add churn, and have an initially-invalid device name.
So it seems like there's a need to an eager lookup of device names in case of cache misses, and in the meantime maybe option (1) of just discarding such series might be best.
The immediate cause of the cache failing to populate properly in my case looks like it was fixed by #3522 but the underlying weakness in the metrics label scheme remains.
Related
- Not all /sys/block devices will have a "dev" file #3340
- sysinfo: Ignore "hidden" sysfs device entries #3260
- sysfs: trim spaces in device hidden check #3522
Diagnostic details
This issue was observed by running this command to grab both the raw /sys/fs/cgroup/io.stat
and the currently-reported value of container_fs_reads_bytes_total
for a specific pod.
for i in $(seq 1 10) ; do kubectl exec -it -n p-nklcmnw2oj pod/p-nklcmnw2oj-1 -- cat /sys/fs/cgroup/io.stat ; echo; promtool query instant http://localhost:9999 'container_fs_reads_bytes_total{pod=~"p-nklcmnw2oj-1"}'; echo; sleep 1m; done > p-nklcmnw2oj-1-io.stat-1m.txt
The resulting output showed a surprising number of values like container_fs_reads_bytes_total{container="postgres",device=""} 450560
for the target pod+container. This didn't make sense for an aggregate counter for all devices, and it was interspersed with values that would've been consistent for other counters seen in io.stat
.
$ awk -F '=>' '/^container_fs_reads_bytes_total/ { print $2 }' p-nklcmnw2oj-1-io.stat-1m.txt
450560 @[1748570525.479]
210518016 @[1748570525.479]
135649726464 @[1748570589.822]
593107652608 @[1748570589.822]
640986619904 @[1748570655.531]
137259347968 @[1748570655.531]
450560 @[1748570719.619]
450560 @[1748570719.619]
593107824640 @[1748570784.021]
450560 @[1748570784.021]
0 @[1748570847.817]
450560 @[1748570847.817]
0 @[1748570911.649]
450560 @[1748570911.649]
450560 @[1748570975.866]
450560 @[1748570975.866]
450560 @[1748571039.912]
196608 @[1748571039.912]
640986619904 @[1748571103.627]
196608 @[1748571103.627]
This number happens to be the rbytes
value in io.stat
for a number of idle device nodes whose counters aren't increasing:
9:125 rbytes=450560 wbytes=4119887937536 rios=100 wios=49739282 dbytes=0 dios=0
253:1 rbytes=450560 wbytes=4133210316800 rios=100 wios=49925374 dbytes=0 dios=0
253:2 rbytes=450560 wbytes=4133210316800 rios=100 wios=49925374 dbytes=0 dios=0
253:19 rbytes=450560 wbytes=5006616088576 rios=100 wios=76139758 dbytes=0 dios=0
which is why it seems to repeat so often.
What seems to be happening is that cadvisor
is (inadvertently) picking a random device node and emitting its rbytes
value for each scrape of container_fs_reads_bytes_total
because it cannot map the major:minor
to a device name (probably due to #3522).
This is nonsensical behaviour. It shouldn't just emit device=""
in these cases. It should instead visibly fail or add some disambiguating label.
Version info
This specific incident was encountered in the RHOS (openshift) cadvisor embedded in their kubectl rpm version:
Name : openshift-kubelet
Version : 4.18.0
Release : 202505060638.p0.gaf98ede.assembly.stream.el9
... and is caused by their kubelet version omitting #3522.
But it looks like the underlying problem can occur in current master
as of cadvisor v0.52.1 and HEAD commit f6e31a3 if anything else causes cadvisor to be unable to map the device major:minor to a name, or there's a race with newly hot-added devices.