Colliding values for `container_fs_` container I/O metric series if device major:minor cannot be resolved to a device path

## Bug

`cadvisor` may emit semi-randomly selected metric values for `container_fs_` metrics like `container_fs_reads_bytes_total`, `container_fs_writes_bytes_total`, `container_fs_reads_total`, `container_fs_writes_total` etc for newly hot-added devices, or when there are issues with device node to name mapping.

If there is more than one device for a given container, this will appear to the user as extremely rapidly increasing values for the `rate(...)`s over these counters because decreases in the values will be treated as counter wrap-arounds by Prometheus. Values for container I/O metrics will be wildly inaccurate and extremely high.

## Cause

This can occur whenever there's a device node that cadvisor can't resolve to a name, because:

* `cadvisor` maps device `major:minor` to device names via a cache
* If the cache refresh fails, is newly hot-added and not in the cache yet, or reading the device state for some subset of devices, the default device name `""` is used ([since `Device` is not set to anything else](https://github.com/google/cadvisor/blob/f6e31a3cff918285fd74cb1f20d0af32c3554a68/metrics/prometheus.go#L67)) and [`assignDeviceNamesToPerDiskStats` does no eager lookup on mapping misses](https://github.com/google/cadvisor/blob/f6e31a3cff918285fd74cb1f20d0af32c3554a68/container/common/helpers.go#L394C6-L401) (it also ignores false returns from [func DeviceName](https://github.com/google/cadvisor/blob/f6e31a3cff918285fd74cb1f20d0af32c3554a68/container/common/helpers.go#L412-L424) that should indicate a missed lookup and [caches the empty-string result anyway](https://github.com/google/cadvisor/blob/f6e31a3cff918285fd74cb1f20d0af32c3554a68/container/common/helpers.go#L439)) 
* `major:minor` mapping when emitting metrics does not treat the empty device-name as an error, and so it may emit multiple series with `device=""` in a single scrape
* the `ioValues` helper's `PerDiskStats` input has no guaranteed sort order, so different scrapes might emit the values in a different order
* promhttp will use the most recent value when there are conflicting values
* whenever the chosen device's current stats value happens to be lower than the last-chosen one, the counter will be treated as having wrapped around and reset.

If there's a cache miss for a newly hot-added device, `cadvisor` won't force a cache refresh, so I suspect this bug can affect  newly hot-added devices like new iSCSI targets or kubernetes PV block devices until the next cache refresh.

## Possible fixes

To prevent this from occurring, `cadvisor` should [not add cache entries for missed device name lookups](https://github.com/google/cadvisor/blob/f6e31a3cff918285fd74cb1f20d0af32c3554a68/container/common/helpers.go#L439), and should:

1. discard metrics where the device could not be mapped to a name;
2. emit the `major:minor` device identity as the value of the `device` label if it cannot be resolved to a string device node name; or
3. add another `devicenode` label with the `major:minor` on all metrics, so that if `device=""` the series are still disambiguated by `devicenode="major:minor"`.

I think (2) would be the best compromise for backwards-compat (no new labels) and usability, except that hot-added devices are likely to appear as (e.g.) `device="252:0"` initially then `device="dm-0"` after the next MachineInfo block device mapping cache refresh. This would break the series and cause unnecessary label churn that adds Prometheus load and memory use.

To fix that, it'd be desirable to trigger an eager lookup of device node->name when the MachineInfo cache misses. But at an initial glance it looks that'd require new methods on the interface, a negative cache to avoid hammering refreshes when there are devices that consistently cannot be mapped, etc. The MachineInfo could look up the major:minor in `/sys/dev/block/[major:minor]` if there's a miss for example, avoiding the need to rebuild the whole cache.

Option (3) wouldn't be any better than option (2) though; initially we'd emit `device="" devicenode="252:0"` and once the cache refreshed, we'd emit `device="dm-0" devicenode="252:0"` for example. This would still break the series, add churn, and have an initially-invalid device name.

So it seems like there's a need to an eager lookup of device names in case of cache misses, and in the meantime maybe option (1) of just discarding such series might be best.

The immediate cause of the cache failing to populate properly in my case looks like it was fixed by https://github.com/google/cadvisor/pull/3522 but the underlying weakness in the metrics label scheme remains.

## Related

* https://github.com/google/cadvisor/issues/3340
* https://github.com/google/cadvisor/pull/3260
* https://github.com/google/cadvisor/pull/3522

## Diagnostic details

This issue was observed by running this command to grab both the raw `/sys/fs/cgroup/io.stat` and the currently-reported value of `container_fs_reads_bytes_total` for a specific pod.

```sh
for i in $(seq 1 10) ; do kubectl exec -it -n p-nklcmnw2oj pod/p-nklcmnw2oj-1 -- cat /sys/fs/cgroup/io.stat ; echo; promtool query instant http://localhost:9999 'container_fs_reads_bytes_total{pod=~"p-nklcmnw2oj-1"}'; echo; sleep 1m; done > p-nklcmnw2oj-1-io.stat-1m.txt
```

[p-nklcmnw2oj-1-io.stat-1m.txt](https://github.com/user-attachments/files/20562371/p-nklcmnw2oj-1-io.stat-1m.txt)

The resulting output showed a surprising number of values like `container_fs_reads_bytes_total{container="postgres",device=""} 450560` for the target pod+container. This didn't make sense for an aggregate counter for all devices, and it was interspersed with values that would've been consistent for other counters seen in `io.stat`. 

```
$ awk -F '=>' '/^container_fs_reads_bytes_total/ { print $2 }' p-nklcmnw2oj-1-io.stat-1m.txt 
 450560 @[1748570525.479]
 210518016 @[1748570525.479]
 135649726464 @[1748570589.822]
 593107652608 @[1748570589.822]
 640986619904 @[1748570655.531]
 137259347968 @[1748570655.531]
 450560 @[1748570719.619]
 450560 @[1748570719.619]
 593107824640 @[1748570784.021]
 450560 @[1748570784.021]
 0 @[1748570847.817]
 450560 @[1748570847.817]
 0 @[1748570911.649]
 450560 @[1748570911.649]
 450560 @[1748570975.866]
 450560 @[1748570975.866]
 450560 @[1748571039.912]
 196608 @[1748571039.912]
 640986619904 @[1748571103.627]
 196608 @[1748571103.627]
```

This number happens to be the `rbytes` value in `io.stat` for a number of idle device nodes whose counters aren't increasing:

```
9:125 rbytes=450560 wbytes=4119887937536 rios=100 wios=49739282 dbytes=0 dios=0
253:1 rbytes=450560 wbytes=4133210316800 rios=100 wios=49925374 dbytes=0 dios=0
253:2 rbytes=450560 wbytes=4133210316800 rios=100 wios=49925374 dbytes=0 dios=0
253:19 rbytes=450560 wbytes=5006616088576 rios=100 wios=76139758 dbytes=0 dios=0
```

which is why it seems to repeat so often.

What seems to be happening is that `cadvisor` is (inadvertently) picking a _random device node_ and emitting its `rbytes` value for each scrape of `container_fs_reads_bytes_total` because it cannot map the `major:minor` to a device name (probably due to https://github.com/google/cadvisor/pull/3522).

This is nonsensical behaviour. It shouldn't just emit `device=""` in these cases. It should instead visibly fail or add some disambiguating label.

## Version info

This specific incident was encountered in the RHOS (openshift) cadvisor embedded in their kubectl rpm version:

```
Name        : openshift-kubelet
Version     : 4.18.0
Release     : 202505060638.p0.gaf98ede.assembly.stream.el9
```

... and is caused by their kubelet version omitting https://github.com/google/cadvisor/pull/3522.

But it looks like the underlying problem can occur in current `master` as of cadvisor v0.52.1 and HEAD commit f6e31a3c if _anything else_ causes cadvisor to be unable to map the device major:minor to a name, or there's a race with newly hot-added devices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Colliding values for `container_fs_` container I/O metric series if device major:minor cannot be resolved to a device path #3693

Bug

Cause

Possible fixes

Related

Diagnostic details

Version info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Colliding values for container_fs_ container I/O metric series if device major:minor cannot be resolved to a device path #3693

Description

Bug

Cause

Possible fixes

Related

Diagnostic details

Version info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Colliding values for `container_fs_` container I/O metric series if device major:minor cannot be resolved to a device path #3693