-
Notifications
You must be signed in to change notification settings - Fork 607
config-linux: Add security considerations for linux.devices raw block I/O #1313
Description
Problem
The linux.devices and linux.resources.devices sections of config-linux.md describe how to configure device access for containers but include no security guidance about the implications of granting r (read) or w (write) access to block devices.
When a block device is configured in linux.devices and linux.resources.devices grants access: "rw" or "rwm", the container process can perform raw block-level I/O via standard read() and write() syscalls — regardless of the process capabilities set.
Specifically:
read()on a block device fd does not requireCAP_SYS_RAWIOor any other capabilitywrite()on a block device fd does not requireCAP_SYS_RAWIOor any other capabilitymount()correctly requiresCAP_SYS_ADMIN
This means a container with a block device entry and only the default unprivileged capability set can read the entire contents of the host device (including all filesystem data, credentials, and keys) and potentially write to it (modifying or corrupting the host filesystem at the block level).
The specification does not document this behavior. As a result, runtime implementors and container orchestrators may assume that Linux capabilities serve as a security boundary for device access — which they do for mount(), but not for raw I/O.
Impact
The gap affects the entire container ecosystem that consumes this specification:
- Container runtimes (runc, crun, youki) faithfully implement the spec and create device nodes with the specified access — no additional validation is performed on block devices
- Container orchestrators (containerd, CRI-O, Docker) populate
linux.devicesbased on higher-level configuration (--device, device plugins,hostPath BlockDevice) without security warnings - Kubernetes exposes block devices via
hostPath type: BlockDevice, device plugins (GPU, FPGA, SR-IOV), and CSI raw block volumes — all of which result inlinux.devicesentries - Security tooling (admission controllers, policy engines) commonly audit capabilities and seccomp profiles but rarely inspect device cgroup rules for block device access
Verified behavior
Tested with runc 1.3.4 on cgroup v2 (eBPF device controller), default seccomp profile active:
# Container capabilities (default set, no SYS_ADMIN, no SYS_RAWIO):
CapPrm: 0x00000000a80425fb
# mount() — correctly blocked:
mount: permission denied (are you root?)
# Raw read via dd — succeeds, extracts host /etc/shadow:
$ dd if=/dev/hostdisk bs=4096 count=38400 2>/dev/null | strings | grep '^root:'
root:x:0:0:root:/root:/bin/sh
root:*::0:::::
# Raw write via dd — succeeds:
$ echo TEST | dd of=/dev/hostdisk bs=1 seek=153000000 count=5 conv=notrunc
5+0 records in
5+0 records out
Proposed Changes
1. Add security note to linux.devices section
After the existing description of linux.devices, add:
Security consideration: Creating a block device node (type
"b") and grantingrorwaccess inlinux.resources.devicesallows the container process to perform raw block-level I/O on the underlying host device using standardread()andwrite()syscalls. These syscalls are not gated by any Linux capability — device cgroup permission and Unix file permissions are the only controls. RemovingCAP_SYS_ADMINpreventsmount()but does not prevent raw data access.Runtimes and orchestrators SHOULD warn when block devices are configured with read or write access. Effective defenses include user namespaces (remapped UID 0 cannot open root-owned device nodes) and running container processes as non-root users.
2. Add note to linux.resources.devices access field
After the access field description, add:
Note: The
randwpermissions control access through the device cgroup controller (or eBPF device program on cgroup v2). When applied to block devices, these permissions enable raw block-level I/O that is independent of Linux capabilities.CAP_SYS_RAWIOis not required forread()orwrite()on block device file descriptors.
References
- GHSA-g54h-m393-cpwq — runc devices resource list treated as denylist (related)
- kernel.org: devices cgroup v1 —
r/wcontrolread()/write()on device inodes - superuser.com/q/842525 — confirms
ddusesread()/write(), not raw I/O ioctls - config-linux.md#devices — current spec text
- PR #1214 — ongoing work to deprecate device access denial (related device cgroup work)
- PR #1148 — device node location clarification (related)