Skip to content

Feat/cgroup v2 support (#276)#277

Open
DhanashreePetare wants to merge 2 commits intoHSF:mainfrom
DhanashreePetare:feat/cgroup-v2-support
Open

Feat/cgroup v2 support (#276)#277
DhanashreePetare wants to merge 2 commits intoHSF:mainfrom
DhanashreePetare:feat/cgroup-v2-support

Conversation

@DhanashreePetare
Copy link

Cgroup v1/v2 Monitoring Support

Overview

Adds comprehensive cgroup (container) resource monitoring to prmon with automatic detection of cgroup v1, v2, and hybrid environments.

Changes

  • New monitor: cgroupmon with 17 metrics across CPU, memory, I/O, and process counts
  • Auto-detection: Detects cgroup version at runtime and adapts accordingly
  • Parser: Handles cpu.stat, memory.stat, io.stat for both v1 and v2
  • Tests: Python integration test + GTest skeleton with precooked fixtures
  • Docs: README section, ADDING_MONITORS example, CONTRIBUTING Windows guide
  • Disable flag: --disable cgroupmon if needed

Benefits

  • Better container/Kubernetes workload monitoring
  • More accurate I/O stats in containerized environments
  • Backward compatible with existing prmon deployments
  • Graceful degradation if cgroups unavailable

Testing

Python linting passes. CI will verify build and tests in Linux environments.

DhanashreePetare added 2 commits December 26, 2025 16:48
- Implement cgroupmon monitor with 17 metrics across CPU, memory, I/O, and process counts
- Support automatic detection of cgroup v1, v2, and hybrid environments
- Parse cgroup controllers: cpu.stat, memory.stat, io.stat for both versions
- Integrate with prmon build system and test infrastructure
- Add Python and GTest test harnesses with precooked cgroup v2 fixtures
- Update documentation: README, ADDING_MONITORS, CONTRIBUTING with usage and Windows dev notes
- Enable container resource tracking with --disable cgroupmon flag
- Provide more accurate I/O statistics for containerized workloads
…ead counting

- Remove unused cgroup_stat_update variable in update_stats()
- Fix read_single_value() to count lines for cgroup.procs and cgroup.threads
  instead of reading first value, since these are multi-line PID lists
@amete
Copy link
Collaborator

amete commented Feb 3, 2026

Thanks a lot for the PR @DhanashreePetare.

Before we go further, I’d like to better understand the motivation and concrete use cases behind this change. The PR adds a fairly substantial amount of new code, and as maintainers we try to be cautious about introducing functionality unless there’s a clear and compelling need, since it also increases the long-term maintenance burden.

Could you elaborate a bit on the scenarios this is meant to support, and whether there are existing users or workflows that would benefit from it? That context would really help us evaluate how to move forward.

@DhanashreePetare
Copy link
Author

Thanks for the thoughtful review and for asking about motivation.

This PR is aimed at containerized and scheduler‑managed workloads where /proc no longer reflects actual resource limits or usage. On modern Linux distributions, cgroup v2 is now the default (e.g., Ubuntu 22.04+, Alma/RHEL 9+), and many HEP/HTC workflows run under Kubernetes, HTCondor, Slurm, or LSF, all of which enforce limits via cgroups. In those environments, prmon’s current /proc view can over‑report memory and under‑report I/O. The cgroup data is the authoritative source for the container’s real quota/usage.

Concrete scenarios this helps:

  • Grid jobs in containers (CVMFS/HTCondor/Singularity/Apptainer): cgroup v2 is increasingly common, and without it prmon misreports memory and I/O relative to enforced limits.
  • Kubernetes batch workflows: resource requests/limits are enforced by cgroups, so cgroup stats provide the correct signal for monitoring and post‑mortem accounting.
  • WLCG/HEP production pipelines: large‑scale job monitoring benefits from accurate cgroup I/O and memory data to avoid false alarms.

The implementation is automatic and safe: it detects v1/v2/hybrid, gracefully degrades when cgroups are absent, and can be disabled with --disable cgroupmon if not desired. The aim is to make prmon accurate in modern container environments while remaining backward‑compatible.

It would be great to hear on it, and I’m happy to adjust further implementation/changes if required based on maintainer's guidance.

@graeme-a-stewart
Copy link
Member

Hello @DhanashreePetare thank you for the explanations. @amete and I will look in more detail at the code and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants