Skip to content

Conversation

MarSik
Copy link
Contributor

@MarSik MarSik commented Jul 28, 2025

This series adds a systemd service to the tuned system. This service runs in the cpu manager-like reconcile loop fashion, monitors all pod cgroups' memory and sets the limits to kubepods-burstable and kubepods-besteffort to protect the system from runaway pods without limits of their own.

kubepods-burstable is set to = node allocatable memory - sum(all guaranteed containers memory limit)
kubepods-besteffort is set to = kubepods-burstable limit - kubepods-bustable current consumption

Note: The boilerplate code was generated by the claude-4-sonnet AI model as interpreted by the Cursor IDE. The python service is my own.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 28, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 28, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 28, 2025

@MarSik: This pull request references CNF-16642 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "4.20.0" version, but no target version was set.

In response to this:

This series adds a systemd service to the tuned system. This service runs in the cpu manager-like reconcile loop fashion, monitors all pod cgroups' memory and sets the limits to kubepods-burstable and kubepods-besteffort to protect the system from runaway pods without limits of their own.

kubepods-burstable is set to = node allocatable memory - sum(all guaranteed containers memory limit)
kubepods-besteffort is set to = kubepods-burstable limit - kubepods-bustable current consumption

Note: The boilerplate code was generated by the claude-4-sonnet AI model as interpreted by the Cursor IDE. The python service is my own.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from jmencak and swatisehgal July 28, 2025 12:01
Copy link
Contributor

openshift-ci bot commented Jul 28, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MarSik

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 28, 2025
@MarSik
Copy link
Contributor Author

MarSik commented Jul 28, 2025

/retitle CNF-18922: Dynamic memory enforcement

@openshift-ci openshift-ci bot changed the title WIP: CNF-16642: Dynamic memory enforcement CNF-18922: Dynamic memory enforcement Jul 28, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Jul 28, 2025

@MarSik: This pull request references CNF-18922 which is a valid jira issue.

In response to this:

This series adds a systemd service to the tuned system. This service runs in the cpu manager-like reconcile loop fashion, monitors all pod cgroups' memory and sets the limits to kubepods-burstable and kubepods-besteffort to protect the system from runaway pods without limits of their own.

kubepods-burstable is set to = node allocatable memory - sum(all guaranteed containers memory limit)
kubepods-besteffort is set to = kubepods-burstable limit - kubepods-bustable current consumption

Note: The boilerplate code was generated by the claude-4-sonnet AI model as interpreted by the Cursor IDE. The python service is my own.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 28, 2025
@MarSik
Copy link
Contributor Author

MarSik commented Jul 28, 2025

/retitle WIP: CNF-18922: Dynamic memory enforcement

@openshift-ci openshift-ci bot changed the title CNF-18922: Dynamic memory enforcement WIP: CNF-18922: Dynamic memory enforcement Jul 28, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 28, 2025
@MarSik
Copy link
Contributor Author

MarSik commented Jul 29, 2025

/retest-required

@jmencak
Copy link
Contributor

jmencak commented Jul 29, 2025

So this is a quick workaround until kubernetes/enhancements/issues/5460 gets resolved, right?

@MarSik
Copy link
Contributor Author

MarSik commented Jul 30, 2025

@jmencak A workaround and demonstrator. Yep.

@MarSik MarSik force-pushed the feat-memory-enforcement branch from f56b246 to 6dd716e Compare July 31, 2025 13:23
@MarSik
Copy link
Contributor Author

MarSik commented Aug 4, 2025

/retest-required

Copy link
Contributor

openshift-ci bot commented Aug 4, 2025

@MarSik: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@MarSik
Copy link
Contributor Author

MarSik commented Aug 12, 2025

/relabel CNF-18922: Dynamic memory enforcement

import os.path

ROOT="/sys/fs/cgroup/kubepods.slice"
ROOT_MAX=os.path.join(ROOT, "memory.max")
Copy link
Contributor

@Tal-or Tal-or Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base line or the entrypoint from which we are performing all the calculation is /sys/fs/cgroup/kubepods.slice/memory.max

Does this value has already the memory system reserved subtract from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

msivak@msivak-thinkpadx1carbongen12 ~> oc describe node testnode | grep -i 'memory:'
memory: 130895396Ki # capacity
memory: 129768996Ki # allocatable

sh-5.1# cat /sys/fs/cgroup/kubepods.slice/memory.max
132988309504
msivak@msivak-thinkpadx1carbongen12 ~> expr 132988309504 / 1024
129871396

I would say: Yes, it already does.

if profilecomponent.IsEnforceReservedMemoryEnabled(profile) {
// Add dynamic memory enforcement service only if annotation is enabled and workload partitioning is enabled
clusterHasWorkloadPartitioning := opts.PinningMode != nil && *opts.PinningMode == apiconfigv1.CPUPartitioningAllNodes
if profilecomponent.IsEnforceReservedMemoryEnabled(profile) && clusterHasWorkloadPartitioning {
Copy link
Contributor

@Tal-or Tal-or Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also check that MemoryManager configure to Static which is suppose to be obvious if the topologyManager configured to single-numa-node or restricted but we're not checking that in this flow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? This has nothing to do with NUMA or logical memory tracking. The code enforces the runtime values only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the whole point was to avoid consuming the reserved memory used for the house keeping tasks, right?

Then if we won't have topologyManager configured to single-numa-node/restricted we won't have the memory manager enabled which required configuring reserved memory for the system.

Copy link
Contributor

@Tal-or Tal-or Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, in the worse case the script will work but will do nothing useful

burstable_current = int(open(BURSTABLE_CURRENT).read().strip())
guaranteed_current = guaranteed_max()

burstable_max = root_max - guaranteed_current
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to double check that memory reserved for workload partitioning is subtracted here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants