Skip to content

OCPBUGS-70201: ctrcfg: set increase ulimits when upgrading from 4.20 to 4.21#5516

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:release-4.20from
haircommander:ulimits-4.20
Jan 15, 2026
Merged

OCPBUGS-70201: ctrcfg: set increase ulimits when upgrading from 4.20 to 4.21#5516
openshift-merge-bot[bot] merged 1 commit intoopenshift:release-4.20from
haircommander:ulimits-4.20

Conversation

@haircommander
Copy link
Member

in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in #5308, but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do.

This was entirely based on #4715

- What I did

- How to verify it

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Dec 23, 2025
@openshift-ci-robot
Copy link
Contributor

@haircommander: This pull request references Jira Issue OCPBUGS-70201, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.z) matches configured target version for branch (4.20.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note text is set and does not match the template
  • dependent bug Jira Issue OCPBUGS-62327 is in the state Verified, which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA))
  • dependent Jira Issue OCPBUGS-62327 targets the "4.21.0" version, which is one of the valid target versions: 4.21.0
  • bug has dependents

Requesting review from QA contact:
/cc @lyman9966

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in #5308, but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do.

This was entirely based on #4715

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@haircommander
Copy link
Member Author

failures seem the same as #5459

/skip

@haircommander
Copy link
Member Author

/test e2e-aws-ovn


ctrl := &Controller{
templatesDir: templatesDir,
namespace: namespace,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): Will this ConfigMap always be in the MCO namespace? If so, could we instead reference ctrlcommon.MCONamespace?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me, updated

const (
componentName = "machine-config-controller"
componentName = "machine-config-controller"
componentNamespace = "openshift-machine-config-operator"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (non-blocking): We have a constant MCONamespace in github.com/openshift/machine-config-operator/pkg/controller/common aka ctrlcommon.MCONamespace that could be used instead.


// Create the crio-default-ulimits MC for all the available pools
for _, pool := range mcpPoolsAll {
if pool.Name != ctrlcommon.MachineConfigPoolMaster && pool.Name != ctrlcommon.MachineConfigPoolWorker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking): Why exclude other MachineConfigPools?

I hope that's not a silly question!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I basically copied @sohankunkerkar on this one, which seems to have been inspired by @yuqi-zhang https://github.com/openshift/machine-config-operator/pull/4635/files#r1851191355

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, that's a great point that I completely forgot about. This makes sense now. Thanks for pointing me to that!

Copy link
Member

@QiWang19 QiWang19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One nit to fix the CI test.
We can get QE verified before merging.

const (
componentName = "machine-config-controller"
componentName = "machine-config-controller"
componentNamespace = "openshift-machine-config-operator"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop the const definition since ctrlcommon.MCONamespace will be used

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@haircommander haircommander force-pushed the ulimits-4.20 branch 2 times, most recently from 9c0db36 to ec6c8e2 Compare January 15, 2026 16:44
in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set
for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads
don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in openshift#5308,
but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#4715

Signed-off-by: Peter Hunt <[email protected]>
@cheesesashimi
Copy link
Member

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 15, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 15, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi, haircommander

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 15, 2026
@haircommander
Copy link
Member Author

[pehunt@fedora ~]
 $ oc describe mc 00-override-master-generated-crio-default-ulimits
Name:         00-override-master-generated-crio-default-ulimits
Namespace:    
Labels:       machineconfiguration.openshift.io/role=master
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfig
Metadata:
  Creation Timestamp:  2026-01-15T20:48:15Z
  Generation:          1
  Resource Version:    11130
  UID:                 dbe59da6-22ac-40cd-bd3f-bd288e672d5a
Spec:
  Base OS Extensions Container Image:  
  Config:
    Ignition:
      Version:  3.5.0
    Storage:
      Files:
        Contents:
          Compression:  
          Source:       data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIGRlZmF1bHRfdWxpbWl0cyA9IFsibm9maWxlPTEwNDg1NzYiXQo=
        Mode:           420
        Overwrite:      true
        Path:           /etc/crio/crio.conf.d/01-ctrcfg-defaultUlimits
  Fips:                 false
  Kernel Arguments:     <nil>
  Kernel Type:          
  Os Image URL:         
Events:                 <none>
[pehunt@fedora ~]
 $ oc get nodes
NAME                                       STATUS   ROLES                  AGE   VERSION
ci-ln-r9n3pfk-72292-8lg5h-master-0         Ready    control-plane,master   28m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-master-1         Ready    control-plane,master   28m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-master-2         Ready    control-plane,master   28m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-worker-a-lr869   Ready    worker                 10m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-worker-b-6jnw8   Ready    worker                 10m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-worker-c-68fdj   Ready    worker                 10m   v1.33.6
[pehunt@fedora ~]
 $ oc debug node/ci-ln-r9n3pfk-72292-8lg5h-worker-a-lr869 -- cat /host/etc/crio/crio.conf.d/01-ctrcfg-defaultUlimits
Starting pod/ci-ln-r9n3pfk-72292-8lg5h-worker-a-lr869-debug-qpscx ...
To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`.
[crio]
  [crio.runtime]
    default_ulimits = ["nofile=1048576"]

in cluster launched by launch [registry.build08.ci.openshift.org/ci-op-4nzlnit6/release](http://registry.build08.ci.openshift.org/ci-op-4nzlnit6/release) gcp in cluster-bot
/verified by @haircommander

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jan 15, 2026
@openshift-ci-robot
Copy link
Contributor

@haircommander: This PR has been marked as verified by @haircommander.

Details

In response to this:

[pehunt@fedora ~]
$ oc describe mc 00-override-master-generated-crio-default-ulimits
Name:         00-override-master-generated-crio-default-ulimits
Namespace:    
Labels:       machineconfiguration.openshift.io/role=master
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfig
Metadata:
 Creation Timestamp:  2026-01-15T20:48:15Z
 Generation:          1
 Resource Version:    11130
 UID:                 dbe59da6-22ac-40cd-bd3f-bd288e672d5a
Spec:
 Base OS Extensions Container Image:  
 Config:
   Ignition:
     Version:  3.5.0
   Storage:
     Files:
       Contents:
         Compression:  
         Source:       data:text/plain;charset=utf-8;base64,W2NyaW9dCiAgW2NyaW8ucnVudGltZV0KICAgIGRlZmF1bHRfdWxpbWl0cyA9IFsibm9maWxlPTEwNDg1NzYiXQo=
       Mode:           420
       Overwrite:      true
       Path:           /etc/crio/crio.conf.d/01-ctrcfg-defaultUlimits
 Fips:                 false
 Kernel Arguments:     <nil>
 Kernel Type:          
 Os Image URL:         
Events:                 <none>
[pehunt@fedora ~]
$ oc get nodes
NAME                                       STATUS   ROLES                  AGE   VERSION
ci-ln-r9n3pfk-72292-8lg5h-master-0         Ready    control-plane,master   28m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-master-1         Ready    control-plane,master   28m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-master-2         Ready    control-plane,master   28m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-worker-a-lr869   Ready    worker                 10m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-worker-b-6jnw8   Ready    worker                 10m   v1.33.6
ci-ln-r9n3pfk-72292-8lg5h-worker-c-68fdj   Ready    worker                 10m   v1.33.6
[pehunt@fedora ~]
$ oc debug node/ci-ln-r9n3pfk-72292-8lg5h-worker-a-lr869 -- cat /host/etc/crio/crio.conf.d/01-ctrcfg-defaultUlimits
Starting pod/ci-ln-r9n3pfk-72292-8lg5h-worker-a-lr869-debug-qpscx ...
To use host binaries, run `chroot /host`. Instead, if you need to access host namespaces, run `nsenter -a -t 1`.
[crio]
 [crio.runtime]
   default_ulimits = ["nofile=1048576"]

in cluster launched by launch [registry.build08.ci.openshift.org/ci-op-4nzlnit6/release](http://registry.build08.ci.openshift.org/ci-op-4nzlnit6/release) gcp in cluster-bot
/verified by @haircommander

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mrunalp mrunalp added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Jan 15, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 15, 2026

@haircommander: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit bcb18a0 link false /test bootstrap-unit

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@haircommander
Copy link
Member Author

/skip

@openshift-merge-bot openshift-merge-bot bot merged commit 930bff1 into openshift:release-4.20 Jan 15, 2026
15 checks passed
@openshift-ci-robot
Copy link
Contributor

@haircommander: Jira Issue Verification Checks: Jira Issue OCPBUGS-70201
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-70201 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

in cri-o 1.33, a change cri-o/cri-o#8962 was made to the default limits set for CRI-O. Now, the ulimit nofile is set much lower, with space to set it higher. however, some workloads don't expect this change, and fail (see https://issues.redhat.com/browse/OCPBUGS-62095)

This was worked around temporarily in #5308, but that workaround was not intendd to be carried in to 4.21.

Instead, we should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters don't get this change, but new clusters started in 4.21 do.

This was entirely based on #4715

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.20.0-0.nightly-2026-01-16-181948

openshift-merge-bot bot pushed a commit to openshift/cincinnati-graph-data that referenced this pull request Jan 27, 2026
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Feb 4, 2026
in cri-o 1.33, a change cri-o/cri-o#9401 was made to the short name mode for CRI-O.
Now, CRI-O is enforcing short names, which means when an image is unqualified (short name) and has an ambiguous
pull path (multiple different names returned), cri-o fails to pull the image.

This may break users however, and so we shouldn't upgrade with it.

We should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#5516, which has its own inspiration history

Signed-off-by: Peter Hunt <[email protected]>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Feb 4, 2026
in cri-o 1.33, a change cri-o/cri-o#9401 was made to the short name mode for CRI-O.
Now, CRI-O is enforcing short names, which means when an image is unqualified (short name) and has an ambiguous
pull path (multiple different names returned), cri-o fails to pull the image.

This may break users however, and so we shouldn't upgrade with it.

We should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#5516, which has its own inspiration history

Signed-off-by: Peter Hunt <[email protected]>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Feb 4, 2026
in cri-o 1.33, a change cri-o/cri-o#9401 was made to the short name mode for CRI-O.
Now, CRI-O is enforcing short names, which means when an image is unqualified (short name) and has an ambiguous
pull path (multiple different names returned), cri-o fails to pull the image.

This may break users however, and so we shouldn't upgrade with it.

We should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#5516, which has its own inspiration history

Signed-off-by: Peter Hunt <[email protected]>
haircommander added a commit to haircommander/machine-config-operator that referenced this pull request Feb 4, 2026
in cri-o 1.33, a change cri-o/cri-o#9401 was made to the short name mode for CRI-O.
Now, CRI-O is enforcing short names, which means when an image is unqualified (short name) and has an ambiguous
pull path (multiple different names returned), cri-o fails to pull the image.

This may break users however, and so we shouldn't upgrade with it.

We should drop-in an ignition file on upgrades from 4.20 to 4.21 to make sure existing clusters
don't get this change, but new clusters started in 4.21 do.

This was entirely based on openshift#5516, which has its own inspiration history

Signed-off-by: Peter Hunt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.