Skip to content

OCPBUGS-37982: Bug fix: Reduce Frequency of Update Requests for Copied CSVs #3497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions pkg/controller/operators/olm/operator.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ import (
"sync"
"time"

"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/labeller"
"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/olm/plugins"
"github.com/sirupsen/logrus"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
Expand Down Expand Up @@ -42,11 +40,14 @@ import (

operatorsv1 "github.com/operator-framework/api/pkg/operators/v1"
"github.com/operator-framework/api/pkg/operators/v1alpha1"

"github.com/operator-framework/operator-lifecycle-manager/pkg/api/client/clientset/versioned"
"github.com/operator-framework/operator-lifecycle-manager/pkg/api/client/informers/externalversions"
"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/certs"
"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/install"
"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/labeller"
"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/olm/overrides"
"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/operators/olm/plugins"
"github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver"
"github.com/operator-framework/operator-lifecycle-manager/pkg/lib/clients"
csvutility "github.com/operator-framework/operator-lifecycle-manager/pkg/lib/csv"
Expand Down
25 changes: 20 additions & 5 deletions pkg/controller/operators/olm/operator_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ import (

"github.com/google/go-cmp/cmp"
configfake "github.com/openshift/client-go/config/clientset/versioned/fake"
hashutil "github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubernetes/pkg/util/hash"
"github.com/sirupsen/logrus"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
Expand All @@ -44,6 +43,7 @@ import (
metadatafake "k8s.io/client-go/metadata/fake"
"k8s.io/client-go/pkg/version"
"k8s.io/client-go/rest"
clienttesting "k8s.io/client-go/testing"
"k8s.io/client-go/tools/cache"
"k8s.io/client-go/tools/record"
apiregistrationv1 "k8s.io/kube-aggregator/pkg/apis/apiregistration/v1"
Expand All @@ -54,7 +54,6 @@ import (
operatorsv1 "github.com/operator-framework/api/pkg/operators/v1"
"github.com/operator-framework/api/pkg/operators/v1alpha1"
opregistry "github.com/operator-framework/operator-registry/pkg/registry"
clienttesting "k8s.io/client-go/testing"

"github.com/operator-framework/operator-lifecycle-manager/pkg/api/client/clientset/versioned"
"github.com/operator-framework/operator-lifecycle-manager/pkg/api/client/clientset/versioned/fake"
Expand All @@ -64,6 +63,7 @@ import (
resolvercache "github.com/operator-framework/operator-lifecycle-manager/pkg/controller/registry/resolver/cache"
"github.com/operator-framework/operator-lifecycle-manager/pkg/lib/clientfake"
csvutility "github.com/operator-framework/operator-lifecycle-manager/pkg/lib/csv"
hashutil "github.com/operator-framework/operator-lifecycle-manager/pkg/lib/kubernetes/pkg/util/hash"
"github.com/operator-framework/operator-lifecycle-manager/pkg/lib/labeler"
"github.com/operator-framework/operator-lifecycle-manager/pkg/lib/operatorclient"
"github.com/operator-framework/operator-lifecycle-manager/pkg/lib/operatorlister"
Expand Down Expand Up @@ -5050,7 +5050,12 @@ func TestSyncOperatorGroups(t *testing.T) {
},
targetNamespace: {
withLabels(
withAnnotations(targetCSV.DeepCopy(), map[string]string{operatorsv1.OperatorGroupAnnotationKey: "operator-group-1", operatorsv1.OperatorGroupNamespaceAnnotationKey: operatorNamespace}),
withAnnotations(targetCSV.DeepCopy(), map[string]string{
operatorsv1.OperatorGroupAnnotationKey: "operator-group-1",
operatorsv1.OperatorGroupNamespaceAnnotationKey: operatorNamespace,
"olm.operatorframework.io/nonStatusCopyHash": "9ZxHAHQTkxzAQd7Qkk4Qjz3VAkA8lXwuX9mDX6",
"olm.operatorframework.io/statusCopyHash": "bedtcmN999WBSJ1RHvM7JfN2NJITrUjJ0g0MoH",
}),
labels.Merge(targetCSV.GetLabels(), map[string]string{v1alpha1.CopiedLabelKey: operatorNamespace}),
),
&rbacv1.Role{
Expand Down Expand Up @@ -5155,7 +5160,12 @@ func TestSyncOperatorGroups(t *testing.T) {
},
targetNamespace: {
withLabels(
withAnnotations(targetCSV.DeepCopy(), map[string]string{operatorsv1.OperatorGroupAnnotationKey: "operator-group-1", operatorsv1.OperatorGroupNamespaceAnnotationKey: operatorNamespace}),
withAnnotations(targetCSV.DeepCopy(), map[string]string{
operatorsv1.OperatorGroupAnnotationKey: "operator-group-1",
operatorsv1.OperatorGroupNamespaceAnnotationKey: operatorNamespace,
"olm.operatorframework.io/nonStatusCopyHash": "9ZxHAHQTkxzAQd7Qkk4Qjz3VAkA8lXwuX9mDX6",
"olm.operatorframework.io/statusCopyHash": "bedtcmN999WBSJ1RHvM7JfN2NJITrUjJ0g0MoH",
}),
labels.Merge(targetCSV.GetLabels(), map[string]string{v1alpha1.CopiedLabelKey: operatorNamespace}),
),
&rbacv1.Role{
Expand Down Expand Up @@ -5312,7 +5322,12 @@ func TestSyncOperatorGroups(t *testing.T) {
},
targetNamespace: {
withLabels(
withAnnotations(targetCSV.DeepCopy(), map[string]string{operatorsv1.OperatorGroupAnnotationKey: "operator-group-1", operatorsv1.OperatorGroupNamespaceAnnotationKey: operatorNamespace}),
withAnnotations(targetCSV.DeepCopy(), map[string]string{
operatorsv1.OperatorGroupAnnotationKey: "operator-group-1",
operatorsv1.OperatorGroupNamespaceAnnotationKey: operatorNamespace,
"olm.operatorframework.io/nonStatusCopyHash": "9ZxHAHQTkxzAQd7Qkk4Qjz3VAkA8lXwuX9mDX6",
"olm.operatorframework.io/statusCopyHash": "bedtcmN999WBSJ1RHvM7JfN2NJITrUjJ0g0MoH",
}),
labels.Merge(targetCSV.GetLabels(), map[string]string{v1alpha1.CopiedLabelKey: operatorNamespace}),
),
},
Expand Down
69 changes: 65 additions & 4 deletions pkg/controller/operators/olm/operatorgroup.go
Original file line number Diff line number Diff line change
Expand Up @@ -360,7 +360,7 @@ func (a *Operator) pruneProvidedAPIs(group *operatorsv1.OperatorGroup, groupProv
}

// Prune providedAPIs annotation if the cluster has fewer providedAPIs (handles CSV deletion)
//if intersection := groupProvidedAPIs.Intersection(providedAPIsFromCSVs); len(intersection) < len(groupProvidedAPIs) {
// if intersection := groupProvidedAPIs.Intersection(providedAPIsFromCSVs); len(intersection) < len(groupProvidedAPIs) {
if len(intersection) < len(groupProvidedAPIs) {
difference := groupProvidedAPIs.Difference(intersection)
logger := logger.WithFields(logrus.Fields{
Expand Down Expand Up @@ -790,6 +790,14 @@ func copyableCSVHash(original *v1alpha1.ClusterServiceVersion) (string, string,
return newHash, originalHash, nil
}

const (
nonStatusCopyHashAnnotation = "olm.operatorframework.io/nonStatusCopyHash"
statusCopyHashAnnotation = "olm.operatorframework.io/statusCopyHash"
// annotations for metadata drift guard
observedGenerationAnnotation = "olm.operatorframework.io/observedGeneration"
observedResourceVersionAnnotation = "olm.operatorframework.io/observedResourceVersion"
)

// If returned error is not nil, the returned ClusterServiceVersion
// has only the Name, Namespace, and UID fields set.
func (a *Operator) copyToNamespace(prototype *v1alpha1.ClusterServiceVersion, nsFrom, nsTo, nonstatus, status string) (*v1alpha1.ClusterServiceVersion, error) {
Expand All @@ -803,6 +811,7 @@ func (a *Operator) copyToNamespace(prototype *v1alpha1.ClusterServiceVersion, ns

existing, err := a.copiedCSVLister.Namespace(nsTo).Get(prototype.GetName())
if apierrors.IsNotFound(err) {
prototype.Annotations[nonStatusCopyHashAnnotation] = nonstatus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because copyToNamespace is called in a loop, prototype, being a pointer, is reused multiple times. Which means that these annotations may already be set. Is there any reason why these annotations simply aren't set in ensureCSVsInNamesapces() where the hashes are calculcated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point possibly. checking...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at it closer it seems like we shouldn't change it, here's my reasoning:

Keeping the annotation logic here, in copyToNamespace(), encapsulate the update semantics so each call handles its own CSV's state reliably.

We're reusing prototype and accounting for possibly set annotations. If we move the logic to ensureCSVsInNamesapces(), we'll have to duplicate the annotation checking logic because the logic for handling those annotations is tightly coupled with the CSV’s create/update lifecycle.

In copyToNamespace() we need to:
• Distinguish between a new creation (where the annotations don’t exist yet) and an update (where the annotations might already be set but could be outdated).
• Apply the updates in a specific order (first updating the non-status hash, then the status hash, including a status update to avoid mismatches).
• Ensure that each target CSV reflects the current state as expected for that specific namespace.

Aside from the hash handling we'd still need to be doing the above work in copyToNamespace()

created, err := a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).Create(context.TODO(), prototype, metav1.CreateOptions{})
if err != nil {
return nil, fmt.Errorf("failed to create new CSV: %w", err)
Expand All @@ -811,6 +820,10 @@ func (a *Operator) copyToNamespace(prototype *v1alpha1.ClusterServiceVersion, ns
if _, err := a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).UpdateStatus(context.TODO(), created, metav1.UpdateOptions{}); err != nil {
return nil, fmt.Errorf("failed to update status on new CSV: %w", err)
}
prototype.Annotations[statusCopyHashAnnotation] = status
if _, err = a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).Update(context.TODO(), prototype, metav1.UpdateOptions{}); err != nil {
return nil, fmt.Errorf("failed to update annotations after updating status: %w", err)
}
return &v1alpha1.ClusterServiceVersion{
ObjectMeta: metav1.ObjectMeta{
Name: created.Name,
Expand All @@ -821,15 +834,53 @@ func (a *Operator) copyToNamespace(prototype *v1alpha1.ClusterServiceVersion, ns
} else if err != nil {
return nil, err
}
// metadata drift guard: detect manual modifications to spec or status
if og, orv := existing.Annotations[observedGenerationAnnotation], existing.Annotations[observedResourceVersionAnnotation]; (og != "" && og != fmt.Sprint(existing.GetGeneration())) || (orv != "" && orv != existing.ResourceVersion) {
// full resync for metadata drift
// prepare prototype for update
prototype.Namespace = existing.Namespace
prototype.ResourceVersion = existing.ResourceVersion
prototype.UID = existing.UID
// sync hash annotations
prototype.Annotations[nonStatusCopyHashAnnotation] = nonstatus
prototype.Annotations[statusCopyHashAnnotation] = status
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we update the status hash post status update, i.e. with the observed resource version and generation annotations (for the same reason described in line 897)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in soon to be pushed commit 4b21aa6e

// update spec and annotations
updated, err := a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).Update(context.TODO(), prototype, metav1.UpdateOptions{})
if err != nil {
return nil, fmt.Errorf("failed to resync spec for metadata drift guard: %w", err)
}
// update status subresource
updated.Status = prototype.Status
if _, err := a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).UpdateStatus(context.TODO(), updated, metav1.UpdateOptions{}); err != nil {
return nil, fmt.Errorf("failed to resync status for metadata drift guard: %w", err)
}
// record observed generation and resourceVersion
updated.Annotations[observedGenerationAnnotation] = fmt.Sprint(updated.GetGeneration())
updated.Annotations[observedResourceVersionAnnotation] = updated.ResourceVersion
if _, err := a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).Update(context.TODO(), updated, metav1.UpdateOptions{}); err != nil {
return nil, fmt.Errorf("failed to update metadata guard annotations: %w", err)
}
return &v1alpha1.ClusterServiceVersion{
ObjectMeta: metav1.ObjectMeta{
Name: updated.Name,
Namespace: updated.Namespace,
UID: updated.UID,
},
}, nil
}

prototype.Namespace = existing.Namespace
prototype.ResourceVersion = existing.ResourceVersion
prototype.UID = existing.UID
existingNonStatus := existing.Annotations["$copyhash-nonstatus"]
existingStatus := existing.Annotations["$copyhash-status"]
// Get the non-status and status hash of the existing copied CSV
existingNonStatus := existing.Annotations[nonStatusCopyHashAnnotation]
existingStatus := existing.Annotations[statusCopyHashAnnotation]

var updated *v1alpha1.ClusterServiceVersion
// Always set the in-memory prototype's nonstatus annotation:
prototype.Annotations[nonStatusCopyHashAnnotation] = nonstatus
if existingNonStatus != nonstatus {
// include updates to the non-status hash annotation if there is a mismatch
if updated, err = a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).Update(context.TODO(), prototype, metav1.UpdateOptions{}); err != nil {
return nil, fmt.Errorf("failed to update: %w", err)
}
Expand All @@ -843,6 +894,17 @@ func (a *Operator) copyToNamespace(prototype *v1alpha1.ClusterServiceVersion, ns
if _, err = a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).UpdateStatus(context.TODO(), updated, metav1.UpdateOptions{}); err != nil {
return nil, fmt.Errorf("failed to update status: %w", err)
}
// Update the status first if the existing copied CSV status hash doesn't match what we expect
// to prevent a scenario where the hash annotations match but the contents do not.
// We also need to update the CSV itself in this case to ensure we set the status hash annotation.
prototype.Annotations[statusCopyHashAnnotation] = status
if updated, err = a.client.OperatorsV1alpha1().ClusterServiceVersions(nsTo).Update(context.TODO(), prototype, metav1.UpdateOptions{}); err != nil {
return nil, fmt.Errorf("failed to update: %w", err)
}
} else {
// Even if they're the same, ensure the returned prototype is annotated.
prototype.Annotations[statusCopyHashAnnotation] = status
updated = prototype
}
Copy link
Contributor

@camilamacedo86 camilamacedo86 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code implemented in this PR to the current state, the main addition seems to be this else block (beyond tests).

I’m not entirely sure I fully understand—are we also looking to implement what’s outlined in the Proposed Fixes section of this document? How/where are we addressing the concerns raised in the: Why don’t we just merge the [fix PR](https://github.com/operator-framework/operator-lifecycle-manager/pull/3411) as-is? section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is first pass, basically, just merge the old PR. With this PR we're taking path of #4 in the scoping doc: merge the PR with some possible problems, they should be a minor use case: users changing the copied CSVs

but the else is not the only thing done here, the main thing added is the tracking hashes so we can tell what's in need of update.

Copy link
Contributor

@camilamacedo86 camilamacedo86 Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is first pass, basically, just merge the old PR.

We previously agreed that the old PR wasn't quite the right approach, correct? Given that, I’m not sure it makes sense to merge it as-is.

If we need to do a release before we have the proper solution in place, we might include a change we don’t want. That doesn’t seem ideal to me. That is a case that I would request changes since it does not provide the desired solution, or fix the problem accordingly as defined in the doc. See that the doc has a section about that Why don’t we just merge the PR as it is?

It’s fine to add it as you did, but what do you think about creating a commit on top with the solution we intend to use? Could we focus on implementing the correct fix for the bug instead?
Is there any reason we need to merge the old PR without the correct fix? Can we not do that as proposed?

c/c @tmshort

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see @tmshort comment on the doc. Idea being, merging this PR is a first step, it gives some relief, then we'll make another pass after this settles. Settling involves seeing much less API activity, especially on clusters with many namespaces for the CSV to be copied to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and settling also involves seeing if the things mentioned in the doc, primarily whether OLM not correcting user-modified copied CSVs, will be a real-world problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is we have two problems:

  1. spamming logs and api server requests
  2. changes to copied csvs won't be detected and therefore will linger

The proposed approach is to separate these two problems by resolving 1 (which has a material impact on the cluster and customer's bottom line) and later handling 2.

It's my understanding that changes to copied csvs don't carry any behavioral changes in the system anyway. They only exist to make it possible to discover which operators are available in a particular namespace with a kubectl command. Also, I'd assume that write access to CSVs will be restricted in most real world cases to the cluster admin and the namespace admin. If these two assumptions hold, I think the blast radius of modifying the copied CSV and not having it reconcile back to the intended spec should be pretty small. So, I tend to agree with the approach here. Let's address the big problem of api server/log spamming, then worry about the relatively small problem of inconsistent copied csvs.

return &v1alpha1.ClusterServiceVersion{
ObjectMeta: metav1.ObjectMeta{
Expand Down Expand Up @@ -939,7 +1001,6 @@ func namespacesChanged(clusterNamespaces []string, statusNamespaces []string) bo

func (a *Operator) getOperatorGroupTargets(op *operatorsv1.OperatorGroup) (map[string]struct{}, error) {
selector, err := metav1.LabelSelectorAsSelector(op.Spec.Selector)

if err != nil {
return nil, err
}
Expand Down
Loading
Loading