Skip to content

Conversation

mergify[bot]
Copy link
Contributor

@mergify mergify bot commented Feb 10, 2025

What does this PR do?

This PR refactors the implementation to eliminate logical race conditions of the kubernetes_secrets provider alongside with brand new unit-tests.

Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.

To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:

  • Using a Kubernetes reflector (watch-based mechanism) is not an option because it would require listing and watching all secrets, which is often a non-starter for users.
  • Instead, we must maintain our own caching mechanism that periodically refreshes only the referenced Kubernetes secrets.

With this in mind, the provider behaviour is now as follows:

Cache Disabled Mode:

  • When caching is disabled, the provider simply reads secrets directly from the Kubernetes API server.

Cache Enabled Mode:

  • When caching is enabled, the provider stores secrets in a cache where entries expire based on the configured TTL (time-to-live) and a lastAccess field of each cache entry.
  • The provider has two primary actors: cache actor and fetch actor, each with well-defined responsibilities.

Cache Actor Responsibilities:

  1. Signal expiration of items: When a secret expires, the cache actor signals that a fetch should occur to reinsert the key into the cache, ensuring continued refreshing.
  2. Detect secret updates and signal changes: When the cache actor detects a secret value change, it signals the ContextProviderComm.
  3. Conditionally update lastAccess:
    • If the secret has changed, update lastAccess to prevent premature expiration and give the fetch actor time to pick up the new value.
    • In any other case, do not update lastAccess and let the entry "age" as it should.

Fetch Actor Responsibilities:

  1. Retrieve secrets from the cache:
    • If present, return the value.
    • If missing, fetch from the Kubernetes API.
  2. Insert fetched secrets into the cache if there isn't a more recent version of the secret already in it (can happen by the cache actor or a parallel fetch actor).
  3. Always update lastAccess when an entry is accessed to prevent unintended expiration.

Considerations:

  • No global locks: Store operations are the only part of the critical path, ensuring that neither cache nor fetch actors block each other.
  • Conditional updates: Since cache state can change between the time an actor reads and writes, all updates use conditional store operations that are part of the critical path.
  • Custom store implementation: The existing ExpirationCache from k8s.io/client-go/tools/cache does not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.
  • Optimized memory management: The prior implementation copied the cache map on every update to prevent Golang map bucket retention. However, I believe this was a misunderstanding of Golang internals and a premature optimisation. If needed in the future, this can be revisited in a more controlled manner.

PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy

Why is it important?

This refactor significantly improves the correctness of the kubernetes_secrets provider by ensuring:

  • Secrets do not expire prematurely due to logical race conditions.
  • Updates are properly signaled to consuming components.
  • Performance is optimised with minimal locking and unnecessary memory allocations.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding changes to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the change-log tool
  • I have added an integration test or an E2E test

Disruptive User Impact

This change does not introduce breaking changes but ensures that the kubernetes_secrets provider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.

How to test this PR locally

  1. Run unit tests to validate the new caching behaviour:
    go test ./internal/pkg/composable/providers/kubernetessecrets/...

Related issues


This is an automatic backport of pull request #6623 done by [Mergify](https://mergify.com).

* fix: refactor kubernetes_secrets provider to eliminate race conditions

* fix: add changelog fragment and unit-tests for kubernetes_secrets provider

* fix: replace RWMutex with Mutex

* fix: rename newExpirationStore to newExpirationCache

* fix: introduce kubernetes_secrets provider name as a const

* fix: extend AddConditionally doc string to describe the case of condition is nil

* fix: gosec lint

(cherry picked from commit 6d4b91c)
@mergify mergify bot requested a review from a team as a code owner February 10, 2025 20:42
@mergify mergify bot added the backport label Feb 10, 2025
@mergify mergify bot requested review from michalpristas and pchila and removed request for a team February 10, 2025 20:42
@github-actions github-actions bot added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Feb 10, 2025
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link

@pkoutsovasilis pkoutsovasilis merged commit d80d79d into 9.0 Feb 11, 2025
16 checks passed
@pkoutsovasilis pkoutsovasilis deleted the mergify/bp/9.0/pr-6623 branch February 11, 2025 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants