Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Fix logical race conditions in kubernetes_secrets provider #6623

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pkoutsovasilis
Copy link
Contributor

@pkoutsovasilis pkoutsovasilis commented Jan 29, 2025

What does this PR do?

This PR refactors the implementation to eliminate logical race conditions of the kubernetes_secrets provider alongside with brand new unit-tests.

Initially, the issue seemed to stem from misuse or lack of synchronisation primitives, but after deeper analysis, it became evident that the "race" conditions were logical rather than concurrency-related. The existing implementation was structured in a way that led to inconsistencies due to overlapping responsibilities of different actors managing the secret lifecycle.

To address this, I restructured the logic while keeping in mind the constraints of the existing provider, specifically:

  • Using a Kubernetes reflector (watch-based mechanism) is not an option because it would require listing and watching all secrets, which is often a non-starter for users.
  • Instead, we must maintain our own caching mechanism that periodically refreshes only the referenced Kubernetes secrets.

With this in mind, the provider behaviour is now as follows:

Cache Disabled Mode:

  • When caching is disabled, the provider simply reads secrets directly from the Kubernetes API server.

Cache Enabled Mode:

  • When caching is enabled, the provider stores secrets in a cache where entries expire based on the configured TTL (time-to-live) and a lastAccess field of each cache entry.
  • The provider has two primary actors: cache actor and fetch actor, each with well-defined responsibilities.

Cache Actor Responsibilities:

  1. Signal expiration of items: When a secret expires, the cache actor signals that a fetch should occur to reinsert the key into the cache, ensuring continued refreshing.
  2. Detect secret updates and signal changes: When the cache actor detects a secret value change, it signals the ContextProviderComm.
  3. Conditionally update lastAccess:
    • If the secret has changed, update lastAccess to prevent premature expiration and give the fetch actor time to pick up the new value.
    • In any other case, do not update lastAccess and let the entry "age" as it should.

Fetch Actor Responsibilities:

  1. Retrieve secrets from the cache:
    • If present, return the value.
    • If missing, fetch from the Kubernetes API.
  2. Insert fetched secrets into the cache if there isn't a more recent version of the secret already in it (can happen by the cache actor or a parallel fetch actor).
  3. Always update lastAccess when an entry is accessed to prevent unintended expiration.

Considerations:

  • No global locks: Store operations are the only part of the critical path, ensuring that neither cache nor fetch actors block each other.
  • Conditional updates: Since cache state can change between the time an actor reads and writes, all updates use conditional store operations that are part of the critical path.
  • Custom store implementation: The existing ExpirationCache from k8s.io/client-go/tools/cache does not suit our needs, as it lacks the aforementioned conditional insertion required to handle these interactions correctly.
  • Optimized memory management: The prior implementation copied the cache map on every update to prevent Golang map bucket retention. However, I believe this was a misunderstanding of Golang internals and a premature optimisation. If needed in the future, this can be revisited in a more controlled manner.

PS: as the main changes of this PR are captured by the commit a549728, I consider this PR to be aligned with the Pull Requests policy

Why is it important?

This refactor significantly improves the correctness of the kubernetes_secrets provider by ensuring:

  • Secrets do not expire prematurely due to logical race conditions.
  • Updates are properly signaled to consuming components.
  • Performance is optimised with minimal locking and unnecessary memory allocations.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding changes to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the change-log tool
  • I have added an integration test or an E2E test

Disruptive User Impact

This change does not introduce breaking changes but ensures that the kubernetes_secrets provider operates correctly in cache-enabled mode. Users relying on cache behaviour may notice improved stability in secret retrieval.

How to test this PR locally

  1. Run unit tests to validate the new caching behaviour:
    go test ./internal/pkg/composable/providers/kubernetessecrets/...

Related issues

@pkoutsovasilis pkoutsovasilis added bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-8.x Automated backport to the 8.x branch with mergify labels Jan 29, 2025
@pkoutsovasilis pkoutsovasilis self-assigned this Jan 29, 2025
@pkoutsovasilis pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch 3 times, most recently from e001b10 to 9093b52 Compare January 29, 2025 09:08
@pkoutsovasilis pkoutsovasilis marked this pull request as ready for review January 30, 2025 07:05
@pkoutsovasilis pkoutsovasilis requested a review from a team as a code owner January 30, 2025 07:05
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pkoutsovasilis pkoutsovasilis force-pushed the k8s/secret_provider_cache_tmp branch from 9093b52 to 3e3788e Compare January 30, 2025 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify bug Something isn't working Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deletion race condition when fetching secret from the Kubernetes secrets provider cache
2 participants