Skip to content

feat: add etcd lock and single API orchestration#416

Closed
patriciareinoso wants to merge 47 commits intocanonical:DPE-9349-rolling-ops-maintenancefrom
patriciareinoso:DPE-9350-single-api
Closed

feat: add etcd lock and single API orchestration#416
patriciareinoso wants to merge 47 commits intocanonical:DPE-9349-rolling-ops-maintenancefrom
patriciareinoso:DPE-9350-single-api

Conversation

@patriciareinoso
Copy link
Copy Markdown

@patriciareinoso patriciareinoso commented Apr 10, 2026

Summary

This PR introduces the etcd sync and async locking mechanism for rolling ops together with a single API that fallbacks to the peer-relation solution in case of etcd failure or unavailability.

Description

etcd async lock

  • An background process is spawn on lock request
  • It performs the lock acquisition and triggers a lock_granted juju custom hook
  • The charm observes this event and executes the corresponding operation and indicates when it finished using etcd keys.

etcd sync lock

  • Implemented a distributed lock using etcd (lease + transactional lock key).
  • We use the :sync as the owner of the lock to differentiate from the lock trying to be acquired in the background process. Meaning that the sync lock may take priority over operations on the async lock.
  • Lock acquisition uses retries (via tenacity) on etcdctl commands.
  • The lock is tied to a lease and automatically released if the lease expires.

common API

  • The RollingOpsManager is the public API for advanced rolling ops
  • It decides on which backend the operation will run: etcd or the peer-relation
  • If etcd operations fail (e.g. etcdctl errors, lease issues), the system falls back to the peer-relation-based implementation.
  • This ensures operations can continue even when etcd is not usable.
  • If there is any failure on the etcd background process a juju custom hook is trigger, this way the RollingOpsManager knows it needs to fallback to peer-relations
  • Every operation request is:
    • written to etcd
    • also duplicated into the peer relation databag
  • After execution, the operation state is updated in both places.
  • This ensures:
    • no operations are lost during fallback
    • scheduling can continue seamlessly regardless of backend
  • The system only switches back to etcd when the operation queue is empty
  • Before resuming etcd usage any remaining etcd queue state is cleaned up. This guarantees a clean slate and avoids inconsistencies

Peer sync lock

The SyncLockBackend defines the interface that any charm author would need to implement in order to use sync locking in the context of the peer-relation solution given that using the peer-solutions we are not able to guarantee mutual exclusion when tearing down.

Comment thread rollingops/src/charmlibs/rollingops/common/_models.py
Comment thread rollingops/src/charmlibs/rollingops/common/_models.py
Comment thread rollingops/src/charmlibs/rollingops/common/_exceptions.py
Comment thread rollingops/src/charmlibs/rollingops/common/_models.py
Comment thread rollingops/src/charmlibs/rollingops/common/_models.py
Comment thread rollingops/src/charmlibs/rollingops/_rollingops_manager.py
Comment thread rollingops/src/charmlibs/rollingops/_rollingops_manager.py Outdated
Comment thread rollingops/src/charmlibs/rollingops/_rollingops_manager.py
Comment thread rollingops/src/charmlibs/rollingops/_rollingops_manager.py
Comment thread rollingops/src/charmlibs/rollingops/_rollingops_manager.py
"""Collection of etcd key prefixes used for rolling operations.

Layout:
/rollingops/{lock_name}/{cluster_id}/granted-unit/
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be /rollingops/cluster_id/lock_name/...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant