docs: document Pod disruption budget configuration#213
Conversation
Adds a "Pod disruption budgets" section to the configuration guide between Pod configuration and Container configuration. The section covers spec.podDisruptionBudget on both ClickHouseCluster and KeeperCluster CRDs: - Operator defaults (per-shard for ClickHouseCluster: maxUnavailable=1 for single-replica shards, minAvailable=1 for multi-replica shards; KeeperCluster: maxUnavailable=replicas/2 to preserve RAFT quorum) - minAvailable vs maxUnavailable overrides with a webhook note that setting both is rejected - The Enabled/Disabled/Ignored policy and when each fits - Cluster-wide ENABLE_PDB env var on the operator Deployment for environments that ship their own disruption policies No code or chart changes — documentation only.
Vale flagged the PDB guide for 'autoscaler' as a spelling error. Add it together with related Kubernetes ecosystem terms that the new section uses (GitOps, Gatekeeper, Kyverno, NotReady) so the next round of the docs guide does not trip on them either.
| |---|---|---| | ||
| | `ClickHouseCluster` | `replicas: 1` (single-replica shard) | `maxUnavailable: 1` — disruption is allowed because there is nothing to preserve anyway | | ||
| | `ClickHouseCluster` | `replicas: 2+` (multi-replica shard) | `minAvailable: 1` — at least one replica per shard must stay up | | ||
| | `KeeperCluster` | any | `maxUnavailable: replicas/2` — preserves the RAFT quorum for a `2F+1` cluster (3 replicas tolerate 1 down, 5 replicas tolerate 2 down) | |
There was a problem hiding this comment.
Recently, it was updated and now with a single-node keeper, it also allows for disrupting
18f10ea
There was a problem hiding this comment.
the defaults table now distinguishes replicas: 1 (maxUnavailable: 1 from #208) from replicas: 3+ (maxUnavailable: replicas/2)
|
|
||
| ### Cluster-wide opt-out {#pdb-cluster-wide-disable} | ||
|
|
||
| PDB management can also be disabled cluster-wide via the operator's `ENABLE_PDB` environment variable. With `ENABLE_PDB=false`, the operator skips the PDB reconcile step for **every** ClickHouseCluster and KeeperCluster regardless of their `spec.podDisruptionBudget.policy`. |
There was a problem hiding this comment.
It also won't watch PDP resources, so the operator would work correctly even if the SA doesn't have any permissions regarding PDP
There was a problem hiding this comment.
Good catch. Updated the "Cluster-wide opt-out" paragraph to make this explicit: with ENABLE_PDB=false the operator does not watch PodDisruptionBudget resources at all, and the SA does not need RBAC permissions on poddisruptionbudgets.policy/v1.
Adds a "Pod disruption budgets" section to the configuration guide between Pod configuration and Container configuration. The section covers spec.podDisruptionBudget on both ClickHouseCluster and KeeperCluster CRDs:
No code or chart changes. documentation only.
Why
The operator automatically creates a
PodDisruptionBudgetfor every ClickHouseCluster shard and KeeperCluster, applying defaults that protect quorum during voluntary disruptions. Despite being a core operational feature, PDB behavior is currently undocumented.This leaves users unaware of automatically managed PDBs until they encounter them in production and forces operators to inspect implementation details in
api/v1alpha1/common.goto understand the availableEnabled,Disabled, andIgnoredpolicies, as well as the cluster-wideENABLE_PDBtoggle.What
Adds a new
## Pod disruption budgetssection todocs/guides/configuration.mdx, placed between Pod configuration and Container configuration.The new section covers:
Defaults: explains the operator-managed PodDisruptionBudget defaults:
maxUnavailable: 1for single-replica shards andminAvailable: 1for multi-replica shards.maxUnavailable: replicas/2, preserving RAFT quorum in a2F+1deployment.Overriding the defaults: documents
minAvailableandmaxUnavailableoverrides, including a warning that specifying both fields is rejected by the validating webhook.Policies: explains the
Enabled,Disabled, andIgnoredpolicies, with YAML examples and guidance on when each option should be used.Cluster-wide opt-out: documents the
ENABLE_PDBenvironment variable on the operator Deployment, which disables automatic PDB management for environments that provide their own disruption policies.No code, API, or chart changes. Documentation only.