|
| 1 | +# Cluster Observer |
| 2 | +--- |
| 3 | + |
| 4 | +{NOTE: } |
| 5 | + |
| 6 | +* The primary goal of the **Cluster Observer** is to monitor the health of each database in the cluster |
| 7 | + and adjust its topology to maintain the desired [Replication Factor](../../../server/clustering/distribution/distributed-database#replication-factor). |
| 8 | + |
| 9 | +* This observer is always running on the [Leader](../../../server/clustering/rachis/cluster-topology#leader) node. |
| 10 | + |
| 11 | +* In this page: |
| 12 | + * [Operation flow](../../../server/clustering/distribution/cluster-observer#operation-flow) |
| 13 | + * [Interacting with the Cluster Observer](../../../server/clustering/distribution/cluster-observer#interacting-with-the-cluster-observer) |
| 14 | + |
| 15 | +{NOTE/} |
| 16 | + |
| 17 | +--- |
| 18 | + |
| 19 | +{PANEL: Operation flow} |
| 20 | + |
| 21 | +* To maintain the Replication Factor, every newly elected [Leader](../../../server/clustering/rachis/cluster-topology#leader) starts measuring the health of each node |
| 22 | + by creating dedicated maintenance TCP connections to all other nodes in the cluster. |
| 23 | + |
| 24 | +* Each node reports the current status of _all_ its databases at intervals of [500 milliseconds](../../../server/configuration/cluster-configuration#cluster.workersampleperiodinms) (by default). |
| 25 | + The `Cluster Observer` consumes those reports every [1000 milliseconds](../../../server/configuration/cluster-configuration#cluster.supervisorsampleperiodinms) (by default). |
| 26 | + |
| 27 | +* Upon a **node failure**, the [Dynamic Database Distribution](../../../server/clustering/distribution/distributed-database#dynamic-database-distribution) sequence |
| 28 | + will take place in order to ensure that the `Replication Factor` does not change. |
| 29 | + |
| 30 | + {NOTE: } |
| 31 | + |
| 32 | + **For example**: |
| 33 | + |
| 34 | + * Let us assume a five-node cluster with servers A, B, C, D, E. |
| 35 | + We create a database with a replication factor of 3 and define an ETL task. |
| 36 | + |
| 37 | + * The newly created database will be distributed automatically to three of the cluster nodes. |
| 38 | + Let's assume it is distributed to B, C, and E (so the database group is [B,C,E]), |
| 39 | + and the cluster decides that node C is responsible for performing the ETL task. |
| 40 | + |
| 41 | + * If node C goes offline or becomes unreachable, the Cluster Observer detects the issue. |
| 42 | + Initially: |
| 43 | + * After the duration specified in the [Cluster.TimeBeforeMovingToRehabInSec](../../../server/configuration/cluster-configuration#cluster.timebeforemovingtorehabinsec) configuration, |
| 44 | + the observer moves node C to rehab mode, allowing time for recovery. |
| 45 | + * The ETL task fails over to another available node in the Database Group. |
| 46 | + |
| 47 | + * If node C remains offline beyond the period specified in the [Cluster.TimeBeforeAddingReplicaInSec](../../../server/configuration/cluster-configuration#cluster.timebeforeaddingreplicainsec) configuration, |
| 48 | + the observer begins replicating the database to another node in the Database Group as a last resort. |
| 49 | + |
| 50 | + {NOTE/} |
| 51 | + |
| 52 | + {WARNING: } |
| 53 | + |
| 54 | + **Note**: |
| 55 | + |
| 56 | + * The _Cluster Observer_ stores its information **in memory**, so when the `Leader` loses leadership, |
| 57 | + the collected reports of the _Cluster Observer_ and its decision log are lost. |
| 58 | + |
| 59 | + {WARNING/} |
| 60 | + |
| 61 | +{PANEL/} |
| 62 | + |
| 63 | +{PANEL: Interacting with the Cluster Observer} |
| 64 | + |
| 65 | +You can interact with the `Cluster Observer` using the following REST API calls: |
| 66 | + |
| 67 | +| URL | Method | Query Params | Description | |
| 68 | +|-------------------------------------|---------|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------| |
| 69 | +| `/admin/cluster/observer/suspend` | POST | value=[`bool`] | Setting `false` will suspend the _Cluster Observer_ operation for the current [Leader term](../../../studio/cluster/cluster-view#cluster-nodes-states-&-types-flow). | |
| 70 | +| `/admin/cluster/observer/decisions` | GET | | Fetch the log of the recent decisions made by the cluster observer. | |
| 71 | +| `/admin/cluster/maintenance-stats` | GET | | Fetch the latest reports of the _Cluster Observer_ | |
| 72 | + |
| 73 | +{PANEL/} |
0 commit comments