Skip to content

Commit 8c3a1b8

Browse files
[ECE] Clarify the steps of identifying the best ZK leader candidate (#1598)
## Description Previously it was documented as users / customer can find it thru logs. Recently during a sync with @pfcoperez in an internal ticket - elastic/sdh-control-plane#9169 (comment) and elastic/support-tech-lead#1554 (comment), we decide to rewrite this part. In detail, we'd recommend users / customers to only collect the essential information, and we (support) check the essential information is collected first, and engage with dev team to make further verification on this together. Motivation is because if handled in a wrong way - either identify the best ZK leader candidate, or recover, this process can potentially corrupt users data permanently. In specific, - We hide the steps to identify the ZK leader into private section in KB: - Private view: https://support.elastic.dev/knowledge/view/fa410d1f - Public view: https://support.elastic.co/knowledge/fa410d1f - We only guide users / customer to collect essential information, and let them reach out to support ## Before / After PR merge **:: Before** <img width="906" alt="image" src="https://github.com/user-attachments/assets/bb486efb-6ee6-49d3-b88c-3c0bfc9ebb13" /> **:: After** In public doc: (orange part will show up) ![image](https://github.com/user-attachments/assets/53c7b805-1fa8-4099-b96e-00832a426e13) In public KB: It will show almost the same thing: https://support.elastic.co/knowledge/fa410d1f In KB private section: It will show the details. https://support.elastic.dev/knowledge/view/fa410d1f --- cc @mmahacek @pfcoperez --------- Co-authored-by: shainaraskas <[email protected]>
1 parent 27aaa4c commit 8c3a1b8

File tree

1 file changed

+38
-16
lines changed

1 file changed

+38
-16
lines changed

troubleshoot/deployments/cloud-enterprise/rebuilding-broken-zookeeper-quorum.md

Lines changed: 38 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ products:
1212
# Rebuilding a broken Zookeeper quorum [ece-troubleshooting-zookeeper-quorum]
1313

1414
::::{warning}
15-
This article covers an advanced recovery method involving directly modifying Zookeeper. This process can potentially corrupt your data. Elastic recommends only following this outline after receiving [confirmation by Elastic Support](/troubleshoot/index.md#contact-us).
15+
This article covers an advanced recovery method involving directly modifying Zookeeper. This process can potentially corrupt your data. Elastic strongly recommends only following this outline after receiving [confirmation by Elastic Support](/troubleshoot/index.md#contact-us).
1616
::::
1717

1818

@@ -67,28 +67,50 @@ Perform the following steps on each host to back up the Zookeeper data directory
6767

6868
## Determine the Zookeeper leader [ece_determine_the_zookeeper_leader]
6969

70-
If a Zookeeper quorum is broken, you must establish the best Zookeeper leader to use for recovery before you start the recovery proces.
70+
If a Zookeeper quorum is broken, you need to identify the best Zookeeper leader candidate to use for recovery before you start the recovery process.
7171

72-
The simplest way to check is using the [Zookeeper sync status](verify-zookeeper-sync-status.md) command.
72+
Collect the following information from all ECE director hosts that have ZK containers running, including any recently created or decommissioned hosts. After you have gathered the information, reach out to [Elastic Support](/troubleshoot/index.md#contact-us) to identify the best ZK leader candidate.
7373

74-
If this command is not reporting any leaders, then perform the following actions on each director host:
74+
* [Output of file list and sizes of Zookeeper directories](#zk-file-list-sizes)
75+
* [ECE diagnostics](#ece-diagnostics)
7576

76-
1. SSH into the host.
77-
2. Enter the Docker `frc-zookeeper-servers-zookeeper` container and check its `/app/logs/zookeeper.log` logs for `LEADING`:
77+
### Collect the output of file list and sizes of Zookeeper directories [zk-file-list-sizes]
7878

79-
```sh
80-
$ docker exec -it frc-zookeeper-servers-zookeeper bash
81-
root@XXXXX:/# cat /app/logs/zookeeper.log | grep 'LEADING'
82-
```
79+
```
80+
# collect disk usage
81+
find /mnt/data/elastic/*/services/zookeeper/data/ -print -exec du -hs {} \;
82+
# collect file status
83+
find /mnt/data/elastic/*/services/zookeeper/data/ -print -exec stat {} \;
84+
```
8385

84-
This command will return results similar to the following:
86+
### Collect ECE diagnostics [ece-diagnostics]
87+
88+
Follow [](run-ece-diagnostics-tool.md) to collect ECE diagnostics.
89+
90+
Make sure to run the tool with the `--disableApiCalls` flag. Without this flag, ECE diagnostics might fail to run.
91+
92+
**Command**
93+
```bash
94+
./ece-diagnostics run --disableApiCalls
95+
```
8596

86-
```sh
87-
INFO [QuorumPeer[myid=10](plain=0.0.0.0:2191)(secure=disabled):o.a.z.s.q.QuorumPeer@1549] - LEADING
88-
INFO [QuorumPeer[myid=10](plain=0.0.0.0:2191)(secure=disabled):o.a.z.s.q.Leader@588] - LEADING - LEADER ELECTION TOOK - 225 MS
89-
```
9097

91-
3. If multiple directors report this log, then determine the one with the latest timestamp, which will contain the latest Zookeeper state.
98+
**Sample response**
99+
100+
```bash
101+
elastic@my-ece-director-host1:~$ ./ece-diagnostics run --disableApiCalls
102+
- Configuring ECE home folder
103+
✓ found /mnt/data/elastic for runner 172.16.15.204
104+
- Log file: /tmp/ecediag-172.16.15.204-20250404-080202.log
105+
++ Created tar output: /tmp/ecediag-172.16.15.204-20250404-080202.tar.gz
106+
⚠ skipping collection of ECE metricbeat data (took: 0s)
107+
⚠ skipping collection of API information for ECE and Elasticsearch (took: 0s)
108+
✓ collected information on certificates (took: 221ms)
109+
✓ collected information on client-forwarder connectivity (took: 368ms)
110+
✓ collected ZooKeeper stats (took: 8.391s)
111+
✓ collected system information (took: 14.263s)
112+
✓ collected Docker info and logs (took: 18.976s)
113+
```
92114

93115

94116
## Recover Zookeeper nodes [ece_recover_zookeeper_nodes]

0 commit comments

Comments
 (0)