Skip to content

[ECE] Clarify the steps of identifying the best ZK leader candidate #1598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jun 10, 2025

Conversation

kunisen
Copy link
Contributor

@kunisen kunisen commented Jun 4, 2025

Description

Previously it was documented as users / customer can find it thru logs.

Recently during a sync with @pfcoperez in an internal ticket - https://github.com/elastic/sdh-control-plane/issues/9169#issuecomment-2752931242 and https://github.com/elastic/support-tech-lead/issues/1554#issuecomment-2777954862, we decide to rewrite this part.

In detail, we'd recommend users / customers to only collect the essential information, and we (support) check the essential information is collected first, and engage with dev team to make further verification on this together.

Motivation is because if handled in a wrong way - either identify the best ZK leader candidate, or recover, this process can potentially corrupt users data permanently.

In specific,

Before / After PR merge

:: Before

image

:: After

In public doc: (orange part will show up)

image

In public KB: It will show almost the same thing:

https://support.elastic.co/knowledge/fa410d1f

In KB private section: It will show the details.

https://support.elastic.dev/knowledge/view/fa410d1f


cc @mmahacek @pfcoperez

## Description

Previously it was documented as users / customer can find it thru logs.

Recently during a sync with @pfcoperez in an internal ticket - elastic/sdh-control-plane#9169 (comment) and elastic/support-tech-lead#1554 (comment), we decide to rewrite this part.

In detail, we'd recommend users / customers to only collect the essential information, and we (support) check the essential information is collected first, and engage with dev team to make further verification on this together.

Motivation is because if handled in a wrong way - either identify the best ZK leader candidate, or recover, this process can potentially corrupt users data permanently.

In specific,
- We hide the steps to identify the ZK leader into private section in KB:
  - Private view: https://support.elastic.dev/knowledge/view/fa410d1f
  - Public view: https://support.elastic.co/knowledge/fa410d1f
- We only guide users / customer to collect essential information, and let them reach out to support
@kunisen kunisen requested a review from pfcoperez June 4, 2025 08:37
@kunisen kunisen self-assigned this Jun 4, 2025
@kunisen kunisen requested a review from a team as a code owner June 4, 2025 08:37
@kunisen kunisen added documentation Improvements or additions to documentation supportability ability enable self-service or support of product labels Jun 4, 2025
@kunisen kunisen changed the title Clarify the steps of identifying the best ZK leader candidate [ECE] Clarify the steps of identifying the best ZK leader candidate Jun 4, 2025
Copy link
Collaborator

@shainaraskas shainaraskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some suggestions for you. I want to draw your attention to the idea of hinting at why this is a support aided process so people have clarity around the risks

@kunisen
Copy link
Contributor Author

kunisen commented Jun 5, 2025

Thank you @shainaraskas
Regarding your comment in #1598 (comment),

I didn't add because we are mentioning the point in the beginning (My bad I should have mentioned this more clearly in my initial description 🙏 ):
https://www.elastic.co/docs/troubleshoot/deployments/cloud-enterprise/rebuilding-broken-zookeeper-quorum

This article covers an advanced recovery method involving directly modifying Zookeeper. This process can potentially corrupt your data. Elastic recommends only following this outline after receiving confirmation by Elastic Support.

image

The TL;DR is, if a ZK leader candidate is wrongly chosen, then the whole ECE installation may become broken and the structure may get permanently lost.

A bit more detail:

  • ZK leader and follow (like ES master nodes) contains all the deployments & structure information, and when ZK leader is broken, aka ZK quorum is lost, ECE is down. To recover this, we need to make technical informed guess about what could be the best ZK leader candidate, and based on that information, we recover the whole ECE state.
  • If a wrong leader is chosen, e.g. it contains only partial information of the whole ECE deployments & structure information, then only that partial information will be recovered, which means all other rest of ECE structure is gone. And this is the so-called "permanent data loss", or "potentially corrupt your data".

Previously the page https://www.elastic.co/docs/troubleshoot/deployments/cloud-enterprise/rebuilding-broken-zookeeper-quorum itself was a KB, but due to it's a popular one, we promoted it to public doc.
However, we thought it's a bit misleading in how to identify the ZK leader, and also this is the most important part / step in the whole flow of this doc page, we'd make extra sure on this part - which is to totally hide the way to identify that from public, but keep it only internal, so that we could make sure users / customers will reach out to check with support, and thus we can avoid potential risk of (users) making mistakes on this.

Hope it's clear.


That said, do you think we should add one more note to emphasize?
Or are we good with only mentioning this on the page top most?

@kunisen
Copy link
Contributor Author

kunisen commented Jun 5, 2025

@shainaraskas
Copy link
Collaborator

That said, do you think we should add one more note to emphasize?
Or are we good with only mentioning this on the page top most?

Thanks @kunisen - I did see that and then promptly forgot about the note at the end of my review. That's good enough for me.

Copy link
Collaborator

@shainaraskas shainaraskas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me from a docs POV, but we'll wait on eng review as well

shainaraskas

This comment was marked as duplicate.

@kunisen
Copy link
Contributor Author

kunisen commented Jun 6, 2025

Thank you so much @shainaraskas! Asked internally here 😄

@pfcoperez
Copy link

LGTM (not approving intentionally because I want my team to take a peek too). I wonder though if you'd like to document either externally or internally what to do with the adquired information too.

@kunisen kunisen requested a review from yang-wei June 10, 2025 02:50
@kunisen kunisen enabled auto-merge (squash) June 10, 2025 05:08
@kunisen
Copy link
Contributor Author

kunisen commented Jun 10, 2025

Thank you all! Will merge this.

@kunisen kunisen merged commit 8c3a1b8 into main Jun 10, 2025
5 of 6 checks passed
@kunisen kunisen deleted the kunisen-docpr-sdhcp-9169 branch June 10, 2025 05:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation supportability ability enable self-service or support of product
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants