Skip to content

DOC-11497 Docs for obs: Enabling troubleshooting hot spots externally (e.g., logs or metrics) #19577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

florence-crl
Copy link
Contributor

@florence-crl florence-crl commented May 1, 2025

Fixes DOC-11497

Added detect-hotspots.md and associated images.

Rendered previews:

Copy link

github-actions bot commented May 1, 2025

Files changed:

Copy link

netlify bot commented May 1, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit d48a24a
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/685596e327eb7d0008ec4ded

Copy link

netlify bot commented May 1, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit d48a24a
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/685596e35e37da0008236082

Copy link

netlify bot commented May 1, 2025

Deploy Preview for cockroachdb-docs failed. Why did it fail? →

Name Link
🔨 Latest commit 0173eb8
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/6813b55b6c4a2d00084eadec

Copy link

netlify bot commented May 1, 2025

Netlify Preview

Name Link
🔨 Latest commit d48a24a
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/685596e32095c900085407b3
😎 Deploy Preview https://deploy-preview-19577--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@florence-crl florence-crl requested a review from kevin-v-ngo May 13, 2025 19:17
Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the first look, @angles-n-daemons please review again.

Copy link

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, couple more quick comments here.

Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR

Copy link

@kevin-v-ngo kevin-v-ngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome Doc! Few questions and suggestions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few questions and suggestions,

  1. Can we simplify this and remove the second box ("Is there a node outlier in the metrics?")?
  2. Are guaranteed to have a 'hot ranges log' when there is a popular key log for the latch contention workflow? CC @angles-n-daemons

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified diagram

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't, I'll explain in detail why.

The hot ranges log shows up under two conditions when enabled:

  1. The logging interval duration has elapsed (eg, once every four hours).
  2. A single replica has exceeded the CPU threshold we configured for logging.

Now when there's a popular key, or rather a row hotspot, a single range may be receiving most of the traffic, but much of the incoming queries are waiting for a latch to be released rather than doing anything. Waiting for a latch incurs no effect on cpu utilization, so if there are lots of waiting queries, there's not quite as much cpu activity.

You can see this difference in the Anatomy of a Hotspot document, if you look at "Appendix B: Anatomy of a Row Hotspot", you'll see that while elevated, the cpu utilization for the leaseholder doesn't exceed 25%.

It's certainly possible that this is enough to go over the threshold defined, but not guaranteed.

Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-v-ngo thanks for your first review, please take a second look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modified diagram

@florence-crl florence-crl requested a review from kevin-v-ngo June 17, 2025 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants