-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][test] BacklogQuotaManagerTest.backlogsAgeMetricsNoPreciseWithoutBacklogQuota handle empty /metrics scrape #24887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…tBacklogQuota handle empty /metrics scrape Signed-off-by: Vinkal Chudgar <[email protected]>
|
@Technoboy- Thanks for the review and approval. CI shows two unrelated test failures that match known issues. 1. CI - Unit - Brokers - Broker Group 1
2. CI - Unit - Brokers - Client Impl
This PR is test only and modifies |
|
/pulsarbot rerun-failure-checks |
2 similar comments
|
/pulsarbot rerun-failure-checks |
|
/pulsarbot rerun-failure-checks |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #24887 +/- ##
============================================
- Coverage 74.34% 74.25% -0.10%
+ Complexity 33496 33422 -74
============================================
Files 1913 1913
Lines 149315 149496 +181
Branches 17331 17391 +60
============================================
- Hits 111012 111008 -4
- Misses 29475 29646 +171
- Partials 8828 8842 +14
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
…tBacklogQuota handle empty /metrics scrape (apache#24887) Signed-off-by: Vinkal Chudgar <[email protected]> (cherry picked from commit ba8468e)
…tBacklogQuota handle empty /metrics scrape (#24887) Signed-off-by: Vinkal Chudgar <[email protected]> (cherry picked from commit ba8468e)
…tBacklogQuota handle empty /metrics scrape (#24887) Signed-off-by: Vinkal Chudgar <[email protected]> (cherry picked from commit ba8468e)
…tBacklogQuota handle empty /metrics scrape (apache#24887) Signed-off-by: Vinkal Chudgar <[email protected]> (cherry picked from commit ba8468e) (cherry picked from commit 42e02a8)
…tBacklogQuota handle empty /metrics scrape (apache#24887) Signed-off-by: Vinkal Chudgar <[email protected]> (cherry picked from commit ba8468e) (cherry picked from commit 42e02a8)
Fixes #24797
Motivation
In CI, the test
backlogsAgeMetricsNoPreciseWithoutBacklogQuotafails intermittently with:Source:
BacklogQuotaManagerTest.xml, failure section.What the test currently does
The test reads the histogram counter
_countfor the quota check duration:What actually happened in the failing run
During the first scrape performed by the test, the Prometheus servlet failed while generating
/metrics, which resulted in no metric families being emitted for that scrape.Evidence from
BacklogQuotaManagerTest.xml:/metricsDisabledMockExceptionduring servlet renderingThis explains why the client saw an empty match list for the histogram family during that scrape. The current test code then performs
.get(0)on an empty list which throws theArrayIndexOutOfBoundsExceptionshown in the failure section.Why a collector exception makes
/metricsempty for that scrapePulsar exposes metrics through a servlet that iterates the
CollectorRegistryto build the response. If a collector throws while the servlet is rendering, the servlet logs the error and the request produces HTTP 500. This behavior is directly visible in the cited log lines above.The failing collector in the stack is
ObserverGauge.collect.TheObserverGaugeimplementation calls each child supplier while collecting:pulsar.getConfig()insideBrokerService.getPendingLookupRequest():The stack shows
PulsarService.getConfigat the top of the cause chain. The exception message confirms the inline mock had been cleared at that time, which is consistent withDisabledMockException.Conclusion
The failure was caused by indexing the first element without verifying that the
_countsample was present.. In this CI run, the servlet returned HTTP 500 while the test scraped. The client received no samples for that scrape which made the list empty. The subsequent .get(0) caused ArrayIndexOutOfBoundsException.Modifications
Scope: test-only changes in
BacklogQuotaManagerTest. No broker or metrics code is changed.1. Make the metric read method tolerant of empty scrapes
getQuotaCheckCount()now returns OptionalLong. It returns OptionalLong.empty() when a scrape has no sample for the expected name and label, for example when the servlet returns 500 or when the series has not yet appeared. This removes the .get(0) crash.2. Capture the baseline only after a successful scrape, then require two checks
waitForQuotaCheckToRunTwice()continues to poll every 1 second and uses the same overall bound TIME_TO_CHECK_BACKLOG_QUOTA * 3. It still asserts two checks after a baseline.The change is when the baseline is taken: the method now waits until
/metricsreturns a sample for the series, records that value as the baseline, then requires the counter to increase by two.Justification for this change
In the failing CI run,
/metricsservlet returned HTTP 500 while the test was scraping. When /metrics servlet returns 500, the client returns no data for that scrape. If the test takes its baseline during such an outage, it can record a value like 0 even if the counter had already advanced. On the next successful scrape, the value might already be greater than baseline + 1, which can produce a false pass by counting increments that occurred before the baseline. The fix avoids taking a baseline during an outage by waiting for the first successful scrape that shows the sample, then requiring two increments after that point. This keeps the original intent intact and removes the flake.This change does not mask real failures. If the servlet continues to fail due to DisabledMockException and the sample never becomes visible, the await times out within the same overall bound, and the test fails deterministically rather than crashing or falsely passing. The failing CI case showed exactly this failure mode at scrape time.
Verifying this change
Reproduction snippet for reviewers
This section is optional and not part of the PR. It demonstrates the original crash locally. Do not commit any temporary edits from this section.
Prerequisite
You are on a commit before the fix, so the two helper methods already have their original implementations shown below. If you are on the PR branch that contains the fix, temporarily replace the fixed helpers with these originals before running this reproducer, then restore the fixed helpers after you finish.
1. Verify the original helpers in BacklogQuotaManagerTest.java
2. Add a temporary test that forces one failing scrape
It forces one failing /metrics scrape by registering a short-lived ObserverGauge whose supplier throws exactly once. The first scrape fails and returns no samples, which makes the original getQuotaCheckCount() index into an empty list and throw ArrayIndexOutOfBoundsException. The original getQuotaCheckCount() then indexes an empty list and causes ArrayIndexOutOfBoundsException. The gauge is labeled so it participates in metrics output and is unregistered in finally
3. Run only this test.
./mvnw -pl pulsar-broker -Dtest=BacklogQuotaManagerTest#backlogQuotaMetricsReproOriginalCrash -DforkCount=1 -DreuseForks=false -Dsurefire.printSummary=true -Dtestng.verbose=2 test4. Expected result
the test throws ArrayIndexOutOfBoundsException, proving the crash is reproducible when a scrape returns no samples.
5. Cleanup
Delete the temporary test method.
Personal CI Results
Tested in Personal CI fork: vinkal-chudgar#2
Status: All checks have passed (47 successful checks, 3 skipped)
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: vinkal-chudgar#2