Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance of DefaultLongTaskTimer#takeSnapshot #3993

Open
Hartigan opened this issue Jul 24, 2023 · 8 comments
Open

Poor performance of DefaultLongTaskTimer#takeSnapshot #3993

Hartigan opened this issue Jul 24, 2023 · 8 comments
Labels
module: micrometer-core An issue that is related to our core module performance Issues related to general performance

Comments

@Hartigan
Copy link

Poor performance of MicrometerCollector.collect() which uses DefaultLongTaskTimer. DefaultLongTaskTimer contains a lot of calls of ConcurrentLinkedDeque.size() with O(n) complexity during histogram creation.

Micrometer version: 1.10.3

Might be worth saving the size of deque or full snapshot at the beginning of the method DefaultLongTaskTimer.takeSnapshot() execution. Also method may contain data races between histogram creation and updates of activeTasks.

@jonatan-ivanov
Copy link
Member

Thank you for the issue!

Do you have benchmarks that shows the bottleneck or is this a theoretical issue?
I would assume that you will have issues because of the huge amount of parallel tasks before you would see performance degradations because of size(). This assumption might be incorrect, some benchmarks can demonstrate the correctness of it.

Also method may contain data races between histogram creation and updates of activeTasks.

Could you please elaborate this? Which method and where can race condition arise?

@jonatan-ivanov jonatan-ivanov added waiting for feedback We need additional information before we can continue and removed waiting-for-triage labels Jul 24, 2023
@Hartigan
Copy link
Author

True, I have service with huge amount of parallel requests and it's expected load (not issue). And ConcurrentLinkedDeque.size() consumes 8% of flamegraph -- actual issue.
Screenshot 2023-07-25 at 10 19 15

I calculate the algorithmic complexity of this class:
DefaultLongTaskTimer store task in ConcurrentLinkedDeque and provide methods start() with O(1) for append task into deque, SampleImpl.stop() -- O(n) for removing random element from deque and takeSnapshot() -- O(n^2) for iterating over deque with size() calls on each iteration.

Could you please elaborate this? Which method and where can race condition arise?

It's theoretical, but you use ConcurrentLinkedDeque. This implies that timers can be added and removed while the histogram is being calculated(calls of start()/SampleImpl.stop() during call of takeSnapshot()).

@shakuzen
Copy link
Member

Thank you for the analysis. The DefaultLongTaskTImer's takeSnapshot implementation does not scale well with many active tasks currently. Out of curiosity, what order of magnitude of active tasks are you expecting at a time? 1,000? 1,000,000?
I have some benchmarks locally and will try messing around with improving the efficiency, but if you have some suggestion, feel free to propose it in a pull request.

@shakuzen shakuzen added performance Issues related to general performance module: micrometer-core An issue that is related to our core module and removed waiting for feedback We need additional information before we can continue labels Jul 26, 2023
@shakuzen shakuzen changed the title Poor performance of MicrometerCollector.collect() Poor performance of DefaultLongTaskTimer#takeSnapshot Jul 26, 2023
@Hartigan
Copy link
Author

Hartigan commented Jul 27, 2023

I created pull request #4001

@Hartigan
Copy link
Author

If you may sacrifice DefaultLongTaskTimer#max() performance, I can replace Deque by HashMap and improve performance of stop operation. For cases with huge amount of active tasks I see 3-5% of profile in ConcurrentLinkedDeque#remove(Object), because tasks located not at the beginning of deque. Example: 100 active long requests and timer start/stop new short requests. start() -- just append new timer for short request, and stop() must iterate 100 long task for delete short task each time.

@fdulger
Copy link

fdulger commented Feb 22, 2024

Is this planned to be taken up anytime soon? We're also experiencing similar behavior when we let our application run for a long time. We're using spring boot and micrometer with default configuration and we scrape the metrics with a prometheus server once in a minute. We notice that the CPU usage of the system is going up continuously. If we increase the frequency of scraping, the CPU usage goes up faster. First, the micrometer endpoint starts to respond slower and eventually the entire application gets effected.

@Sipaha
Copy link

Sipaha commented Aug 13, 2024

@fdulger We had the same issue in our system. The problem started after we updated Spring, which now registers its own DefaultMeterObservationHandler. We already had one handler registered, so this caused a conflict.
Each handler registers samples in the onStart method, and only last registered sample properly stops in onStop. The others stay in a 'started' state forever, leading to a memory leak and increased CPU usage, just like what you're seeing.

@jonatan-ivanov
Copy link
Member

Fyi, since 1.13.0, you can disable LongTaskTimer creation: #4604. If you are using Spring Boot, there is a property for that:
actuator.management.observations.long-task-timer.enabled.

Also, I would really look into why some of your Observations are never stopped (it's likely that there is a problem with your instrumentation). Not stopping Observations or LongTaskTimers are in fact memory leaks and it's a separate problem than this issue (it's likely a problem with the instrumentation not Micrometer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: micrometer-core An issue that is related to our core module performance Issues related to general performance
Projects
None yet
Development

No branches or pull requests

5 participants