-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix reporting number of threads per state #3664
base: main
Are you sure you want to change the base?
Conversation
… to pass the single test.
@lenin-jaganathan @shakuzen @jonatan-ivanov Please review it again when you have time. |
@asasas234 Thank you for the PR! I think not being real time is actually a benefit than a drawback. It is because I think the gauges should be consistent to each other. With snapshotting, this is ensured since you create one snapshot and then use that. With querying them all the time they can be inconsistent. Let's say you have only one thread which is transitioning from state X to Y. The Gauge that tracks state X, queries the number of threads, got 1, then the transition happens, the Gauge that tracks state X, queries the number of threads, got 1 too so you will see two threads if you sum your Gauges even though you only have one. With snapshotting, this should not be possible since one consistent snapshot is used per publishing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything wrong with it, it looks like it just elevates the threadBean to a property and changes the variable and method naming and comments
@jonatan-ivanov I see that this PR has not been merged, I am not sure what the reason is? Is there anything else I need to do? |
@asasas234 I would like @shakuzen to take a look at this too, I think I'm ok with merging it in. |
@@ -47,18 +43,22 @@ public class JvmThreadMetrics implements MeterBinder { | |||
|
|||
private final Iterable<Tag> tags; | |||
|
|||
private final ThreadMXBean threadBean; | |||
|
|||
private final ThreadLocal<Map<Thread.State, Long>> threadCountsByStateThreadLocal; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we're using a ThreadLocal to store the thread states. While we typically use a thread pool with a single thread to publish metrics, that is not guaranteed. If reading of the thread state guages is done on arbitrarily many threads, we end up using a copy of the thread states on each thread and I think we could have potential memory leaks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this. I don't think using ThreadLocal is a great way to ensure that we reduce the call to ThreadMxBean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shakuzen @lenin-jaganathan I don't think this will create a memory leak, the number of threads inside the JVM is usually limited so it won't cause much memory waste, and threads are usually multiplexed, when the same metrics are queried for the second time or each time the metrics of all thread states are fetched the ThreadLocal's remove operation is triggered. So I don't think there is any need to worry about this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not particularly worried about memory leaks for the same reason @shakuzen mentioned: "we typically use a thread pool with a single thread to publish metrics, that is not guaranteed". No real objection from me this approach and after having a second look at the change, "when the same metrics are queried for the second time" - I believe this solution should work just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the same metrics are queried for the second time or each time the metrics of all thread states are fetched the ThreadLocal's remove operation is triggered.
This assumes that the same thread is reading all thread states which is not guaranteed. It seems like ThreadLocal was chosen for convenience because it has some API around setting initial values after removing, but semantically, what we want is a static map that gets refreshed after all keys are read once, which shouldn't depend on whether the same thread is being used or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shakuzen If the metrics are read by other threads, then the initialization logic of threadlocal will be followed and there will be no problem. If you think there is still a problem, can you describe it in detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shakuzen Unless you're talking about multiple threads reading metrics, and each time reading a different thread state? For example.
Request A to read the runnable state of aThreadLocal
and then request B reads the blocked state of aThreadLocal?
It is true that this scenario does not result in an update of the threadlocal, but I think it is necessary to evaluate whether this scenario really exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shakuzen If this is the case, there is a solution. My idea is to add a property that records the time of the last read of metrics, which is stored in the same threadlocal with the stateCountGroup, and each time the data is read from the stateCountGroup, determine the time of the last read, and if exceeds the limit, for example, 1 second, the cache refresh operation is triggered. If the cache is absolutely sufficient for ordinary scenarios within 1 second, I don't think it will be a problem if different states are read by different requests within 1 second, because the read state data is also relatively real-time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asasas234 I talked with @jonatan-ivanov about this today. I think we're all in agreement that the general behavior we want is to retrieve the thread states once for the gauges together per some period of time. We discussed that maybe this is a more general use case that we could update MultiGauge
to support. Currently the way MultiGauge
works is that you need to register rows, but another way it could work is that a function is registered that will produce the rows (gauges) when it is queried perhaps with some interval before it refreshes. I'll need to spend some more time thinking about if it is feasible to evolve the MultiGauge API for this purpose and how that affects publishing in the different registries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to need more time to work on what I mentioned in my previous comment. I've opened #3698 to track it.
Long count = threadCountsByState.remove(state); | ||
|
||
// This means that this state was already queried. Assuming that a Gauge will be | ||
// queried once per publication, this should mean that we need a new snapshot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no guarantee about how many times a gauge is queried per step interval. Many registry implementations would only have need to query once per step, but for example, a Prometheus setup with multiple Prometheus servers scraping the same instances for redundancy would read each gauge more than once per step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple reads will not be a problem, will trigger the expiration of the snapshot, as well as re-create the snapshot, performance will return to the same as before, however, this is the extreme case I think, in the vast majority of cases, this commit is to reduce repeated calls and optimize performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shakuzen Do you have any other comments on this?
fixes #3658 the issue was introduced in #3651
Resolved the bugs mentioned here, and re-submitted 1 PR and added 1 line of single test, in order to pass the test.
I think during the same request to get ThreadMetrics, if the state of the thread has changed, it is not necessary to be real-time, just make sure that the next time you get it, you can get the latest one. So this modification meets that goal.