Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reporting number of threads per state #3664

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

asasas234
Copy link
Contributor

@asasas234 asasas234 commented Feb 22, 2023

fixes #3658 the issue was introduced in #3651

Resolved the bugs mentioned here, and re-submitted 1 PR and added 1 line of single test, in order to pass the test.

I think during the same request to get ThreadMetrics, if the state of the thread has changed, it is not necessary to be real-time, just make sure that the next time you get it, you can get the latest one. So this modification meets that goal.

@asasas234 asasas234 changed the title Resolved the bugs and added 1 line of code for unit testing, in order… fixbug JvmThreadMetrics Feb 22, 2023
@asasas234
Copy link
Contributor Author

@lenin-jaganathan @shakuzen @jonatan-ivanov Please review it again when you have time.

@jonatan-ivanov
Copy link
Member

@asasas234 Thank you for the PR!
I think this should fix the issue and mean some improvements since fetching the gauges are usually single threaded. It was not very easy to understand so I moved a few things around (ccd8661) to improve readability. Hope it helps.

I think not being real time is actually a benefit than a drawback. It is because I think the gauges should be consistent to each other. With snapshotting, this is ensured since you create one snapshot and then use that. With querying them all the time they can be inconsistent. Let's say you have only one thread which is transitioning from state X to Y. The Gauge that tracks state X, queries the number of threads, got 1, then the transition happens, the Gauge that tracks state X, queries the number of threads, got 1 too so you will see two threads if you sum your Gauges even though you only have one. With snapshotting, this should not be possible since one consistent snapshot is used per publishing.

@jonatan-ivanov jonatan-ivanov changed the title fixbug JvmThreadMetrics Fix reporting number of threads per state Feb 23, 2023
Copy link
Contributor Author

@asasas234 asasas234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything wrong with it, it looks like it just elevates the threadBean to a property and changes the variable and method naming and comments

@asasas234
Copy link
Contributor Author

@jonatan-ivanov I see that this PR has not been merged, I am not sure what the reason is? Is there anything else I need to do?

@jonatan-ivanov
Copy link
Member

@asasas234 I would like @shakuzen to take a look at this too, I think I'm ok with merging it in.

@@ -47,18 +43,22 @@ public class JvmThreadMetrics implements MeterBinder {

private final Iterable<Tag> tags;

private final ThreadMXBean threadBean;

private final ThreadLocal<Map<Thread.State, Long>> threadCountsByStateThreadLocal;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we're using a ThreadLocal to store the thread states. While we typically use a thread pool with a single thread to publish metrics, that is not guaranteed. If reading of the thread state guages is done on arbitrarily many threads, we end up using a copy of the thread states on each thread and I think we could have potential memory leaks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. I don't think using ThreadLocal is a great way to ensure that we reduce the call to ThreadMxBean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shakuzen @lenin-jaganathan I don't think this will create a memory leak, the number of threads inside the JVM is usually limited so it won't cause much memory waste, and threads are usually multiplexed, when the same metrics are queried for the second time or each time the metrics of all thread states are fetched the ThreadLocal's remove operation is triggered. So I don't think there is any need to worry about this issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not particularly worried about memory leaks for the same reason @shakuzen mentioned: "we typically use a thread pool with a single thread to publish metrics, that is not guaranteed". No real objection from me this approach and after having a second look at the change, "when the same metrics are queried for the second time" - I believe this solution should work just fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the same metrics are queried for the second time or each time the metrics of all thread states are fetched the ThreadLocal's remove operation is triggered.

This assumes that the same thread is reading all thread states which is not guaranteed. It seems like ThreadLocal was chosen for convenience because it has some API around setting initial values after removing, but semantically, what we want is a static map that gets refreshed after all keys are read once, which shouldn't depend on whether the same thread is being used or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shakuzen If the metrics are read by other threads, then the initialization logic of threadlocal will be followed and there will be no problem. If you think there is still a problem, can you describe it in detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shakuzen Unless you're talking about multiple threads reading metrics, and each time reading a different thread state? For example.
Request A to read the runnable state of aThreadLocal
and then request B reads the blocked state of aThreadLocal?
It is true that this scenario does not result in an update of the threadlocal, but I think it is necessary to evaluate whether this scenario really exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shakuzen If this is the case, there is a solution. My idea is to add a property that records the time of the last read of metrics, which is stored in the same threadlocal with the stateCountGroup, and each time the data is read from the stateCountGroup, determine the time of the last read, and if exceeds the limit, for example, 1 second, the cache refresh operation is triggered. If the cache is absolutely sufficient for ordinary scenarios within 1 second, I don't think it will be a problem if different states are read by different requests within 1 second, because the read state data is also relatively real-time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asasas234 I talked with @jonatan-ivanov about this today. I think we're all in agreement that the general behavior we want is to retrieve the thread states once for the gauges together per some period of time. We discussed that maybe this is a more general use case that we could update MultiGauge to support. Currently the way MultiGauge works is that you need to register rows, but another way it could work is that a function is registered that will produce the rows (gauges) when it is queried perhaps with some interval before it refreshes. I'll need to spend some more time thinking about if it is feasible to evolve the MultiGauge API for this purpose and how that affects publishing in the different registries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need more time to work on what I mentioned in my previous comment. I've opened #3698 to track it.

Long count = threadCountsByState.remove(state);

// This means that this state was already queried. Assuming that a Gauge will be
// queried once per publication, this should mean that we need a new snapshot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee about how many times a gauge is queried per step interval. Many registry implementations would only have need to query once per step, but for example, a Prometheus setup with multiple Prometheus servers scraping the same instances for redundancy would read each gauge more than once per step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple reads will not be a problem, will trigger the expiration of the snapshot, as well as re-create the snapshot, performance will return to the same as before, however, this is the extreme case I think, in the vast majority of cases, this commit is to reduce repeated calls and optimize performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shakuzen Do you have any other comments on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"jvm.threads.states" reports stale data
4 participants