-
Notifications
You must be signed in to change notification settings - Fork 1k
Manually measure "spike" metrics #4964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
4030d71 to
dbb6479
Compare
1c22cf6 to
7f8ebc5
Compare
6a79f8f to
d76bba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this change! Our timers haven't been in a good place for sometime, and this will be very helpful in getting us back to a more feature rich and performant metrics system.
I've left a few comments. At a high level, I think this looks good, but we should probably do some refactoring. In general, this library seems a little complicated from the perspective of the user. There is a templated system that requires non-trivial constructor syntax with tuples and argument forwarding, and a pretty subtle requirement that the caller needs to expose a sync function when using this version of the timer, where our normal medida Timer is ready out of the box. Sometimes features are inherently complex, but I think we can probably avoid this code complexity for this particular case.
Additionally, this is a pretty sensitive perf change, as the reason we removed timers in the first place was due high overhead. While I think this looks good, I'd like to see some analysis with tracy to verify that this change doesn't add additional performance overhead.
One other procedural note, the commit history is very long and difficult to follow. For a reviewer, it's often helpful to step through commit by commit to follow along with the process and better understand the feature. The commits are so granular that it's challenging for me to follow your process. We'll of course rebase and squash when we actually merge, but I like to have each of my commits reflect a single logical change, feature, or refactor, as apposed to a line by line change log for the sake of PR review.
src/bucket/BucketSnapshotManager.h
Outdated
| // Lock must be held when accessing any member variables holding snapshots | ||
| mutable SharedMutex mSnapshotMutex; | ||
|
|
||
| mutable Mutex mSimpleTimerRegistryMutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it makes much sense for BucketSnapshotManager to manage the metrics registry. While this initial PR mostly deals with BucketList related loads, there are other time based metrics that also require simple timer, see "herder.pending[-soroban]-txs.sum". I think what makes the most sense to me is to create a lightweight wrapper that combines our SimpleTimer metric registry with medida::MetricsRegistry. Maybe we could call this something like stellar::MetricRegistry and have this be the entry point for both registries. This type would then be returned by Application::getMetrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable enough, I hadn't been sure if that was too intrusive of a change. Maybe worth noting that right now the registry is only for the snapshot metrics—the other SimpleTimers are owned by their respective classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya I think given the sync requirement and in order to maintain a unified interface with our other metrics, it would be helpful.
src/herder/TransactionQueue.h
Outdated
| medida::Counter& transactionsDelayCounter, | ||
| medida::Counter& transactionsSelfDelayAccumulator, | ||
| medida::Counter& transactionsSelfDelayCounter, | ||
| std::tuple<medida::MetricsRegistry&, std::string&&, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems overly complicated for an operation we do often (create a metric in a constructor). I think we can simplify this to a shared interface with app.getMetrics().getSimpleTimer() if we follow the wrapper suggestion from the comment in BucketSnapshotManager.h.
src/util/SimpleTimer.h
Outdated
| // while keeping track of the maximums. Names are based on the constructor with | ||
| // a suffix of `sum`, `count`, or `max`. Timers can be expensive, so this class | ||
| // replaces them with a subset of the functionality using Counters internally. | ||
| template <typename Duration> class SimpleTimer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have a templated duration here? In medida::Timer, it looks like all Timers have a duration of std::chrono::nanoseconds. We can always reduce resolution from ns and convert larger values, it seems like the duration is adding uncecessary caller complexity and also making our constructor/registration process more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, the duration felt like a reasonable part of the type—I'll update, though.
src/util/SimpleTimer.h
Outdated
| // Specify `windowSize` to auto-call `syncMax` on every `Update` that takes | ||
| // place at least `windowSize` since the last `syncMax`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this comment means, given that there is no windowSize parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, that's left over from before the previous review.
src/util/SimpleTimer.h
Outdated
| void | ||
| SimpleTimer<Duration>::syncMax() | ||
| { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: extra whitespace
src/util/SimpleTimer.h
Outdated
| std::chrono::nanoseconds Stop(); | ||
| }; | ||
|
|
||
| template <typename Duration> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it would be nice to separate into a .cpp file to keep in line with our general code style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do now that I'm removing the template.
src/util/SimpleTimer.h
Outdated
| isDuration<Duration>, | ||
| "SimpleTimer must be instantiated with a std::chrono::duration"); | ||
| medida::Counter& mSum; | ||
| medida::Counter& mCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the term "counter" is overloaded a bit here between mMaxCounter and mCount. Maybe change the names to be a little more clear, mSampleCount and mMaxSampleValue or something like that.
src/bucket/BucketManager.h
Outdated
|
|
||
| // Ensure any metrics that are "current state" gauge-like counters reflect | ||
| // the current reality as best as possible. | ||
| void syncMetrics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current design, where callers of the metric have to wire in a sync functionality, is flawed. While we do have some syncMetrics functions already, these are special cases, where the metric has something weird going on that the caller want to maintain. This is a bit different. Here we have a timer that just won't work properly unless synced and always requires the caller to manage sync. It looks like syncMetrics is thread safe. Could we not just call this from Application on all SimpleTimers registered? This would be much simpler, and shouldn't be too hard if we refactor the metrics registration to Application (or something similar).
e9cb7e6 to
8eeec45
Compare
|
Changed to do an application-wide registry that wraps |
8eeec45 to
25933b6
Compare
25933b6 to
59f0670
Compare
SirTyson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your changes! It looks like there was a bit of a painful find-replace on the refactor, but I appreciate it, the new unified interface is a lot cleaner and more streamlined. Overall I think this looks good and almost done, I just have a few nitpics, and one question about the duration parameter.
| { | ||
| } | ||
| namespace medida | ||
| namespace stellar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be removed and just use the stellar namespace bellow
| // Get the registry of metrics owned by this application. Metrics are | ||
| // reported through the administrative HTTP interface, see CommandHandler. | ||
| virtual medida::MetricsRegistry& getMetrics() = 0; | ||
| virtual stellar::MetricsRegistry& getMetrics() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think stellar:: is necessary here
| , mSelfCheckTimer(*this) | ||
| , mMetrics( | ||
| std::make_unique<medida::MetricsRegistry>(cfg.HISTOGRAM_WINDOW_SIZE)) | ||
| std::make_unique<stellar::MetricsRegistry>(cfg.HISTOGRAM_WINDOW_SIZE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the stellar:: namespace is necessary in this file.
| .set_count(fs::getOpenHandleCount()); | ||
|
|
||
| // Update simple timer metrics | ||
| mMetrics->syncMaxes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a good place for a bit of a longer comment explaining why we need to periodically sync the SimpleTimer.
| virtual bool isStopping() const override; | ||
| virtual VirtualClock& getClock() override; | ||
| virtual medida::MetricsRegistry& getMetrics() override; | ||
| virtual stellar::MetricsRegistry& getMetrics() override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and the rest of the file, I think we can drop the stellar when dealing with our new MetricsRegistry
|
|
||
| // Call syncMax() on each simple timer--i.e., set max metric to the tracked | ||
| // max, and reset max tracking. | ||
| void syncMaxes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a little more descriptive? Maybe syncSimpleTimerStats or syncSimpleTimerMaxes.
| class MetricsRegistry : public medida::MetricsRegistry | ||
| { | ||
| Mutex mLock; | ||
| // Note that we can use SimpleTimer instead of shared_ptr as medida does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I don't think I really understand this comment. I get that there's an implementation difference between this and the medida registry, but it seems like a very detailed explanation for something pretty straight forward, i.e., we just need to keep track of non-polymorphic timers so a map is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a map in medida too—I was initially confused why medida used shared_ptrs for the value type (polymorphism + returning the map). Off the top of my head, I also didn't know that map iterators were stable, but I'm happy to remove the comment.
|
|
||
| SimpleTimer::SimpleTimer(MetricsRegistry& registry, | ||
| medida::MetricName const& name, | ||
| std::chrono::nanoseconds durationUnit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this durationUnit parameter necessary? It looks like every other medidia::timer we use just uses the default duration unit. If this is correct, I'd rather just drop the param (or at least give it a default value) so SimpleTimer and medida::Timer maintain the same interface. If this isn't correct, we definitely need to look into it (outside the scope of this PR, but we'd need to file a follow up issue and at least investigate).
| {BucketT::METRIC_STRING, label, "count"}); | ||
| mPointCounters.emplace(static_cast<LedgerEntryType>(t), counter); | ||
| auto& metric = app.getMetrics().NewSimpleTimer( | ||
| {BucketT::METRIC_STRING, label, "sum"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this label is incorrect, as it will have the name sum-sum
| app.getMetrics().NewCounter( | ||
| {"herder", "pending-soroban-txs", "self-count"}), | ||
| app.getMetrics().NewSimpleTimer( | ||
| {"herder", "pending-soroban-txs", "sum"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue here, this will get called sum-sum. Is there a better way to pass in label information? The current param is a little confusing, since it looks the same a NewTimer but is subtly different. Not a huge deal and I can't think of a better way off the top of my head, but just an observation since we have the same param bug repeated in a few places, and passing "" as the name seems counterintuitive.
Resolves #4729. Implements a new class to use for timer replacement when timers are too expensive. For timers replaced in #4670, uses this new class. This class keeps track of the same sum and count metrics as introduced in #4670, but also keeps track of a max. Each max metric represents the max value since the previous
/metricscall.Other notes
BucketSnapshotManager(although the registry only producesSimpleTimer<std::chrono::microseconds>currently)./clearendpoint doesn't correctly reset the internalmaxtracker, but this just causes inaccuracies for the nextsyncMax()callSimpleTimer registry
Initially,
That caused errors because there wasn't a great way to clean up the registry on application destruction (putting it in e.g.,
~BucketManagercaused an error because~ApplicationImplwas already called by then).So, currently, it lives in
BucketSnapshotManager. The only issue I have with that is that theSearchableBucketListSnapshotBasehas aconstref to the manager, so there's a use ofmutableon themap, which seems at least slightly misleading to me. (On the other hand, we do usemutablefor a number of the metrics related variables inSearchableBucketListSnapshotBase.)