Skip to content

Conversation

@drebelsky
Copy link
Contributor

@drebelsky drebelsky commented Oct 9, 2025

Resolves #4729. Implements a new class to use for timer replacement when timers are too expensive. For timers replaced in #4670, uses this new class. This class keeps track of the same sum and count metrics as introduced in #4670, but also keeps track of a max. Each max metric represents the max value since the previous /metrics call.

Other notes

  • I like SimpleTimer having the base unit as part of the type, but I can see a case for it to be switched such that we don't have the template. This is perhaps more relevant now that there is a SimpleTimer registry in BucketSnapshotManager (although the registry only produces SimpleTimer<std::chrono::microseconds> currently).
  • The /clear endpoint doesn't correctly reset the internal max tracker, but this just causes inaccuracies for the next syncMax() call

SimpleTimer registry

Initially,

The bucket metric uses static maps in its creation of the timer because all the timers need to share the same max. I chose to put the pseudo-registry as a static map because it felt like the least invasive change, especially if we might be removing Medida (#4945). I also couldn't think of a particularly better place to have a simple timer registry. With that said, one alternative would be to add a registry to Application, which would remove the need for much of the plumbing in this PR. Another would be to create a SimpleTimer.cpp and do the same sort of medida::MetricsRegistry -> simple timer registry map. One issue with the current approach and this last approach is that the pseudo-registry doesn't currently clean up old entries (e.g., in testing, each new Application will create a new entry in the pseudo-registry).

That caused errors because there wasn't a great way to clean up the registry on application destruction (putting it in e.g., ~BucketManager caused an error because ~ApplicationImpl was already called by then).

So, currently, it lives in BucketSnapshotManager. The only issue I have with that is that the SearchableBucketListSnapshotBase has a const ref to the manager, so there's a use of mutable on the map, which seems at least slightly misleading to me. (On the other hand, we do use mutable for a number of the metrics related variables in SearchableBucketListSnapshotBase.)

@drebelsky drebelsky changed the title DRAFT: Unify timer replacement implementation and add max tracking DRAFT: Manually measure "spike" metrics Oct 9, 2025
@drebelsky drebelsky changed the title DRAFT: Manually measure "spike" metrics Manually measure "spike" metrics Oct 21, 2025
@drebelsky drebelsky force-pushed the measure-spikes branch 2 times, most recently from 1c22cf6 to 7f8ebc5 Compare October 22, 2025 21:22
@drebelsky drebelsky marked this pull request as ready for review November 6, 2025 18:07
@drebelsky drebelsky requested a review from bboston7 November 6, 2025 19:01
bboston7
bboston7 previously approved these changes Nov 7, 2025
@drebelsky drebelsky requested a review from SirTyson November 7, 2025 06:45
Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this change! Our timers haven't been in a good place for sometime, and this will be very helpful in getting us back to a more feature rich and performant metrics system.

I've left a few comments. At a high level, I think this looks good, but we should probably do some refactoring. In general, this library seems a little complicated from the perspective of the user. There is a templated system that requires non-trivial constructor syntax with tuples and argument forwarding, and a pretty subtle requirement that the caller needs to expose a sync function when using this version of the timer, where our normal medida Timer is ready out of the box. Sometimes features are inherently complex, but I think we can probably avoid this code complexity for this particular case.

Additionally, this is a pretty sensitive perf change, as the reason we removed timers in the first place was due high overhead. While I think this looks good, I'd like to see some analysis with tracy to verify that this change doesn't add additional performance overhead.

One other procedural note, the commit history is very long and difficult to follow. For a reviewer, it's often helpful to step through commit by commit to follow along with the process and better understand the feature. The commits are so granular that it's challenging for me to follow your process. We'll of course rebase and squash when we actually merge, but I like to have each of my commits reflect a single logical change, feature, or refactor, as apposed to a line by line change log for the sake of PR review.

// Lock must be held when accessing any member variables holding snapshots
mutable SharedMutex mSnapshotMutex;

mutable Mutex mSimpleTimerRegistryMutex;
Copy link
Contributor

@SirTyson SirTyson Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes much sense for BucketSnapshotManager to manage the metrics registry. While this initial PR mostly deals with BucketList related loads, there are other time based metrics that also require simple timer, see "herder.pending[-soroban]-txs.sum". I think what makes the most sense to me is to create a lightweight wrapper that combines our SimpleTimer metric registry with medida::MetricsRegistry. Maybe we could call this something like stellar::MetricRegistry and have this be the entry point for both registries. This type would then be returned by Application::getMetrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable enough, I hadn't been sure if that was too intrusive of a change. Maybe worth noting that right now the registry is only for the snapshot metrics—the other SimpleTimers are owned by their respective classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya I think given the sync requirement and in order to maintain a unified interface with our other metrics, it would be helpful.

medida::Counter& transactionsDelayCounter,
medida::Counter& transactionsSelfDelayAccumulator,
medida::Counter& transactionsSelfDelayCounter,
std::tuple<medida::MetricsRegistry&, std::string&&,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems overly complicated for an operation we do often (create a metric in a constructor). I think we can simplify this to a shared interface with app.getMetrics().getSimpleTimer() if we follow the wrapper suggestion from the comment in BucketSnapshotManager.h.

// while keeping track of the maximums. Names are based on the constructor with
// a suffix of `sum`, `count`, or `max`. Timers can be expensive, so this class
// replaces them with a subset of the functionality using Counters internally.
template <typename Duration> class SimpleTimer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have a templated duration here? In medida::Timer, it looks like all Timers have a duration of std::chrono::nanoseconds. We can always reduce resolution from ns and convert larger values, it seems like the duration is adding uncecessary caller complexity and also making our constructor/registration process more complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, the duration felt like a reasonable part of the type—I'll update, though.

Comment on lines 37 to 38
// Specify `windowSize` to auto-call `syncMax` on every `Update` that takes
// place at least `windowSize` since the last `syncMax`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this comment means, given that there is no windowSize parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, that's left over from before the previous review.

void
SimpleTimer<Duration>::syncMax()
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra whitespace

std::chrono::nanoseconds Stop();
};

template <typename Duration>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it would be nice to separate into a .cpp file to keep in line with our general code style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do now that I'm removing the template.

isDuration<Duration>,
"SimpleTimer must be instantiated with a std::chrono::duration");
medida::Counter& mSum;
medida::Counter& mCount;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term "counter" is overloaded a bit here between mMaxCounter and mCount. Maybe change the names to be a little more clear, mSampleCount and mMaxSampleValue or something like that.


// Ensure any metrics that are "current state" gauge-like counters reflect
// the current reality as best as possible.
void syncMetrics();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the current design, where callers of the metric have to wire in a sync functionality, is flawed. While we do have some syncMetrics functions already, these are special cases, where the metric has something weird going on that the caller want to maintain. This is a bit different. Here we have a timer that just won't work properly unless synced and always requires the caller to manage sync. It looks like syncMetrics is thread safe. Could we not just call this from Application on all SimpleTimers registered? This would be much simpler, and shouldn't be too hard if we refactor the metrics registration to Application (or something similar).

@drebelsky
Copy link
Contributor Author

Changed to do an application-wide registry that wraps medida::MetricsRegistry. This might be a little excessive for this PR (and I'm happy to separate so that we have, e.g., a getMetrics() and getSimpleTimerMetrics()), but it does also lead more nicely into potentially replacing medida later on.

Copy link
Contributor

@SirTyson SirTyson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your changes! It looks like there was a bit of a painful find-replace on the refactor, but I appreciate it, the new unified interface is a lot cleaner and more streamlined. Overall I think this looks good and almost done, I just have a few nitpics, and one question about the duration parameter.

{
}
namespace medida
namespace stellar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed and just use the stellar namespace bellow

// Get the registry of metrics owned by this application. Metrics are
// reported through the administrative HTTP interface, see CommandHandler.
virtual medida::MetricsRegistry& getMetrics() = 0;
virtual stellar::MetricsRegistry& getMetrics() = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think stellar:: is necessary here

, mSelfCheckTimer(*this)
, mMetrics(
std::make_unique<medida::MetricsRegistry>(cfg.HISTOGRAM_WINDOW_SIZE))
std::make_unique<stellar::MetricsRegistry>(cfg.HISTOGRAM_WINDOW_SIZE))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the stellar:: namespace is necessary in this file.

.set_count(fs::getOpenHandleCount());

// Update simple timer metrics
mMetrics->syncMaxes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be a good place for a bit of a longer comment explaining why we need to periodically sync the SimpleTimer.

virtual bool isStopping() const override;
virtual VirtualClock& getClock() override;
virtual medida::MetricsRegistry& getMetrics() override;
virtual stellar::MetricsRegistry& getMetrics() override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the rest of the file, I think we can drop the stellar when dealing with our new MetricsRegistry


// Call syncMax() on each simple timer--i.e., set max metric to the tracked
// max, and reset max tracking.
void syncMaxes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a little more descriptive? Maybe syncSimpleTimerStats or syncSimpleTimerMaxes.

class MetricsRegistry : public medida::MetricsRegistry
{
Mutex mLock;
// Note that we can use SimpleTimer instead of shared_ptr as medida does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I don't think I really understand this comment. I get that there's an implementation difference between this and the medida registry, but it seems like a very detailed explanation for something pretty straight forward, i.e., we just need to keep track of non-polymorphic timers so a map is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a map in medida too—I was initially confused why medida used shared_ptrs for the value type (polymorphism + returning the map). Off the top of my head, I also didn't know that map iterators were stable, but I'm happy to remove the comment.


SimpleTimer::SimpleTimer(MetricsRegistry& registry,
medida::MetricName const& name,
std::chrono::nanoseconds durationUnit)
Copy link
Contributor

@SirTyson SirTyson Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this durationUnit parameter necessary? It looks like every other medidia::timer we use just uses the default duration unit. If this is correct, I'd rather just drop the param (or at least give it a default value) so SimpleTimer and medida::Timer maintain the same interface. If this isn't correct, we definitely need to look into it (outside the scope of this PR, but we'd need to file a follow up issue and at least investigate).

{BucketT::METRIC_STRING, label, "count"});
mPointCounters.emplace(static_cast<LedgerEntryType>(t), counter);
auto& metric = app.getMetrics().NewSimpleTimer(
{BucketT::METRIC_STRING, label, "sum"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this label is incorrect, as it will have the name sum-sum

app.getMetrics().NewCounter(
{"herder", "pending-soroban-txs", "self-count"}),
app.getMetrics().NewSimpleTimer(
{"herder", "pending-soroban-txs", "sum"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here, this will get called sum-sum. Is there a better way to pass in label information? The current param is a little confusing, since it looks the same a NewTimer but is subtly different. Not a huge deal and I can't think of a better way off the top of my head, but just an observation since we have the same param bug repeated in a few places, and passing "" as the name seems counterintuitive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Manually Measure "spike" metrics

3 participants