Query Stats framework #2210

phoebusm · 2025-02-27T16:34:25Z

Reference Issues/PRs

https://man312219.monday.com/boards/7852509418/pulses/8297768017

What does this implement or fix?

Add basic C++ framework to stat query
Add stat to list_symbols on s3 to demostrate the framework

Any other comments?

For python layer, the change is minimum, only serve the purpose of outputing something that can be verified in the test
The missing functions will be handled in later PRs.

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

github-actions · 2025-02-27T16:34:36Z

Label error. Requires exactly 1 of: patch, minor, major. Found: enhancement

phoebusm · 2025-02-28T14:41:08Z

python/arcticdb/toolbox/stats_query.py

+    def __sub__(self, other):
+        return self._populate_stats(other._create_time)
+
+    def _populate_stats(self, other_time):


Boilerplate code to beautify the output for now
Pending changes and improvement in later PRs

poodlewars

Nice job figuring out how to pass this stuff through Folly. Can I suggest merging a PR to start with that just introduces the custom Folly executors (with a suite of tests to show that the stats calculation works with both our task based APIs and normal Folly::.via) and then we can figure out the other discussions after.

It would have been helpful if your PR description had explained your design.

poodlewars · 2025-03-03T09:36:04Z

cpp/arcticdb/async/task_scheduler.hpp

+
+    // The first overload function will call the second one in folly. Have to override both as they are overloading
+    // Called by the submitter when submitted to a executor
+    void add(folly::Func func) override {


I don't quite follow why we need this kind of no-op override?

It's a C++ syntax. To override a parent function which is overloaded, it is needed to override all

poodlewars · 2025-03-03T09:37:54Z

cpp/arcticdb/async/task_scheduler.hpp

@@ -174,11 +175,73 @@ inline auto get_default_num_cpus([[maybe_unused]] const std::string& cgroup_fold
 * 3/ Priority: How to assign priorities to task in order to treat the most pressing first.
 * 4/ Throttling: (similar to priority) how to absorb work spikes and apply memory backpressure
 */
+
+class CustomIOThreadPoolExecutor : public folly::IOThreadPoolExecutor{


This seems like a use-case for the CRTP rather than one copy of the code for IO and one for CPU.

The name is a bit weird, CustomIOThreadPoolExecutor could apply to any subclass of IOThreadPoolExecutor regardless of its purpose. StatsContextIOThreadPoolExecutor?

Good point that 4 classes can be merged into 2. Updated
Though CRTP maybe not necessary to get the job done?

poodlewars · 2025-03-03T09:38:36Z

cpp/arcticdb/async/task_scheduler.hpp

+class IOSchedulerType : public folly::FutureExecutor<CustomIOThreadPoolExecutor>{
+public:
+    template<typename... Args>
+    IOSchedulerType(Args&&... args) : folly::FutureExecutor<CustomIOThreadPoolExecutor>(std::forward<Args>(args)...){}


CRTP for this too I think

poodlewars · 2025-03-03T09:40:20Z

cpp/arcticdb/async/task_scheduler.hpp

@@ -194,17 +257,27 @@ class TaskScheduler {
        auto task = std::forward<decltype(t)>(t);
        static_assert(std::is_base_of_v<BaseTask, std::decay_t<Task>>, "Only supports Task derived from BaseTask");
        ARCTICDB_DEBUG(log::schedule(), "{} Submitting CPU task {}: {} of {}", uintptr_t(this), typeid(task).name(), cpu_exec_.getTaskQueueSize(), cpu_exec_.kDefaultMaxQueueSize);
+        // Executor::Add will be called before below function


Why do we need this here too? Don't your custom executors handle this for us regardless of whether futures are scheduled with normal Folly APIs or our own task-based wrappers?

Here copy_instance is needed, as the instance is needed to copied for each worker from caller.
The one in executor calls pass_instance, as the instance just needs to be passed along the pipeline

poodlewars · 2025-03-03T09:41:07Z

cpp/arcticdb/async/task_scheduler.hpp

@@ -194,17 +257,27 @@ class TaskScheduler {
        auto task = std::forward<decltype(t)>(t);
        static_assert(std::is_base_of_v<BaseTask, std::decay_t<Task>>, "Only supports Task derived from BaseTask");
        ARCTICDB_DEBUG(log::schedule(), "{} Submitting CPU task {}: {} of {}", uintptr_t(this), typeid(task).name(), cpu_exec_.getTaskQueueSize(), cpu_exec_.kDefaultMaxQueueSize);
+        // Executor::Add will be called before below function
+        auto task_with_stat_query_wrap = [parent_instance = util::stats_query::StatsInstance::instance(), task = std::move(task)]() mutable{


I don't understand what task_with_stat_query_wrap is supposed to mean

poodlewars · 2025-03-03T10:05:56Z

cpp/arcticdb/util/stats_query.hpp

+
+}
+
+#define GROUPABLE_STAT_NAME(x) stats_query_info##x


I don't understand how I'm meant to use these APIs. A C++ unit test suite would help. How does the grouping work? Am I able to specify a grouping on a composite like, increment the counter for objects of this key type seen during this storage operation during this arcticdb operation?

Added the explaination to the description of this PR

This must have a C++ unit test suite

poodlewars · 2025-03-03T10:07:30Z

python/tests/integration/toolbox/test_stats_query.py

+    stats = query_stats_tools_end - query_stats_tools_start
+    """
+    Expected output; time values are not deterministic
+    arcticdb_call stage key_type     storage_op parallelized count  time_count_20  time_count_510


What does time_count_{20,510} mean?

Can be done later, let's have human readable key types in the output (like TABLE_DATA).

The titles will be more intuitive in the json format

poodlewars · 2025-03-03T10:15:57Z

python/tests/integration/toolbox/test_stats_query.py

+    """
+    Expected output; time values are not deterministic
+    arcticdb_call stage key_type     storage_op parallelized count  time_count_20  time_count_510
+    0  list_streams  None     None           None         None  None              0               1


list_streams isn't a Python API method so won't mean anything to the user. How has it ended up in this output?

It's manually named. Will be updated

poodlewars · 2025-03-03T10:16:45Z

python/tests/integration/toolbox/test_stats_query.py

+    query_stats_tools_end = StatsQueryTool()
+    stats = query_stats_tools_end - query_stats_tools_start
+    """
+    Expected output; time values are not deterministic


I think we might need to rethink the idea of the stats output being a dataframe. It seems hard to answer the most important questions like "how long did I spend running list_symbols in total", "how much of that was compaction" with the dataframe API. How would you get that information from the proposed dataframe output? We could always have utilities to transform some strongly typed stats output to a dataframe for the subset of measurements where that makes sense (eg these breakdowns of storage operations).

Also there are things like the dataframe API forces all the operations to share the same histogram buckets which probably isn't suitable

Agree. Will change to json

poodlewars · 2025-03-03T10:20:24Z

cpp/arcticdb/util/stats_query.hpp

I can see why it has this, but this design has a kind of "no schema" approach to the stats, the schema is generated
dynamically based on the macro invocations. I think it may be better to have a defined schema for the stats. Just like Prometheus metric names get defined up front. I think the APIs to maintain the stats should be more like the Prometheus APIs to modify metrics.

I think your design is more similar to the APIs used by tracing libraries where you can add a hook anywhere you like, but this is quite different because we have to aggregate the metrics together.

This would add an extra chore when adding a new stat, but I think would make the whole thing clearer to people who don't use these APIs all the time (people may add new stats a couple of times a year so won't be familiar with this framework).

We spoke about this a lot and a stricter schema has its own big downsides, so OK sticking with this. The strict schema forces context to be passed between layers of the stack, which is painful

poodlewars · 2025-03-03T10:44:34Z

How are you calculating the histogram buckets?

phoebusm · 2025-03-04T10:30:26Z

How are you calculating the histogram buckets?

They are hardcoded 10ms buckets

poodlewars · 2025-03-04T16:46:49Z

python/tests/integration/toolbox/test_query_stats.py

+    """
+    Sample output:
+    {
+        "list_symbols": {


Still no way to see how long list_symbols took, or how many uncompacted keys it saw?

Oops I have missed it in the conversion from df to map. Let me add it back

Also discussed including how many symbols are in the list_symbols result set

poodlewars · 2025-03-05T14:26:59Z

cpp/arcticdb/util/query_stats.cpp

+}
+
+void QueryStats::register_new_query_stat_tool() {
+    auto new_stat_tool_count = query_stat_tool_count.fetch_add(1, std::memory_order_relaxed) + 1;


I don't really understand all these counters, what are these for? And the count isn't correct is it (adding one in this thread after the atomic fetch add)

I don't really understand all these counters, what are these for?

The counter is for keeping track how many python object QueryStatsTool is in-use:

ArcticDB/python/arcticdb/toolbox/query_stats.py

Line 13 in da31689

QueryStats.register_new_query_stat_tool()

If the counter > 0, query stats is ON
If counter reaches 0, query stats is OFF and stats are cleared.

And the count isn't correct is it (adding one in this thread after the atomic fetch add)

It fetches the value before adding. So without the + 1 at the end, that will be the old value

Test

ArcticDB/python/tests/integration/toolbox/test_query_stats.py

Line 86 in da31689

def test_query_stats_tool_counter(s3_version_store_v1):

to demostrate the idea

poodlewars · 2025-03-05T14:29:00Z

cpp/arcticdb/util/query_stats.hpp

+#include <fmt/format.h>
+
+namespace arcticdb::util::query_stats {
+using StatsGroups = std::vector<std::shared_ptr<std::pair<std::string, std::string>>>;


a type alias for the pair would help me read this (same anywhere you have complicated structures built out of std types like this, eg StatsOutputFormat)

poodlewars · 2025-03-05T14:30:18Z

cpp/arcticdb/util/query_stats.hpp

+    std::atomic<int32_t> query_stat_tool_count = 0;
+    std::mutex stats_mutex_;
+    //TODO: Change to std::list<std::pair<StatsGroups, std::pair<std::string, std::variant<std::string, xxx>>> 
+    std::list<std::pair<StatsGroups, std::pair<std::string, std::string>>> stats;


pull out structs and type alias for these 🙏

phoebusm · 2025-03-11T19:24:30Z

cpp/arcticdb/async/test/test_async.cpp

@@ -141,6 +141,90 @@ TEST(Async, CollectWithThrow) {
   ARCTICDB_DEBUG(log::version(), "Collect returned");
 }

+TEST(Async, StatsQueryDemo) {


Demo for how to add/use query stats

poodlewars · 2025-03-12T09:37:49Z

cpp/arcticdb/async/task_scheduler.hpp

@@ -165,21 +169,53 @@ inline auto get_default_num_cpus([[maybe_unused]] const std::string& cgroup_fold
    #endif
 }

-/*


This comment is still valid, why have you removed it?

That's a mistake

poodlewars · 2025-03-12T09:38:36Z

cpp/arcticdb/async/task_scheduler.hpp

+        void add(folly::Func func,
+            std::chrono::milliseconds expiration,
+            folly::Func expireCallback) override {
+            if (arcticdb::util::query_stats::QueryStats::instance().is_enabled_) {


Why not a function for whether it's enabled?

poodlewars · 2025-03-12T09:38:57Z

cpp/arcticdb/async/task_scheduler.hpp

+            std::chrono::milliseconds expiration,
+            folly::Func expireCallback) override {
+            if (arcticdb::util::query_stats::QueryStats::instance().is_enabled_) {
+                auto func_with_stat_query_wrap = [layer = util::query_stats::QueryStats::instance().current_layer(), func = std::move(func)](auto&&... vars) mutable{


Just call this wrapped_func

poodlewars · 2025-03-12T09:39:36Z

cpp/arcticdb/async/task_scheduler.hpp

        std::lock_guard lock{cpu_mutex_};
-        return cpu_exec_.addFuture(std::move(task));
+        if (arcticdb::util::query_stats::QueryStats::instance().is_enabled_) {
+            auto task_with_stat_query_instance = [&parent_thread_local_var = util::query_stats::QueryStats::instance().thread_local_var_, task = std::move(task)]() mutable{


Call this wrapped_task

poodlewars · 2025-03-12T09:43:04Z

cpp/arcticdb/toolbox/python_bindings.cpp


+
+    using namespace arcticdb::util::query_stats;
+    auto query_stats_module = tools.def_submodule("QueryStats", "Stats query functionality");    


"Query stats" not "stats query"

poodlewars · 2025-03-12T09:45:08Z

cpp/arcticdb/toolbox/python_bindings.cpp


+
+    using namespace arcticdb::util::query_stats;
+    auto query_stats_module = tools.def_submodule("QueryStats", "Stats query functionality");    


QueryStats isn't a conventional name for a Python module, I would have expected query_stats?

poodlewars · 2025-03-12T09:46:41Z

cpp/arcticdb/toolbox/python_bindings.cpp

+
+    using namespace arcticdb::util::query_stats;
+    auto query_stats_module = tools.def_submodule("QueryStats", "Stats query functionality");    
+    py::enum_<StatsGroupName>(query_stats_module, "StatsGroupName")


Some ideas about naming given that these are all in a module called QueryStats anyway,

StatsGroupName -> GroupName StatsName -> StatisticName StatsGroupLayer -> GroupingLevel current_layer -> current_level root_layers -> root_levels (not sure why it isn't just root_level())

poodlewars · 2025-03-12T09:47:14Z

cpp/arcticdb/toolbox/python_bindings.cpp

+    query_stats_module.def("current_layer", []() { 
+        return QueryStats::instance().current_layer(); 
+    });
+    query_stats_module.def("root_layers", []() { 


I'm slightly surprised you're exposing these levels to the Python layer, but not a big deal

Yea I want to move the formatting to python level as much as I can.

poodlewars · 2025-03-12T09:48:32Z

python/tests/integration/toolbox/test_query_stats.py

+
+def test_query_stats(s3_version_store_v1, clear_query_stats):
+    s3_version_store_v1.write("a", 1)
+    QueryStatsTool.enable()


Can the API just be free functions, I'm not sure why we need an object QueryStatsTool

enable_query_stats() disable_query_stats()

poodlewars · 2025-03-12T10:09:48Z

python/arcticdb/toolbox/query_stats.py

+
+            next_layer_map = next_layer_maps[group_idx]
+
+            # top level


I think you should have a check in this function that the arcticdb_call is indeed at the top level of the stats object we're processing

Good point. But I rather add the checking at the C++ layer

poodlewars · 2025-03-12T10:11:11Z

cpp/arcticdb/util/query_stats.hpp

+ *    so the stats logged in folly threads will be aggregated to the master map
+ *    (Checking will be added after all log entries are added)
+ * 2. All folly tasks must be submitted through the TaskScheduler::submit_cpu_task/submit_io_task
+ * 3. All folly tasks must complete ("collected") before last StatsGroup object is destroyed in the call stack


What happens if a task fails? Should add testing in your C++ test for that

It will still work. All I need is the task to complete, failed or not, to avoid race condition
Will update the C++ test for that

poodlewars · 2025-03-12T10:11:21Z

cpp/arcticdb/util/query_stats.hpp

+ *   When created, it adds a new layer and when destroyed, it restores the previous layer state
+ * 
+ * Note:
+ * To make the query stats model works, there are two requirements:


work not works
three

poodlewars · 2025-03-12T10:13:58Z

cpp/arcticdb/util/query_stats.hpp

+
+}
+
+#define STATS_GROUP_VAR_NAME(x) query_stats_info##x


I still don't think there's any need for these to be macros rather than normal functions? I get that you'd have to hold the StatsGroup alive for the RAII to work, but that should be fine.

Yea I am referencing existing implementation, e.g. ARCTICDB_SAMPLE.
Declaring a variable is a bit weird for logging. And folding the ON/OFF into StatsGroup class's constructor/destructor is also weird IMO
And I want to make adding groupable stat and non-groupable stat unifrom as well, from the user's perspective

poodlewars · 2025-03-12T10:16:22Z

python/tests/integration/toolbox/test_query_stats.py

Does list_symbols release the GIL? As soon as you instrument a function that does, should add tests to check how this all behaves with Python multi-threading

It doesn't release the GIL.
I simulate the python multi-threading in the C++ test Async.StatsQueryDemo

phoebusm added 2 commits February 25, 2025 17:17

Foundation of stats query

8f1cbd4

More C++ changes

b914eb3

phoebusm requested review from alexowens90, willdealtry and poodlewars as code owners February 27, 2025 16:34

phoebusm marked this pull request as draft February 27, 2025 16:34

phoebusm added 2 commits February 28, 2025 11:25

More polishing

9b2a3eb

Some boilerplate code to beautify the output

463b9ba

phoebusm added the enhancement New feature or request label Feb 28, 2025

phoebusm commented Feb 28, 2025

View reviewed changes

phoebusm marked this pull request as ready for review February 28, 2025 14:41

phoebusm added 3 commits February 28, 2025 16:58

Slight performace improvement and remove useless function

e6b654b

Beautify the output and clear some python warning

d6694f7

Simplification

9a6867a

poodlewars requested changes Mar 3, 2025

View reviewed changes

Address PR comments

c6ba469

phoebusm changed the title ~~Stats query framework~~ Query Stats framework Mar 4, 2025

phoebusm added 2 commits March 4, 2025 16:00

Add support of filtering stats with object

1651d6e

Support context manager usage

bae315b

poodlewars reviewed Mar 4, 2025

View reviewed changes

phoebusm added 4 commits March 4, 2025 16:56

More tests

9755f09

Address some PR comments

b5f05fc

Update docstring

d0deeb7

Tests on stats reset

da31689

poodlewars reviewed Mar 5, 2025

View reviewed changes

phoebusm added 6 commits March 10, 2025 12:45

More docstring and checking

48d05d3

Support python multithread

310cc3a

Some bug fix

696dc62

Better chart and naming

04fe32b

Remove obsolete log

48bb964

Add C++ test for demo

4791497

phoebusm commented Mar 11, 2025

View reviewed changes

poodlewars reviewed Mar 12, 2025

View reviewed changes

Better naming

76f1e34

poodlewars requested changes Mar 12, 2025

View reviewed changes

phoebusm added 17 commits March 13, 2025 11:25

Addressing PR comment snapshot

dc73c93

Better printing

d8d4147

Better printing

b240011

Bug fix

682ce3f

typo

9ad4ff5

typo

44f87d9

typo

cadd55e

Bug fix

aab9219

python test update

58c4b66

Fix cpp test

fc14727

Set qs' function free in python!

dedf0f0

Update some tests

ac87910

Update docstring

911e9ce

Update some comments

2a5e374

Fix minor issues

8e70769

Update time limit

7127c18

Make the limit losser

3eaafb6

                   #endif
               }
-              /*



		using namespace arcticdb::util::query_stats;
		auto query_stats_module = tools.def_submodule("QueryStats", "Stats query functionality");

Query Stats framework #2210

Are you sure you want to change the base?

Query Stats framework #2210

Conversation

phoebusm commented Feb 27, 2025 • edited Loading

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

github-actions bot commented Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

poodlewars left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars commented Mar 3, 2025

phoebusm commented Mar 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poodlewars Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phoebusm Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phoebusm commented Feb 27, 2025 •

edited

Loading

github-actions bot commented Feb 27, 2025 •

edited

Loading

poodlewars left a comment •

edited

Loading

poodlewars Mar 3, 2025 •

edited

Loading

poodlewars Mar 3, 2025 •

edited

Loading

poodlewars Mar 5, 2025 •

edited

Loading

phoebusm Mar 12, 2025 •

edited

Loading