perf: Add memory profiling #1702

andygrove · 2025-05-01T13:46:25Z

Which issue does this PR close?

Rationale for this change

We need a way to profile memory.

What changes are included in this PR?

Logging of JVM and native memory utilization on each call to execute_plan. After running TPC-H locally with 8GB executor memory and 8GB off-heap memory, I see these stats:

JVM_MEMORY: { heapUsed: 2696, heapCommitted: 4394, nonHeapUsed: 142, nonHeapCommitted: 160 }
NATIVE_MEMORY: { resident: 11531.53 }

When running with multiple iterations, I see overall memory exceeding 16GB. If this were running in a container environment then I would expect to see the executor be killed due to OOM.

JVM_MEMORY: { heapUsed: 4217, heapCommitted: 7440, nonHeapUsed: 145, nonHeapCommitted: 165 }
NATIVE_MEMORY: { resident: 18406.37 }

How are these changes tested?

andygrove · 2025-05-01T14:25:11Z

@mbutrovich I'd like to get your feedback on this PR.

codecov-commenter · 2025-05-01T15:08:06Z

Codecov Report

Attention: Patch coverage is 37.50000% with 10 lines in your changes missing coverage. Please review.

Project coverage is 58.69%. Comparing base (f09f8af) to head (077f5af).
Report is 177 commits behind head on main.

Files with missing lines	Patch %	Lines
...ain/scala/org/apache/comet/CometExecIterator.scala	23.07%	9 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1702      +/-   ##
============================================
+ Coverage     56.12%   58.69%   +2.56%     
- Complexity      976     1143     +167     
============================================
  Files           119      129      +10     
  Lines         11743    12615     +872     
  Branches       2251     2359     +108     
============================================
+ Hits           6591     7404     +813     
- Misses         4012     4038      +26     
- Partials       1140     1173      +33

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

comphead · 2025-05-01T16:13:49Z

native/core/src/execution/jni_api.rs

@@ -359,6 +365,21 @@ pub unsafe extern "system" fn Java_org_apache_comet_Native_executePlan(
        // Retrieve the query
        let exec_context = get_execution_context(exec_context);

+        // memory profiling is only available on linux
+        if exec_context.memory_profiling_enabled {
+            #[cfg(target_os = "linux")]


should the if statement be inside cfg?

I did that originally, but the compiler complained about the unused variable memory_profiling_enabled on non-Linux platforms.

kazuyukitanimura · 2025-05-01T16:39:42Z

spark/src/main/scala/org/apache/comet/CometExecIterator.scala

+      // scalastyle:off println
+      println(
+        "JVM_MEMORY: { " +
+          s"heapUsed: ${mb(heap.getUsed)}, heapCommitted: ${mb(heap.getCommitted)}, " +
+          s"nonHeapUsed: ${mb(nonHeap.getUsed)}, nonHeapCommitted: ${mb(nonHeap.getCommitted)} " +
+          "}")


Perhaps logInfo ?
If println is preferred, we may have to add back // scalastyle:on println

comphead

lgtm thanks @andygrove

kazuyukitanimura

pending ci

andygrove · 2025-05-02T18:52:02Z

Moving to draft while I experiment with jemalloc-specific logging

andygrove · 2025-05-02T18:56:05Z

I added some jemalloc specific logging:

NATIVE_MEMORY_JEMALLOC: { allocated=561605760, resident=3749654528 }
NATIVE_MEMORY_JEMALLOC: { allocated=360603312, resident=3691925504 }
NATIVE_MEMORY_JEMALLOC: { allocated=161601808, resident=3181555712 }
NATIVE_MEMORY_JEMALLOC: { allocated=14660072, resident=3086413824 }
NATIVE_MEMORY_JEMALLOC: { allocated=14271320, resident=3075936256 }

@mbutrovich does this seem useful?

andygrove · 2025-05-02T19:00:01Z

I'm not sure how to reconcile the different resident numbers:

JVM_MEMORY: { heapUsed: 2027, heapCommitted: 3778, nonHeapUsed: 121, nonHeapCommitted: 131 }
NATIVE_MEMORY: { resident: 4702 }
NATIVE_MEMORY_JEMALLOC: { allocated: 13, resident: 2816 }

edit: updated to use consistent MB formatting

edit: resident memory is close to heapUsed + jemalloc resident, but is this the correct interpretation?

comphead · 2025-05-02T21:49:50Z

native/core/src/execution/jni_api.rs

+                println!(
+                    "NATIVE_MEMORY: {{ resident: {:.0} }}",
+                    (statm.resident * page_size) as f64 / (1024.0 * 1024.0)
+                );


Suggested change

println!(

"NATIVE_MEMORY: {{ resident: {:.0} }}",

(statm.resident * page_size) as f64 / (1024.0 * 1024.0)

);

comphead · 2025-05-02T21:52:23Z

native/core/src/execution/jni_api.rs

+                    // Read statistics using MIB key:
+                    let allocated = allocated.read().unwrap() as f64 / (1024.0 * 1024.0);
+                    let resident = resident.read().unwrap() as f64 / (1024.0 * 1024.0);
+                    println!(


wondering should we feed the metrics to log or to plan metrics

These metrics are for the overall process, so not specific to any particular plan. This feature just allows us to watch the executor logs and see when memory starts to approach the limit. I hope this will help us debug the cause of OOMs, although it will still primarily be a manual process.

andygrove · 2025-05-05T15:59:45Z

Thanks for the reviews @comphead @kazuyukitanimura and @mbutrovich (offline review). I'm going to go ahead and merge this and then iterate on improving it once I have used it to debug a real issue.

mbutrovich · 2025-05-07T14:18:24Z

Not the end of the world, but this introduces warnings on macOS when using jemalloc:

| use tikv_jemalloc_ctl::{epoch, stats};
|                         ^^^^^  ^^^^^
|
= note: `#[warn(unused_imports)]` on by default

andygrove added 4 commits May 1, 2025 07:45

Add memory profiling config

8a44d31

fix

a339522

fix

3b88a27

fix

2c17a71

comphead reviewed May 1, 2025

View reviewed changes

kazuyukitanimura reviewed May 1, 2025

View reviewed changes

comphead approved these changes May 2, 2025

View reviewed changes

scalastyle

00e3e47

andygrove changed the title ~~feat: Add memory profiling~~ perf: Add memory profiling May 2, 2025

andygrove added the performance label May 2, 2025

kazuyukitanimura approved these changes May 2, 2025

View reviewed changes

add jemalloc logging

ad395fd

andygrove marked this pull request as draft May 2, 2025 18:51

andygrove added 3 commits May 2, 2025 13:01

formatting

2aa4c5d

formatting

97e8cc7

formatting

077f5af

andygrove marked this pull request as ready for review May 2, 2025 19:14

comphead reviewed May 2, 2025

View reviewed changes

andygrove merged commit af02182 into apache:main May 5, 2025
78 checks passed

andygrove deleted the memory-profiling branch May 5, 2025 15:59

perf: Add memory profiling #1702

perf: Add memory profiling #1702

Uh oh!

Conversation

andygrove commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented May 1, 2025

Uh oh!

codecov-commenter commented May 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

comphead May 1, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove May 1, 2025

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura May 1, 2025

Choose a reason for hiding this comment

Uh oh!

comphead left a comment

Choose a reason for hiding this comment

Uh oh!

kazuyukitanimura left a comment

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 2, 2025

Uh oh!

andygrove commented May 2, 2025

Uh oh!

andygrove commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead May 2, 2025

Choose a reason for hiding this comment

Uh oh!

comphead May 2, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove May 3, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 5, 2025

Uh oh!

Uh oh!

mbutrovich commented May 7, 2025

Uh oh!

Uh oh!

andygrove commented May 1, 2025 •

edited

Loading

codecov-commenter commented May 1, 2025 •

edited

Loading

andygrove commented May 2, 2025 •

edited

Loading