Skip to content

[ML] Report the "actual" memory usage of the autodetect process #2846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

edsavage
Copy link
Contributor

@edsavage edsavage commented Apr 4, 2025

Determine the actual memory usage of the autodetect process as reported by the OS, e.g. on Linux this would be the value of the maximum resident set size returned by a call to getrusage.

Add this value to the model size stats record returned to the ES Java process so it can be included in the job counts tab for anomaly detection jobs.

Relates elastic/elasticsearch#126256

Determine the actual memory usgae of the autodetect process as reported by the OS, e.g. on Linux this mould be the value of the maximum resident set size returned by a call to `getrusage`.

Add this value to the model size stats record returned to the ES Java process so it can be included in the `job counts` tab for anomaly detection jobs.
Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ed. I did the first pass.

We should discuss the naming of the new field. While "actual" conveys the intention of the value, it is confusing to the user.

Also, does maximum resident set size actually correspond to the actual current memory usage or is it the historical peak process memory usage?

@@ -180,6 +181,8 @@ class MODEL_EXPORT CResourceMonitor {
//! Returns the sum of used memory plus any extra memory
std::size_t totalMemory() const;

std::size_t actualMemoryUsage() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can come up with something better than actualMemoryUsage. Maybe: systemMemoryUsage?

@edsavage
Copy link
Contributor Author

edsavage commented Apr 8, 2025

Also, does maximum resident set size actually correspond to the actual current memory usage or is it the historical peak process memory usage?

The resident set size (RSS) represents the process's current RAM usage (so not counting pages that have been swapped out etc.), and the max RSS is the high water mark of that value. I think that reporting both would be useful for our purposes.

* ActualMemory -> SystemMemory
* Report current resident set size as well as max
Comment on lines 723 to 724
E_AssignmentBasisSystemMemoryBytes = 4, //!< Use the current system memory size
E_AssignmentBasisMaxSystemMemoryBytes = 5 //!< Use the highest ever system memory size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this assignment basis reasons (at least so far)

Comment on lines 88 to 89
ml::counter_t::E_TSADResidentSetSize,
ml::counter_t::E_TSADMaxResidentSetSize};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call these counters SystemMemoryUsage for consistency?

Comment on lines 1736 to 1739
case E_AssignmentBasisSystemMemoryBytes:
return "system_memory_bytes";
case E_AssignmentBasisMaxSystemMemoryBytes:
return "max_system_memory_bytes";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those necessary?

edsavage and others added 6 commits April 11, 2025 11:08
Co-authored-by: Valeriy Khakhutskyy <[email protected]>
Co-authored-by: Valeriy Khakhutskyy <[email protected]>
Co-authored-by: Valeriy Khakhutskyy <[email protected]>
…mem_usage

# Conflicts:
#	bin/autodetect/Main.cc
#	include/model/CResourceMonitor.h
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
274 New issues
23.0% Duplication on New Code (required ≤ 3%)
164 New Major Issues (required ≤ 0)
10 New Critical Issues (required ≤ 0)

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarLint SonarLint

@prodsecmachine
Copy link

prodsecmachine commented May 25, 2025

🎉 Snyk checks have passed. No issues have been found so far.

security/snyk check is complete. No issues have been found. (View Details)

license/snyk check is complete. No issues have been found. (View Details)

edsavage added 2 commits May 27, 2025 15:50
* Address failing unit tests
* More accurate, meaningful description of new program counters
Copy link
Contributor

@valeriy42 valeriy42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I think the last piece that is missing now is that we use the system memory usage when in CResourceMemory when calculating if allocations are allowed and that we report it back to Java as "model memory usage" and "peak model memory usage" instead of the estimated values on Linux.

edsavage added 3 commits June 4, 2025 16:24
… set size) for the "model memory usage" and "peak model memory usage" fields reported to Java.
@edsavage edsavage added the ci:run-qa-tests Run a subset of the QA tests label Jun 5, 2025
@edsavage
Copy link
Contributor Author

edsavage commented Jun 5, 2025

buildkite run_qa_tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, this change is missing to update CResrouceMonitor::totalMemory(). Please, correct me if I'm wrong, but the way I understand this code, totalMemory() on Linux should now simply return systemMemoryUsage().

res.s_Usage = this->totalMemory();
res.s_AdjustedUsage = this->adjustedUsage(res.s_Usage);
res.s_AdjustedUsage = systemMemoryUsage(this->adjustedUsage(res.s_Usage));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this complexity here. We can keep reporting adjusted usage the way we did before. But we need to update the value for s_Usage. If you change the logic in totalMemory(), you don't need to change anything here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments Valeriy.

What I was thinking was since s_AdjustedUsage and s_AdjustedPeakUsage are used as the values for the fields MODEL_BYTES and PEAK_MODEL_BYTES respectively -

writer.onKey(MODEL_BYTES);
writer.onUint64(results.s_AdjustedUsage);
writer.onKey(PEAK_MODEL_BYTES);
writer.onUint64(results.s_AdjustedPeakUsage);
that on Linux these should simply both be set to systemMemoryUsage() and left as-is elsewhere. In doing so I was trying to avoid "adjusting" the system memory usage as it should not be necessary to do so.

That said, I'll adjust the code as you suggest and see how the results look.

edsavage added 2 commits June 9, 2025 16:06
On Linux, use systemMemoryUsage for the value of totalMem. Do not "adjust" this value as is done for the estimated usage, as it is unnecessary.
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
1 New issue
1 New Critical Issues (required ≤ 0)

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarLint SonarLint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants