Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eviction): Add eviction based on RSS memory. DRAFT #4772

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BagritsevichStepan
Copy link
Contributor

@BagritsevichStepan BagritsevichStepan commented Mar 15, 2025

fixes #4011

THIS IS A DRAFT.

The PR is added to investigate how RSS memory works and whether eviction can be triggered based on it.

Current problem: mimalloc does not release RSS memory back to the OS (as was mentioned here and here). Alternatively, the OS might not be reclaiming this freed memory due to a lack of memory pressure (as noted here). This is a common issue in mimalloc, and I have tried various options, but only one of them helped. Issues that I have used during the investigation: 1, 2, 3, 4, 5

The steps I took:

  1. Added a simple pytest test: test_cache_eviction_with_rss_deny_oom_simple_case. It simply generates data and checks that we are not evicting too many items (since RSS is not decreasing). Thus, the used_memory after eviction should be around 70% of max_memory (as rss_deny_oom_ratio is set to 80%).
  2. Uncommented the previous code and added memory decommit after deleting more than force_decommit_threshold items.
    • This didn't help because the issue is that the OS is not reclaiming the freed memory due to a lack of memory pressure. As a result, the RSS memory usage didn't change even after the decommit
  3. Set mi_option_purge_delay to 0, and it helped. I don't know why it works, as it only reduces the memory purging delay from 10ms to 0ms. However, when it was set to the default value, RSS memory didn't change at all, but after setting it to 0, it started decreasing with a small delay. Still, we are evicting a lot due to the delay.
  4. Improved the logic a bit by adding the calculation of the number of evicted bytes → Now the simple test works fine:
[2025-03-16 10:13:04.779 INFO] Current used memory: 416031760, current used rss: 507600896, rss eviction threshold: 375809638.40000004.
[2025-03-16 10:13:04.781 INFO] Current evicted: 8800. Total keys: 89128.
[2025-03-16 10:13:05.785 INFO] Current used memory: 371487760, current used rss: 507600896, rss eviction threshold: 375809638.40000004.
[2025-03-16 10:13:05.787 INFO] Current evicted: 17400. Total keys: 89128.
[2025-03-16 10:13:06.791 INFO] Current used memory: 339518480, current used rss: 507428864, rss eviction threshold: 375809638.40000004.
[2025-03-16 10:13:06.793 INFO] Current evicted: 23644. Total keys: 89128.
[2025-03-16 10:13:07.797 INFO] Current used memory: 339518480, current used rss: 507428864, rss eviction threshold: 375809638.40000004.
[2025-03-16 10:13:07.799 INFO] Current evicted: 23644. Total keys: 89128.
[2025-03-16 10:13:08.803 INFO] Current used memory: 339518480, current used rss: 507428864, rss eviction threshold: 375809638.40000004.
[2025-03-16 10:13:08.805 INFO] Current evicted: 23644. Total keys: 89128.

From the output, we can see that we are not evicting too much data: the current used memory stopped decreasing and is slightly below the RSS eviction threshold. I also checked the mimalloc stats, and it frees around 150MiB, which is the expected behavior.

  1. Added test_cache_eviction_with_rss_deny_oom_two_waves, which does almost the same but with two waves of data population. The first wave remains the same. Then, we wait until eviction stops and populate around 20% of max_memory. The problems are:
    1. Since RSS used memory does not decrease after the first wave, we get an OOM error for the second wave from the ShouldDenyOnOOM method
    2. I know how to fix it, but for testing purposes, I temporarily removed this logic to check how eviction works for the second wave.
    3. As a result, we are not evicting much data for the second wave due to how the eviction algorithm works.
    4. I also know how to fix this, but on the other hand, after fixing it, a potential issue could arise where, in some cases, we might evict a large amount of data (though still much less than in our initial naive approach).
    So, the current approach is not aggressive, meaning that in all cases, it will either evict a sufficient amount of data or slightly less than needed, leading to an OOM. I think this is better than evicting more than necessary.

Conclusion:
We need to find a way to accurately calculate real RSS usage. The possible steps are:

  1. Test OS commands to check their effectiveness:
    • echo 1 | sudo tee /proc/sys/vm/drop_caches
    • echo 1 | sudo tee /proc/sys/vm/overcommit_memory
      If these commands work, we can invoke them from the control plane side when we detect a significant difference between used_memory and rss_used_memory.
  2. Configure mimalloc to properly clear the data or wait for its behavior to be fixed.
  3. Improve the calculation of evicted bytes by using mimalloc stats and VMRssHeap memory usage

Since we will improve our RSS memory calculation, even if it updates with a delay, the current approach should work fine in most cases. In some scenarios, we may still encounter OOM, but I don't think this will be a common case.

Note: I also tried changing the mem_defrag_threshold, but it didn’t help. So, I believe the issue is not related to fragmentation.

@romange
Copy link
Collaborator

romange commented Mar 18, 2025

Have not looked at the code, just commenting on the PR description and thank you for making the effort to describe the background behind the PR.

@romange
Copy link
Collaborator

romange commented Mar 18, 2025

  1. /proc/sys/vm/drop_caches are unrelated, this is for OS file caches. /proc/sys/vm/overcommit_memory is unrelated - it's about VM reservations, not about RSS.
  2. It is possible for a userland app to control its rss via explicit madvise(MADV_DONTNEED) calls but we lack this ability since we work with memory allocator.

At the end the goal is to understand how much RSS is being used by the app (and we have this ability via memory arena calls) vs the RSS being monitored by OS and see if memory decommit helps.

We can also explore the option to test later versions of mimalloc (aka 2.2.2).

@BagritsevichStepan
Copy link
Contributor Author

BagritsevichStepan commented Mar 18, 2025

@adiholden
This is what I get for the test test_cache_eviction_with_rss_deny_oom_simple_case in a single-threaded datastore with maxmemory=6GB. After eviction, I call MEMORY ARENA, then perform MEMORY DECOMMIT, and finally call MEMORY ARENA again.
Results:

[2025-03-18 15:43:26.674 INFO] Current used memory: 4382055440, current used rss: 5799755776, rss eviction threshold: 4509715660.8.                                                                         
[2025-03-18 15:43:26.675 INFO] Current evicted: 226796. Total keys: 1069547.                                                                                                                                
[2025-03-18 15:43:27.893 INFO] Memory arena before decommit:                                                                                                                                                
Arena statistics from thread:0                                                                                                                                                                              
Count BlockSize Reserved Committed Used                                                                                                                                                                     
27 5120 61440 61440 10240                                                                                                                                                                                   
1 16384 65536 65536 16384                                                                                                                                                                                   
10047 5120 61440 61440 61440                                                                                                                                                                                
3744 5120 61440 61440 30720                                                                                                                                                                                 
551 5120 61440 61440 20480                                                                                                                                                                                  
19882 5120 61440 61440 51200                                                                                                                                                                                
5 5120 61440 61440 5120                                                                                                                                                                                     
11333 5120 61440 61440 40960                                                                                                                                                                                
7028 5120 61440 61440 35840                                                                                                                                                                                 
1670 5120 61440 61440 25600                                                                                                                                                                                 
116 5120 61440 61440 15360                                                                                                                                                                                  
1 20480 512000 81920 20480                                                                                                                                                                                  
128 32768 524288 524288 524288                                                                                                                                                                              
1 1280 65280 5120 1280                                                                                                                                                                                      
15952 5120 61440 61440 46080
18774 5120 61440 61440 56320
1 24576 516096 98304 24576
1 8 65512 4096 16
total reserved: 5544419048, comitted: 5543449600, used: 4382056720 fragmentation waste: 20.9507%
--- End mimalloc statistics, took 216639us ---                                                                                                                                                            
                                                                                                                                                                                                            
[2025-03-18 15:43:29.080 INFO] Memory arena after decommit:                                                                                                                                                 
Arena statistics from thread:0                                                                                                                                                                              
Count BlockSize Reserved Committed Used                                                                                                                                                                     
1 16384 65536 65536 16384                                                                                                                                                                                   
3744 5120 61440 61440 30720                                                                                                                                                                                 
116 5120 61440 61440 15360                                                                                                                                                                                  
128 32768 524288 524288 524288                                                                                                                                                                              
10047 5120 61440 61440 61440                                                                                                                                                                                
18774 5120 61440 61440 56320                                                                                                                                                                                
27 5120 61440 61440 10240                                                                                                                                                                                   
1 8 65512 4096 16                                                                                                                                                                                           
1 24576 516096 98304 24576                                                                                                                                                                                  
5 5120 61440 61440 5120                                                                                                                                                                                     
11333 5120 61440 61440 40960                                                                                                                                                                                
551 5120 61440 61440 20480                                                                                                                                                                                  
7028 5120 61440 61440 35840                                                                                                                                                                                 
1670 5120 61440 61440 25600
1 20480 512000 81920 20480
1 1280 65280 5120 1280
15952 5120 61440 61440 46080
19882 5120 61440 61440 51200
total reserved: 5544419048, comitted: 5543449600, used: 4382056720 fragmentation waste: 20.9507%
--- End mimalloc statistics, took 181623us --

@romange
Copy link
Collaborator

romange commented Mar 18, 2025

so based on total reserved: 5544419048, comitted: 5543449600, used: 4382056720 fragmentation waste: 20.9507%
and current used rss: 5799755776 i do not see any problems with mimalloc, its rss is close to comitted memory usage and the gap is due to fragmentation. So it's only natural that if we evict a bunch of random entries our comitted usage stays the same due to these entries creating holes in memory pages but not freeing them up entirely. decommit won't help you here because your allocator library does not have empty pages to decommit. but we have another command "memory DEFRAGMENT" that kicks in deframentation process. I wonder how it affects the arena stats.

@BagritsevichStepan
Copy link
Contributor Author

BagritsevichStepan commented Mar 18, 2025

MEMORY DEFRAGMENT helps. Strange that when I set mem_defrag_threshold to 0.2 it didn't help:

[2025-03-18 16:48:03.384 INFO] Current evicted: 226826. Total keys: 1069547.                                                                                                                                
[2025-03-18 16:48:04.564 INFO] Memory arena before defrag:                                                                                                                                                  
Arena statistics from thread:0                                                                                                                                                                              
Count BlockSize Reserved Committed Used                                                                                                                                                                     
116 5120 61440 61440 15360                                                                                                                                                                                  
1669 5120 61440 61440 25600                                                                                                                                                                                 
18765 5120 61440 61440 56320                                                                                                                                                                                
3745 5120 61440 61440 30720                                                                                                                                                                                 
1 8 65512 4096 16                                                                                                                                                                                           
19887 5120 61440 61440 51200                                                                                                                                                                                
1 1280 64000 5120 1280                                                                                                                                                                                      
552 5120 61440 61440 20480                                                                                                                                                                                  
1 16384 65536 65536 16384                                                                                                                                                                                   
27 5120 61440 61440 10240                                                                                                                                                                                   
1 24576 516096 98304 24576                                                                                                                                                                                  
5 5120 61440 61440 5120                                                                                                                                                                                     
11335 5120 61440 61440 40960                                                                                                                                                                                
7032 5120 61440 61440 35840                                                                                                                                                                                 
10046 5120 61440 61440 61440                                                                                                                                                                                
15950 5120 61440 61440 46080                                                                                                                                                                                
128 32768 524288 524288 524288                                                                                                                                                                              
1 20480 491520 81920 20480                                                                                                                                                                                  
total reserved: 5544397288, comitted: 5543449600, used: 4381903120 fragmentation waste: 20.9535%                                                                                                            
--- End mimalloc statistics, took 177844us ---                                                                                                                                                              

[2025-03-18 16:48:07.677 INFO] Memory arena after defrag and before decommit:                                                                                                                               
Arena statistics from thread:0                                                                                                                                                                              
Count BlockSize Reserved Committed Used                                                                                                                                                                     
1 20480 491520 81920 20480                                                                                                                                                                                  
11301 5120 61440 61440 51200                                                                                                                                                                                
13797 5120 61440 61440 56320                                                                                                                                                                                
1 24576 516096 98304 24576                                                                                                                                                                                  
1 1280 64000 5120 1280                                                                                                                                                                                      
1 16384 65536 65536 16384                                                                                                                                                                                   
48162 5120 61440 61440 61440                                                                                                                                                                                
1 8 65512 4096 16                                                                                                                                                                                           
128 32768 524288 524288 524288                                                                                                                                                                              
total reserved: 4569405928, comitted: 4568458240, used: 4381903120 fragmentation waste: 4.08355%                                                                                                            
--- End mimalloc statistics, took 108767us ---                                                                                                                                                              
                                                                                                                                                                                                            
[2025-03-18 16:48:10.792 INFO] Memory arena after decommit:                                                                                                                                                 
Arena statistics from thread:0                                                                                                                                                                              
Count BlockSize Reserved Committed Used                                                                                                                                                                     
1 20480 491520 81920 20480                                                                                                                                                                                  
11301 5120 61440 61440 51200                                                                                                                                                                                
13797 5120 61440 61440 56320                                                                                                                                                                                
1 24576 516096 98304 24576                                                                                                                                                                                  
1 1280 64000 5120 1280                                                                                                                                                                                      
1 16384 65536 65536 16384                                                                                                                                                                                   
48162 5120 61440 61440 61440                                                                                                                                                                                
1 8 65512 4096 16                                                                                                                                                                                           
128 32768 524288 524288 524288                                                                                                                                                                              
total reserved: 4569405928, comitted: 4568458240, used: 4381903120 fragmentation waste: 4.08355%                                                                                                            
--- End mimalloc statistics, took 108448us --- 

@romange
Copy link
Collaborator

romange commented Mar 19, 2025

In that I would like to suggest changing the problem statement. We wanted to trigger eviction based on RSS to account for use-cases when a default heap uses lots of memory but data heap shows relatively low levels, which does not wake eviction. This is the use-case we wanted to fix but RSS direction seems unreliable.

I think we should write a design document stating why eviction based on RSS does not reliably work, mention the case of defragmentation and allocator caching interfering and suggest another solution - where we still use "used memory" from data heap like today but we also account from used memory of the default heap. we should also fix the defragmentation heuristic - to be more agressive in low-memory use-cases but it's another task.

@adiholden - thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tune eviction threshold in cache mode
2 participants