Improve active defrag in jemalloc 5.2 (redis#9778)

oranagra · web-flow · commit d4e7ffb38c51 · 2021-11-21T13:35:39.000+02:00
Background: Following the upgrade to jemalloc 5.2, there was a test that used to be flaky and started failing consistently (on 32bit), so we disabled it ​(see redis#9645). This is a test that i introduced in redis#7289 when i attempted to solve a rare stagnation problem, and it later turned out i failed to solve it, ans what's more i added a test that caused it to be not so rare, and as i mentioned, now in jemalloc 5.2 it became consistent on 32bit. Stagnation can happen when all the slabs of the bin are equally utilized, so the decision to move an allocation from a relatively empty slab to a relatively full one, will never happen, and in that test all the slabs are at 50% utilization, so the defragger could just keep scanning the keyspace and not move anything. What this PR changes: * First, finally in jemalloc 5.2 we have the count of non-full slabs, so when we compare the utilization of the current slab, we can compare it to the average utilization of the non-full slabs in our bin, instead of the total average of our bin. this takes the full slabs out of the game, since they're not candidates for migration (neither source nor target). * Secondly, We add some 12% (100/8) to the decision to defrag an allocation, this is the part that aims to avoid stagnation, and it's especially important since the above mentioned change can get us closer to stagnation. * Thirdly, since jemalloc 5.2 adds sharded bins, we take into account all shards (something that's missing from the original PR that merged it), this isn't expected to make any difference since anyway there should be just one shard. How this was benchmarked. What i did was run the memefficiency test unit with `--verbose` and compare the defragger hits and misses the tests reported. At first, when i took into consideration only the non-full slabs, it got a lot worse (i got into stagnation, or just got a lot of misses and a lot of hits), but when i added the 10% i got back to results that were slightly better than the ones of the jemalloc 5.1 branch. i.e. full defragmentation was achieved with fewer hits (relocations), and fewer misses (keyspace scans).
diff --git a/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h b/deps/jemalloc/include/jemalloc/internal/jemalloc_internal_inlines_c.h
@@ -235,23 +235,34 @@ iget_defrag_hint(tsdn_t *tsdn, void* ptr) {
 		unsigned binshard = extent_binshard_get(slab);
 		bin_t *bin = &arena->bins[binind].bin_shards[binshard];
 		malloc_mutex_lock(tsdn, &bin->lock);
-		/* don't bother moving allocations from the slab currently used for new allocations */
+		/* Don't bother moving allocations from the slab currently used for new allocations */
 		if (slab != bin->slabcur) {
 			int free_in_slab = extent_nfree_get(slab);
 			if (free_in_slab) {
 				const bin_info_t *bin_info = &bin_infos[binind];
-				unsigned long curslabs = bin->stats.curslabs;
-				size_t curregs = bin->stats.curregs;
-				if (bin->slabcur) {
-					/* remove slabcur from the overall utilization */
-					curregs -= bin_info->nregs - extent_nfree_get(bin->slabcur);
-					curslabs -= 1;
+				/* Find number of non-full slabs and the number of regs in them */
+				unsigned long curslabs = 0;
+				size_t curregs = 0;
+				/* Run on all bin shards (usually just one) */
+				for (uint32_t i=0; i< bin_info->n_shards; i++) {
+					bin_t *bb = &arena->bins[binind].bin_shards[i];
+					curslabs += bb->stats.nonfull_slabs;
+					/* Deduct the regs in full slabs (they're not part of the game) */
+					unsigned long full_slabs = bb->stats.curslabs - bb->stats.nonfull_slabs;
+					curregs += bb->stats.curregs - full_slabs * bin_info->nregs;
+					if (bb->slabcur) {
+						/* Remove slabcur from the overall utilization (not a candidate to nove from) */
+						curregs -= bin_info->nregs - extent_nfree_get(bb->slabcur);
+						curslabs -= 1;
+					}
 				}
-				/* Compare the utilization ratio of the slab in question to the total average,
-				 * to avoid precision lost and division, we do that by extrapolating the usage
-				 * of the slab as if all slabs have the same usage. If this slab is less used 
-				 * than the average, we'll prefer to evict the data to hopefully more used ones */
-				defrag = (bin_info->nregs - free_in_slab) * curslabs <= curregs;
+				/* Compare the utilization ratio of the slab in question to the total average
+				 * among non-full slabs. To avoid precision loss in division, we do that by
+				 * extrapolating the usage of the slab as if all slabs have the same usage.
+				 * If this slab is less used than the average, we'll prefer to move the data
+				 * to hopefully more used ones. To avoid stagnation when all slabs have the same
+				 * utilization, we give additional 12.5% weight to the decision to defrag. */
+				defrag = (bin_info->nregs - free_in_slab) * curslabs <= curregs + curregs / 8;
 			}
 		}
 		malloc_mutex_unlock(tsdn, &bin->lock);
diff --git a/tests/unit/memefficiency.tcl b/tests/unit/memefficiency.tcl
@@ -389,9 +389,6 @@ start_server {tags {"defrag external:skip"} overrides {appendonly yes auto-aof-r
             r del biglist1 ;# coverage for quicklistBookmarksClear
         } {1}
 
-        # Temporarily skip the active defrag edge case since it constantly fails on 32bit bit builds
-        # since upgrading to jemalloc 5.2.1 (#9623). We need to resolve this and re-enabled.
-        if {false} {
         test "Active defrag edge case" {
             # there was an edge case in defrag where all the slabs of a certain bin are exact the same
             # % utilization, with the exception of the current slab from which new allocations are made
@@ -494,7 +491,6 @@ start_server {tags {"defrag external:skip"} overrides {appendonly yes auto-aof-r
                 r save ;# saving an rdb iterates over all the data / pointers
             }
         }
-        }
     }
 }
 } ;# run_solo

Original file line number	Diff line number	Diff line change
`@@ -389,9 +389,6 @@ start_server {tags {"defrag external:skip"} overrides {appendonly yes auto-aof-r`
`389`	`389`	`r del biglist1 ;# coverage for quicklistBookmarksClear`
`390`	`390`	`} {1}`
`391`	`391`
`392`		`- # Temporarily skip the active defrag edge case since it constantly fails on 32bit bit builds`
`393`		`- # since upgrading to jemalloc 5.2.1 (#9623). We need to resolve this and re-enabled.`
`394`		`- if {false} {`
`395`	`392`	`test "Active defrag edge case" {`
`396`	`393`	`# there was an edge case in defrag where all the slabs of a certain bin are exact the same`
`397`	`394`	`# % utilization, with the exception of the current slab from which new allocations are made`
`@@ -494,7 +491,6 @@ start_server {tags {"defrag external:skip"} overrides {appendonly yes auto-aof-r`
`494`	`491`	`r save ;# saving an rdb iterates over all the data / pointers`
`495`	`492`	`}`
`496`	`493`	`}`
`497`		`- }`
`498`	`494`	`}`
`499`	`495`	`}`
`500`	`496`	`} ;# run_solo`