Make Set/Map balance calls faster

I've been looking at #980 which improves insertion times by ~22% in the no-rebalance case. This made me wonder: _why_ is rebalancing slow?

Let's look at `Set` because it is simpler. `Map` is similar.
From the code, we can see that `balanceL` and `balanceR` are marked as
`NOINLINE`.

https://github.com/haskell/containers/blob/9b1d9d44653640a15eecbaeb87e46906b772c420/containers/src/Data/Set/Internal.hs#L1885-L1906

The `NOINLINE`s were added in b96ffebb56d8275326c88c937b7f0dfba3bf681c citing code bloat issues. I think that's reasonable, `balanceL` and `balanceR` are moderately large functions. But let's try inlining them to see what happens.

We'll use the existing `insert` benchmark, which is pretty good for this purpose. It repeatedly inserts a maximum element into the set. This is close enough to typical usage (repeated insertion) but also a bad case (because it always makes one branch heavier). Benchmarks use GHC 9.6.6.

```
Current
  insert: OK
    684  μs ±  33 μs, 2.6 MB allocated,  55 KB copied, 7.0 MB peak memory

INLINE balanceR
  insert: OK
    463  μs ±  20 μs, 2.6 MB allocated,  56 KB copied, 7.0 MB peak memory, 32% less than baseline
```

That's a significant improvement, but it might come at the cost of code bloat.

But, there is a compromise:  
Most calls to `balanceR` will be on already balanced trees. Performing rebalancing is the rare case. Using some unsafe IO, I gathered some counts to show this:

```hs
L.foldl' (flip insert) empty [1..100]
```
```hs
-- 1 to 50
-- rebalances
0 0 1 0 1 1 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1
0 1 2 2 3 3 3 3 4 4 4 4 5 5 4 4 5 5 5 5 6 6 5 5 6 6 6 6 7 7 5 5 6 6 6 6 7 7 6 6 7 7 7 7 8 8 6 6 7 7
-- calls to balanceR

-- 51 to 100
-- rebalances
1 0 1 2 1 0 1 1 1 0 1 4 1 0 1 1 1 0 1 2 1 0 1 1 1 0 1 3 1 0 1 1 1 0 1 2 1 0 1 1 1 0  1  4 1 0 1 1 1 0
7 7 8 8 7 7 8 8 8 8 9 9 6 6 7 7 7 7 8 8 7 7 8 8 8 8 9 9 7 7 8 8 8 8 9 9 8 8 9 9 9 9 10 10 7 7 8 8 8 8
-- calls to balanceR
```

Putting it another way, the number of calls to `balanceR` is O(n log n), but the number of rebalancings is O(n) (amortized O(1) per insert).

Which brings me to the suggestion: _only inline the fast no-rebalance branch of `balanceR`_. And to make that smaller, only the no-rebalance Bin-Bin branch.

```diff
--- a/containers/src/Data/Set/Internal.hs
+++ b/containers/src/Data/Set/Internal.hs
@@ -1881,7 +1881,14 @@ balanceL x l r = case r of
 -- balanceR is called when right subtree might have been inserted to or when
 -- left subtree might have been deleted from.
 balanceR :: a -> Set a -> Set a -> Set a
-balanceR x l r = case l of
+balanceR x l r = case (l, r) of
+  (Bin ls _ _ _, Bin rs _ _ _)
+    | rs <= delta*ls -> Bin (1+ls+rs) x l r
+  _ -> balanceR' x l r
+{-# INLINE balanceR #-}
+
+balanceR' :: a -> Set a -> Set a -> Set a
+balanceR' x l r = case l of
   Tip -> case r of
            Tip -> Bin 1 x Tip Tip
            (Bin _ _ Tip Tip) -> Bin 2 x Tip r
@@ -1894,14 +1901,12 @@ balanceR x l r = case l of
   (Bin ls _ _ _) -> case r of
            Tip -> Bin (1+ls) x l Tip

-           (Bin rs rx rl rr)
-              | rs > delta*ls  -> case (rl, rr) of
+           (Bin rs rx rl rr) -> case (rl, rr) of
                    (Bin rls rlx rll rlr, Bin rrs _ _ _)
                      | rls < ratio*rrs -> Bin (1+ls+rs) rx (Bin (1+ls+rls) x l rl) rr
                      | otherwise -> Bin (1+ls+rs) rlx (Bin (1+ls+size rll) x l rll) (Bin (1+rrs+size rlr) rx rlr rr)
                    (_, _) -> error "Failure in Data.Set.balanceR"
-              | otherwise -> Bin (1+ls+rs) x l r
-{-# NOINLINE balanceR #-}
+{-# NOINLINE balanceR' #-}
```

With this, we get

```
INLINE Bin-Bin no-rebalance
  insert: OK
    500  μs ±  30 μs, 2.6 MB allocated,  55 KB copied, 7.0 MB peak memory, 26% less than baseline
```

TL;DR:

To make balanceL/R calls faster, we can

1. Inline them. This might cause some code bloat. How bad would this be, I don't know. This improves insert by 31%.
2. Or, inline the "fast path" of Bin-Bin no-rebalance. A small increase in code size for a large share of the improvement above (26%).

Edit: I should mention that I expect many Set/Map functions will benefit from this, but at the moment I have tested out the idea only on `insert`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make Set/Map balance calls faster #1053

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	balanceR :: a -> Set a -> Set a -> Set a
	balanceR x l r = case l of
	Tip -> case r of
	Tip -> Bin 1 x Tip Tip
	(Bin _ _ Tip Tip) -> Bin 2 x Tip r
	(Bin _ rx Tip rr@(Bin _ _ _ _)) -> Bin 3 rx (Bin 1 x Tip Tip) rr
	(Bin _ rx (Bin _ rlx _ _) Tip) -> Bin 3 rlx (Bin 1 x Tip Tip) (Bin 1 rx Tip Tip)
	(Bin rs rx rl@(Bin rls rlx rll rlr) rr@(Bin rrs _ _ _))
	\| rls < ratio*rrs -> Bin (1+rs) rx (Bin (1+rls) x Tip rl) rr
	\| otherwise -> Bin (1+rs) rlx (Bin (1+size rll) x Tip rll) (Bin (1+rrs+size rlr) rx rlr rr)

	(Bin ls _ _ _) -> case r of
	Tip -> Bin (1+ls) x l Tip

	(Bin rs rx rl rr)
	\| rs > delta*ls -> case (rl, rr) of
	(Bin rls rlx rll rlr, Bin rrs _ _ _)
	\| rls < ratio*rrs -> Bin (1+ls+rs) rx (Bin (1+ls+rls) x l rl) rr
	\| otherwise -> Bin (1+ls+rs) rlx (Bin (1+ls+size rll) x l rll) (Bin (1+rrs+size rlr) rx rlr rr)
	(_, _) -> error "Failure in Data.Set.balanceR"
	\| otherwise -> Bin (1+ls+rs) x l r
	{-# NOINLINE balanceR #-}

Make Set/Map balance calls faster #1053

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions