Skip to content

Improve SequenceSet memory use and performance #484

@nevans

Description

@nevans

SequenceSet is very useful for managing mailbox state. But I've known since first implementing it that, even though it uses binary searches for O(lg n) in many places, the current naive implementation of SequenceSet has some serious performance issues, especially at larger scales (e.g: when a set holds millions of members, which is very common for "All Mail" type mailboxes). For many scenarios, it performs worse than simply keeping a sorted array of integers (using Array#bsearch, etc).

Most notably: by storing each element as a [min, max] range/run tuple, we create a 8-byte pointer to 40-byte array for each uint32x2 run. We could easily use 1/3rd as much memory, either by storing a flat array (runs represented by flat.each_cons(2)) or two "columnar" arrays (runs represented by mins.zip(lens). And that could be cut in half by storing the uint32 values in a packed string or by merging each uint32x2 run into a single uint64 Integer—in practice this will almost always be less than UINT62_MAX and thus stored as a FIXNUM.

So we're basically using 6x more memory than needed, and that memory isn't even in a contiguous block.

And that's all before we get into more advanced tricks, for example: splitting large sets into UInt16 chunks, allowing runs to be stored as uint16x2. And once that's done, we can dynamically encode each chunk as either run length encoded, a bitset, or simple array of numbers. I've tested the encoding of real-life mailbox UID sets, and this last step provides enormous memory savings. (While working on this, I accidentally re-invented a naive version of Roaring Bitmaps... and when I started searching for prior art, I discovered roaring bitmaps!)

Additionally, I'd also like to add less naive versions of specific set operations. Most set operations are implemented as simply dup.mutating_version!(other). In my benchmarks, some fairly straightforward tricks can be employed to make AND and XOR operations 20% to 300% faster for real-world scenarios (i.e: my own production code).

And, of course, none of this should be done without good benchmarks to justify any significant jump in complexity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions