Optimize _byte_pair_merge to o(m log n) using heap-based candidate selection #442
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
-> Replaced the
O(m·n)
sequential merge scan with a heap-driven algorithm that maintains candidate merges in a max-heap keyed by rank, updating only local neighbors on each merge.-> This yields
m·log n
behavior where:m
: number of merges andn
: is the number of initial symbolsKey changes:
_byte_pair_merge
, maintaining a linked-list of live nodes and per-position versions to avoid stale heap entries.compute_rank_at
and updates only affected neighbors after each merge._byte_pair_merge
boundaries.Complexity:
Before: repeated linear scans → approximately
O(m·n)
in worst-case merges.After: heap operations per merge →
O(m·log n)
, withO(n)
initialization.