LittDB: Keymap threading improvements#3645
Conversation
PR SummaryHigh Risk Overview Keymap work is asynchronous. A new GC is split across two roles. A Iterators no longer suspend GC. Opening an iterator reserves snapshot segments so files survive until Other changes: rebase CLI migrates Reviewed by Cursor Bugbot for commit 9490600. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3645 +/- ##
==========================================
- Coverage 59.12% 58.14% -0.99%
==========================================
Files 2259 2176 -83
Lines 186489 176729 -9760
==========================================
- Hits 110255 102751 -7504
+ Misses 66353 64900 -1453
+ Partials 9881 9078 -803
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
A well-engineered, thoroughly-tested rework of LittDB keymap threading (async keymap manager, two-phase GC gated by a deletion watermark, crash-time keymap repair). No blocking correctness issues found; a couple of low-severity edge-case observations remain.
Findings: 0 blocking | 4 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Cursor second-opinion pass produced no output (cursor-review.md is empty), so only the Codex pass and this review contributed external findings.
- Startup now runs repairKeymap on every non-reload open. On a clean shutdown this is cheap (the first key checked is present, so it stops immediately), but it's worth confirming via the existing TestCleanShutdownLeavesKeymapConsistent that the zero-rescue fast path holds for large tables in CI, since the walk does a keymap.Get per missing key.
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
With litt's async keymap flush (#3645), flushing ~once per block is cheap, so: - littFlushInterval 100ms -> 5ms (~one block at Giga throughput), bounding crash loss to roughly a single block (near-WAL durability) - ShardingFactor 16 -> 1 (flushing a single segment file is cheaper; sharding mainly helps across multiple disks) DO NOT MERGE until #3645 (litt async keymap) lands: without it, flushing this often regresses write throughput ~48%. Validated via cryptosim (main littidx + #3645 + parallel getLogs): - reads-off write ceiling 167k (vs 168k prior) - reads-on ~153-157k writes / 58k point-reads/s / 520 getLogs/s @ 7ms - 3h GC soak flat (old -27% control-loop step-down gone), 0 panics Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Solid, well-tested rework of LittDB keymap threading: async keymap manager, off-control-loop GC, reservation-based iterators, and a durable gc-watermark that correctly prevents post-crash resurrection of garbage-collected keys. No blocking correctness issues found; the two Codex findings are real but low-severity disk-reclamation lags that self-heal on restart.
Findings: 0 blocking | 5 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Watermark-drop reclamation lag (Codex #1):
publishDeletionWatermarkdrops updates whendeletionWatermarkChan(cap 1024) is full. Because a full FIFO channel retains the older, smaller values and drops the newest (largest) watermark, if a collection burst publishes >1024 segment watermarks between two control-loop drains, the control loop reads a stale max and leaves the higher-indexed collected segments' files on disk until a later publish or restart. Unlikely in practice (deletes are fsync-bounded, control loop drains every GCPeriod), and it self-heals at startup via watermark seeding. Consider a latest-value (cap-1, drain-then-send / overwrite) channel likesetTTLuses, so the freshest watermark is never lost. - Second-opinion passes:
cursor-review.mdis empty (Cursor produced no output) andREVIEW_GUIDELINES.mdis empty/missing on the base branch, so no repo-specific standards were applied beyond AGENTS.md. - Positive: the crash-consistency regression test (
TestGCWatermarkPreventsResurrectionAfterCrash) and the single-atomic-batch repair test (TestRepairKeymapSingleBatch) directly exercise the trickiest invariants; the new async paths are well covered. - 2 suggestion(s)/nit(s) flagged inline on specific lines.
There was a problem hiding this comment.
A large but well-structured and well-tested refactor that moves keymap puts/deletes and GC collection off the control loop, adds a durable gc-watermark to prevent post-crash key resurrection, and switches iterators to reservations. No blocking correctness issues found; the one external (Codex) finding is unreachable under LittDB's documented no-overwrite contract.
Findings: 0 blocking | 6 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Codex flagged a possible stale-read: an older in-flight put-batch's writeBatch deletes a key from unflushedDataCache, potentially evicting a newer Put(K) cache entry and surfacing the older keymap value until the newer batch applies. This is technically real but unreachable in correct usage — LittDB's Put contract (table.go:21) forbids overwriting an existing value, which is the only thing that puts two put-batches for the same key in flight concurrently. For non-overwrite usage the keymap-Put-then-cache-Delete ordering is correct. Worth a code comment noting writeBatch's correctness depends on the no-overwrite invariant, but not a blocker.
- Cursor's second-opinion pass produced no output (cursor-review.md is empty); REVIEW_GUIDELINES.md was also empty, so no repo-specific standards were applied.
- gcManager.collectExpiredSegments fsyncs the gc-watermark file once per expired segment (advanceWatermark before each scheduleDelete). With GCPeriod lowered to 10s and large collection bursts this is many sequential fsyncs; it's off the control loop so it won't block writes, but coalescing the watermark advance per pass (single fsync to the highest collected index) would reduce syncs. Minor.
- The deletionWatermark channel is fire-and-forget and can drop updates if a single pass collects more segments than KeymapManagerWatermarkChannelSize (default 1024) before the control loop drains it, deferring some file reclamation to a later pass. This is monotonic-safe and documented, but explicit RunGC over a very large backlog may not reclaim everything in one call — acceptable given the documentation.
- repairKeymap's all-or-nothing recovery now depends on keymap.Put being crash-atomic as a batch; the interface doc was updated to require this and a test asserts a single Put call, but the guarantee is only as strong as each Keymap implementation actually provides (PebbleDB batch). Worth confirming the unsafe/non-default keymap implementations honor it.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
There was a problem hiding this comment.
A large but carefully designed refactor of LittDB keymap threading: keymap puts/deletes now run asynchronously off the control loop, GC is split into an off-loop manager with a durable gc-watermark, and crash recovery (repair + purge) is hardened against resurrecting collected keys. The design is sound and very well tested; the only issues are a documented-but-real watermark-drop edge case and a stale doc comment, neither of which blocks merge.
Findings: 0 blocking | 4 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Stale doc comment (Codex finding):
sei-db/db_engine/litt/table.go~line 152 still states "iteration disables GC temporarily" (and has a typo, "Additinoally"). This PR changes iterators to reserve their snapshot segments while GC continues, so the comment is now misleading about iterator/GC/disk-retention semantics. Not flaggable inline because that line is not in the PR diff, but it should be updated alongside the other iterator doc updates in this PR. - GCPeriod default was lowered from 5m to 10s (littdb_config.go). This is a meaningful behavioral change for all LittDB consumers (much more frequent collection passes). It is justified in the comment (collection is now cheap on the control loop), but worth calling out explicitly in the PR description / release notes since it affects every deployment, not just the keymap-threading path.
- REVIEW_GUIDELINES.md is empty/missing on the base branch, so no repo-specific review standards could be applied. The Cursor second-opinion pass (cursor-review.md) produced no output (empty file); only the Codex pass contributed findings, both of which are addressed here.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
…ei-protocol#3660) ## Description - NOTE: Only Merge post [3645](sei-protocol#3645): That makes per-block flush affordable, reclaiming near-per-block crash durability for free. - `littFlushInterval` 100ms -> **5ms** (~one block at Giga throughput) — bounds crash loss to ~a single block (near-WAL durability for receipt bodies). - `ShardingFactor` 16 -> **1**. flushing one segment file is cheaper; sharding mainly helps across multiple disks. ## Testing - reads-off write ceiling **167k** (vs 168k prior) - reads-on **~153–157k** writes / 58k point-reads/s / 520 getLogs/s @ 7ms - 3h GC soak **flat** (old −27% control-loop step-down gone), 0 panics Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
A large, well-engineered refactor moving keymap puts/deletes and GC collection off the control loop, backed by a durable gc-watermark that prevents post-crash resurrection of collected keys, with strong test coverage. No blocking defects found; a few bounded file-reclamation/lag edge cases (all self-healing on restart) are worth noting.
Findings: 0 blocking | 5 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Tooling note: the Cursor second-opinion review file (cursor-review.md) was empty, and REVIEW_GUIDELINES.md was empty/missing, so no repo-specific standards or Cursor findings could be incorporated. Codex's review was present and its three findings are merged below.
- Overall the durability story is sound: the gc-watermark file is fsynced before the keymap deletes it guards, reads/repair/reload all floor at it, and it is re-seeded into the in-memory deletion watermark on restart — so the reclamation-lag edge cases below cannot cause permanent data resurrection or loss, only transient/at-rest disk-space lag that a later collection or a restart resolves. Worth a brief note in the PR description that these are accepted, self-healing limitations.
- (Codex 3) gcManager.setTTL drains-then-sends on a size-1 channel without synchronization, so two concurrent SetTTL calls can each see the slot empty and one update is silently dropped, leaving a stale TTL. The Table interface advertises all methods as thread-safe; the code comment instead asserts SetTTL 'is not expected to be invoked concurrently'. Low impact, but consider either guarding with a mutex or documenting the relaxed contract on the interface.
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
|
|
||
| // Schedule the segment's keymap entries for deletion. The manager applies it asynchronously and then | ||
| // advances the deletion watermark, which lets the control loop later delete the segment's files. | ||
| if err := g.keymapManager.scheduleDelete(g.gcCursorKeys, int64(index)); err != nil { |
There was a problem hiding this comment.
[suggestion] When a collected segment has zero keys, scheduleDelete returns early (keymap_manager.go:180 treats an empty batch as a no-op), so the keymap manager never enqueues a delete for it and therefore never calls publishDeletionWatermark for this segment's index — even though advanceWatermark(index+1) and deletionScheduledIndex have already moved past it. The control loop's in-memory deletion watermark thus won't advance to cover an empty collected segment until a later non-empty segment publishes a higher watermark; if the empty segment is the last one collected, its files linger unreclaimed until the next non-empty collection (or a restart, which re-seeds from the durable gc-watermark). Empty sealed segments are reachable (e.g. an empty mutable segment sealed on clean close and reopened). Consider publishing the deletion watermark for the segment even when its key batch is empty (or have GC schedule a watermark advance directly) so reclamation isn't deferred.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
There are 4 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9490600. Configure here.
| case *keymapManagerShutdownRequest: | ||
| m.drainAll() | ||
| req.shutdownCompleteChan <- struct{}{} | ||
| return true |
There was a problem hiding this comment.
Shutdown ignores failed keymap drain
High Severity
The keymap manager shutdown handler always signals completion after drainAll, even when drainAll returns false because a put or delete failed. The sync handler correctly gates its ack on a successful drain. A failed shutdown drain can still let Close seal the mutable segment and stop the keymap while scheduled async puts never became durable.
Reviewed by Cursor Bugbot for commit 9490600. Configure here.
| } | ||
| if gcWatermarkFile == nil { | ||
| gcWatermarkFile = f | ||
| } |
There was a problem hiding this comment.
GC watermark wrong table root
Medium Severity
When no gc-watermark exists yet, NewDiskTable pins the watermark file to the first configured root’s table directory, not the table root that actually hosts the keymap (available as keymapPath). After a rebase or path reorder, GC can persist the barrier on the wrong volume while the keymap lives elsewhere, so restart may miss the durable watermark and repair/reload can resurrect collected keys.
Reviewed by Cursor Bugbot for commit 9490600. Configure here.
There was a problem hiding this comment.
A large, carefully-engineered refactor that moves keymap writes and GC off the control loop and adds a durable gc-watermark to prevent resurrection of garbage-collected data; the design and test coverage are strong, but the lossy deletion-watermark channel can permanently strand the control loop's watermark below the true durable-delete frontier, which both leaks segment files and resurrects GC-deleted keys via iterator/boundary-key reads.
Findings: 1 blocking | 4 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
- 1 blocking issue(s) flagged inline on specific lines.
Non-blocking
- REVIEW_GUIDELINES.md (from the base branch) is empty/missing, so no repo-specific review standards could be applied; this review uses general engineering criteria only.
- cursor-review.md is empty — the Cursor second-opinion pass produced no output. The Codex pass produced one finding (the deletion-watermark drop), which is captured inline below and which I independently confirmed.
- The KeymapManagerWatermarkChannelSize config doc frames a dropped watermark as merely deferring file reclamation 'until a later collection'. That understates the impact in two ways: (1) a later collection only re-publishes if more segments expire, so on an idle table the gap is permanent, not deferred; (2) it omits that readableFloor() is derived from the same stuck watermark, so reads — not just file cleanup — are affected. Consider updating the doc once the underlying issue is fixed.
- Nice regression coverage: TestGCWatermarkPreventsResurrectionAfterCrash, TestReadsSkipCollectedSegments{BeforeReclaim,AfterCrash}, TestRepairKeymapSingleBatch, and TestCleanShutdownLeavesKeymapConsistent directly exercise the crash/ordering invariants this PR introduces. Worth adding one more test that overflows the watermark channel (publish more than KeymapManagerWatermarkChannelSize segments between control-loop drains, then go idle) to lock in the fix for the blocker below.
| // durably deleted, so the control loop may reclaim their files. | ||
| func (c *controlLoop) publishDeletionWatermark(watermark int64) { | ||
| select { | ||
| case c.deletionWatermarkChan <- watermark: |
There was a problem hiding this comment.
[blocker] Blocker (correctness): this non-blocking send drops the watermark when deletionWatermarkChan is full, and draining via refreshDeletionWatermark is the only thing that advances keymapDeletionWatermark. Because watermarks are monotonically increasing segment indices, an overflow drops the highest (latest) values and leaves only older ones in the channel. If a GC burst overflows the channel within one control-loop drain interval and GC then goes idle (no further publishDeletionWatermark, since collectExpiredSegments only resumes above deletionScheduledIndex), keymapDeletionWatermark is stuck below the true durable-delete frontier permanently.
That strands readableFloor() low too: segments whose keymap entries are durably deleted stay inside [readableFloor, highestSegmentIndex) and are read directly by iterators and computeOldest/NewestPrimaryKey, resurrecting GC-deleted keys (and their files are never reclaimed). This is exactly the resurrection class this PR exists to prevent.
Fix: make this a latest-value channel like gcManager.setTTL — drain the stale value, then send — so the freshest monotonic watermark is always present. It's safe: single producer (keymap manager goroutine), monotonic value, control loop takes the max on drain.


Describe your changes and provide context
Several changes to the threading of the keymap, with potentially large impacts for some workloads.