Skip to content

LittDB: Keymap threading improvements#3645

Open
cody-littley wants to merge 12 commits into
mainfrom
cjl/litt-fast-flush
Open

LittDB: Keymap threading improvements#3645
cody-littley wants to merge 12 commits into
mainfrom
cjl/litt-fast-flush

Conversation

@cody-littley

Copy link
Copy Markdown
Contributor

Describe your changes and provide context

Several changes to the threading of the keymap, with potentially large impacts for some workloads.

  • Writing to the keymap is now asyncronous and off the main control loop thread
    • this makes it possible for a crash to leave data in a segment but not in the keymap, we now detect and fix this at startup time
    • flush should have lower latency, especially for workloads with many keys being flushed
    • flushing LittDB should now be roughly equivalent to flushing a WAL, since that's basically all a littDB segment is
  • keymap GC now happens off the main control loop thread

@cursor

cursor Bot commented Jun 25, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes core LittDB persistence, GC ordering, and crash recovery in the storage engine; incorrect watermark or async keymap behavior could cause data loss or resurrecting deleted keys.

Overview
This PR restructures LittDB disk-table threading and garbage-collection correctness so flushes and GC no longer block the control loop on keymap I/O, and crash/restart cannot resurrect logically deleted data.

Keymap work is asynchronous. A new keymapManager goroutine applies puts (after segment data is durable) and GC-driven deletes in FIFO order, with batching and backpressure. The flush loop schedules keymap writes instead of blocking on keymap.Put. Clean shutdown drains the manager so the keymap matches segments without relying on repair.

GC is split across two roles. A gcManager goroutine TTL-expires sealed segments, fsyncs a table-root gc-watermark file before scheduling keymap deletes, and hands deletes to the keymap manager. The control loop only reclaims segment files after the deletion watermark covers those segments. Startup loads the watermark (across roots), runs repairKeymap / purgeCollectedKeymapEntries, and floors reads and keymap reload at the lowest readable segment.

Iterators no longer suspend GC. Opening an iterator reserves snapshot segments so files survive until Close; GC and file deletion can proceed while iterators still read reserved segments.

Other changes: rebase CLI migrates gc-watermark with the keymap; Exists checks that the backing segment is still live; CacheAwareGet is removed from the table API and cache layer; new config knobs for manager/GC channels and default GC period 10s (was 5 minutes).

Reviewed by Cursor Bugbot for commit 9490600. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 30, 2026, 2:14 PM

Comment thread sei-db/db_engine/litt/disktable/control_loop.go Outdated
Comment thread sei-db/db_engine/litt/disktable/control_loop.go
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.14%. Comparing base (c528303) to head (9490600).
⚠️ Report is 21 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3645      +/-   ##
==========================================
- Coverage   59.12%   58.14%   -0.99%     
==========================================
  Files        2259     2176      -83     
  Lines      186489   176729    -9760     
==========================================
- Hits       110255   102751    -7504     
+ Misses      66353    64900    -1453     
+ Partials     9881     9078     -803     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/db_engine/litt/cli/rebase.go 59.04% <ø> (ø)
sei-db/db_engine/litt/dbcache/cached_table.go 57.50% <ø> (-5.27%) ⬇️
sei-db/db_engine/litt/disktable/control_loop.go 68.47% <ø> (-0.88%) ⬇️
sei-db/db_engine/litt/disktable/disk_table.go 64.14% <ø> (-0.37%) ⬇️
sei-db/db_engine/litt/disktable/flush_loop.go 44.89% <ø> (-0.39%) ⬇️
...ei-db/db_engine/litt/disktable/forward_iterator.go 69.23% <ø> (ø)
...ei-db/db_engine/litt/disktable/reverse_iterator.go 71.73% <ø> (ø)
sei-db/db_engine/litt/littdb_config.go 42.10% <ø> (-14.76%) ⬇️
sei-db/db_engine/litt/table.go 100.00% <ø> (ø)

... and 83 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread sei-db/db_engine/litt/disktable/keymap_manager.go Outdated
Comment thread sei-db/db_engine/litt/disktable/disk_table.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-engineered, thoroughly-tested rework of LittDB keymap threading (async keymap manager, two-phase GC gated by a deletion watermark, crash-time keymap repair). No blocking correctness issues found; a couple of low-severity edge-case observations remain.

Findings: 0 blocking | 4 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Cursor second-opinion pass produced no output (cursor-review.md is empty), so only the Codex pass and this review contributed external findings.
  • Startup now runs repairKeymap on every non-reload open. On a clean shutdown this is cheap (the first key checked is present, so it stops immediately), but it's worth confirming via the existing TestCleanShutdownLeavesKeymapConsistent that the zero-rescue fast path holds for large tables in CI, since the walk does a keymap.Get per missing key.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/db_engine/litt/disktable/keymap_manager.go Outdated
Comment thread sei-db/db_engine/litt/disktable/flush_loop.go
Kbhat1 added a commit that referenced this pull request Jun 29, 2026
With litt's async keymap flush (#3645), flushing ~once per block is cheap, so:
- littFlushInterval 100ms -> 5ms (~one block at Giga throughput), bounding
  crash loss to roughly a single block (near-WAL durability)
- ShardingFactor 16 -> 1 (flushing a single segment file is cheaper; sharding
  mainly helps across multiple disks)

DO NOT MERGE until #3645 (litt async keymap) lands: without it, flushing this
often regresses write throughput ~48%.

Validated via cryptosim (main littidx + #3645 + parallel getLogs):
- reads-off write ceiling 167k (vs 168k prior)
- reads-on ~153-157k writes / 58k point-reads/s / 520 getLogs/s @ 7ms
- 3h GC soak flat (old -27% control-loop step-down gone), 0 panics

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread sei-db/db_engine/litt/disktable/disk_table.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-tested rework of LittDB keymap threading: async keymap manager, off-control-loop GC, reservation-based iterators, and a durable gc-watermark that correctly prevents post-crash resurrection of garbage-collected keys. No blocking correctness issues found; the two Codex findings are real but low-severity disk-reclamation lags that self-heal on restart.

Findings: 0 blocking | 5 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Watermark-drop reclamation lag (Codex #1): publishDeletionWatermark drops updates when deletionWatermarkChan (cap 1024) is full. Because a full FIFO channel retains the older, smaller values and drops the newest (largest) watermark, if a collection burst publishes >1024 segment watermarks between two control-loop drains, the control loop reads a stale max and leaves the higher-indexed collected segments' files on disk until a later publish or restart. Unlikely in practice (deletes are fsync-bounded, control loop drains every GCPeriod), and it self-heals at startup via watermark seeding. Consider a latest-value (cap-1, drain-then-send / overwrite) channel like setTTL uses, so the freshest watermark is never lost.
  • Second-opinion passes: cursor-review.md is empty (Cursor produced no output) and REVIEW_GUIDELINES.md is empty/missing on the base branch, so no repo-specific standards were applied beyond AGENTS.md.
  • Positive: the crash-consistency regression test (TestGCWatermarkPreventsResurrectionAfterCrash) and the single-atomic-batch repair test (TestRepairKeymapSingleBatch) directly exercise the trickiest invariants; the new async paths are well covered.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/db_engine/litt/disktable/gc_manager.go
Comment thread sei-db/db_engine/litt/docs/filesystem_layout.md
Comment thread sei-db/db_engine/litt/disktable/disk_table.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large but well-structured and well-tested refactor that moves keymap puts/deletes and GC collection off the control loop, adds a durable gc-watermark to prevent post-crash key resurrection, and switches iterators to reservations. No blocking correctness issues found; the one external (Codex) finding is unreachable under LittDB's documented no-overwrite contract.

Findings: 0 blocking | 6 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Codex flagged a possible stale-read: an older in-flight put-batch's writeBatch deletes a key from unflushedDataCache, potentially evicting a newer Put(K) cache entry and surfacing the older keymap value until the newer batch applies. This is technically real but unreachable in correct usage — LittDB's Put contract (table.go:21) forbids overwriting an existing value, which is the only thing that puts two put-batches for the same key in flight concurrently. For non-overwrite usage the keymap-Put-then-cache-Delete ordering is correct. Worth a code comment noting writeBatch's correctness depends on the no-overwrite invariant, but not a blocker.
  • Cursor's second-opinion pass produced no output (cursor-review.md is empty); REVIEW_GUIDELINES.md was also empty, so no repo-specific standards were applied.
  • gcManager.collectExpiredSegments fsyncs the gc-watermark file once per expired segment (advanceWatermark before each scheduleDelete). With GCPeriod lowered to 10s and large collection bursts this is many sequential fsyncs; it's off the control loop so it won't block writes, but coalescing the watermark advance per pass (single fsync to the highest collected index) would reduce syncs. Minor.
  • The deletionWatermark channel is fire-and-forget and can drop updates if a single pass collects more segments than KeymapManagerWatermarkChannelSize (default 1024) before the control loop drains it, deferring some file reclamation to a later pass. This is monotonic-safe and documented, but explicit RunGC over a very large backlog may not reclaim everything in one call — acceptable given the documentation.
  • repairKeymap's all-or-nothing recovery now depends on keymap.Put being crash-atomic as a batch; the interface doc was updated to require this and a test asserts a single Put call, but the guarantee is only as strong as each Keymap implementation actually provides (PebbleDB batch). Worth confirming the unsafe/non-default keymap implementations honor it.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/db_engine/litt/disktable/keymap_manager.go
Comment thread sei-db/db_engine/litt/disktable/disk_table.go
Comment thread sei-db/db_engine/litt/disktable/disk_table.go
Comment thread sei-db/db_engine/litt/disktable/gc_watermark_file.go
Comment thread sei-db/db_engine/litt/disktable/gc_manager.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large but carefully designed refactor of LittDB keymap threading: keymap puts/deletes now run asynchronously off the control loop, GC is split into an off-loop manager with a durable gc-watermark, and crash recovery (repair + purge) is hardened against resurrecting collected keys. The design is sound and very well tested; the only issues are a documented-but-real watermark-drop edge case and a stale doc comment, neither of which blocks merge.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Stale doc comment (Codex finding): sei-db/db_engine/litt/table.go ~line 152 still states "iteration disables GC temporarily" (and has a typo, "Additinoally"). This PR changes iterators to reserve their snapshot segments while GC continues, so the comment is now misleading about iterator/GC/disk-retention semantics. Not flaggable inline because that line is not in the PR diff, but it should be updated alongside the other iterator doc updates in this PR.
  • GCPeriod default was lowered from 5m to 10s (littdb_config.go). This is a meaningful behavioral change for all LittDB consumers (much more frequent collection passes). It is justified in the comment (collection is now cheap on the control loop), but worth calling out explicitly in the PR description / release notes since it affects every deployment, not just the keymap-threading path.
  • REVIEW_GUIDELINES.md is empty/missing on the base branch, so no repo-specific review standards could be applied. The Cursor second-opinion pass (cursor-review.md) produced no output (empty file); only the Codex pass contributed findings, both of which are addressed here.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/db_engine/litt/disktable/keymap_manager.go Outdated
Sahil-4555 pushed a commit to Sahil-4555/sei-chain that referenced this pull request Jun 30, 2026
…ei-protocol#3660)

## Description
- NOTE: Only Merge post
[3645](sei-protocol#3645): That makes
per-block flush affordable, reclaiming near-per-block crash durability
for free.
- `littFlushInterval` 100ms -> **5ms** (~one block at Giga throughput) —
bounds crash loss to ~a single block (near-WAL durability for receipt
bodies).
- `ShardingFactor` 16 -> **1**. flushing one segment file is cheaper;
sharding mainly helps across multiple disks.

## Testing
- reads-off write ceiling **167k** (vs 168k prior)
- reads-on **~153–157k** writes / 58k point-reads/s / 520 getLogs/s @
7ms
- 3h GC soak **flat** (old −27% control-loop step-down gone), 0 panics

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread sei-db/db_engine/litt/disktable/disk_table.go
Comment thread sei-db/db_engine/litt/disktable/control_loop.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large, well-engineered refactor moving keymap puts/deletes and GC collection off the control loop, backed by a durable gc-watermark that prevents post-crash resurrection of collected keys, with strong test coverage. No blocking defects found; a few bounded file-reclamation/lag edge cases (all self-healing on restart) are worth noting.

Findings: 0 blocking | 5 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Tooling note: the Cursor second-opinion review file (cursor-review.md) was empty, and REVIEW_GUIDELINES.md was empty/missing, so no repo-specific standards or Cursor findings could be incorporated. Codex's review was present and its three findings are merged below.
  • Overall the durability story is sound: the gc-watermark file is fsynced before the keymap deletes it guards, reads/repair/reload all floor at it, and it is re-seeded into the in-memory deletion watermark on restart — so the reclamation-lag edge cases below cannot cause permanent data resurrection or loss, only transient/at-rest disk-space lag that a later collection or a restart resolves. Worth a brief note in the PR description that these are accepted, self-healing limitations.
  • (Codex 3) gcManager.setTTL drains-then-sends on a size-1 channel without synchronization, so two concurrent SetTTL calls can each see the slot empty and one update is silently dropped, leaving a stale TTL. The Table interface advertises all methods as thread-safe; the code comment instead asserts SetTTL 'is not expected to be invoked concurrently'. Low impact, but consider either guarding with a mutex or documenting the relaxed contract on the interface.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

Comment thread sei-db/db_engine/litt/disktable/control_loop.go

// Schedule the segment's keymap entries for deletion. The manager applies it asynchronously and then
// advances the deletion watermark, which lets the control loop later delete the segment's files.
if err := g.keymapManager.scheduleDelete(g.gcCursorKeys, int64(index)); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] When a collected segment has zero keys, scheduleDelete returns early (keymap_manager.go:180 treats an empty batch as a no-op), so the keymap manager never enqueues a delete for it and therefore never calls publishDeletionWatermark for this segment's index — even though advanceWatermark(index+1) and deletionScheduledIndex have already moved past it. The control loop's in-memory deletion watermark thus won't advance to cover an empty collected segment until a later non-empty segment publishes a higher watermark; if the empty segment is the last one collected, its files linger unreclaimed until the next non-empty collection (or a restart, which re-seeds from the durable gc-watermark). Empty sealed segments are reachable (e.g. an empty mutable segment sealed on clean close and reopened). Consider publishing the deletion watermark for the segment even when its key batch is empty (or have GC schedule a watermark advance directly) so reclamation isn't deferred.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9490600. Configure here.

case *keymapManagerShutdownRequest:
m.drainAll()
req.shutdownCompleteChan <- struct{}{}
return true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shutdown ignores failed keymap drain

High Severity

The keymap manager shutdown handler always signals completion after drainAll, even when drainAll returns false because a put or delete failed. The sync handler correctly gates its ack on a successful drain. A failed shutdown drain can still let Close seal the mutable segment and stop the keymap while scheduled async puts never became durable.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9490600. Configure here.

}
if gcWatermarkFile == nil {
gcWatermarkFile = f
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC watermark wrong table root

Medium Severity

When no gc-watermark exists yet, NewDiskTable pins the watermark file to the first configured root’s table directory, not the table root that actually hosts the keymap (available as keymapPath). After a rebase or path reorder, GC can persist the barrier on the wrong volume while the keymap lives elsewhere, so restart may miss the durable watermark and repair/reload can resurrect collected keys.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9490600. Configure here.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large, carefully-engineered refactor that moves keymap writes and GC off the control loop and adds a durable gc-watermark to prevent resurrection of garbage-collected data; the design and test coverage are strong, but the lossy deletion-watermark channel can permanently strand the control loop's watermark below the true durable-delete frontier, which both leaks segment files and resurrects GC-deleted keys via iterator/boundary-key reads.

Findings: 1 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.
  • 1 blocking issue(s) flagged inline on specific lines.

Non-blocking

  • REVIEW_GUIDELINES.md (from the base branch) is empty/missing, so no repo-specific review standards could be applied; this review uses general engineering criteria only.
  • cursor-review.md is empty — the Cursor second-opinion pass produced no output. The Codex pass produced one finding (the deletion-watermark drop), which is captured inline below and which I independently confirmed.
  • The KeymapManagerWatermarkChannelSize config doc frames a dropped watermark as merely deferring file reclamation 'until a later collection'. That understates the impact in two ways: (1) a later collection only re-publishes if more segments expire, so on an idle table the gap is permanent, not deferred; (2) it omits that readableFloor() is derived from the same stuck watermark, so reads — not just file cleanup — are affected. Consider updating the doc once the underlying issue is fixed.
  • Nice regression coverage: TestGCWatermarkPreventsResurrectionAfterCrash, TestReadsSkipCollectedSegments{BeforeReclaim,AfterCrash}, TestRepairKeymapSingleBatch, and TestCleanShutdownLeavesKeymapConsistent directly exercise the crash/ordering invariants this PR introduces. Worth adding one more test that overflows the watermark channel (publish more than KeymapManagerWatermarkChannelSize segments between control-loop drains, then go idle) to lock in the fix for the blocker below.

// durably deleted, so the control loop may reclaim their files.
func (c *controlLoop) publishDeletionWatermark(watermark int64) {
select {
case c.deletionWatermarkChan <- watermark:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[blocker] Blocker (correctness): this non-blocking send drops the watermark when deletionWatermarkChan is full, and draining via refreshDeletionWatermark is the only thing that advances keymapDeletionWatermark. Because watermarks are monotonically increasing segment indices, an overflow drops the highest (latest) values and leaves only older ones in the channel. If a GC burst overflows the channel within one control-loop drain interval and GC then goes idle (no further publishDeletionWatermark, since collectExpiredSegments only resumes above deletionScheduledIndex), keymapDeletionWatermark is stuck below the true durable-delete frontier permanently.

That strands readableFloor() low too: segments whose keymap entries are durably deleted stay inside [readableFloor, highestSegmentIndex) and are read directly by iterators and computeOldest/NewestPrimaryKey, resurrecting GC-deleted keys (and their files are never reclaimed). This is exactly the resurrection class this PR exists to prevent.

Fix: make this a latest-value channel like gcManager.setTTL — drain the stale value, then send — so the freshest monotonic watermark is always present. It's safe: single producer (keymap manager goroutine), monotonic value, control loop takes the max on drain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants