persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044
persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044wen-coding wants to merge 37 commits intomainfrom
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3044 +/- ##
==========================================
+ Coverage 58.27% 58.36% +0.08%
==========================================
Files 2076 2080 +4
Lines 171480 170889 -591
==========================================
- Hits 99933 99735 -198
+ Misses 62619 62232 -387
+ Partials 8928 8922 -6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| for lane, first := range laneFirsts { | ||
| lw, ok := bp.lanes[lane] | ||
| if !ok { | ||
| continue // no WAL yet; PersistBlock will create one lazily | ||
| } | ||
| lane, fileN, err := parseBlockFilename(entry.Name()) | ||
| if err != nil { | ||
| firstBN, ok := lw.firstBlockNum().Get() | ||
| if !ok || first <= firstBN { | ||
| continue | ||
| } | ||
| first, ok := laneFirsts[lane] | ||
| if ok && fileN >= first { | ||
| continue | ||
| walIdx := lw.firstIdx + uint64(first-firstBN) | ||
| if err := lw.TruncateBefore(walIdx); err != nil { | ||
| return fmt.Errorf("truncate lane %s WAL before block %d: %w", lane, first, err) | ||
| } | ||
| path := filepath.Join(bp.dir, entry.Name()) | ||
| if err := os.Remove(path); err != nil && !os.IsNotExist(err) { | ||
| logger.Warn("failed to delete block file", "path", path, "err", err) | ||
| } |
Check warning
Code scanning / CodeQL
Iteration over map
| dbwal.Config{ | ||
| WriteBufferSize: 0, // synchronous writes | ||
| WriteBatchSize: 1, // no batching | ||
| FsyncEnabled: true, |
There was a problem hiding this comment.
Be aware of the latency impact here, if putting write in critical path, it would introduce some noticeable latency. Would recommend synchronous write + nofsync for perf reason, fsync does provide stronger guarantees, but the chance of all validators hitting power off at the same time is pretty rare
There was a problem hiding this comment.
That's reasonable, changed
There was a problem hiding this comment.
@yzang2019 Talked with @pompon0 about this, since lane block QC is only f+1, theoretically one OS crash can screw us. We are changing all lanes and commitqc writes to happen concurrently, so latency is less of a concern. Is it okay if I change the Fsync back to true here?
| // Used when all entries are stale (e.g. the prune anchor advanced past | ||
| // everything persisted). | ||
| // | ||
| // TODO: sei-db/wal doesn't expose tidwall/wal's AllowEmpty option, so there's |
There was a problem hiding this comment.
Is the plan to expose that in a separate PR? It should be pretty simple to add though?
There was a problem hiding this comment.
PR 3049 merged and switched to TruncateAll() here.
| if err := w.wal.Write(entry); err != nil { | ||
| return err | ||
| } | ||
| if w.firstIdx == 0 { |
There was a problem hiding this comment.
Recommend using Count() == 0 instead of relying on firstIdx == 0 as a sentinel for "WAL is empty" incase the assumption of wal starting index from 0 is not valid in the future if we switch the wal library
| return nil, nil | ||
| } | ||
| entries := make([]T, 0, w.Count()) | ||
| err := w.wal.Replay(w.firstIdx, w.nextIdx-1, func(_ uint64, entry T) error { |
There was a problem hiding this comment.
There's no validation that len(entries) == w.Count() after a successful replay. If Replay succeeds but returns fewer entries than expected (e.g., the underlying WAL silently truncated its tail on open due to corruption), ReadAll would return a short slice with no error.
| return nil | ||
| } | ||
| for _, lw := range bp.lanes { | ||
| if err := lw.Close(); err != nil { |
There was a problem hiding this comment.
If one lane's WAL fails to close, the remaining lanes are never closed. Use errors.Join to accumulate errors and close all lanes
Replace the file-per-block and file-per-commitqc persistence with sei-db/wal. Blocks use one WAL per lane so that truncation is independent (no stale-lane problem). CommitQCs use a single WAL with a linear RoadIndex-to-WAL-index mapping. Key changes: - BlockPersister: per-lane WAL in blocks/<hex_lane_id>/ subdirs, lazy lane creation, independent per-lane TruncateBefore. - CommitQCPersister: single WAL in commitqcs/, tracks firstWALIdx and nextWALIdx locally for correct truncation mapping. - Remove all file-per-entry code: filename construction/parsing, directory scanning, individual file read/write/delete, corrupt file skipping. - Rewrite tests for WAL semantics (append-only, truncation, replay). Made-with: Cursor
Extract common WAL mechanics (index tracking, typed write/replay, truncation) into a generic indexedWAL[T] backed by sei-db/wal, replacing the duplicated raw-bytes WAL setup in both blocks.go and commitqcs.go. Key changes: - Add indexedWAL[T] with codec[T] interface for typed serialization - laneWAL embeds indexedWAL; firstBlockNum() returns Option for safety - DeleteBefore now removes stale lane WALs (validators no longer in committee) and their directories - Add empty-WAL guard to CommitQCPersister.DeleteBefore - Add direct unit tests for indexedWAL (wal_test.go) - Add TODO for dynamic committee membership support Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Instead of extracting the lane ID from the first replayed entry (and closing/skipping empty WALs), decode it from the hex directory name. This keeps the WAL open so the lane can receive blocks without reopening it. Made-with: Cursor
Made-with: Cursor
Validate the directory name (hex decode + PublicKeyFromBytes) before opening the WAL, avoiding a redundant hex.DecodeString call. Add tests for both skip paths: non-hex directory name and valid hex but invalid public key length. Made-with: Cursor
Replace callback-based Replay on indexedWAL with ReadAll that returns a slice. Remove the defensive sort in blocks loadAll since WAL entries are already in append order. Fix stale Replay reference in godoc. Made-with: Cursor
TruncateBefore now reads and verifies the entry at the target WAL index before truncating, catching index-mapping corruption before data loss. PersistCommitQC and PersistBlock enforce strict sequential order to prevent gaps that would break the linear domain-to-WAL-index mapping. Made-with: Cursor
The non-contiguous commitQC test now expects the gap to be caught
at PersistCommitQC time ("out of sequence") rather than at NewState
load time, matching the defense-in-depth guard added earlier.
Made-with: Cursor
Made-with: Cursor
Add contiguity checks during WAL replay to catch on-disk corruption that bypasses write-time guards. Includes tests that write directly to the WAL to simulate corrupted data. Made-with: Cursor
Reduces log noise on restart with many validators and blocks. Made-with: Cursor
Made-with: Cursor
BlockPersister uses Option[string] for dir, CommitQCPersister uses Option[*indexedWAL] for iw. None = no-op mode, Some = real persistence. The noop behavior is now structurally implied rather than a separate flag. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
When a node restarts after being offline for a long time, the prune anchor may have advanced past all locally persisted WAL entries. - Add indexedWAL.Reset() to close, remove, and reopen a fresh WAL. - DeleteBefore in both blocks.go and commitqcs.go now calls Reset() when the prune point is at or past the last persisted entry. - CommitQCPersister.DeleteBefore advances the write cursor (cp.next) in the reset branch, making ResetNext unnecessary. - PersistCommitQC now silently ignores duplicates (idx < next) so startup can idempotently re-persist in-memory entries after a reset. - Remove ResetNext; replace call sites with a re-persist loop at startup and rely on DeleteBefore's cursor management at runtime. - Reorder runPersist: prune before writes (WAL needs contiguous indices). - Update runPersist godoc to match new step ordering. - Add tests for Reset, DeleteBefore-past-all, duplicate no-op, and an integration test for NewState with anchor past all persisted QCs. Made-with: Cursor
Made-with: Cursor
When the WAL was reset (anchor past all entries) and the process crashed before writing new entries, restart would find an empty WAL with cp.next=0. The old guard order (count==0 early return before cursor check) prevented DeleteBefore from advancing cp.next, causing the subsequent PersistCommitQC to fail with "out of sequence". Fix: check idx >= cp.next before the count==0 guard so the cursor is always advanced, even on an already-empty WAL. Add TestCommitQCDeleteBeforePastAllCrashRecovery. Made-with: Cursor
The loop over all in-memory CommitQCs was unnecessary — entries loaded from the WAL are guaranteed to survive DeleteBefore (it only removes entries below the anchor). Only the anchor's CommitQC could be missing after a WAL reset, so persist just that one entry. Made-with: Cursor
- Disable WAL fsync; the prune anchor (A/B files with fsync) already provides the crash-recovery watermark. - Use Count() == 0 instead of firstIdx == 0 for emptiness checks in Write and ReadAll for robustness. - Add post-replay count check in ReadAll to detect silent data loss. - Use errors.Join in BlockPersister.Close to close all lane WALs. - Rename laneDir → lanePath to avoid shadowing the laneDir() function. - Add TestIndexedWAL_ReadAllDetectsStaleNextIdx. Made-with: Cursor
c0727e6 to
422aa8f
Compare
Now that sei-db/wal exposes AllowEmpty and TruncateAll (#3049), use them to clear a WAL in-place instead of the heavier close → remove directory → reopen pattern. - Enable AllowEmpty in WAL config. - Replace Reset() with TruncateAll() — single call, no dir removal. - Remove dir/codec fields from indexedWAL (only needed for reopen). - Eliminate firstIdx == 0 sentinel: Count() is now just nextIdx - firstIdx, empty when equal. Write() no longer needs the first-write bookkeeping branch. - Update openIndexedWAL to handle AllowEmpty's empty-log reporting (first > last) uniformly with the non-empty case. Made-with: Cursor
…runcates" Made-with: Cursor
sei-db/wal.NewWAL no longer accepts a *slog.Logger parameter. Made-with: Cursor
Instead of immediately deleting lane WALs not in the current committee, retain them for 30 minutes (defaultStaleRetention) after the last write. This gives catching-up peers time to fetch blocks from lanes that have left the committee. - Add lastWriteTime to indexedWAL, initialized to time.Now() on open and updated on every Write() - Add staleRetention field to BlockPersister (default 30m) - DeleteBefore skips stale lane deletion when lastWriteTime is recent - Tests use staleRetention=0 for immediate deletion behavior Made-with: Cursor
Move commitQC DeleteBefore inside the anchor-persist block and derive the truncation index directly from anchor.CommitQC rather than from the in-memory queue's first index. This makes the safety invariant explicit: we only truncate WAL entries that the on-disk anchor covers. Remove the now-unused commitQCFirst field from persistBatch. Made-with: Cursor
Annotate all guard sites where persistence is disabled (dir/iw is None) with inline comments so the no-op behavior is immediately obvious. Made-with: Cursor
| // defaultStaleRetention is how long a lane WAL is kept after its last write | ||
| // before being deleted when the lane is no longer in the committee. This gives | ||
| // catching-up peers time to fetch blocks from the stale lane. | ||
| const defaultStaleRetention = 30 * time.Minute |
There was a problem hiding this comment.
that should not be an issue, given that we persist data until it is executed anyway.
There was a problem hiding this comment.
Let's talk about lane retention in tomorrow's syncup.
| return &BlockPersister{lanes: map[types.LaneID]*laneWAL{}, staleRetention: defaultStaleRetention}, nil, nil | ||
| } | ||
| dir := filepath.Join(sd, "blocks") | ||
| dir := filepath.Join(sd, blocksDir) |
There was a problem hiding this comment.
nit: "Dir" suffix seems redundant as a part of a filepath, no?
There was a problem hiding this comment.
It is, but naming the constant blocks sounds a bit weird, since it seems to imply these are the blocks when it's only a path. Any better suggestions?
| entries, err := os.ReadDir(bp.dir) | ||
| if err != nil { | ||
| return fmt.Errorf("list blocks dir for cleanup: %w", err) | ||
| if len(laneFirsts) == 0 { |
There was a problem hiding this comment.
nit: is this the right layer to check such invariants?
There was a problem hiding this comment.
That's fair, removed.
| if ok && fileN >= first { | ||
| if first >= lw.nextBlockNum { | ||
| // Anchor advanced past all persisted blocks for this lane. | ||
| if err := lw.TruncateAll(); err != nil { |
There was a problem hiding this comment.
is truncation synchronous? How expensive it is?
There was a problem hiding this comment.
TruncateAll is synchronous, it removes the segment files and then change some internal pointers, not too expensive. I don't expect this to happen very often though. In practice if every validator keeps emitting blocks, there should always be 1 or 2 lane blocks which are generated but not in AppQC yet. Unless they do "generate a block then wait for a while, generate another block then wait for a while", are you imagining that as an attack?
| "filenameLane", lane, | ||
| slog.Uint64("filenameNum", uint64(n)), | ||
| ) | ||
| if time.Since(lw.LastWriteTime()) < bp.staleRetention { |
There was a problem hiding this comment.
why have you switched from the strict availability guarantees to time-based ones?
There was a problem hiding this comment.
This time-based approach is only for how long we retain a lane not in current committee any more. I originally delete the stale lane immediately, but then freaked out thinking some block in flight might still include lane blocks in the stale lanes, so I changed this to wait for 30 minutes before deletion. We can delete this for now and wait for the Epoch implementation.
For things other than stale lanes, truncation is not time-based.
| @@ -673,32 +704,23 @@ func (s *State) runPersist(ctx context.Context, pers persisters) error { | |||
| s.markCommitQCsPersisted(batch.commitQCs[len(batch.commitQCs)-1]) | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
I think it is time to actually do those writes concurrently - i.e. persistence of commitQCs and each lane are independent and will be done faster in parallel. We need to minimize the critical path of sending the votes.
Make Close() internal (close()) since it's only called within the persist package — by tests and constructors for error cleanup. Add no-op comments, TODO for metrics, fix truncation comment, and correct close() godoc. Made-with: Cursor
aeeddbf to
8de2a83
Compare
Group blocks by lane in runPersist step 4 and write each lane's blocks in a separate errgroup goroutine. Each goroutine calls markBlockPersisted when its lane finishes so voting unblocks per-lane without waiting for all lanes. MaybeCreateLane pre-creates lane WALs sequentially before launching goroutines so bp.lanes is read-only during the concurrent phase — no mutex needed. PersistBlock now requires the lane to exist (returns an error otherwise) to prevent silent data races from lazy map writes. Made-with: Cursor
Move CommitQC writes into the same errgroup as per-lane block writes so they run in parallel. CommitQCs go in one goroutine; blocks fan out one goroutine per lane. Each goroutine publishes its result as soon as it finishes. No ordering dependency between the two — they write to independent WALs and update independent inner fields. Made-with: Cursor
Made-with: Cursor
Summary
Replace file-per-block and file-per-commitqc persistence in
blocks.goandcommitqcs.gowithsei-db/wal, and extract common WAL mechanics into a genericindexedWAL[T].wal.go(new): genericindexedWAL[T]withcodec[T]interface, providingWrite,ReadAll,TruncateBefore(with verify callback),TruncateAll,Count,LastWriteTime, andClose. Handles monotonic index tracking, typed serialization, and full WAL lifecycle. Opens the underlying WAL withAllowEmpty: trueso an empty log is valid; emptiness is defined byfirstIdx == nextIdx(no sentinel).TruncateAlldelegates towal.TruncateAll()(a synchronous reset that removes all segment files) and advancesfirstIdxtonextIdxsoCount() == 0while preserving the index counter. Fsync is enabled for additional durability; the prune anchor (persisted via A/B files with fsync) is the crash-recovery watermark.ReadAllincludes a post-replay count check to detect silent data loss.lastWriteTimeis tracked per WAL (initialized totime.Now()on open, updated on everyWrite()), used byBlockPersisterfor stale lane retention.blocks.go: one WAL per lane inblocks/<hex_lane_id>/subdirectories, with independent per-lane truncation, andTruncateAll()when the prune anchor advances past all persisted blocks.MaybeCreateLanepre-creates lane WALs sequentially before concurrent writes;PersistBlockrequires the lane to exist (returns error otherwise) to prevent silent data races from lazy map writes.PersistBlockenforces strict contiguous block numbers;DeleteBeforeverifies block numbers before truncating via a defense-in-depth callback. Stale lanes (not in current committee) are retained fordefaultStaleRetention(30 minutes) after their last write before deletion, giving catching-up peers time to fetch blocks.loadAllchecks for gaps at replay time.close()(unexported — only used by tests and constructor error cleanup) useserrors.Jointo ensure all lane WALs are closed even if one fails.commitqcs.go: single WAL incommitqcs/, linear RoadIndex-to-WAL-index mapping.PersistCommitQCsilently ignores duplicates (idx < next) for idempotent startup, rejects gaps (idx > next).DeleteBeforeadvances the write cursor and truncates the WAL viaTruncateAllwhen the anchor advances past all entries — cursor advancement happens before the count-zero check so it works correctly even after a crash betweenTruncateAlland the first new write.loadAllchecks for gaps at replay time.Close()is exported (used byavail/state_test.go).state.go: startup prunes stale WAL entries viaDeleteBefore, then re-persists in-memory CommitQCs (no-op in normal case; writes anchor's QC after a WALTruncateAll). RuntimerunPersistreordered: anchor + commitQC prune → block prune → concurrent writes viaerrgroup. CommitQCs and blocks are persisted concurrently: one goroutine for CommitQCs, one goroutine per lane for blocks (sequential within each lane). Each goroutine publishes its result (markCommitQCsPersisted/markBlockPersisted) as soon as it finishes. CommitQC WAL truncation is co-located with the anchor persist step so the truncation point is derived directly from the on-disk anchor, making the safety invariant explicit: we only truncate entries the anchor covers. Block pruning runs every cycle for stale lane retention timeout evaluation.ResetNextmethod.Concurrency design (no mutex)
runPersistgroups blocks by lane, callsMaybeCreateLanesequentially to pre-create any missing lane WALs, then launches anerrgroupwith one goroutine for CommitQCs and one goroutine per block lane. During the concurrent phasebp.lanesis read-only (no map writes), so nosync.Mutexis needed. Each*laneWALis exclusively owned by one goroutine.g.Wait()acts as a barrier before the nextcollectPersistBatchiteration, ensuring no two batches write to the same lane concurrently.Test plan
-race(avail, consensus, data, types)wal_test.go: empty start, write+read, reopen with data, reopen after truncate, truncate all but last, verify callback (accept + reject), write after truncate, TruncateAll, stale nextIdx detection, LastWriteTime (set on open, advances on write, non-zero on reopen with data)blocks_test.go: empty dir, persist+load, multiple lanes, delete-before (single/multi/empty lane/restart), noop, delete-then-persist, delete-past-all (TruncateAll), stale lane removal (with retention=0), stale lane retention (recent write keeps lane alive), empty stale lane retained at startup, empty WAL survives reopen, MaybeCreateLane (creates on first call, idempotent, PersistBlock errors without it), skip non-hex/invalid-key dirs, out-of-sequence rejection, gap detection at load timecommitqcs_test.go: empty dir, persist+load, delete-before, duplicate is no-op, gap rejected, noop, delete-then-persist, delete-past-all (TruncateAll + cursor advance), crash-recovery (TruncateAll then crash before write, restart with empty WAL), gap detection at load timestate_test.go: anchor past all persisted commitQCs truncates WAL and re-persists anchor's QC (integration test for "long offline" scenario), loads persisted blocks, loads persisted AppQC and blocks together