Skip to content

persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044

Open
wen-coding wants to merge 37 commits intomainfrom
wen/use_wal_for_persistence
Open

persist: replace file-per-entry with WAL and refactor into generic indexedWAL#3044
wen-coding wants to merge 37 commits intomainfrom
wen/use_wal_for_persistence

Conversation

@wen-coding
Copy link
Contributor

@wen-coding wen-coding commented Mar 9, 2026

Summary

Replace file-per-block and file-per-commitqc persistence in blocks.go and commitqcs.go with sei-db/wal, and extract common WAL mechanics into a generic indexedWAL[T].

  • wal.go (new): generic indexedWAL[T] with codec[T] interface, providing Write, ReadAll, TruncateBefore (with verify callback), TruncateAll, Count, LastWriteTime, and Close. Handles monotonic index tracking, typed serialization, and full WAL lifecycle. Opens the underlying WAL with AllowEmpty: true so an empty log is valid; emptiness is defined by firstIdx == nextIdx (no sentinel). TruncateAll delegates to wal.TruncateAll() (a synchronous reset that removes all segment files) and advances firstIdx to nextIdx so Count() == 0 while preserving the index counter. Fsync is enabled for additional durability; the prune anchor (persisted via A/B files with fsync) is the crash-recovery watermark. ReadAll includes a post-replay count check to detect silent data loss. lastWriteTime is tracked per WAL (initialized to time.Now() on open, updated on every Write()), used by BlockPersister for stale lane retention.
  • blocks.go: one WAL per lane in blocks/<hex_lane_id>/ subdirectories, with independent per-lane truncation, and TruncateAll() when the prune anchor advances past all persisted blocks. MaybeCreateLane pre-creates lane WALs sequentially before concurrent writes; PersistBlock requires the lane to exist (returns error otherwise) to prevent silent data races from lazy map writes. PersistBlock enforces strict contiguous block numbers; DeleteBefore verifies block numbers before truncating via a defense-in-depth callback. Stale lanes (not in current committee) are retained for defaultStaleRetention (30 minutes) after their last write before deletion, giving catching-up peers time to fetch blocks. loadAll checks for gaps at replay time. close() (unexported — only used by tests and constructor error cleanup) uses errors.Join to ensure all lane WALs are closed even if one fails.
  • commitqcs.go: single WAL in commitqcs/, linear RoadIndex-to-WAL-index mapping. PersistCommitQC silently ignores duplicates (idx < next) for idempotent startup, rejects gaps (idx > next). DeleteBefore advances the write cursor and truncates the WAL via TruncateAll when the anchor advances past all entries — cursor advancement happens before the count-zero check so it works correctly even after a crash between TruncateAll and the first new write. loadAll checks for gaps at replay time. Close() is exported (used by avail/state_test.go).
  • state.go: startup prunes stale WAL entries via DeleteBefore, then re-persists in-memory CommitQCs (no-op in normal case; writes anchor's QC after a WAL TruncateAll). Runtime runPersist reordered: anchor + commitQC prune → block prune → concurrent writes via errgroup. CommitQCs and blocks are persisted concurrently: one goroutine for CommitQCs, one goroutine per lane for blocks (sequential within each lane). Each goroutine publishes its result (markCommitQCsPersisted / markBlockPersisted) as soon as it finishes. CommitQC WAL truncation is co-located with the anchor persist step so the truncation point is derived directly from the on-disk anchor, making the safety invariant explicit: we only truncate entries the anchor covers. Block pruning runs every cycle for stale lane retention timeout evaluation.
  • Removed all file-per-entry code and the ResetNext method.
  • No-op persister guards annotated with inline comments throughout.
  • TODO: add Prometheus metrics for blocks/commitQCs written, truncated, and stale lanes removed.

Concurrency design (no mutex)

runPersist groups blocks by lane, calls MaybeCreateLane sequentially to pre-create any missing lane WALs, then launches an errgroup with one goroutine for CommitQCs and one goroutine per block lane. During the concurrent phase bp.lanes is read-only (no map writes), so no sync.Mutex is needed. Each *laneWAL is exclusively owned by one goroutine. g.Wait() acts as a barrier before the next collectPersistBatch iteration, ensuring no two batches write to the same lane concurrently.

Test plan

  • All autobahn tests pass with -race (avail, consensus, data, types)
  • golangci-lint: 0 issues
  • wal_test.go: empty start, write+read, reopen with data, reopen after truncate, truncate all but last, verify callback (accept + reject), write after truncate, TruncateAll, stale nextIdx detection, LastWriteTime (set on open, advances on write, non-zero on reopen with data)
  • blocks_test.go: empty dir, persist+load, multiple lanes, delete-before (single/multi/empty lane/restart), noop, delete-then-persist, delete-past-all (TruncateAll), stale lane removal (with retention=0), stale lane retention (recent write keeps lane alive), empty stale lane retained at startup, empty WAL survives reopen, MaybeCreateLane (creates on first call, idempotent, PersistBlock errors without it), skip non-hex/invalid-key dirs, out-of-sequence rejection, gap detection at load time
  • commitqcs_test.go: empty dir, persist+load, delete-before, duplicate is no-op, gap rejected, noop, delete-then-persist, delete-past-all (TruncateAll + cursor advance), crash-recovery (TruncateAll then crash before write, restart with empty WAL), gap detection at load time
  • state_test.go: anchor past all persisted commitQCs truncates WAL and re-persists anchor's QC (integration test for "long offline" scenario), loads persisted blocks, loads persisted AppQC and blocks together

@github-actions
Copy link

github-actions bot commented Mar 9, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMar 16, 2026, 7:25 PM

@wen-coding wen-coding changed the title persist: refactor block/commitqc WAL into generic indexedWAL persist: replace file-per-entry with WAL and refactor into generic indexedWAL Mar 9, 2026
@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 74.80315% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.36%. Comparing base (904de28) to head (3a145a7).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...dermint/internal/autobahn/consensus/persist/wal.go 67.21% 11 Missing and 9 partials ⚠️
sei-tendermint/internal/autobahn/avail/state.go 50.00% 10 Missing and 7 partials ⚠️
...mint/internal/autobahn/consensus/persist/blocks.go 84.95% 11 Missing and 6 partials ⚠️
...t/internal/autobahn/consensus/persist/commitqcs.go 78.26% 5 Missing and 5 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3044      +/-   ##
==========================================
+ Coverage   58.27%   58.36%   +0.08%     
==========================================
  Files        2076     2080       +4     
  Lines      171480   170889     -591     
==========================================
- Hits        99933    99735     -198     
+ Misses      62619    62232     -387     
+ Partials     8928     8922       -6     
Flag Coverage Δ
sei-chain-pr 78.83% <74.80%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/autobahn/avail/inner.go 97.36% <ø> (ø)
...t/internal/autobahn/consensus/persist/commitqcs.go 79.41% <78.26%> (+0.68%) ⬆️
sei-tendermint/internal/autobahn/avail/state.go 75.00% <50.00%> (-0.69%) ⬇️
...mint/internal/autobahn/consensus/persist/blocks.go 82.17% <84.95%> (+5.02%) ⬆️
...dermint/internal/autobahn/consensus/persist/wal.go 67.21% <67.21%> (ø)

... and 257 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment on lines +155 to +168
for lane, first := range laneFirsts {
lw, ok := bp.lanes[lane]
if !ok {
continue // no WAL yet; PersistBlock will create one lazily
}
lane, fileN, err := parseBlockFilename(entry.Name())
if err != nil {
firstBN, ok := lw.firstBlockNum().Get()
if !ok || first <= firstBN {
continue
}
first, ok := laneFirsts[lane]
if ok && fileN >= first {
continue
walIdx := lw.firstIdx + uint64(first-firstBN)
if err := lw.TruncateBefore(walIdx); err != nil {
return fmt.Errorf("truncate lane %s WAL before block %d: %w", lane, first, err)
}
path := filepath.Join(bp.dir, entry.Name())
if err := os.Remove(path); err != nil && !os.IsNotExist(err) {
logger.Warn("failed to delete block file", "path", path, "err", err)
}

Check warning

Code scanning / CodeQL

Iteration over map

Iteration over map may be a possible source of non-determinism
@wen-coding wen-coding requested a review from pompon0 March 10, 2026 23:52
dbwal.Config{
WriteBufferSize: 0, // synchronous writes
WriteBatchSize: 1, // no batching
FsyncEnabled: true,
Copy link
Contributor

@yzang2019 yzang2019 Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware of the latency impact here, if putting write in critical path, it would introduce some noticeable latency. Would recommend synchronous write + nofsync for perf reason, fsync does provide stronger guarantees, but the chance of all validators hitting power off at the same time is pretty rare

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable, changed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yzang2019 Talked with @pompon0 about this, since lane block QC is only f+1, theoretically one OS crash can screw us. We are changing all lanes and commitqc writes to happen concurrently, so latency is less of a concern. Is it okay if I change the Fsync back to true here?

// Used when all entries are stale (e.g. the prune anchor advanced past
// everything persisted).
//
// TODO: sei-db/wal doesn't expose tidwall/wal's AllowEmpty option, so there's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to expose that in a separate PR? It should be pretty simple to add though?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR 3049 merged and switched to TruncateAll() here.

if err := w.wal.Write(entry); err != nil {
return err
}
if w.firstIdx == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend using Count() == 0 instead of relying on firstIdx == 0 as a sentinel for "WAL is empty" incase the assumption of wal starting index from 0 is not valid in the future if we switch the wal library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return nil, nil
}
entries := make([]T, 0, w.Count())
err := w.wal.Replay(w.firstIdx, w.nextIdx-1, func(_ uint64, entry T) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no validation that len(entries) == w.Count() after a successful replay. If Replay succeeds but returns fewer entries than expected (e.g., the underlying WAL silently truncated its tail on open due to corruption), ReadAll would return a short slice with no error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

return nil
}
for _, lw := range bp.lanes {
if err := lw.Close(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one lane's WAL fails to close, the remaining lanes are never closed. Use errors.Join to accumulate errors and close all lanes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Replace the file-per-block and file-per-commitqc persistence with
sei-db/wal. Blocks use one WAL per lane so that truncation is
independent (no stale-lane problem). CommitQCs use a single WAL
with a linear RoadIndex-to-WAL-index mapping.

Key changes:
- BlockPersister: per-lane WAL in blocks/<hex_lane_id>/ subdirs,
  lazy lane creation, independent per-lane TruncateBefore.
- CommitQCPersister: single WAL in commitqcs/, tracks firstWALIdx
  and nextWALIdx locally for correct truncation mapping.
- Remove all file-per-entry code: filename construction/parsing,
  directory scanning, individual file read/write/delete, corrupt
  file skipping.
- Rewrite tests for WAL semantics (append-only, truncation, replay).

Made-with: Cursor
Extract common WAL mechanics (index tracking, typed write/replay,
truncation) into a generic indexedWAL[T] backed by sei-db/wal,
replacing the duplicated raw-bytes WAL setup in both blocks.go and
commitqcs.go.

Key changes:
- Add indexedWAL[T] with codec[T] interface for typed serialization
- laneWAL embeds indexedWAL; firstBlockNum() returns Option for safety
- DeleteBefore now removes stale lane WALs (validators no longer in
  committee) and their directories
- Add empty-WAL guard to CommitQCPersister.DeleteBefore
- Add direct unit tests for indexedWAL (wal_test.go)
- Add TODO for dynamic committee membership support

Made-with: Cursor
Instead of extracting the lane ID from the first replayed entry
(and closing/skipping empty WALs), decode it from the hex directory
name. This keeps the WAL open so the lane can receive blocks without
reopening it.

Made-with: Cursor
Validate the directory name (hex decode + PublicKeyFromBytes) before
opening the WAL, avoiding a redundant hex.DecodeString call. Add tests
for both skip paths: non-hex directory name and valid hex but invalid
public key length.

Made-with: Cursor
Replace callback-based Replay on indexedWAL with ReadAll that returns
a slice. Remove the defensive sort in blocks loadAll since WAL entries
are already in append order. Fix stale Replay reference in godoc.

Made-with: Cursor
TruncateBefore now reads and verifies the entry at the target WAL index
before truncating, catching index-mapping corruption before data loss.
PersistCommitQC and PersistBlock enforce strict sequential order to
prevent gaps that would break the linear domain-to-WAL-index mapping.

Made-with: Cursor
The non-contiguous commitQC test now expects the gap to be caught
at PersistCommitQC time ("out of sequence") rather than at NewState
load time, matching the defense-in-depth guard added earlier.

Made-with: Cursor
Add contiguity checks during WAL replay to catch on-disk corruption
that bypasses write-time guards. Includes tests that write directly
to the WAL to simulate corrupted data.

Made-with: Cursor
Reduces log noise on restart with many validators and blocks.

Made-with: Cursor
BlockPersister uses Option[string] for dir, CommitQCPersister uses
Option[*indexedWAL] for iw. None = no-op mode, Some = real persistence.
The noop behavior is now structurally implied rather than a separate flag.

Made-with: Cursor
When a node restarts after being offline for a long time, the prune
anchor may have advanced past all locally persisted WAL entries.

- Add indexedWAL.Reset() to close, remove, and reopen a fresh WAL.
- DeleteBefore in both blocks.go and commitqcs.go now calls Reset()
  when the prune point is at or past the last persisted entry.
- CommitQCPersister.DeleteBefore advances the write cursor (cp.next)
  in the reset branch, making ResetNext unnecessary.
- PersistCommitQC now silently ignores duplicates (idx < next) so
  startup can idempotently re-persist in-memory entries after a reset.
- Remove ResetNext; replace call sites with a re-persist loop at
  startup and rely on DeleteBefore's cursor management at runtime.
- Reorder runPersist: prune before writes (WAL needs contiguous indices).
- Update runPersist godoc to match new step ordering.
- Add tests for Reset, DeleteBefore-past-all, duplicate no-op, and an
  integration test for NewState with anchor past all persisted QCs.

Made-with: Cursor
When the WAL was reset (anchor past all entries) and the process crashed
before writing new entries, restart would find an empty WAL with cp.next=0.
The old guard order (count==0 early return before cursor check) prevented
DeleteBefore from advancing cp.next, causing the subsequent PersistCommitQC
to fail with "out of sequence".

Fix: check idx >= cp.next before the count==0 guard so the cursor is always
advanced, even on an already-empty WAL. Add TestCommitQCDeleteBeforePastAllCrashRecovery.

Made-with: Cursor
The loop over all in-memory CommitQCs was unnecessary — entries loaded
from the WAL are guaranteed to survive DeleteBefore (it only removes
entries below the anchor). Only the anchor's CommitQC could be missing
after a WAL reset, so persist just that one entry.

Made-with: Cursor
- Disable WAL fsync; the prune anchor (A/B files with fsync) already
  provides the crash-recovery watermark.
- Use Count() == 0 instead of firstIdx == 0 for emptiness checks in
  Write and ReadAll for robustness.
- Add post-replay count check in ReadAll to detect silent data loss.
- Use errors.Join in BlockPersister.Close to close all lane WALs.
- Rename laneDir → lanePath to avoid shadowing the laneDir() function.
- Add TestIndexedWAL_ReadAllDetectsStaleNextIdx.

Made-with: Cursor
@wen-coding wen-coding force-pushed the wen/use_wal_for_persistence branch from c0727e6 to 422aa8f Compare March 11, 2026 04:29
wen-coding and others added 9 commits March 11, 2026 08:01
Now that sei-db/wal exposes AllowEmpty and TruncateAll (#3049), use
them to clear a WAL in-place instead of the heavier close → remove
directory → reopen pattern.

- Enable AllowEmpty in WAL config.
- Replace Reset() with TruncateAll() — single call, no dir removal.
- Remove dir/codec fields from indexedWAL (only needed for reopen).
- Eliminate firstIdx == 0 sentinel: Count() is now just nextIdx -
  firstIdx, empty when equal. Write() no longer needs the first-write
  bookkeeping branch.
- Update openIndexedWAL to handle AllowEmpty's empty-log reporting
  (first > last) uniformly with the non-empty case.

Made-with: Cursor
sei-db/wal.NewWAL no longer accepts a *slog.Logger parameter.

Made-with: Cursor
Instead of immediately deleting lane WALs not in the current committee,
retain them for 30 minutes (defaultStaleRetention) after the last write.
This gives catching-up peers time to fetch blocks from lanes that have
left the committee.

- Add lastWriteTime to indexedWAL, initialized to time.Now() on open
  and updated on every Write()
- Add staleRetention field to BlockPersister (default 30m)
- DeleteBefore skips stale lane deletion when lastWriteTime is recent
- Tests use staleRetention=0 for immediate deletion behavior

Made-with: Cursor
Move commitQC DeleteBefore inside the anchor-persist block and derive
the truncation index directly from anchor.CommitQC rather than from the
in-memory queue's first index. This makes the safety invariant explicit:
we only truncate WAL entries that the on-disk anchor covers.

Remove the now-unused commitQCFirst field from persistBatch.

Made-with: Cursor
Annotate all guard sites where persistence is disabled (dir/iw is None)
with inline comments so the no-op behavior is immediately obvious.

Made-with: Cursor
// defaultStaleRetention is how long a lane WAL is kept after its last write
// before being deleted when the lane is no longer in the committee. This gives
// catching-up peers time to fetch blocks from the stale lane.
const defaultStaleRetention = 30 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that should not be an issue, given that we persist data until it is executed anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about lane retention in tomorrow's syncup.

return &BlockPersister{lanes: map[types.LaneID]*laneWAL{}, staleRetention: defaultStaleRetention}, nil, nil
}
dir := filepath.Join(sd, "blocks")
dir := filepath.Join(sd, blocksDir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Dir" suffix seems redundant as a part of a filepath, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but naming the constant blocks sounds a bit weird, since it seems to imply these are the blocks when it's only a path. Any better suggestions?

entries, err := os.ReadDir(bp.dir)
if err != nil {
return fmt.Errorf("list blocks dir for cleanup: %w", err)
if len(laneFirsts) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is this the right layer to check such invariants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, removed.

if ok && fileN >= first {
if first >= lw.nextBlockNum {
// Anchor advanced past all persisted blocks for this lane.
if err := lw.TruncateAll(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is truncation synchronous? How expensive it is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TruncateAll is synchronous, it removes the segment files and then change some internal pointers, not too expensive. I don't expect this to happen very often though. In practice if every validator keeps emitting blocks, there should always be 1 or 2 lane blocks which are generated but not in AppQC yet. Unless they do "generate a block then wait for a while, generate another block then wait for a while", are you imagining that as an attack?

"filenameLane", lane,
slog.Uint64("filenameNum", uint64(n)),
)
if time.Since(lw.LastWriteTime()) < bp.staleRetention {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have you switched from the strict availability guarantees to time-based ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time-based approach is only for how long we retain a lane not in current committee any more. I originally delete the stale lane immediately, but then freaked out thinking some block in flight might still include lane blocks in the stale lanes, so I changed this to wait for 30 minutes before deletion. We can delete this for now and wait for the Epoch implementation.

For things other than stale lanes, truncation is not time-based.

@@ -673,32 +704,23 @@ func (s *State) runPersist(ctx context.Context, pers persisters) error {
s.markCommitQCsPersisted(batch.commitQCs[len(batch.commitQCs)-1])
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is time to actually do those writes concurrently - i.e. persistence of commitQCs and each lane are independent and will be done faster in parallel. We need to minimize the critical path of sending the votes.

Make Close() internal (close()) since it's only called within the persist
package — by tests and constructors for error cleanup. Add no-op comments,
TODO for metrics, fix truncation comment, and correct close() godoc.

Made-with: Cursor
@wen-coding wen-coding force-pushed the wen/use_wal_for_persistence branch from aeeddbf to 8de2a83 Compare March 16, 2026 17:04
Group blocks by lane in runPersist step 4 and write each lane's
blocks in a separate errgroup goroutine. Each goroutine calls
markBlockPersisted when its lane finishes so voting unblocks per-lane
without waiting for all lanes.

MaybeCreateLane pre-creates lane WALs sequentially before launching
goroutines so bp.lanes is read-only during the concurrent phase — no
mutex needed. PersistBlock now requires the lane to exist (returns an
error otherwise) to prevent silent data races from lazy map writes.

Made-with: Cursor
Move CommitQC writes into the same errgroup as per-lane block writes
so they run in parallel. CommitQCs go in one goroutine; blocks fan out
one goroutine per lane. Each goroutine publishes its result as soon as
it finishes. No ordering dependency between the two — they write to
independent WALs and update independent inner fields.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants