Skip to content

[codex] Harden multiversion iterator validation#3656

Merged
masih merged 2 commits into
mainfrom
codex/harden-multiversion-iterator-validation
Jun 30, 2026
Merged

[codex] Harden multiversion iterator validation#3656
masih merged 2 commits into
mainfrom
codex/harden-multiversion-iterator-validation

Conversation

@codchen

@codchen codchen commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add a regression test for stale validation iterator keys after a lower-index transaction is re-executed with a smaller writeset.
  • Make validation iterators ignore removed multiversion entries without treating real deletes as removed keys.
  • Harden iterator validation so recovered panics become validation failures with diagnostic logging, and abort notification cannot leave an ownerless or blocked validation goroutine.

Root Cause

validationIterator.Value() assumed every captured key still had a multiversion value. When a lower-index transaction was re-executed and stopped writing one of those keys, the stale key could remain in the validation iterator's key list while GetLatestBeforeIndex returned nil, leading to a nil pointer dereference.

Validation

  • go test ./sei-cosmos/store/multiversion ./sei-cosmos/tasks ./giga/deps/tasks -count=1
  • git diff --check

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 30, 2026, 9:42 AM

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.50575% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.22%. Comparing base (509eb34) to head (f05b87e).

Files with missing lines Patch % Lines
sei-cosmos/store/multiversion/store.go 86.66% 4 Missing and 4 partials ⚠️
sei-cosmos/store/multiversion/memiterator.go 92.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3656      +/-   ##
==========================================
- Coverage   59.17%   58.22%   -0.96%     
==========================================
  Files        2262     2178      -84     
  Lines      187055   177492    -9563     
==========================================
- Hits       110688   103343    -7345     
+ Misses      66429    64998    -1431     
+ Partials     9938     9151     -787     
Flag Coverage Δ
sei-chain-pr 90.73% <88.50%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-cosmos/store/multiversion/memiterator.go 86.48% <92.59%> (+2.81%) ⬆️
sei-cosmos/store/multiversion/store.go 89.96% <86.66%> (-2.01%) ⬇️

... and 84 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codchen codchen marked this pull request as ready for review June 29, 2026 04:21
@cursor

cursor Bot commented Jun 29, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Touches OCC multiversion iterator validation on the transaction commit path; incorrect skipping or abort handling could mask real conflicts or change validation outcomes.

Overview
Fixes multiversion iterator validation when a lower-index transaction is re-executed with a smaller writeset: validation iterators now skip keys that no longer have a multiversion entry (while still honoring real deletes), and Value() guards against nil lookups instead of dereferencing.

Iterator validation is refactored so replay runs in validateIteratorReplay, aborts are polled non-blockingly during the merge walk (including WriteAbort with a non-blocking send for estimates), and the validation goroutine recovers panics, logs diagnostics, and reports failure instead of crashing the process.

Adds regression tests for stale captured keys mid-validation, panic recovery, and multiple estimate aborts completing without hanging.

Reviewed by Cursor Bugbot for commit c1cdcd7. Bugbot is set up for automated code reviews on this repo. Configure here.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A conservative, well-tested fix that prevents a nil-pointer dereference in validation-iterator value resolution and contains iterator panics as validation failures. Logic is correct and safe (skipping vanished keys only ever pushes toward re-execution, never a false pass); only minor non-blocking simplification notes.

Findings: 0 blocking | 5 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • The per-iterator validation goroutine no longer serves its original purpose. It previously ran concurrently so the main goroutine could listen for aborts on a channel; now aborts are polled inline via iteratorValidationAborted and the main goroutine just blocks on <-validChannel. Since recover() works within the same goroutine, panic containment does not require the goroutine either. Consider running validateIteratorReplay inline with a deferred recover to avoid a goroutine spawn per iteration tracker on this hot validation path (validateIterator is called once per tracker in checkIteratorAtIndex).
  • Codex and Cursor second-opinion passes both produced no material findings (codex-review.md notes it could not run the package tests due to blocked network access for the Go 1.25.6 toolchain; cursor-review.md was empty).
  • Confirm the new tests run under CI's race detector — they spin up goroutines coordinated via channels and a 5s timeout; the blocking/panic mock stores share state behind mutexes which looks correct, but worth ensuring go test -race passes for this package in CI.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

return false
}
// return whether we found the exact number of expected keys
return foundKeys >= len(expectedKeys)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] iteratorValidationAborted(abortChan) is polled five times per loop iteration (lines 329, 335, 343, 361) plus once after the loop. This is harmless since it's a non-blocking select, but the repetition is heavy — a single check at the top of each iteration plus the post-loop check would be equivalent given nothing inside the loop body produces an abort (aborts originate from validationIterator.Value() invoked indirectly via mergeIterator.Valid()/Key()). Minor cleanup only.

}

// try to get value from the writeset, otherwise try to get from multiversion store, otherwise try to get from parent iterator
func (vi *validationIterator) Value() []byte {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Value() deliberately reads vi.Iterator.Key() (the raw underlying key) rather than calling the overridden Key() which would run skipRemovedKeys. This is correct in practice because the merge iterator always positions via Valid()/Key() before calling Value(), and when it isn't positioned, GetLatestBeforeIndex returning nil now yields nil safely. A one-line comment noting this invariant (Value assumes the iterator is already positioned on a present key) would help future readers, since the asymmetry between Value() and the other overrides is subtle.

@masih masih added the backport release/v6.6 Backport to release v6.6 label Jun 29, 2026
@masih masih requested a review from sei-will June 29, 2026 10:37
@codchen codchen force-pushed the codex/harden-multiversion-iterator-validation branch from 04a43a7 to c1cdcd7 Compare June 30, 2026 03:01

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR hardens multiversion iterator validation: it fixes a real nil-pointer dereference in validationIterator.Value() when a re-executed lower-index tx drops a captured key, converts recovered panics into validation failures, and prevents an ownerless/blocking abort send. The change is a clear net improvement with good test coverage; one residual TOCTOU noted by Codex is worth acknowledging but is inherent to concurrent validation and not blocking.

Findings: 0 blocking | 6 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Cursor's second-opinion review (cursor-review.md) was empty — no Cursor findings were available to incorporate.
  • Codex's residual concern (see inline comment) is a genuine TOCTOU but is inherent to validating against a concurrently-mutated MV store, and the readCache only provides consistency after a value is materialized — skipRemovedKeys/hasCurrentValue do not populate it. Consider either populating readCache from hasCurrentValue so a skip decision and the subsequent Value() agree, or documenting that the merge-iterator's nil-as-delete behavior is acceptable here because it errs toward validation failure (conservative).
  • validateIteratorReplay calls iteratorValidationAborted at four points but the abort channel can only be written from within mergeIterator.Valid()/Key() (via vi.Value()); the redundant pre/post checks are harmless but add noise — a brief comment explaining that aborts are produced as a side effect of advancing the merge iterator would aid future readers.
  • The panic-recovery path logs the full stack and raw start/end keys at Error level; if these key ranges can contain sensitive/high-cardinality data, consider hex-encoding or truncating to keep logs bounded.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

@@ -110,10 +149,13 @@ func (vi *validationIterator) Value() []byte {

// get the value from the multiversion store
val := vi.mvStore.GetLatestBeforeIndex(vi.index, key)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] TOCTOU noted by Codex: skipRemovedKeys()/hasCurrentValue() (called from the overridden Valid()/Key()) decide this key still has a current value via GetLatestBeforeIndex, but Value() re-reads GetLatestBeforeIndex here. If a lower-index transaction removes the MV entry between those two reads, this returns nil, which mvsMergeIterator.skipUntilExistsOrInvalid treats as a real delete and skips the key. Because hasCurrentValue does not populate readCache, there is no consistency guarantee for the first Value() of a key. In practice this generally biases toward fewer foundKeys → validation failure (conservative), and the underlying MV store is concurrently mutable regardless, so this is not blocking — but consider caching the hasCurrentValue determination (or the materialized value) so the skip decision and Value() cannot disagree.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codchen this is worth fixing in a follow up PR you think?


func (vi *validationIterator) WriteAbort(abort occtypes.Abort) {
select {
case vi.abortChannel <- abort:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Good fix: the non-blocking WriteAbort() prevents the validation goroutine from blocking when multiple estimates are encountered against the size-1 buffered abort channel (previously a direct vi.abortChannel <- ... could block once a prior abort was buffered). The added regression test TestMVSIteratorValidationMultipleEstimatesDoNotBlock covers this.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-scoped, well-tested hardening of the multiversion iterator validation: it fixes a nil-pointer dereference when a lower-index tx rewrites with a smaller writeset, prevents estimate aborts from blocking the validation goroutine, and contains iterator panics as validation failures. No correctness or security issues found; only minor non-blocking notes.

Findings: 0 blocking | 4 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • The repository's REVIEW_GUIDELINES.md (from the base branch) is empty, so no repo-specific standards could be applied.
  • Second-opinion passes produced no actionable findings: codex-review.md reports "No material findings" but notes it could not run the test suite because the Go 1.25.6 toolchain download was blocked by the sandbox; cursor-review.md is empty.
  • Minor: validateIteratorReplay polls iteratorValidationAborted four times per loop iteration. This is correct (aborts are written synchronously within the same goroutine via mergeIterator.Key()/Valid() → cache.Value()), but the repetition is somewhat verbose; a single check after each merge-iterator access would be equivalent.
  • Minor: validationIterator.Value() now reads vi.Iterator.Key() directly without re-running skipRemovedKeys. This is safe in practice because the merge iterator always calls Valid()/Key() (which skip) before Value(), and the new nil-guard plus goroutine panic-recovery would absorb any edge case — but it relies on that call ordering rather than being self-guarding.

@masih

masih commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator

Proceeding to merge this. We can address any comments retroactively since none are blockers 🙌

@masih masih added this pull request to the merge queue Jun 30, 2026
Merged via the queue into main with commit cbb11cc Jun 30, 2026
68 checks passed
@masih masih deleted the codex/harden-multiversion-iterator-validation branch June 30, 2026 16:18
@seidroid

seidroid Bot commented Jun 30, 2026

Copy link
Copy Markdown

Successfully created backport PR for release/v6.6:

yzang2019 added a commit that referenced this pull request Jun 30, 2026
* main:
  feat(seid): ConfigManager selection seam (PLT-775 PR1) (#3671)
  fix(evmrpc): limit listener max open connections, configurable via max_open_connections (PLT-704) (#3637)
  LittDB: Keymap threading improvements (#3645)
  integrate hashlogger (#3647)
  fix(metrics): Prometheus metrics output (#3640)
  [codex] Harden multiversion iterator validation (#3656)
  feat(consensus): mock_chain_validation replay build + memIAVL state-sync restore fixes (#3663)
  chore: replace OLD red SeiLogo banner in README with new 2026 Sei lockup (#3670)
  Require absolute path for evmone lib (#3668)
  fix(evmrpc): apply getLogs maxLog cap during merge instead of after (PLT-687) (#3666)
  feat(evmrpc): pre-decode request size admission control (PLT-295) (#3648)
  Make autobahn block production check wait for progress (#3667)
  fix(sei-tendermint): prevent readRoutine goroutine leak on /websocket when writeChan is full (PLT-707) (#3664)
  Per-block littidx flush + single shard (gated on #3645) (#3660)
  fix(evmrpc): bound debug_traceStateAccess memory and add trace admission control (PLT-360) (#3653)
  [codex] bump go-ethereum to v1.15.7-sei-17 (#3657)
  Upodate checkout GHA step across all workflows (#3659)
  Add GoReleaser release pipeline for static seid binaries (#3425)
  Parallelize littidx eth_getLogs across blocks (#3652)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants