[codex] Harden multiversion iterator validation#3656
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3656 +/- ##
==========================================
- Coverage 59.17% 58.22% -0.96%
==========================================
Files 2262 2178 -84
Lines 187055 177492 -9563
==========================================
- Hits 110688 103343 -7345
+ Misses 66429 64998 -1431
+ Partials 9938 9151 -787
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
PR SummaryHigh Risk Overview Iterator validation is refactored so replay runs in Adds regression tests for stale captured keys mid-validation, panic recovery, and multiple estimate aborts completing without hanging. Reviewed by Cursor Bugbot for commit c1cdcd7. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
A conservative, well-tested fix that prevents a nil-pointer dereference in validation-iterator value resolution and contains iterator panics as validation failures. Logic is correct and safe (skipping vanished keys only ever pushes toward re-execution, never a false pass); only minor non-blocking simplification notes.
Findings: 0 blocking | 5 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- The per-iterator validation goroutine no longer serves its original purpose. It previously ran concurrently so the main goroutine could listen for aborts on a channel; now aborts are polled inline via iteratorValidationAborted and the main goroutine just blocks on
<-validChannel. Sincerecover()works within the same goroutine, panic containment does not require the goroutine either. Consider running validateIteratorReplay inline with a deferred recover to avoid a goroutine spawn per iteration tracker on this hot validation path (validateIterator is called once per tracker in checkIteratorAtIndex). - Codex and Cursor second-opinion passes both produced no material findings (codex-review.md notes it could not run the package tests due to blocked network access for the Go 1.25.6 toolchain; cursor-review.md was empty).
- Confirm the new tests run under CI's race detector — they spin up goroutines coordinated via channels and a 5s timeout; the blocking/panic mock stores share state behind mutexes which looks correct, but worth ensuring
go test -racepasses for this package in CI. - 2 suggestion(s)/nit(s) flagged inline on specific lines.
| return false | ||
| } | ||
| // return whether we found the exact number of expected keys | ||
| return foundKeys >= len(expectedKeys) |
There was a problem hiding this comment.
[nit] iteratorValidationAborted(abortChan) is polled five times per loop iteration (lines 329, 335, 343, 361) plus once after the loop. This is harmless since it's a non-blocking select, but the repetition is heavy — a single check at the top of each iteration plus the post-loop check would be equivalent given nothing inside the loop body produces an abort (aborts originate from validationIterator.Value() invoked indirectly via mergeIterator.Valid()/Key()). Minor cleanup only.
| } | ||
|
|
||
| // try to get value from the writeset, otherwise try to get from multiversion store, otherwise try to get from parent iterator | ||
| func (vi *validationIterator) Value() []byte { |
There was a problem hiding this comment.
[nit] Value() deliberately reads vi.Iterator.Key() (the raw underlying key) rather than calling the overridden Key() which would run skipRemovedKeys. This is correct in practice because the merge iterator always positions via Valid()/Key() before calling Value(), and when it isn't positioned, GetLatestBeforeIndex returning nil now yields nil safely. A one-line comment noting this invariant (Value assumes the iterator is already positioned on a present key) would help future readers, since the asymmetry between Value() and the other overrides is subtle.
04a43a7 to
c1cdcd7
Compare
There was a problem hiding this comment.
This PR hardens multiversion iterator validation: it fixes a real nil-pointer dereference in validationIterator.Value() when a re-executed lower-index tx drops a captured key, converts recovered panics into validation failures, and prevents an ownerless/blocking abort send. The change is a clear net improvement with good test coverage; one residual TOCTOU noted by Codex is worth acknowledging but is inherent to concurrent validation and not blocking.
Findings: 0 blocking | 6 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Cursor's second-opinion review (cursor-review.md) was empty — no Cursor findings were available to incorporate.
- Codex's residual concern (see inline comment) is a genuine TOCTOU but is inherent to validating against a concurrently-mutated MV store, and the readCache only provides consistency after a value is materialized — skipRemovedKeys/hasCurrentValue do not populate it. Consider either populating readCache from hasCurrentValue so a skip decision and the subsequent Value() agree, or documenting that the merge-iterator's nil-as-delete behavior is acceptable here because it errs toward validation failure (conservative).
- validateIteratorReplay calls iteratorValidationAborted at four points but the abort channel can only be written from within mergeIterator.Valid()/Key() (via vi.Value()); the redundant pre/post checks are harmless but add noise — a brief comment explaining that aborts are produced as a side effect of advancing the merge iterator would aid future readers.
- The panic-recovery path logs the full stack and raw start/end keys at Error level; if these key ranges can contain sensitive/high-cardinality data, consider hex-encoding or truncating to keep logs bounded.
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
| @@ -110,10 +149,13 @@ func (vi *validationIterator) Value() []byte { | |||
|
|
|||
| // get the value from the multiversion store | |||
| val := vi.mvStore.GetLatestBeforeIndex(vi.index, key) | |||
There was a problem hiding this comment.
[suggestion] TOCTOU noted by Codex: skipRemovedKeys()/hasCurrentValue() (called from the overridden Valid()/Key()) decide this key still has a current value via GetLatestBeforeIndex, but Value() re-reads GetLatestBeforeIndex here. If a lower-index transaction removes the MV entry between those two reads, this returns nil, which mvsMergeIterator.skipUntilExistsOrInvalid treats as a real delete and skips the key. Because hasCurrentValue does not populate readCache, there is no consistency guarantee for the first Value() of a key. In practice this generally biases toward fewer foundKeys → validation failure (conservative), and the underlying MV store is concurrently mutable regardless, so this is not blocking — but consider caching the hasCurrentValue determination (or the materialized value) so the skip decision and Value() cannot disagree.
There was a problem hiding this comment.
@codchen this is worth fixing in a follow up PR you think?
|
|
||
| func (vi *validationIterator) WriteAbort(abort occtypes.Abort) { | ||
| select { | ||
| case vi.abortChannel <- abort: |
There was a problem hiding this comment.
[nit] Good fix: the non-blocking WriteAbort() prevents the validation goroutine from blocking when multiple estimates are encountered against the size-1 buffered abort channel (previously a direct vi.abortChannel <- ... could block once a prior abort was buffered). The added regression test TestMVSIteratorValidationMultipleEstimatesDoNotBlock covers this.
There was a problem hiding this comment.
Well-scoped, well-tested hardening of the multiversion iterator validation: it fixes a nil-pointer dereference when a lower-index tx rewrites with a smaller writeset, prevents estimate aborts from blocking the validation goroutine, and contains iterator panics as validation failures. No correctness or security issues found; only minor non-blocking notes.
Findings: 0 blocking | 4 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- The repository's REVIEW_GUIDELINES.md (from the base branch) is empty, so no repo-specific standards could be applied.
- Second-opinion passes produced no actionable findings: codex-review.md reports "No material findings" but notes it could not run the test suite because the Go 1.25.6 toolchain download was blocked by the sandbox; cursor-review.md is empty.
- Minor: validateIteratorReplay polls iteratorValidationAborted four times per loop iteration. This is correct (aborts are written synchronously within the same goroutine via mergeIterator.Key()/Valid() → cache.Value()), but the repetition is somewhat verbose; a single check after each merge-iterator access would be equivalent.
- Minor: validationIterator.Value() now reads vi.Iterator.Key() directly without re-running skipRemovedKeys. This is safe in practice because the merge iterator always calls Valid()/Key() (which skip) before Value(), and the new nil-guard plus goroutine panic-recovery would absorb any edge case — but it relies on that call ordering rather than being self-guarding.
|
Proceeding to merge this. We can address any comments retroactively since none are blockers 🙌 |
|
Successfully created backport PR for |
* main: feat(seid): ConfigManager selection seam (PLT-775 PR1) (#3671) fix(evmrpc): limit listener max open connections, configurable via max_open_connections (PLT-704) (#3637) LittDB: Keymap threading improvements (#3645) integrate hashlogger (#3647) fix(metrics): Prometheus metrics output (#3640) [codex] Harden multiversion iterator validation (#3656) feat(consensus): mock_chain_validation replay build + memIAVL state-sync restore fixes (#3663) chore: replace OLD red SeiLogo banner in README with new 2026 Sei lockup (#3670) Require absolute path for evmone lib (#3668) fix(evmrpc): apply getLogs maxLog cap during merge instead of after (PLT-687) (#3666) feat(evmrpc): pre-decode request size admission control (PLT-295) (#3648) Make autobahn block production check wait for progress (#3667) fix(sei-tendermint): prevent readRoutine goroutine leak on /websocket when writeChan is full (PLT-707) (#3664) Per-block littidx flush + single shard (gated on #3645) (#3660) fix(evmrpc): bound debug_traceStateAccess memory and add trace admission control (PLT-360) (#3653) [codex] bump go-ethereum to v1.15.7-sei-17 (#3657) Upodate checkout GHA step across all workflows (#3659) Add GoReleaser release pipeline for static seid binaries (#3425) Parallelize littidx eth_getLogs across blocks (#3652)
Summary
Root Cause
validationIterator.Value()assumed every captured key still had a multiversion value. When a lower-index transaction was re-executed and stopped writing one of those keys, the stale key could remain in the validation iterator's key list whileGetLatestBeforeIndexreturned nil, leading to a nil pointer dereference.Validation
go test ./sei-cosmos/store/multiversion ./sei-cosmos/tasks ./giga/deps/tasks -count=1git diff --check