Skip to content

feat(repair): reduce qm_cids HeadObject calls with bucket list index#199

Merged
raymondjacobson merged 3 commits intomainfrom
feat/qm-cids-list-index-simple
Apr 4, 2026
Merged

feat(repair): reduce qm_cids HeadObject calls with bucket list index#199
raymondjacobson merged 3 commits intomainfrom
feat/qm-cids-list-index-simple

Conversation

@raymondjacobson
Copy link
Copy Markdown
Contributor

Summary

Simplified alternative to #198. Same core optimization — replace per-key bucket.Attributes (HeadObject) calls during qm_cids cleanup with a single bucket.List that builds an in-memory presence map — but without the shadow comparison machinery, cadence controls, evidence tracker, or extra env vars.

What changed:

  • New repairPresenceIndex type: ~47 lines, just a map and a build function
  • repairCid accepts an optional presence index; when present, does a map lookup instead of HeadObject
  • Index is built once before the qm_cids loop during cleanup cycles
  • One env var: OPENAUDIO_REPAIR_QM_CIDS_USE_LIST_INDEX=true (off by default)
  • Uses existing tracker.Counters for observability (qm_cids_list_index_hit, qm_cids_list_index_miss, qm_cids_list_index_entries, qm_cids_list_index_build_fail)
  • Falls back to per-key attrs if the index build fails

Also fixes (from review of #198 and #175):

  • ValidateCID dead-code bug: if err != nil -> if errVal != nil
  • SeenKeys invalidation after invalid CID deletion
  • dropFromMyBucket error variable scoping (err = -> err :=)

What was intentionally left out vs #198:

  • Shadow comparison system (sampling, mismatch detection, auto-disable)
  • OPENAUDIO_REPAIR_CLEANUP_EVERY / OPENAUDIO_REPAIR_QM_CIDS_CLEANUP_EVERY cadence knobs
  • repairSourceEvidenceTracker (119 lines, 10 methods)
  • Whitespace realignment of MediorumConfig struct

~107 lines changed vs ~872 in #198, same HeadObject reduction.

Test plan

  • Verify build: go build ./pkg/mediorum/server/
  • Verify vet: go vet ./pkg/mediorum/server/
  • Run existing repair tests: go test ./pkg/mediorum/server/ -run TestRepair
  • Deploy to canary with OPENAUDIO_REPAIR_QM_CIDS_USE_LIST_INDEX=true, observe qm_cids_list_index_hit / qm_cids_list_index_miss counters in repair tracker
  • Deploy without env var set and verify no behavior change

🤖 Generated with Claude Code

Replace per-key bucket.Attributes (HeadObject) calls during qm_cids
cleanup with a single bucket.List that builds an in-memory presence
index. For nodes with millions of qm_cids entries, this collapses
millions of HeadObject calls into one paginated ListObjects call.

Gated behind OPENAUDIO_REPAIR_QM_CIDS_USE_LIST_INDEX=true (off by
default). Falls back gracefully to per-key attrs if the index build
fails. Uses existing tracker.Counters for observability.

Also fixes the ValidateCID dead-code bug (checked wrong error variable)
and the dropFromMyBucket error variable scoping issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@RolfAris
Copy link
Copy Markdown
Contributor

RolfAris commented Apr 4, 2026

Adding the operator-side evidence we gathered, since the simpler shape here is the right upstream path.

Our live canaries on the heavier proof branch were validating the same core mechanism this PR keeps: replacing per-key qm_cids attr checks with a bucket.List -> map presence index during cleanup.

Representative live result from the clean cohort window:

  • 1,995,657 total qm_cids calls
  • 29,937 effective attr-backed lookups
  • 66.66x direct reduction
  • 98.4999% reduction
  • zero normal attr calls on val003, val004, and val006 in the clean proof windows

So from our side the core list-index mechanism is real. The thing that did not survive review was the extra upstream surface area around it.

Happy to help validate or supply more rollout evidence if useful.

@RolfAris
Copy link
Copy Markdown
Contributor

RolfAris commented Apr 4, 2026

One thing I still thought was worth tightening here was direct coverage for the new list-index path itself.

I put together a tiny support branch on my fork with just three focused tests and no behavior changes:

  • TestBuildRepairPresenceIndexIncludesLocalBlob
  • TestRepairCidWithPresenceIndexUsesListedState
  • TestRepairCidUsesKnownPresentOutsideCleanup

Branch: RolfAris:support/pr-199-targeted-tests-20260404
Commit: 46b32bd

Focused run on that support branch:

go test ./pkg/mediorum/server -run 'TestBuildRepairPresenceIndexIncludesLocalBlob|TestRepairCidWithPresenceIndexUsesListedState|TestRepairCidUsesKnownPresentOutsideCleanup' -count=1\n```\n\nI’m not trying to reopen the broader `#198` surface here. This is just the smallest direct coverage I thought might make the simpler path easier to trust.

@raymondjacobson
Copy link
Copy Markdown
Contributor Author

raymondjacobson commented Apr 4, 2026

@RolfAris Thanks for the canary data and the tests — cherry-picked them onto the branch. The 66x reduction numbers are great to have as baseline evidence for this approach.

@raymondjacobson raymondjacobson merged commit ab8be7c into main Apr 4, 2026
6 checks passed
@raymondjacobson raymondjacobson deleted the feat/qm-cids-list-index-simple branch April 4, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants