-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Motivation
Currently all indexed data is written directly to PostgreSQL and served from there. Adding S3 as a storage backend gives us:
- Redundancy — blocks, transactions, and address data persisted to durable object storage independent of the database
- Cost efficiency — cold data in S3 is significantly cheaper than keeping everything in PG at scale
- Disaster recovery — can rebuild the database from S3 if needed
Proposal
Write path
The indexer writes all indexed data (blocks, transactions, address state) to both PostgreSQL (as today) and S3 (new). S3 writes can be async/best-effort with retries so they don't block the indexing pipeline.
Read path — in-memory cache for recent data
Hold the last N blocks (and their transactions) in an in-memory cache. Serve recent queries directly from memory, bypassing both PG and S3. This covers the hot path — most explorer traffic hits recent data.
Cache eviction: LRU or ring buffer keyed by block number.
Read path — fallback
- Check in-memory cache
- Miss → query PostgreSQL (current behavior)
- Optionally: if PG is down/slow, fall back to S3
Open questions
- S3 object layout — one object per block (including its txs)? Separate prefixes for blocks/txs/addresses? Batched ranges?
- Serialization format — JSON, bincode, protobuf?
- Consistency — how to handle S3 write failures? Retry queue? Reconciliation job?
- Cache size — what's a reasonable default for N? Should it be memory-bounded (bytes) or count-bounded (blocks)?
- Existing data backfill — do we need a one-time migration to populate S3 from PG?
- Address data granularity — store snapshots per block or cumulative state?
- Configuration —
S3_BUCKET,S3_REGION,S3_PREFIX,BLOCK_CACHE_SIZEenv vars
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels