Skip to content

Glossary

Jackson Owens edited this page May 9, 2025 · 45 revisions
  • Batch - A Batch is a sequence of mutations that the user of Pebble has constructed. Batch commits are atomic; all their mutations are made visible or none.
  • Blob file - A blob file is a file encoding values (the value portion of a KV pair), stored separately from their corresponding keys. Storing values separately reduces write amplification by allowing some compactions to avoid rewriting values. (This feature is still in-progress as of Mar 2025.)
  • Block - A block is a chunk of data written to a file. A block may be compressed when on disk. An sstable contains many blocks. A block is typically referenced through a block handle.
  • Block cache - The block cache is a cache of blocks from sstables, held in-memory and uncompressed. When a block is required by an iterator, it first looks in the block cache to see if the block is already in-memory. Otherwise, it reads from the underlying sstable file and inserts the uncompressed block into the block cache.
  • Block handle - A block handle describes a block's physical location within a file. A block handle is a (offset, length) tuple, with the offset indicating the offset within the file at which the block begins, and the length indicating the length of the block.
  • Block property - A block property is a small set of data collected by a user-configured block-property collector and stored alongside a block handle. A user may configure an iterator with a block-property filter to skip whole blocks without reading them based on the value of a block property. This is used by CockroachDB to perform "time-bound iteration," where an iterator can skip over data with MVCC timestamps outside a configured range. (This powers CockroachDB's incremental backups.)
  • Block trailer - A block trailer is a small set of data that is encoded within a sstable after each block. It contains a checksum over the block's data and a byte indicating what compression algorithm was used to compress the block.
  • Columnar blocks - A new (in 25.1+) sstable block format that decomposes the KVs within a block into a handful of columns, improving the efficiency of seeks and providing a more extensible format. See the sstable/colblk package for details.
  • Compaction - A compaction is a background job that takes as input a set of sstables and outputs a new set of sstables, establishing a new version of the LSM. Most compactions are default compactions which effectively merge sstables from two levels Li and Li+1, outputting the new, merged sstables into Li+1.
  • Compaction heuristics - Compaction heuristics decide when to pursue a compaction and what tables to compact. These decisions can greatly impact the overall performance of the database and are a frequent area of development.
  • Comparer - A comparer defines the ordering of keys within a database. The user of Pebble implements a Comparer and passes it to Pebble at Open. In addition to facilities for comparing keys, a comparer provides facilities for constructing keys like index separators.
  • Checkpoint - A checkpoint is a user-initiated operation (see DB.Checkpoint) that duplicates a pebble database (or a subset) on the filesystem at a consistent point-in-time. SSTables are hard-linked, and other files copied. CockroachDB only creates checkpoints to serve as root cause analysis artifacts when a replica divergence is detected.
  • Data block - A data block is a block of data containing key-value pairs representing user data. It contains KV pairs that were committed to the engine (as opposed to index handles, like an index block).
  • Disk stall - A disk stall is a term we use to describe when I/O latency to the underlying disk spikes. This typically manifests as WAL fsync latency, which results in latency to all batch commits. This is common on cloud infrastructure where the "disks" (AWS EBS, GCP PD) are actually distributed systems. Pebble's WAL Failover helps mitigate the impact of disk stalls by writing to a WAL file on a secondary disk. Note that this is distinct from a write stall.
  • Excise - An excise is an operation that deletes a region of the keyspace atomically, producing a new version of the LSM. The user may perform an excise independently, or atomically alongside an ingestion (effectively replacing the region of the keyspace with the contents of the ingestion). Excise is semantically similar to a delete range operation, but it's more performant for very large deletions. It will directly drop entire sstables that are contained within the deleted keyspan and virtualize sstables that partially overlap producing virtual sstables.
  • External sstable - An external sstable is a remote sstable that was ingested. In CockroachDB, "online" restore ingests remote sstables containing backed up data. An external sstable can be downloaded by a download compaction, replacing the external sstable with a local sstable on the local filesystem.
  • Filter block - A filter block is a block within a sstable that encodes a filter that can be used to determine that a key does not exist within a sstable without reading the sstable's index blocks or data blocks. Pebble implements filter blocks using bloom filters.
  • Flushable ingest - A flushable ingest is an ingestion optimization when an ingestion's data overlaps data in the memtable. A flushable ingest enqueues the ingested sstables into the queue and creates a new mutable memtable.
  • Flushable queue - The flushable queue is a queue of in-memory structures that need to be flushed. Most typically, the flushable queue consists of memtables, with the mutable memtable at the end. However, very large Batches may also be inserted to the queue (skipping memtable application). Also, ingestions may sometimes enqueue sstables within the flushable queue as a part of flushable ingests.
  • Footer - A footer of a sstable is a fixed-length structure encoded at the very end of a file. It encodes the table format of the file, the block handle of the metaindex block and other metadata important for interpreting the sstable correctly.
  • Format major version - A format major version provides versioning of Pebble and its physical file formats over time. As a CockroachDB node's cluster version increases, the format major version is increased allowing use of new features that are incompatible with previous versions of Pebble.
  • Index block - An index block is a block of data that describes the location of other blocks of data. It's structured as a series of key-value pairs itself. The value encodes a block handle. The key is a index separator indicating that all the keys within the referenced block are ≤ the index separator.
  • Ingestion - An ingestion is an operation that takes sstables already constructed by the user of Pebble and links them into the LSM, producing a new version. Ingestion is used for bulk loading data into the database quickly, because data ingested is not written to the write-ahead log (WAL), nor does it go through a memtable flush. CockroachDB uses ingestion when performing snapshot reception, restore from backup, schema changes and imports.
  • Internal iterator - An InternalIterator is a common interface for iterators over a sequence of internal KV pairs within Pebble. Generally all data structures holding KVs within Pebble (blocks, sstables, the memtable skiplist, etc) implement this interface and these internal iterators are merged together to form the pebble.Iterator over the merged database state used by the user of Pebble.
  • Internal key - An internal key is an internal representation of a key that consists of a user key and a trailer (encoding a sequence number and key kind). Most of Pebble handles internal keys.
  • Internal key kind - The internal key kind is an enum denoting the kind of key represented (eg, a set, a tombstone, etc.).
  • Key schema - A KeySchema is provided by the user of Pebble to define how a columnar block should store a user key as columns within the block. CockroachDB's implementation is in the cockroachkvs package.
  • Level - A level of the LSM forms a set of sstables in the hierarchy of the log-structured merge tree. There are 7 levels, numbered L0 to L6. All levels except L0 contain sstables that are all sorted and non-overlapping. We describe L0 as being the "highest" level and L6 as the "bottommost." So keys enter the top of the tree in L0 and are compacted downwards to L6.
  • L0 (Level 0) - L0 (Level 0) is the first level of the LSM. When a memtable flush completes, it outputs its sstables into L0. L0 is the only level where sstables are allowed to overlap one another. However Pebble takes these overlapping sstables and organizes them into a sublevel structure called L0 sublevels. Within each sublevel, sstables do not overlap one another.
  • LBase (Base level) - LBase is a term used for the first non-empty level below L0. A compaction moving data out of L0 compacts into LBase. In a LSM where all the levels have sstables, LBase is L1. The term LBase exists because compactions from L0 into LBase are important. Most of the read amplification in a database originates from L0, so it's important that Pebble is able to compact sstables out of L0 into LBase quickly.
  • Level multiplier - By default Pebble uses a 10x level multiplier. This means that each level of the LSM has a target size that is 10x bigger than the level above. For example, if your database has 100GB of data, you could expect a steady state to have ~90GB in L6, ~9GB in L5, ~900MB in L4, ~90MB in L3, etc.
  • Manifest - The manifest is a file on the filesystem (with a filename like MANIFEST-025922) recording the structure of the LSM, which files should exist, etc. It's structured as a log, recording a sequence of version edits, each recording a diff. Every compaction, memtable flush and ingestion append a version edit to the manifest.
  • Manual compaction - A manual compaction is a compaction initiated explicitly by the user of Pebble through DB.Compact (as opposed to an automatic compaction that Pebble's heuristics schedule). In CockroachDB, manual compactions only occur when an operator uses a special crdb_internal builtin to schedule one, used as an escape hatch if Pebble's heuristics are deficient in some way.
  • Memtable - A memtable is an in-memory structure holding recently committed data. It is implemented using a skiplist. When a memtable fills up, it triggers a memtable flush. Every memtable has a corresponding write-ahead log (WAL) containing the same data on disk.
  • Memtable flush - A memtable flush is a background job that takes 1 or more flushables (eg, immutable memtables) as input, merges them and outputs the merged result to sstables in L0.
  • MERGE key - A MERGE key is a key of a special key kind indicating that its value should be 'merged' with the previous value of the key. The user of Pebble provides an implementation of a pebble.Merger to dictate how to perform the merge. During iteration or compactions, Pebble invokes the user-defined Merger to produce a new value. CockroachDB uses MERGE keys in exactly 1 place: the implementation of the embedded timeseries database. Cockroach writes new recorded timeseries points as MERGE keys, and the Merger handles combining them with previously recorded points during iteration and compaction.
  • Merging iterator - The merging iterator (mergingIter in the code) is a special iterator that uses a heap to merge iterators across all the levels of the LSM, providing a single, merged, ordered view of the contents of the LSM.
  • Metaindex block - The metaindex block is a special block within a sstable that encodes the location of the top-level index block, properties block, filter block, etc. When Pebble opens a sstable, it must read the metaindex block to discover the location of these other blocks.
  • Mutable memtable - The mutable memtable is the most recent memtable. As batches are committed, their keys are inserted into the mutable memtable until the mutable memtable is full. When the mutable memtable is full, it becomes an immutable memtable and a new mutable memtable is created. If one is not already ongoing, a memtable flush may be triggered to flush the now immutable memtable.
  • MVCC - MVCC (Multiversion Concurrency Control) [wikipedia link] is a scheme for building database transactions that preserves multiple versions of keys within the database. CockroachDB uses MVCC, and Pebble facilitates some of CockroachDB's implementation. See the definition of Prefix and Suffix for more details.
  • Object - An object is a term for a file that may be stored locally on the filesystem or remotely on an object store service (eg, s3, GCS). SSTables and blob files are the only object types today. All other files are stored locally.
  • Object provider - An object provider provides APIs to create and read objects–sstables and blob files. An object provider may wrap a vfs.FS to store objects on the local filesystem. Or it may use a remote object storage service.
  • Obsolete key - An obsolete key is a key that is no longer considered live, because it's shadowed by another key with a higher sequence number. For example, the internal key k#5,SET would be considered obsolete if there exists another internal key k#9,SET. Iterators at recent sequence numbers would observe k#9,SET's value, not k#5,SET's value. Obsolete keys may be retained due to open Snapshots preventing the obsolete key from being dropped by a compaction.
  • Point key - A point key is a singular key that exists at exactly 1 user key. This is in contrast to range deletions and range keys which may be defined over a span of the keyspace.
  • Prefix - In Pebble, a key prefix represents a MVCC user key. Pebble dictates that users implementing MVCC should split their keys into two parts: a prefix (representing the MVCC user key) and a suffix representing the MVCC timestamp. Users provide a Split implementation on their Comparer that allows Pebble to split a key into its prefix and suffix. When building bloom filters, Pebble uses only the prefix of a key.
  • Properties block - A properties block is a special block within a sstable that describes and summarizes the sstable. Some heuristics within Pebble use these properties to inform compaction decisions.
  • Range deletion - A range deletion is an tombstone internal key that indicates an entire span of the keyspace [start, end) is deleted. A range deletion has internal key kind RANGEDEL.
  • Range key - A range key is a key that's defined over an entire key span. The internal key kinds RANGEKEYSET, RANGEKEYUNSET and RANGEKEYDEL are range keys.
  • Read amplification - A measure of the number of disk reads that a user read (ie, a lookup of 1 key) needs to perform. We usually measure it as the number of sstables that need to be consulted. In a full Pebble LSM, this measurement is 6 (1 for each level L1-L6) + n where n is the number of sublevels in L0.
  • Sequence number - A sequence number is an unsigned integer indicating the order in which keys were committed. When there exist multiple versions of the same user key, the sequence number determines which key should be visible. Sequence numbers are also used to implement iterators' consistent snapshot of the LSM.
  • Sequence number invariant - The sequence number invariant is a central invariant of the LSM that ensures if there are two keys with the same user key k#si in LSM level Li and k#sj in LSM level Lj such that si < sj, then LiLj. In other words when two internal keys have the same user key, the key with a lower sequence number must be "beneath" the key with the higher sequence number (lower numbered levels are described as "higher" in the LSM).
  • Single delete (SINGLEDEL) - A single delete is a special tombstone key kind. Unlike a DEL tombstone, when a single delete tombstone meets a key that it deletes both the tombstone and the deleted key are dropped. This is more efficient if the caller knows for certain that they only wrote the deleted key once, so there's only one internal key in the LSM that needs to be deleted. Ordinary DEL tombstones need to be compacted to the bottom of the LSM (L6) before they'll be elided.
  • Snappy - Snappy is a compression algorithm produced by Google. It's the default compression algorithm used by Pebble to compress blocks. It's fast but it does not result in as much compression as other algorithms.
  • Snapshot - A snapshot creates a consistent, point-in-time snapshot of the database without pinning the memtables. It's used by Pebble's users when a user requires a point-in-time snapshot for a longer duration. A normal iterator needs to pin memtables that exist at the time of its creation so that it can continue to read from them for the lifetime of the iterator. A snapshot avoids pinning memtables (avoiding excessive memory usage) through coordinating with memtable flushes to retain data necessary for its point-in-time snapshot.
  • Space amplification - A measure of the volume of disk space Pebble consumes, relative to the volume of live, logical data. Space amplification can result from duplicate keys in the LSM. For example, there may be a tombstone in L1 that deletes a value in L6. Both the tombstone and the deleted value consume disk space, but neither are live data. Compactions can reduce space amplification by deleting obsolete keys.
  • SSTable - A SSTable (sorted string table) is a file (with a filename like 923523.sst) containing key-value pairs in sorted order. Every sstable is organized into blocks.
  • Suffix - In Pebble, a key suffix represents a MVCC timestamp of a key. Pebble dictates that users implementing MVCC should split their keys into two parts: a prefix (representing the MVCC user key) and a suffix representing the MVCC timestamp. Users provide a Split implementation on their Comparer that allows Pebble to split a key into its prefix and suffix. Range keys also support taking a suffix so that a range key that is defined over a key span can have an associated MVCC timestamp.
  • Table metadata - Every sstable in the database has associated metadata held in a manifest.TableMetadata struct. This metadata is held in-memory for every sstable still referenced by a Version.
  • Tombstone - A tombstone is an internal key with a key kind that indicates it's a delete. A tombstone exists to ensure that a key appears deleted, even if Pebble hasn't yet physically deleted the underlying keys. When a tombstone meets a key that it tombstones in a compaction, the tombstoned key is dropped. When a tombstone is compacted into the bottom of the LSM, it is itself dropped.
  • Trailer - An internal key trailer is a uint64 value encoding an internal key's sequence number and its internal key kind.
  • User key - A user key is the []byte slice representing a key, as provided by the user of Pebble. The term user key is used to disambiguate from an internal key which contains a user key but also a trailer.
  • Value block - A value block is a block within a sstable that contains only values. The corresponding keys encode a handle describing the location of the value in the value block. Value blocks are an optimization used to move values that are unlikely to be needed out-of-band, improving the efficiency of the block cache and improving cache locality.
  • Version - A version (specifically a manifest.Version or LSM version) describes the structure of a LSM at a moment in time. The current Version is the current structure of the LSM that new iterators should use. Older versions may continue to exist if there are open iterators that depend on them.
  • Verison edit - A version edit describes a diff between two Versions. Every compaction, flush or ingestion will append a version edit to the manifest file and produce a new Version.
  • Virtual sstable - A virtual sstable is a table within a LSM Version that does not directly represent a physical, on-disk sstable. A virtual sstable may define bounds narrower than its underlying physical sstable, effectively hiding data that exists on-disk but outside the virtual sstable's bounds. Virtual sstables may be created by exicse operations or during ingestion.
  • Write-ahead log (WAL) - A write-ahead log is a file (with a filename like 002421.log) that records the contents of batches committed to the storage engine to provide durability. Every WAL has a 1:1 mapping with a memtable. When a WAL's corresponding memtable transitions from being mutable to immutable, its corresponding WAL is closed and a new WAL is opened. When a WAL's corresponding immutable memtable has been flushed to sstables, the WAL may be deleted (or recycled) because its contents now exist in sorted sstables.
  • Write amplification - A measure of the volume of disk writes that Pebble performs, relative to the volume of user writes. We usually measure it in terms of bandwidth. If higher levels have committed 100MB to Pebble but Pebble has written 500MB to disk so far, the write amplification is 5x. Most of the write amplification originates from compactions rewriting data in the background.
  • Write stall - A write stall occurs when Pebble deliberately blocks incoming batch commits in order to protect the health of the engine. There are two possible reasons for a write stall: If more than Options.MemtableStopWritesThreshold memtables have accumulated because flushes are not progressing fast enough, Pebble will stall to avoid OOMs. If more than Options.L0StopWritesThreshold sublevels in L0 have accumulated because compactions from L0 to LBase are not proceeding fast enough, Pebble will stall to avoid excessive read amplification (that can result in very slow reads). Note this is distinct from a disk stall.
  • Zombie sstables - A zombie sstable is a sstable that is no longer referenced by the most recent Version of the LSM, but is still referenced by a Version in use by an iterator. If an iterator is created at using Version vi containing sstable tj and a compaction compacts tj creating a new Version vi+1, the new version vi+1 will not contain tj. However, the file backing tj cannot be deleted from the filesystem until the iterator using vi is closed. In this case tj is described as a 'zombie sstable.'
  • ZSTD (ZStandard) - ZSTD is a compression algorithm produced by Meta. ZSTD typically compresses more than Snappy, but is more CPU expensive. Pebble supports ZSTD.
Clone this wiki locally