Skip to content

Commit 6c10d8b

Browse files
author
Dan Lambright
committed
Hot shard design documentation
1 parent afeef45 commit 6c10d8b

File tree

1 file changed

+313
-0
lines changed

1 file changed

+313
-0
lines changed

design/hotshard.md

Lines changed: 313 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
# Hot Shard Management in FoundationDB
2+
3+
## Table of Contents
4+
5+
- [Overview](#overview)
6+
- [What is a Hot Shard?](#what-is-a-hot-shard)
7+
- [Normal Internal Shard Splitting Process](#normal-internal-shard-splitting-process)
8+
- [Hot Shard Special Case](#hot-shard-special-case)
9+
- [When a Hot Shard Occurs That Cannot Be Split](#when-a-hot-shard-occurs-that-cannot-be-split)
10+
- [Storage Server Impact](#storage-server-impact)
11+
- [Cluster-Wide Rate Limiting](#cluster-wide-rate-limiting)
12+
- [Mitigation](#mitigation)
13+
- [Avoiding Hot Shards](#avoiding-hot-shards)
14+
- [Avoiding Read Hotspots](#avoiding-read-hotspots)
15+
- [Schema & Keyspace Design Patterns](#schema--keyspace-design-patterns)
16+
- [Optional Hot Shard Throttling (Targeted Mitigation)](#optional-hot-shard-throttling-targeted-mitigation)
17+
- [Key Insight](#key-insight)
18+
- [Server Knobs Related to Hot Shards](#server-knobs-related-to-hot-shards)
19+
- [Hot Shard Throttling Knobs](#hot-shard-throttling-knobs)
20+
- [Shard Bandwidth Splitting Knobs](#shard-bandwidth-splitting-knobs)
21+
- [Read Hot Shard Detection Knobs](#read-hot-shard-detection-knobs)
22+
- [Storage Queue Rebalancing Knobs](#storage-queue-rebalancing-knobs)
23+
- [Read Rebalancing Knobs](#read-rebalancing-knobs)
24+
- [Shard Split/Merge Priority Knobs](#shard-splitmerge-priority-knobs)
25+
- [Other Hot Shard Related Knobs](#other-hot-shard-related-knobs)
26+
- [Key Relationships](#key-relationships)
27+
- [References](#references)
28+
29+
## Overview
30+
31+
FoundationDB partitions the keyspace into shards (contiguous key ranges) and assigns each shard to a storage-team replica set.
32+
33+
Shards are just key ranges; their size adapts continuously as Data Distribution monitors load and splits or merges ranges to stay within configured bounds [1][2]. Splits are triggered when metrics like bytes, read/write bandwidth, or hot-shard detection exceed thresholds (splitMetrics.bytes = shardBounds.max.bytes / 2, splitMetrics.bytesWrittenPerKSecond etc.). In other words, shard size is workload driven rather than fixed [1][3]. Storage servers and commit proxies provide ongoing measurements so DD can resize ranges dynamically to relieve load hotspots and maintain balance [4][5].
34+
35+
## What is a Hot Shard?
36+
37+
A shard is "hot" when it absorbs a disproportionate share of the cluster's read or write workload, driving CPU or bandwidth saturation on its replica servers [6].
38+
39+
Storage servers continually sample bytes-read, ops-read, and shard sizes; a shard whose read bandwidth density exceeds configured thresholds is tagged as read-hot so the distributor can identify the offending range [4].
40+
41+
## Normal Internal Shard Splitting Process
42+
43+
- The data distributor decides a shard should shrink and calls splitStorageMetrics, which in turn asks each storage server in the shard for split points via SplitMetricsRequest; the tracker sets targets like "half of the max shard size" and minimum bytes/throughput [1].
44+
45+
- The storage server handles that request in StorageServerMetrics::splitMetrics, pulling the live samples it already maintains for bytes, write bandwidth, and IOs (byteSample, bytesWriteSample, iopsSample) to compute balanced cut points [7][8].
46+
47+
- Inside splitMetrics, the helper getSplitKey converts the desired metric offset into an actual key using the sampled histogram, adds jitter so repeated splits don't choose exactly the same boundary, and enforces MIN_SHARD_BYTES plus optional write-traffic thresholds before accepting the split [8][9].
48+
49+
- It loops until the remaining range falls under all limits, emitting each chosen key into the reply; the tracker then uses those keys to call executeShardSplit, which updates the shard map and kicks off the relocation [10][11].
50+
51+
## Hot Shard Special Case
52+
53+
StorageServerMetrics::splitMetrics only emits a split point when it can carve the shard into two chunks that both satisfy the size/traffic limits—specifically the loop exits if remaining.bytes < 2 * MIN_SHARD_BYTES or the write bandwidth is below SHARD_SPLIT_BYTES_PER_KSEC, and getSplitKey() will return the shard end if the byte/IO samples can't find an interior key that meets the target [8][9].
54+
55+
Hot shards are often tiny (say, a single hot key), so they're already near the minimum size or lack sample resolution; splitting them would just violate those constraints and leave the hot key on the same team anyway.
56+
57+
## When a Hot Shard Occurs That Cannot Be Split
58+
59+
When a hot shard cannot be split (due to minimum size constraints or lack of sample resolution), the cluster experiences cascading effects:
60+
61+
### Storage Server Impact
62+
63+
- The storage servers hosting the hot shard become saturated with write traffic, causing their write queues to grow [13][14].
64+
65+
- Storage server metrics report high `bytesInput` and increasing queue depths, which are monitored by Ratekeeper [13].
66+
67+
- If the storage server's write queue size exceeds thresholds relative to available space and target ratios, it becomes the bottleneck for the entire cluster [14].
68+
69+
### Cluster-Wide Rate Limiting
70+
71+
- Ratekeeper continuously monitors storage server queues and bandwidth metrics to determine cluster-wide transaction rate limits [13][14].
72+
73+
- When a storage server's write queue becomes excessive, Ratekeeper sets the limit reason to `storage_server_write_queue_size` and reduces the cluster-wide transaction rate proportionally [14][15].
74+
75+
- This causes **all transactions across the cluster** to be throttled, even those not touching the hot shard, because the cluster cannot commit faster than its slowest storage server can durably persist data.
76+
77+
- The rate limit is calculated as: `limitTps = min(actualTps * maxBytesPerSecond / inputRate, maxBytesPerSecond * MAX_TRANSACTIONS_PER_BYTE)`, ensuring the cluster doesn't overwhelm the saturated storage server [13].
78+
79+
## Mitigation
80+
81+
### Avoiding Hot Shards
82+
83+
- Randomize key prefixes so consecutive writes land on different shard ranges; for example, hash user IDs or add a short random salt before the natural key. This way inserts will scatter instead of piling onto one shard.
84+
- If you need to store a counter, consider sharding them across N disjoint keys (e.g., counter/<bucket>/…) and aggregate in clients or background jobs; this keeps the per-key mutation rate below the commit proxy’s hot-shard throttle.
85+
- If you are storing append-only logs, split them into multiple partitions (such as log/<partition>/<ts>), rotating partitions over time rather than funneling through a single key path.
86+
- Avoid “read-modify-write” cycles. Use FDB's atomic operations (like ADD) when possible, and throttle/queue work in clients so they don’t stampede on that hot key.
87+
- For multi-tenant schemas, assign each tenant a disjoint key subspace.
88+
89+
### Avoiding Read Hotspots
90+
91+
- Cache hot objects (in the application or via a memory service) instead of repeatedly issuing get for the same key every transaction; read-hot shards
92+
often stem from the same value being fetched continuously.
93+
- For range scans, fetch only the window you truly need—store derived indexes or “chunked” materializations so transactions avoid reading a large
94+
contiguous range every time.
95+
- If one small document is hammered by many readers, replicate it into multiple keys (e.g., per region or per hash bucket) and route readers to
96+
different replicas based on a deterministic hash; background jobs can keep replicas in sync.
97+
- Stagger periodic polling across clients—add random jitter so metrics or heartbeat jobs don’t all read the same shard simultaneously.
98+
- Watch readBandwidth metrics and proactively split large ranges (via splitStorageMetrics) if access skew is predictable (e.g., time-series keys); pre-
99+
splitting keeps each shard small enough that DD can relocate it quickly.
100+
101+
### Schema & Keyspace Design Patterns
102+
103+
- Choose the subspace layout so high-cardinality data (per user, per session, per order) groups together while low-cardinality shared data fans out. A
104+
common pattern is subspace.pack([prefix, hash(id), id, ...]), where the hash spreads load and the trailing id preserves locality for range scans within
105+
a user.
106+
- Reserve dedicated key ranges for “global” data that’s touched by admins or background coordinators; keep those ranges tiny and infrequently written so
107+
they never become hot.
108+
- For time-ordered data, interleave a bucket prefix (hour, minute) or a modulo partition to avoid writing the latest timestamp into a single ever-
109+
growing shard; readers can consult a manifest that lists active buckets.
110+
- If you must maintain a monotonic index (e.g., queue), combine it with a random shard ID so each enqueue writes to queue/<shard>/<increasing counter>;
111+
consumers merge shards client-side.
112+
113+
### Optional Hot Shard Throttling (Targeted Mitigation)
114+
115+
**Note**: This feature is **experimental** and disabled by default, guarded by the `HOT_SHARD_THROTTLING_ENABLED` server knob [16].
116+
117+
- When enabled, write-hot shards are tracked by the commit proxies, which maintain a hot-shard table and reject incoming mutations against those ranges with the `transaction_throttled_hot_shard` error to keep them from overwhelming a single team [5].
118+
119+
- This targeted throttling attempts to throttle only transactions writing to the hot shard, rather than penalizing the entire cluster with global rate limits [5].
120+
121+
- Once the data distributor flags a hot shard, it can split the range into smaller pieces and/or relocate portions to other teams to spread the traffic while respecting health constraints and load targets [12].
122+
123+
### Key Insight
124+
125+
The fundamental problem with unsplittable hot shards is that they create a **single point of contention** that forces cluster-wide rate limiting. Even with targeted hot shard throttling enabled, the storage server write queues will still grow if the hot key receives sustained traffic, eventually triggering global rate limits that affect all clients.
126+
127+
---
128+
129+
## Server Knobs Related to Hot Shards
130+
131+
FoundationDB uses several server knobs to control hot shard detection, splitting, and throttling behavior.
132+
133+
### Hot Shard Throttling Knobs
134+
135+
These knobs control the experimental hot shard throttling feature (PR #10970):
136+
137+
**`HOT_SHARD_THROTTLING_ENABLED`** (bool, default: `false`)
138+
- Master switch to enable/disable hot shard throttling at commit proxies
139+
- When enabled, commit proxies track hot shards and reject transactions writing to them
140+
- Disabled by default as the feature is experimental
141+
- Location: fdbclient/ServerKnobs.cpp:999
142+
143+
**`HOT_SHARD_THROTTLING_EXPIRE_AFTER`** (double, default: `3.0` seconds)
144+
- Duration after which a throttled hot shard expires and is removed from the throttle list
145+
- Prevents indefinite throttling if load decreases
146+
- Location: fdbclient/ServerKnobs.cpp:1000
147+
148+
**`HOT_SHARD_THROTTLING_TRACKED`** (int64_t, default: `1`)
149+
- Maximum number of hot shards to track and throttle per storage server
150+
- Limits the size of the hot shard list to prevent excessive memory usage
151+
- Location: fdbclient/ServerKnobs.cpp:1001
152+
153+
**`HOT_SHARD_MONITOR_FREQUENCY`** (double, default: `5.0` seconds)
154+
- How often Ratekeeper queries storage servers for hot shard information
155+
- Lower values provide faster hot shard detection but increase RPC overhead
156+
- Location: fdbclient/ServerKnobs.cpp:1002
157+
158+
### Shard Bandwidth Splitting Knobs
159+
160+
These knobs control when Data Distribution splits shards based on write bandwidth:
161+
162+
**`SHARD_MAX_BYTES_PER_KSEC`** (int64_t, default: `1,000,000,000` bytes/ksec = 1 GB/sec)
163+
- Shards with write bandwidth exceeding this threshold are split immediately
164+
- For a large shard (e.g., 100MB), it will be split into multiple pieces
165+
- If set too low, causes excessive data movement for small bandwidth spikes
166+
- If set too high, workload can remain concentrated on a single team indefinitely
167+
- Location: fdbclient/ServerKnobs.cpp:244
168+
169+
**`SHARD_MIN_BYTES_PER_KSEC`** (int64_t, default: `100,000,000` bytes/ksec = 100 MB/sec)
170+
- Shards with write bandwidth above this threshold will not be merged
171+
- Must be significantly less than `SHARD_MAX_BYTES_PER_KSEC` to avoid merge/split cycles
172+
- Controls the number of extra shards: max extra shards ≈ (total cluster bandwidth) / `SHARD_MIN_BYTES_PER_KSEC`
173+
- Location: fdbclient/ServerKnobs.cpp:255
174+
175+
**`SHARD_SPLIT_BYTES_PER_KSEC`** (int64_t, default: `250,000,000` bytes/ksec = 250 MB/sec)
176+
- When splitting a hot shard, each resulting piece will have less than this bandwidth
177+
- Must be less than half of `SHARD_MAX_BYTES_PER_KSEC`
178+
- Smaller values split hot shards into more pieces, distributing load faster across more servers
179+
- If too small relative to `SHARD_MIN_BYTES_PER_KSEC`, generates immediate re-merging work
180+
- Location: fdbclient/ServerKnobs.cpp:269
181+
182+
### Read Hot Shard Detection Knobs
183+
184+
These knobs control read-hot shard detection (primarily for read load balancing):
185+
186+
**`SHARD_MAX_READ_OPS_PER_KSEC`** (int64_t, default: `45,000,000` ops/ksec = 45k ops/sec)
187+
- Read operation rate above which a shard is considered read-hot
188+
- Assumption: at 45k reads/sec, a storage server becomes CPU-saturated
189+
- Location: fdbclient/ServerKnobs.cpp:224
190+
191+
**`SHARD_READ_OPS_CHANGE_THRESHOLD`** (int64_t, default: `SHARD_MAX_READ_OPS_PER_KSEC / 4`)
192+
- When sampled read ops change by more than this amount, shard metrics update immediately
193+
- Enables faster response to sudden changes in read workload
194+
- Location: fdbclient/ServerKnobs.cpp:225
195+
196+
**`READ_SAMPLING_ENABLED`** (bool, default: `false`)
197+
- Master switch to enable/disable read sampling on storage servers
198+
- When enabled, storage servers sample read operations to detect hot ranges
199+
- Location: fdbclient/ServerKnobs.cpp:1018
200+
201+
**`ENABLE_WRITE_BASED_SHARD_SPLIT`** (bool)
202+
- Controls whether write traffic (in addition to read traffic) can trigger shard splits
203+
- When enabled, high write bandwidth marks shards as hot for splitting
204+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:260
205+
206+
**`SHARD_MAX_READ_DENSITY_RATIO`** (double)
207+
- Maximum read density ratio before a shard is considered hot
208+
- Used to detect hotspots based on read concentration rather than absolute rate
209+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:263
210+
211+
**`SHARD_READ_HOT_BANDWIDTH_MIN_PER_KSECONDS`** (int64_t)
212+
- Minimum read bandwidth threshold (in bytes/ksec) for a shard to be marked read-hot
213+
- Prevents low-traffic shards from being flagged as hot due to ratio calculations
214+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:264
215+
216+
**`SHARD_MAX_BYTES_READ_PER_KSEC_JITTER`** (int64_t)
217+
- Jitter applied to read hot shard detection to avoid synchronization
218+
- Adds randomness to prevent all shards from being evaluated simultaneously
219+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:265
220+
221+
### Storage Queue Rebalancing Knobs
222+
223+
These knobs control automatic shard movement when storage queues become unbalanced:
224+
225+
**`REBALANCE_STORAGE_QUEUE_SHARD_PER_KSEC_MIN`** (int64_t, default: `SHARD_MIN_BYTES_PER_KSEC`)
226+
- Minimum write bandwidth for a shard to be considered for storage queue rebalancing
227+
- Prevents moving tiny/idle shards during rebalancing
228+
- Location: fdbclient/ServerKnobs.cpp:368
229+
230+
**`DD_ENABLE_REBALANCE_STORAGE_QUEUE_WITH_LIGHT_WRITE_SHARD`** (bool, default: `true`)
231+
- Allows moving light-traffic shards out of overloaded storage servers
232+
- Helps reduce queue buildup by redistributing low-write shards
233+
- Location: fdbclient/ServerKnobs.cpp:369
234+
235+
### Read Rebalancing Knobs
236+
237+
These knobs control Data Distribution's ability to move shards based on read hotspots:
238+
239+
**`READ_REBALANCE_SHARD_TOPK`** (int)
240+
- Number of top read-hot shards to consider for rebalancing moves
241+
- Limits how many hot shards DD tracks simultaneously for read-aware relocations
242+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:242
243+
244+
**`READ_REBALANCE_MAX_SHARD_FRAC`** (double)
245+
- Maximum fraction of the read imbalance gap that a moved shard can represent
246+
- Prevents moving excessively large shards during read rebalancing
247+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:245
248+
249+
### Shard Split/Merge Priority Knobs
250+
251+
These knobs control the priority of different Data Distribution operations:
252+
253+
**`PRIORITY_SPLIT_SHARD`** (int)
254+
- Priority level for shard split operations in DD's work queue
255+
- Higher priority means splits happen before lower-priority operations
256+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:193
257+
258+
**`PRIORITY_MERGE_SHARD`** (int)
259+
- Priority level for shard merge operations in DD's work queue
260+
- Typically lower than split priority since merges are less urgent
261+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:177
262+
263+
### Other Hot Shard Related Knobs
264+
265+
**`WAIT_METRICS_WRONG_SHARD_CHANCE`** (double)
266+
- Probability of injecting a delay when a storage server reports wrong_shard error
267+
- Prevents rapid retry loops when shard location information is stale
268+
- Location: fdbclient/include/fdbclient/ServerKnobs.h:1124
269+
270+
**`WRONG_SHARD_SERVER_DELAY`** (double, client-side)
271+
- Backoff delay on client when receiving a wrong_shard_server error
272+
- Prevents tight retry loops when client's location cache is stale
273+
- Location: fdbclient/include/fdbclient/ClientKnobs.h:57
274+
275+
### Key Relationships
276+
277+
```
278+
SHARD_MIN_BYTES_PER_KSEC (100 MB/s)
279+
↓ (must be significantly less than)
280+
SHARD_SPLIT_BYTES_PER_KSEC (250 MB/s)
281+
↓ (must be less than half of)
282+
SHARD_MAX_BYTES_PER_KSEC (1 GB/s)
283+
```
284+
285+
When a shard's write bandwidth exceeds `SHARD_MAX_BYTES_PER_KSEC`, it is split into pieces each with bandwidth < `SHARD_SPLIT_BYTES_PER_KSEC`. Shards with bandwidth > `SHARD_MIN_BYTES_PER_KSEC` will not be merged back together.
286+
287+
**Read Hot Shard Detection Flow:**
288+
1. If `READ_SAMPLING_ENABLED`, storage servers sample read operations
289+
2. When reads exceed `SHARD_MAX_READ_OPS_PER_KSEC` or read density exceeds `SHARD_MAX_READ_DENSITY_RATIO`
290+
3. And read bandwidth is above `SHARD_READ_HOT_BANDWIDTH_MIN_PER_KSECONDS`
291+
4. The shard is marked as read-hot
292+
5. DD considers top-K hot shards (`READ_REBALANCE_SHARD_TOPK`) for rebalancing
293+
294+
---
295+
296+
## References
297+
298+
[1] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/DDShardTracker.actor.cpp#L875
299+
[2] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/DDShardTracker.actor.cpp#L884
300+
[3] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/DDShardTracker.actor.cpp#L879
301+
[4] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/StorageMetrics.actor.cpp#L472
302+
[5] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/CommitProxyServer.actor.cpp#L742
303+
[6] https://github.com/apple/foundationdb/blob/7.3.0/design/data-distributor-internals.md#L129
304+
[7] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/storageserver.actor.cpp#L1973
305+
[8] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/StorageMetrics.actor.cpp#L286
306+
[9] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/StorageMetrics.actor.cpp#L302
307+
[10] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/StorageMetrics.actor.cpp#L332
308+
[11] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/DDShardTracker.actor.cpp#L890
309+
[12] https://github.com/apple/foundationdb/blob/7.3.0/design/data-distributor-internals.md#L146
310+
[13] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/Ratekeeper.actor.cpp#L933
311+
[14] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/Ratekeeper.actor.cpp#L973
312+
[15] https://github.com/apple/foundationdb/blob/7.3.0/fdbserver/Ratekeeper.actor.cpp#L1440
313+
[16] https://github.com/apple/foundationdb/blob/7.3.0/fdbclient/ServerKnobs.cpp#L872

0 commit comments

Comments
 (0)