Hot shard design documentation #12413

dlambrig · 2025-10-02T19:51:03Z

Hot shard documentation

[X]explains what a hot shard is
[X] - explains how to avoid a hot shard (e.g. structure data and/or access patterns in a certain way)
[ ] - explains the server management side options for fixing a hot shard, should it arise. Please emphasize that the options are rather bad / near non-existent and hot shards should be avoided by design for this reason

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

foundationdb-ci · 2025-10-02T20:27:54Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 6c10d8b
Duration 0:36:41
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-10-02T20:31:40Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: 7094c6e
Duration 0:34:29
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-10-02T20:39:30Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 6c10d8b
Duration 0:48:15
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-10-02T20:39:35Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 6c10d8b
Duration 0:48:20
Result: ❌ FAILED
Error: Error while executing command: ctest -j ${NPROC} --no-compress-output -T test --output-on-failure. Reason: exit status 8
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-10-02T20:43:18Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 6c10d8b
Duration 0:52:03
Result: ❌ FAILED
Error: Error while executing command: docker build --label "org.foundationdb.version=${FDB_VERSION}" --label "org.foundationdb.build_date=${BUILD_DATE}" --label "org.foundationdb.commit=${COMMIT_SHA}" --progress plain --build-arg FDB_VERSION="${FDB_VERSION}" --build-arg FDB_LIBRARY_VERSIONS="${FDB_VERSION}" --build-arg FDB_WEBSITE="${FDB_WEBSITE}" --tag foundationdb/foundationdb-kubernetes-sidecar:${FDB_VERSION}-${COMMIT_SHA}-1 --file Dockerfile --target foundationdb-kubernetes-sidecar .. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2025-10-02T20:44:45Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: 7094c6e
Duration 0:47:32
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-10-02T20:47:45Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: 7094c6e
Duration 0:50:29
Result: ❌ FAILED
Error: Error while executing command: docker build --label "org.foundationdb.version=${FDB_VERSION}" --label "org.foundationdb.build_date=${BUILD_DATE}" --label "org.foundationdb.commit=${COMMIT_SHA}" --progress plain --build-arg FDB_VERSION="${FDB_VERSION}" --build-arg FDB_LIBRARY_VERSIONS="${FDB_VERSION}" --build-arg FDB_WEBSITE="${FDB_WEBSITE}" --tag foundationdb/ycsb:${FDB_VERSION}-${COMMIT_SHA} --file Dockerfile --target ycsb .. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2025-10-02T20:57:02Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 6c10d8b
Duration 1:05:47
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-10-02T20:57:24Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: 7094c6e
Duration 1:00:10
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2025-10-02T21:00:10Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: 7094c6e
Duration 1:02:59
Result: ✅ SUCCEEDED
Error: N/A
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

jzhou77

This looks great!

I think it's also worth documenting the max shard size is dynamically calculated in

int64_t getMaxShardSize(double dbSizeEstimate) {
    int64_t size = std::min((SERVER_KNOBS->MIN_SHARD_BYTES + (int64_t)std::sqrt(std::max<double>(dbSizeEstimate, 0)) *
                                                                 SERVER_KNOBS->SHARD_BYTES_PER_SQRT_BYTES) *
                                SERVER_KNOBS->SHARD_BYTES_RATIO,
                            (int64_t)SERVER_KNOBS->MAX_SHARD_BYTES);

i.e., a formula of (MIN_SHARD_BYTES + sqrt(DB_size) * SHARD_BYTES_PER_SQRT_BYTES) * SHARD_BYTES_RATIO, and then max'ed at MAX_SHARD_BYTES.

jzhou77 · 2025-10-03T19:25:40Z

design/hotshard.md

+
+A shard is "hot" when it absorbs a disproportionate share of the cluster's read or write workload, driving CPU or bandwidth saturation on its replica servers [6].
+
+Storage servers continually sample bytes-read, ops-read, and shard sizes; a shard whose read bandwidth density exceeds configured thresholds is tagged as read-hot so the distributor can identify the offending range [4].


There is also write hot. Right now, only write-hot shard triggers split.

jzhou77 · 2025-10-03T19:29:49Z

design/hotshard.md

+- Randomize key prefixes so consecutive writes land on different shard ranges; for example, hash user IDs or add a short random salt before the natural key. This way inserts will scatter instead of piling onto one shard.
+- If you need to store a counter, consider sharding them across N disjoint keys (e.g., counter/<bucket>/…) and aggregate in clients or background jobs; this keeps the per-key mutation rate below the commit proxy’s hot-shard throttle.
+- If you are storing append-only logs,  split them into multiple partitions (such as log/<partition>/<ts>), rotating partitions over time rather than funneling through a single key path.
+- Avoid “read-modify-write” cycles. Use FDB's atomic operations (like ADD) when possible, and throttle/queue work in clients so they don’t stampede on that hot key.


s/ADD/ATOMIC_ADD/

jzhou77 · 2025-10-03T19:33:27Z

design/hotshard.md

+- Master switch to enable/disable hot shard throttling at commit proxies
+- When enabled, commit proxies track hot shards and reject transactions writing to them
+- Disabled by default as the feature is experimental
+- Location: fdbclient/ServerKnobs.cpp:999


we can delete this, since the actual line number will likely be changed. Or use a permanent link instead.

jzhou77 · 2025-10-03T19:33:36Z

design/hotshard.md

+**`HOT_SHARD_THROTTLING_EXPIRE_AFTER`** (double, default: `3.0` seconds)
+- Duration after which a throttled hot shard expires and is removed from the throttle list
+- Prevents indefinite throttling if load decreases
+- Location: fdbclient/ServerKnobs.cpp:1000


jzhou77 · 2025-10-03T19:33:51Z

design/hotshard.md

+**`HOT_SHARD_THROTTLING_TRACKED`** (int64_t, default: `1`)
+- Maximum number of hot shards to track and throttle per storage server
+- Limits the size of the hot shard list to prevent excessive memory usage
+- Location: fdbclient/ServerKnobs.cpp:1001


jzhou77 · 2025-10-03T19:33:58Z

design/hotshard.md

+**`HOT_SHARD_MONITOR_FREQUENCY`** (double, default: `5.0` seconds)
+- How often Ratekeeper queries storage servers for hot shard information
+- Lower values provide faster hot shard detection but increase RPC overhead
+- Location: fdbclient/ServerKnobs.cpp:1002


dlambrig requested a review from jzhou77 October 2, 2025 19:52

Hot shard design documentation

7094c6e

dlambrig force-pushed the hotshard-doc branch from 6c10d8b to 7094c6e Compare October 2, 2025 19:57

jzhou77 requested changes Oct 3, 2025

View reviewed changes

dlambrig requested a review from nicmorales9 October 4, 2025 20:43


		A shard is "hot" when it absorbs a disproportionate share of the cluster's read or write workload, driving CPU or bandwidth saturation on its replica servers [6].

		Storage servers continually sample bytes-read, ops-read, and shard sizes; a shard whose read bandwidth density exceeds configured thresholds is tagged as read-hot so the distributor can identify the offending range [4].

Hot shard design documentation #12413

Are you sure you want to change the base?

Hot shard design documentation #12413

Uh oh!

Conversation

dlambrig commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code-Reviewer Section

For Release-Branches

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Oct 2, 2025

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

jzhou77 left a comment

Choose a reason for hiding this comment

Uh oh!

jzhou77 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

jzhou77 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

jzhou77 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

jzhou77 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

jzhou77 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

jzhou77 Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dlambrig commented Oct 2, 2025 •

edited

Loading