Save to Disk in Bio Thread - draft #1784

nitaicaro · 2025-02-26T10:33:10Z

Introduction

This PR introduces a new feature that enables replicas to perform disk-based synchronization on a dedicated background thread (Bio thread). Benchmarking results demonstrate significant improvements in synchronization duration. In extreme cases, this optimization allows syncs that would have previously failed to succeed.

This is an early draft pull request, as requested by the maintainers, to allow for review of the overall structure and approach before the full implementation is completed.

Problem Statement

Some administrators prefer the disk-based full synchronization mode for replicas. This mode allows replicas to continue serving clients with data while downloading the RDB file.

Valkey's predominantly single-threaded nature creates a challenge: serving client read requests and saving data from the socket to disk are not truly concurrent operations. In practice, the replica alternates between processing client requests and replication data, leading to inefficient behavior and prolonged sync durations, especially under high load.

Proposed Solution

To address this, the solution offloads the task of downloading the RDB file from the socket to a background thread. This allows the main thread to focus exclusively on handling client read requests while the background thread handles communication with the primary.

Benchmarking Results

Potential for Improvement

In theory, this optimization can lead to unbounded improvement in sync duration. By eliminating competition between client read events and socket communication (i.e., events related to handling RDB download with the primary), sync times become independent on load - the main thread handles only client reads, while the background thread focuses on primary RDB download events, allowing the system to perform consistently even under high load.

The full valkey-benchmark commands can be found in the appendix below.

Sync Duration with Feature Disabled (times in seconds)

16 threads, 64 clients: 172 seconds
32 threads, 128 clients: 436 seconds
48 threads, 192 clients: 710 seconds

Sync Duration with Feature Enabled (times in seconds)

16 threads, 64 clients: 33 seconds (80.8% improvement)
32 threads, 128 clients: 33 seconds (92.4% improvement)
48 threads, 192 clients: 33 seconds (95.3% improvement)

Alternative Solutions Considered

IO Threads
IO threads to not have an advantage over Bio in this case: The save-to-disk job is rare (most likely no more than several executions in a replica's lifetime), and there is never more than one simultaneous execution. Bio threads make more sense for a single, slow long running operation.

io_uring
For a single connection, io_uring doesn't provide as much of a performance boost because the primary advantage comes from batching many I/O operations together to reduce syscall overhead. With just one connection, we won't have enough operations to benefit significantly from these optimizations.

Prioritizing primary's socket in the event loop
This approach would help, but less effectively than using a Bio thread. We would still need to allocate attention to handling read requests, which could limit its benefit. It could be more useful on smaller instance types with limited CPU cores.

Appendix:

Benchmarking Setup

Client machine: AWS c5a.16xlarge
Server machines: AWS c5a.2xlarge

# Step 1: Fill the primary and replica DBs with 6GB of data:

./valkey-benchmark -h <host> -p <port> -l -d 128 -t set -r 30000000 --threads 16 -c 64

# Step 2: Initiate heavy read load on the replica:

./valkey-benchmark -h <host> -p <port> -t get -r 30000000 --threads <t> -c <t> -n 1000000000 -P <P>

# Step 3: Enable/disable the config controlling the new feature:

./valkey-cli -h <host> -p <port> config set replica-save-to-disk-in-bio-thread <yes/no>

# Step 4: Initiate sync:

./valkey-cli -h <replica host> -p <replica port> replicaof <primary host> <primary port>

src/bio.c

codecov · 2025-02-26T10:48:20Z

Codecov Report

Attention: Patch coverage is 87.05882% with 22 lines in your changes missing coverage. Please review.

Project coverage is 71.02%. Comparing base (aa88453) to head (c0129a9).

Files with missing lines	Patch %	Lines
src/bio.c	82.10%	17 Missing ⚠️
src/replication.c	93.24%	5 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1784      +/-   ##
============================================
- Coverage     71.09%   71.02%   -0.08%     
============================================
  Files           123      123              
  Lines         65671    65816     +145     
============================================
+ Hits          46692    46748      +56     
- Misses        18979    19068      +89

Files with missing lines	Coverage Δ
src/config.c	`78.39% <ø> (ø)`
src/server.c	`87.54% <100.00%> (ø)`
src/server.h	`100.00% <ø> (ø)`
src/replication.c	`86.35% <93.24%> (-0.92%)`	⬇️
src/bio.c	`83.04% <82.10%> (-1.41%)`	⬇️

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…read

xbasel

Initial comments.

src/bio.c

xbasel · 2025-03-17T13:16:20Z

src/replication.c

@@ -2649,6 +2678,12 @@ void freePendingReplDataBuf(void) {
    server.pending_repl_data.len = 0;
 }

+void receiveRDBinBioThread(connection *conn) {
+    serverLog(LL_NOTICE, "Replica main thread creating Bio thread to save RDB to disk");
+    connSetReadHandler(conn, NULL);


What happens to the write handler? Is the main thread supposed to do any writes in the meantime?

Good question, it was never set to anything on the replica's side, see:
https://github.com/valkey-io/valkey/blob/unstable/src/replication.c#L3517

^ this is executed during sync handshake. After the sync is done and we enter steady-state we initialize it: https://github.com/valkey-io/valkey/blob/unstable/src/replication.c#L4380

xbasel · 2025-03-17T13:31:48Z

src/replication.c

@@ -3918,6 +3963,10 @@ void replicationAbortSyncTransfer(void) {
    cleanupTransferResources();
 }

+void waitForDiskSaveBioThreadComplete(void) {
+    while (bioPendingJobsOfType(BIO_SAVE_TO_DISK));


This is a blocking operation and should be avoided in the main thread. Although this is already being done in the main thread I think, via: bioDrainWorker

We have to do this to prevent a race condition where a new sync starts before the previous Bio thread finishes.
Busy-waiting is also how it's done for io-threads, see waitForClient()

(1) can bioDrainWorker be used?
(2) Did you consider timing out the operation and shutting down ?

Good point - since we switched to blocking mode on the bio thread we can have a deadlock here. I'll remove this, and then the timeouts of read() will guarantee that we eventually reach shouldAbortSave() which will free the main thread.

Edit:
Actually we set a timeout so we cannot block forever:
connRecvTimeout(conn, server.repl_syncio_timeout * 1000);

So we are guaranteed to eventually reach shouldAbortSave() which will free the main thread from busy-waiting.

Blocking the main thread forever is bad practice, even if the child thread is logically guaranteed to finish.
Maybe timing it out and crashing is the right approach.

Your suggestion to switch to bioDrain seems to be the right solution. Updated.
About blocking forever, this seems to be a risk in other places in the code too, whether they busy-wait with bioDrain or with bioPendingJobsOfType. I suggest we open an issue to address this more comprehensively.

xbasel · 2025-03-17T13:49:36Z

src/bio.c

+                    error = 1;
+                    goto done;


why not to create an error label instead of using a variable?

I think we cannot have more than one label inside the else if (job_type == BIO_RDB_SAVE).
If we have

done: ... error: ...

then error would always get executed after done (we cannot return after done since we need to reach the end of the iteration:

zfree(job); /* Lock again before reiterating the loop, if there are no longer * jobs to process we'll block again in pthread_cond_wait(). */ pthread_mutex_lock(&bio_mutex[worker]); listDelNode(bio_jobs[worker], ln); bio_jobs_counter[job_type]--; pthread_cond_signal(&bio_newjob_cond[worker]);

)

I guess we can do

done: ... goto really_done; error: ... really_done:

But I'm not sure it makes things better

src/bio.c

xbasel · 2025-03-17T13:55:00Z

src/bio.c

+                    goto done;
+                } else if (ret == INSPECT_BULK_PAYLOAD_PRIMARY_PING) {
+                    atomic_store_explicit(&server.repl_transfer_lastio, atomic_load_explicit(&server.unixtime, memory_order_relaxed), memory_order_relaxed);
+                    memset(buf, 0, PROTO_IOBUF_LEN);


why this is being done?

This is the purpose of a ping from the primary - to refresh last_io field in order to avoid timeout

xbasel · 2025-03-17T13:55:55Z

src/bio.c

+             * We'll restore it when the RDB is received. */
+            connBlock(conn);
+            connRecvTimeout(conn, server.repl_syncio_timeout * 1000);
+            do {


why do we need two loops in the code?

We are basically imitating what readSyncBulkPayload does for normal replication. We try to read the bulk payload length from the primary. Ideally one pass would be enough (no need for loop), but the primary is sometimes not fast enough in sending the length, so it periodically sends pings (newlines) until it's ready. We have to keep looping while we receive these pings.

The second loop goes on until the primary finishes sending all the data (we see an EOF or the amount we read is equal to the previously passed payload length).

So the first loop is conditioned on receiving pings, the second one on sync completion.

src/bio.c

…read

xbasel self-requested a review February 26, 2025 10:34

nitaicaro changed the title ~~save to disk in bio thread - draft~~ Save to Disk in Bio Thread - draft Feb 26, 2025

nitaicaro commented Feb 26, 2025

View reviewed changes

src/bio.c Outdated Show resolved Hide resolved

nitaicaro force-pushed the replica-save-to-disk-in-bio-thread branch 5 times, most recently from 2d9c776 to 466a0ca Compare March 4, 2025 14:20

nitaicaro force-pushed the replica-save-to-disk-in-bio-thread branch 5 times, most recently from 6bcc6be to c67c618 Compare March 11, 2025 11:09

save to disk in bio thread - draft

0a14eff

nitaicaro force-pushed the replica-save-to-disk-in-bio-thread branch from c67c618 to 0a14eff Compare March 11, 2025 11:20

Merge branch 'valkey-io:unstable' into replica-save-to-disk-in-bio-th…

4fd62f2

…read

nitaicaro force-pushed the replica-save-to-disk-in-bio-thread branch from 2b6e9a5 to ad1b8fc Compare March 11, 2025 12:09

Formatter fixes

f1418b1

nitaicaro force-pushed the replica-save-to-disk-in-bio-thread branch from ad1b8fc to f1418b1 Compare March 11, 2025 12:15

xbasel reviewed Mar 17, 2025

View reviewed changes

nitaicaro and others added 5 commits March 18, 2025 13:57

Merge branch 'valkey-io:unstable' into replica-save-to-disk-in-bio-th…

07f685b

…read

Change job name to BIO_RDB_SAVE

efbeddc

Bring back comment explaining received_ping

42522ea

Use bioDrain instead of bioPendingJobs

f9d5a9a

Formatter fixes

ca16a76

nitaicaro force-pushed the replica-save-to-disk-in-bio-thread branch from 298e925 to ca16a76 Compare March 18, 2025 13:30

Nitai Caro added 2 commits March 18, 2025 13:33

Remove else keyword from inspectBulkPayloadHeaderForErrors

59af66e

Refactor the eof_reached loop for readability

c0129a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save to Disk in Bio Thread - draft #1784

Save to Disk in Bio Thread - draft #1784

nitaicaro commented Feb 26, 2025 •

edited

Loading

codecov bot commented Feb 26, 2025 •

edited

Loading

xbasel left a comment

xbasel Mar 17, 2025

nitaicaro Mar 18, 2025

xbasel Mar 17, 2025

nitaicaro Mar 18, 2025

xbasel Mar 18, 2025

nitaicaro Mar 18, 2025 •

edited

Loading

xbasel Mar 18, 2025

nitaicaro Mar 18, 2025

xbasel Mar 17, 2025

nitaicaro Mar 18, 2025 •

edited

Loading

xbasel Mar 17, 2025

nitaicaro Mar 18, 2025

xbasel Mar 17, 2025

nitaicaro Mar 18, 2025

Save to Disk in Bio Thread - draft #1784

Are you sure you want to change the base?

Save to Disk in Bio Thread - draft #1784

Conversation

nitaicaro commented Feb 26, 2025 • edited Loading

Introduction

Problem Statement

Proposed Solution

Benchmarking Results

Potential for Improvement

Sync Duration with Feature Disabled (times in seconds)

Sync Duration with Feature Enabled (times in seconds)

Alternative Solutions Considered

Appendix:

Benchmarking Setup

codecov bot commented Feb 26, 2025 • edited Loading

Codecov Report

xbasel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nitaicaro Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nitaicaro Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nitaicaro commented Feb 26, 2025 •

edited

Loading

codecov bot commented Feb 26, 2025 •

edited

Loading

nitaicaro Mar 18, 2025 •

edited

Loading

nitaicaro Mar 18, 2025 •

edited

Loading