client-api: Rewrite websocket loop #2906

kim · 2025-06-26T18:02:14Z

Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.

This addresses two issues:

The select! loop is not blocked on sending messages, and can thus
react to auxiliary events. Namely, when a module exits, we want to
terminate the connection as soon as possible in order to release any
database handles.
Large outgoing messages should not occupy tokio worker threads, in
particular when there are a large number of clients receiving large
intial updates.

EDIT: This patch started out to address the above issues, but evolved into a rewrite as more issues where discovered. Namely:

logic is split into multiple functions that can be tested in isolation
remove unbounded ingress buffer in favor of no buffer (i.e. applying backpressure to clients in case of slow processing)
use an idle timer on any received packet, instead of waiting for pongs that are potentially end-of-line
time-bound waiting for the close handshake to complete (server-initiated close)
crucially, do not evaluate outstanding reducer call messages when a close handshake was already initiated
ensure the connection is dropped when there is an error sending or receiving
ensure all tasks exit when the loop exits

API and ABI breaking changes

Because there is no inbound buffering, the queue length metric will not be updated.

Expected complexity level and risk

4 - The state transitions remain hard to follow.

Testing

Ran a stress test with many clients and large initial updates,
and observed no hangs / delays (which I did before this patch).
In reconnection scenarios, all clients where disconnected timely, but
could reconnect almost immediately.
Added unit-level tests

Split the websocket stream into send and receive halves and spawns a new tokio task to handle the sending. Also move message serialization + compression to a blocking task if the message appears to be large. This addresses two issues: 1. The `select!` loop is not blocked on sending messages, and can thus react to auxiliary events. Namely, when a module exits, we want to terminate the connection as soon as possible in order to release any database handles. 2. Large outgoing messages should not occupy tokio worker threads, in particular when there are a large number of clients receiving large intial updates.

crates/client-api/src/routes/subscribe.rs

gefjon

I'd like to figure out what's going on with the SerializeBuffer and fix it before merging, but otherwise this looks good to me.

Also close the messages queue after the close went through. Accordingly, closed and exited are the same -- we can just drop incoming messages when closed.

kim · 2025-06-27T11:09:02Z

Updated to:

Reclaim the serialize buffer
Not send any more data after sending a Close frame (as mandated by the RFC)

I think that we should also clear the message queue and cancel outstanding execution futures in the latter case, but that can be left to a future change.

jsdt

I looked through this for a while, and I'm still not very confident that I understand the error cases. I think we should do some bot testing with this to see what effect it has, but I think I'd like to try writing some tests for this, so we can trigger some of these tricky cases.

crates/client-api/src/routes/subscribe.rs

Also fixes the actual resource hog, which is that the ws_actor never terminated because all receive errors were ignored.

kim · 2025-06-30T07:02:01Z

Update to:

split into smaller functions that mainly transform Streams. For readability and testability.
actually terminate the actor loop when recv from the socket returns an error

kim · 2025-06-30T07:58:34Z

Updated to:

consider that buffer reclamation can fail if the socket is already closed
re-introduce spawning the send loop

This seems to be necessary in order to guarantee timely release of the database.
I'm considering to spawn the receive end, too, so that we can get rid of the unbounded buffer + apply backpressure to clients instead.

kim · 2025-06-30T08:53:20Z

Updated to:

spawn the receive end, too

…t message

Pong frames sit in line with previously sent messages, and so may not be received in time if the server is backlogging. We also want to time out the connection in "cable plugged" scenarios, where the kernel doesn't consider the connection gone until `tcp_keepalive_time`.

Also losen `Unpin` requirements and use long names for type variables denoting futures.

gefjon

This is a huge improvement, many thanks. The tests are very encouraging. I've left a number of requests for new or expanded comments, as I'm wary that in a few months or years we'll end up with a similarly opaque tangle of logic to the previous version if multiple collaborators don't fully understand what's happening.

Because there is no inbound buffering, the queue length metric will not be updated.

Should we remove this metric? Or is that going to break our dashboards in some frustrating way?

crates/client-api/src/routes/subscribe.rs

kim · 2025-07-08T17:25:25Z

Should we remove this metric?

I'd like to put this up for discussion. The effect of not buffering is that we have no or reduced visibility into a situation where the database is lagging behind. We may alternatively insert a bounded queue, where the effect on the metric is likely that the queue is reported as either full or empty. But maybe that's good enough?

Co-authored-by: Phoebe Goldman <[email protected]> Signed-off-by: Kim Altintop <[email protected]>

gefjon

Beautiful, thanks!

Signed-off-by: Kim Altintop <[email protected]> Co-authored-by: Phoebe Goldman <[email protected]>

kim requested review from Centril, gefjon and jsdt June 26, 2025 18:02

kim commented Jun 26, 2025

View reviewed changes

crates/client-api/src/routes/subscribe.rs Show resolved Hide resolved

gefjon approved these changes Jun 26, 2025

View reviewed changes

kim added 2 commits June 27, 2025 10:07

Reclaim those bytes

038aeb0

Don't send more data after sending a close frame.

b67fbd3

Also close the messages queue after the close went through. Accordingly, closed and exited are the same -- we can just drop incoming messages when closed.

jsdt reviewed Jun 27, 2025

View reviewed changes

kim added 2 commits June 29, 2025 11:21

Use rayon instead of tokio blocking thread for serialize

7e6df49

Rewrite to use more modular stream transformers for testability

4e75fc2

Also fixes the actual resource hog, which is that the ws_actor never terminated because all receive errors were ignored.

kim changed the title ~~client-api: Move websocket sender to its own tokio task~~ client-api: Rewrite websocket loop Jun 30, 2025

kim added 2 commits June 30, 2025 09:39

Reclaim can actually fail

1cf159d

Send loop needs to be spawned

787f050

kim added 2 commits June 30, 2025 10:05

Merge branch 'master' into kim/ws/unblock

a36d4f0

Spawn receiver

1e8f3c0

Apply timeout to draining the receiver until closed, not just the nex…

e88b76c

…t message

bfops added the release-any To be landed in any release window label Jun 30, 2025

kim added 8 commits July 1, 2025 11:56

Refactor

1396eac

Propagate task panics and abort tasks on exit / unwind.

55178e7

Terminate send loop on all send errors

3c982a2

Exit select loop when send task terminates -- connection is probably bad

76f3b6b

Add a batch of tests

abdcad1

fixup! Add a batch of tests

e5b0b96

Another batch of tests

5cf5d3b

kim added 6 commits July 2, 2025 11:22

Well, always return when the send task is gone.

a9aaf68

Final batch of tests

edda56d

Merge remote-tracking branch 'origin/master' into kim/ws/unblock

6a0d928

Hotswap future needs to be recreated after it completed.

2d1d335

Also losen `Unpin` requirements and use long names for type variables denoting futures.

Syntactically nicer

e5f42ca

Reset idle timeout only on recv

a855b1d

kim requested review from jsdt and joshua-spacetime and removed request for Centril July 3, 2025 13:32

ResuBaka mentioned this pull request Jul 3, 2025

SpacetimeDB does not respond to Websocket Ping if it is still responding to another message #2891

Closed

kim requested a review from gefjon July 7, 2025 07:38

joshua-spacetime assigned kim Jul 7, 2025

gefjon reviewed Jul 8, 2025

View reviewed changes

kim and others added 4 commits July 9, 2025 09:21

Documentation

a53d173

Apply suggestions from code review

f1b2356

Co-authored-by: Phoebe Goldman <[email protected]> Signed-off-by: Kim Altintop <[email protected]>

fixup! Documentation

fee473c

Merge branch 'master' into kim/ws/unblock

3a00981

gefjon approved these changes Jul 9, 2025

View reviewed changes

kim added 2 commits July 10, 2025 12:23

Reinstantiate ingress queue

a387d3a

Merge branch 'master' into kim/ws/unblock

1a5d0cc

kim enabled auto-merge July 10, 2025 10:25

kim added this pull request to the merge queue Jul 10, 2025

Merged via the queue into master with commit b63216a Jul 10, 2025
18 of 19 checks passed

kim deleted the kim/ws/unblock branch July 10, 2025 11:17

mamcx pushed a commit that referenced this pull request Aug 26, 2025

client-api: Rewrite websocket loop (#2906)

3f11585

Signed-off-by: Kim Altintop <[email protected]> Co-authored-by: Phoebe Goldman <[email protected]>

client-api: Rewrite websocket loop #2906

client-api: Rewrite websocket loop #2906

Uh oh!

Conversation

kim commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

API and ABI breaking changes

Expected complexity level and risk

Testing

Uh oh!

Uh oh!

gefjon left a comment

Choose a reason for hiding this comment

Uh oh!

kim commented Jun 27, 2025

Uh oh!

jsdt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kim commented Jun 30, 2025

Uh oh!

kim commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kim commented Jun 30, 2025

Uh oh!

gefjon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kim commented Jul 8, 2025

Uh oh!

gefjon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kim commented Jun 26, 2025 •

edited

Loading

kim commented Jun 30, 2025 •

edited

Loading