Skip to content

Conversation

kim
Copy link
Contributor

@kim kim commented Jun 26, 2025

Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.

This addresses two issues:

  1. The select! loop is not blocked on sending messages, and can thus
    react to auxiliary events. Namely, when a module exits, we want to
    terminate the connection as soon as possible in order to release any
    database handles.

  2. Large outgoing messages should not occupy tokio worker threads, in
    particular when there are a large number of clients receiving large
    intial updates.


EDIT: This patch started out to address the above issues, but evolved into a rewrite as more issues where discovered. Namely:

  • logic is split into multiple functions that can be tested in isolation
  • remove unbounded ingress buffer in favor of no buffer (i.e. applying backpressure to clients in case of slow processing)
  • use an idle timer on any received packet, instead of waiting for pongs that are potentially end-of-line
  • time-bound waiting for the close handshake to complete (server-initiated close)
  • crucially, do not evaluate outstanding reducer call messages when a close handshake was already initiated
  • ensure the connection is dropped when there is an error sending or receiving
  • ensure all tasks exit when the loop exits

API and ABI breaking changes

Because there is no inbound buffering, the queue length metric will not be updated.

Expected complexity level and risk

4 - The state transitions remain hard to follow.

Testing

  • Ran a stress test with many clients and large initial updates,
    and observed no hangs / delays (which I did before this patch).
    In reconnection scenarios, all clients where disconnected timely, but
    could reconnect almost immediately.
  • Added unit-level tests

Split the websocket stream into send and receive halves and spawns a
new tokio task to handle the sending. Also move message serialization +
compression to a blocking task if the message appears to be large.

This addresses two issues:

1. The `select!` loop is not blocked on sending messages, and can thus
   react to auxiliary events. Namely, when a module exits, we want to
   terminate the connection as soon as possible in order to release any
   database handles.

2. Large outgoing messages should not occupy tokio worker threads, in
   particular when there are a large number of clients receiving large
   intial updates.
@kim kim requested review from Centril, gefjon and jsdt June 26, 2025 18:02
Copy link
Contributor

@gefjon gefjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to figure out what's going on with the SerializeBuffer and fix it before merging, but otherwise this looks good to me.

kim added 2 commits June 27, 2025 10:07
Also close the messages queue after the close went through.
Accordingly, closed and exited are the same -- we can just drop incoming
messages when closed.
@kim
Copy link
Contributor Author

kim commented Jun 27, 2025

Updated to:

  • Reclaim the serialize buffer
  • Not send any more data after sending a Close frame (as mandated by the RFC)

I think that we should also clear the message queue and cancel outstanding execution futures in the latter case, but that can be left to a future change.

Copy link
Contributor

@jsdt jsdt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through this for a while, and I'm still not very confident that I understand the error cases. I think we should do some bot testing with this to see what effect it has, but I think I'd like to try writing some tests for this, so we can trigger some of these tricky cases.

kim added 2 commits June 29, 2025 11:21
Also fixes the actual resource hog, which is that the ws_actor never
terminated because all receive errors were ignored.
@kim
Copy link
Contributor Author

kim commented Jun 30, 2025

Update to:

  • split into smaller functions that mainly transform Streams. For readability and testability.
  • actually terminate the actor loop when recv from the socket returns an error

@kim kim changed the title client-api: Move websocket sender to its own tokio task client-api: Rewrite websocket loop Jun 30, 2025
@kim
Copy link
Contributor Author

kim commented Jun 30, 2025

Updated to:

  • consider that buffer reclamation can fail if the socket is already closed

  • re-introduce spawning the send loop

    This seems to be necessary in order to guarantee timely release of the database.
    I'm considering to spawn the receive end, too, so that we can get rid of the unbounded buffer + apply backpressure to clients instead.

@kim
Copy link
Contributor Author

kim commented Jun 30, 2025

Updated to:

  • spawn the receive end, too

@bfops bfops added the release-any To be landed in any release window label Jun 30, 2025
kim added 8 commits July 1, 2025 11:56
Pong frames sit in line with previously sent messages, and so may not be
received in time if the server is backlogging.

We also want to time out the connection in "cable plugged" scenarios,
where the kernel doesn't consider the connection gone until
`tcp_keepalive_time`.
Copy link
Contributor

@gefjon gefjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge improvement, many thanks. The tests are very encouraging. I've left a number of requests for new or expanded comments, as I'm wary that in a few months or years we'll end up with a similarly opaque tangle of logic to the previous version if multiple collaborators don't fully understand what's happening.

Because there is no inbound buffering, the queue length metric will not be updated.

Should we remove this metric? Or is that going to break our dashboards in some frustrating way?

@kim
Copy link
Contributor Author

kim commented Jul 8, 2025

Should we remove this metric?

I'd like to put this up for discussion. The effect of not buffering is that we have no or reduced visibility into a situation where the database is lagging behind. We may alternatively insert a bounded queue, where the effect on the metric is likely that the queue is reported as either full or empty. But maybe that's good enough?

Copy link
Contributor

@gefjon gefjon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful, thanks!

@kim kim enabled auto-merge July 10, 2025 10:25
@kim kim added this pull request to the merge queue Jul 10, 2025
Merged via the queue into master with commit b63216a Jul 10, 2025
18 of 19 checks passed
@kim kim deleted the kim/ws/unblock branch July 10, 2025 11:17
mamcx pushed a commit that referenced this pull request Aug 26, 2025
Signed-off-by: Kim Altintop <[email protected]>
Co-authored-by: Phoebe Goldman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-any To be landed in any release window
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants