Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Migrate away from handling each packet in a dedicated goroutine #71

Open
1 task done
jonbarrow opened this issue Nov 9, 2024 · 3 comments
Open
1 task done
Labels
awaiting-approval Topic has not been approved or denied enhancement An update to an existing part of the codebase

Comments

@jonbarrow
Copy link
Member

Checked Existing

  • I have checked the repository for duplicate issues.

What enhancement would you like to see?

Currently we handle each request as its own goroutine. This was fine originally when there were fewer requests, but now that we're handling thousands upon thousands of requests this can create an ungodly number of goroutines. This can, under heavy load, create a lot of system resource usage. Goroutines are not "free", even if small they still take up memory and can create slowdowns during context switching. This can, under the right circumstances, create a situation where clients have connection issues or large amounts of memory ends up being used.

To account for this, it may be worth moving to a worker pool-based approach with a channel queue. This would mean creating a pool of premade goroutines, say 100 of them, to handle incoming packets. A channel queue can move the packets into a worker as soon as one becomes ready to handle the packet. This would keep things under a somewhat more consistent predefined level of usage.

Any other details to share? (OPTIONAL)

Considerations

Some more accurate testing needs to be done to determine how much of an issue the current design is. We've fixed a number of memory related issues, however we are still abusing goroutines pretty heavily which goes against recommendations and sometimes I do see the friends server get some high-ish memory usage.

There are a number of edge cases and caveats to consider when using this sort of design as well. For instance, whether or not to buffer the channel or not.

Buffering the channel will mean it has a predefined size and will drop any new data put into it if it's reached its max size, called "load shedding". This has the benefit of controlling memory usage better but means we will begin to drop packets if the channel becomes overloaded. We handle instances where packets may be dropped already, since that can happen no matter what, so maybe that's not a huge deal, but it does mean that if we are consistently under load then the servers will consistently drop packets.

Channels in Go can be unbuffered, and thus have an infinite size, however this means that under heavy load we will get more memory usage as data backs up. This also means that clients may actually have a higher risk of timing out. In a system where the channel is buffered if the packet is dropped then the next time the packet is sent by the client it may have a chance to get handled sooner as there may be less packets in the queue at that time. However, with an unbuffered queue the amount of time before the packet is processed is variable. This means the queue may have many instances of the same packet if it's resent multiple times, and if the queue is very long then it may take a while before it becomes processed.

Another thing to consider is the number of workers. Unlike most languages we can safely have a number of workers higher than CPU cores, since the Go runtime is very good at multiplexing them and context switching. However, since some operations may take a while to complete, we could run into an issue where all workers are busy. Say we have 100 workers and we get 101 requests, all of which are operations which take a second or more (this is a real possibility, as we had some friends server requests taking multiple seconds in the past). Suddenly that 101st request must now wait an extra second to be processed because there are no workers available.

Alternatives

If we wanted to get really low level, we could do things like what Tailscale does. Tailscale uses some lower-level Linux systems to improve performance and reliability, such as recvmmsg and netpoll to handle incoming packets and finding free goroutines to process them. This can be incredibly powerful and would likely help us a lot here, but is also fairly complex and would likely break on other platforms outside of Linux

@jonbarrow jonbarrow added enhancement An update to an existing part of the codebase awaiting-approval Topic has not been approved or denied labels Nov 9, 2024
@ashquarky
Copy link
Member

I usually don't think of adding buffering as a good thing for performance - you can look to the world of network routing and Bufferbloat to see how hard they've been working to eliminate standard FIFO buffers due to the latency they add - but you've already mentioned some of these effects so it's good that we're considering it. I think in the networking world the latest strategy is to throttle high-bandwidth connections to keep the latency low for everyone else; but I'm not sure how that maps to our problem space. Reserve some workers for "easy" things like ping and SecureConnection, and queue those separately?

Could we have a goroutine (or even real process using fork) per user? Then if you're waiting 3 seconds for some db-heavy Friends call it's at least won't affect other users. Then we can know when a goroutine is definitely not needed (the user it belongs to has their connection time out or disconnect). I can see this still ballooning out of control if we have a lot of simultaneous users, though.

Dropping packets might honestly be fine, UDP is designed for it and it gives some backpressure to the client when we're under heavy load (potentially buying us more processing time too).

I'd be curious how other programs that handle lots of connections with a worker pool model (e.g. nginx) handle this. I just worry that we might end up actually adding more latency under load if more requests come in than 100 goroutines can handle.

@jonbarrow
Copy link
Member Author

jonbarrow commented Nov 30, 2024

To be clear here the main purpose of this issue is about system performance, not necessarily network performance. It's about managing server resources more effectively to prevent issues like what you saw with Splatoon on Discord

I usually don't think of adding buffering as a good thing for performance

To be clear here I am talking about buffered channels in the context of Go, which is a core concept in Go's concurrency model. Channels can be either buffered or unbuffered, where buffered channels are better for preserving system resources as it limits the amount of data inside the channel at the cost of possibly losing additional data if the channel is full, and where an unbuffered channel are worse at preserving system resources as they continue to allocate more and more as needed to add more data to the channel but with the benefit of ensuring no data is actively lost just by adding it to the channel

Each has their pros and cons and are better/worse in different situations, which is why I mentioned both

I think in the networking world the latest strategy is to throttle high-bandwidth connections to keep the latency low for everyone else

This is, effectively, what a buffered channel allows you to do. When more bandwidth comes in than we can handle, the channel just simply sheds off the data that is attempting to be added (if using select). Which is effectively a throttle at the application layer. Though it's a simple one depending on the implementation, as you've mentioned later

Reserve some workers for "easy" things like ping and SecureConnection, and queue those separately?

This can be done but it comes at the cost of now processing packet decoding synchronously. We don't know anything about the packet at the time the data comes in from the socket, it has to be decoded to know it's type (such as PING), and then data about the client must be obtained and used to decrypt/rebuild the payload to know the RMC protocol/method (such as SecureConnection). If you want to reserve goroutines for specific types of packets, then we now must do ALL packet processing (besides the actual messaging handling) outside of goroutines, which may take a hit

Could we have a goroutine (or even real process using fork) per user? Then if you're waiting 3 seconds for some db-heavy Friends call it's at least won't affect other users. Then we can know when a goroutine is definitely not needed (the user it belongs to has their connection time out or disconnect). I can see this still ballooning out of control if we have a lot of simultaneous users, though

This also works but as you mentioned has some issues. Having a real dedicated process to each user would definitely start to balloon a bit. Having a goroutine per user sort of works but that only helps once we know the user (unless you mean "user" to mean "physical socket connection"), which has the same issues about processing packets synchronously as mentioned before. Additionally, a connection can send multiple packets at once. So limiting it to just one goroutine per connection (I'm ignoring the issues this would cause for virtual connections, btw) would mean that each packet the client sent would be processed (effectively) synchronously (PING packet being delayed by a long running DB query could become an issue)

Dropping packets might honestly be fine, UDP is designed for it and it gives some backpressure to the client when we're under heavy load (potentially buying us more processing time too)

I agree this is fine, I just also noticed that Nintendo seems to not really have an issue with dropped packets? Dropped packets seem pretty rare in the dumps I've looked at, so they must have a better system in place than "just shed whatever we can't handle". Such as what I mentioned on Discord, with regards to splitting the PRUDP and RMC handling systems into their own processes connected with a message queue

I'd be curious how other programs that handle lots of connections with a worker pool model (e.g. nginx) handle this

In most trivial cases it's a worker pool + channels (buffered or unbuffered depending on the needs of the application) for balancing data. It's why I brought this issue up in the first place, we're kinda just chucking goroutines at everything which is typically frowned upon and a sign of less-than-optimal design

For non-trivial cases, this largely depends on the context and requirements of the application. I linked to a couple sources for such cases in the original issue in the Alternatives section

I just worry that we might end up actually adding more latency under load if more requests come in than 100 goroutines can handle

To reiterate, this issue is not necessarily about latency. The main concern was about system resources, with the goal being to find a balance in the implementation that produces acceptable levels of latency and resource usage

@jonbarrow
Copy link
Member Author

jonbarrow commented Nov 30, 2024

I believe the idea I mentioned on Discord could also be a good alternative to this proposal and honestly I'm voting we should move to this design regardless of what happens in this issue, as I think it's just overall a better design for these servers. For context, here's a copy-paste of my Discord message:

The proposal also has the consideration of heavy workload requests. For instance requests which hit the database, and may take more time. If we have a pool of 100 goroutines and all 100 make heavy calls which take multiple seconds, the server can no longer process requests

However this, too, can be worked around with some refactoring. Though it’s not trivial refactoring. It involves splitting the PRUDP server and the RMC message handling from each other. I have a hunch that this is what Quazal did as well? It’s well known at this point that Rendez-Vous made use of Python during message handling, but it seems to not have been used in the PRUDP connection based on the known error codes and when they trigger. This tells me that the real servers split this functionality, having the main PRUDP endpoint accept messages which then get handed to some Python script for processing, and the result handed back when it’s ready to be sent to the client

We don’t use Python scripts but this type of architecture does make sense and we can emulate a version of it. By splitting the servers into 3 processes/real threads each (as opposed to using goroutines) we can:

  • Have a message queue/pool system (such as RabbitMQ? Or some other pub-sub system? I believe there’s gRPC based ones but idk) which will store data to be passed between the other 2 systems
  • Have the PRUDP server push a pending RMC message (and all relevant context) to the queue
  • Have the protocol RMC handling process look for messages in the queue and handle them
  • Have the protocol RMC handling process push response messages to a queue
  • Have the PRUDP server look for responses (or requests, when working with notifications, really just any data) to send to the client
  • Profit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-approval Topic has not been approved or denied enhancement An update to an existing part of the codebase
Projects
None yet
Development

No branches or pull requests

2 participants