Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of POST body processing speed is 10x slower in Cowboy 2.10.0 compared to 1.1.2 #1611

Open
EzoeRyou opened this issue Jun 28, 2023 · 9 comments
Milestone

Comments

@EzoeRyou
Copy link

We're porting an Erlang software that depends on now deprecated Cowboy 1.1.2 to the recent Cowboy 2.10.0.

During the porting process, we found out that on processing POST body, Cowboy 2.10.0 performs 10x slower in terms of bandwidth relative to the Cowboy 1.1.2 without enabling JIT. It's still 8.4x slower even if we enabled JIT.

This regression of performance prevent us to update the Cowboy in our software.

Here is the minimal benchmark code to reproduce the issue, and the summary of benchmark result.

https://github.com/AoiMoe/cowboy_post_bench

@essen
Copy link
Member

essen commented Jun 28, 2023

You may want to tweak the read_body options or the HTTP/1.1 option active_n, maybe others.

@EzoeRyou
Copy link
Author

Thanks for the suggestions.

We tweaked various options, changing active_n and length doesn't solve the performance regression.

We found out that changing buffer size of socket setopts was effective. But it's still 10-20% slower than cowboy1. buffer need to be set to really huge value to compensate the regression introduced in cowboy2

The detailed micro benchmark code and results are noted here, see Test 3.

https://github.com/AoiMoe/cowboy_post_bench

The summary of tweaking buffer size is, cowboy2 with default buffer size of 1460 is 10x slower than cowboy1. the performance improves as we increase the buffer size. We saw dramatic improvement(or I'd like to call it compensation) on performance until buffer size of 32768. After that, it appears like diminishing returns but we see some performance improvement until 262144. Buffer size of 524288 was worse than 262144. It will never reach to the same performance of cowboy1.

While the performance regression on cowboy2 was somewhat mitigate by increasing the buffer size, the micro benchmark was performed on a loopback device rather than going through the real Internet route so it's not the real world scenario, we still think 10-20% performance regression is too much to risk the upgrade. We also think default behaviour should be sane.

Is there any way we can do to completely fix the performance regression introduced in cowboy2?

@essen
Copy link
Member

essen commented Jul 14, 2023

The changes that result in a performance drop are related to the support for HTTP/2 which performs better than HTTP/1.1 in real use cases. In the future Cowboy will also support HTTP/3 which performs even better (http3 branch is a work in progress).

There's likely room for improvement for HTTP/1.1 still, I'll take a look when time allows. But right now my priority is HTTP/3.

There's not much point measuring performance using loopback for what it's worth, although I'm sure the code performs worse in Cowboy 2 due to how it is structured. One thing you can do with Cowboy 2 however is write your own stream handler to handle these requests as stream handlers execute in the connection process and have the same performance properties as Cowboy 1 had.

@essen essen added this to the 2.13.0 milestone Jan 15, 2025
@essen
Copy link
Member

essen commented Jan 24, 2025

I've started looking into this in details. One interesting bit is that when moving to the new approach I had to move from a sync recv to async recv. At the time {active,N} with a large enough N value proved to be pretty good. That doesn't appear to be the case now, at least not when reading larger bodies. This is because Cowboy receives a lot of small data packets: it can process packets literally as they arrive, rather than Erlang previously buffering them more. This is great for low latency work, not so much for high throughput. To restore performance today would require letting the VM buffer input data more than it currently does. I will have to figure out how to do it.

@essen
Copy link
Member

essen commented Jan 24, 2025

OK I verified the configuration of buffer does help a lot with performance. This allows active mode to work in a way that is closer to the previous recv. So it seems that {active, once} and higher buffer would work better at least in the body reading scenario.

@essen
Copy link
Member

essen commented Jan 24, 2025

With a large buffer and active_n set to 1 the difference is pretty clear.

          http.plain_h_1M_post_1: ********µs     92.6reqs/s
          http.plain_h_1M_post_1:  7023122µs   1423.9reqs/s

Numbers are above the 10x difference reported in the ticket. These requests are all 1MB in size and there are 10000 so 10GB is transferred in total, in about 7s so around 1.4GB per second. On localhost of course.

Remains to be seen whether body reading is the only thing requiring this change or if it's good to have for requests too.

@essen
Copy link
Member

essen commented Jan 28, 2025

I opened erlang/otp#9355 to question the default in OTP. I don't think we can set an appropriate default in Cowboy because Cowboy can't know in what environment it will run in (constrained or not). But we can definitely provide guidance in the documentation for what should be configured for high performance, as well as have a better default like the one I recommend OTP changes to. If OTP doesn't change its default Cowboy can set its own default to that value, and recommend a higher value in documentation.

@essen
Copy link
Member

essen commented Jan 29, 2025

A large buffer can be harmful. It depends on what the protocol is doing.

For HTTP/1, a large buffer (131072) makes requests without bodies a little slower but not significantly, and requests with large bodies a lot faster.

For HTTP/2, a large buffer (131072) is just as bad as the default. A better value is around 32768. But that's only true with the default HTTP/2 protocol options. If you tweak those, then 131072 becomes better, and 32768 worse. This is because by default the HTTP/2 frames are smaller and so increasing their size leads to the larger buffer being more appropriate.

For Websocket, a large buffer (131072) is a clear improvement, at least when a lot of data goes through (the larger the frames, the better the improvement, but also the smaller the frame, the worse it gets).

In other words there's no real way to set an appropriate buffer value that works in all cases.

On the other hand I am looking at doing two things:

  • Increasing the default to 8192 if it is not set. This value has no negative impact that I could see.
  • When we know the size of the data we are reading (e.g. a large request body) and that size is over a certain size, we can dynamically increase the buffer value. Cowboy would have a new option that it can use to increase or decrease the buffer size depending on what is going on. The details of how this option is used would differ depending on the protocol. For HTTP/1 it can use the request's body as a hint. For HTTP/2 it can use the stream or the connection window. For Websocket perhaps an average of frame sizes.

@essen
Copy link
Member

essen commented Feb 3, 2025

I believe the branch at #1666 solves the issue.

The solution ended up a little different than described above; we only consider incoming packet sizes for resizing the buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants