Performance of POST body processing speed is 10x slower in Cowboy 2.10.0 compared to 1.1.2 #1611

EzoeRyou · 2023-06-28T07:10:50Z

We're porting an Erlang software that depends on now deprecated Cowboy 1.1.2 to the recent Cowboy 2.10.0.

During the porting process, we found out that on processing POST body, Cowboy 2.10.0 performs 10x slower in terms of bandwidth relative to the Cowboy 1.1.2 without enabling JIT. It's still 8.4x slower even if we enabled JIT.

This regression of performance prevent us to update the Cowboy in our software.

Here is the minimal benchmark code to reproduce the issue, and the summary of benchmark result.

https://github.com/AoiMoe/cowboy_post_bench

essen · 2023-06-28T14:57:43Z

You may want to tweak the read_body options or the HTTP/1.1 option active_n, maybe others.

EzoeRyou · 2023-07-13T08:49:12Z

Thanks for the suggestions.

We tweaked various options, changing active_n and length doesn't solve the performance regression.

We found out that changing buffer size of socket setopts was effective. But it's still 10-20% slower than cowboy1. buffer need to be set to really huge value to compensate the regression introduced in cowboy2

The detailed micro benchmark code and results are noted here, see Test 3.

https://github.com/AoiMoe/cowboy_post_bench

The summary of tweaking buffer size is, cowboy2 with default buffer size of 1460 is 10x slower than cowboy1. the performance improves as we increase the buffer size. We saw dramatic improvement(or I'd like to call it compensation) on performance until buffer size of 32768. After that, it appears like diminishing returns but we see some performance improvement until 262144. Buffer size of 524288 was worse than 262144. It will never reach to the same performance of cowboy1.

While the performance regression on cowboy2 was somewhat mitigate by increasing the buffer size, the micro benchmark was performed on a loopback device rather than going through the real Internet route so it's not the real world scenario, we still think 10-20% performance regression is too much to risk the upgrade. We also think default behaviour should be sane.

Is there any way we can do to completely fix the performance regression introduced in cowboy2?

essen · 2023-07-14T07:23:33Z

The changes that result in a performance drop are related to the support for HTTP/2 which performs better than HTTP/1.1 in real use cases. In the future Cowboy will also support HTTP/3 which performs even better (http3 branch is a work in progress).

There's likely room for improvement for HTTP/1.1 still, I'll take a look when time allows. But right now my priority is HTTP/3.

There's not much point measuring performance using loopback for what it's worth, although I'm sure the code performs worse in Cowboy 2 due to how it is structured. One thing you can do with Cowboy 2 however is write your own stream handler to handle these requests as stream handlers execute in the connection process and have the same performance properties as Cowboy 1 had.

essen · 2025-01-24T21:47:19Z

I've started looking into this in details. One interesting bit is that when moving to the new approach I had to move from a sync recv to async recv. At the time {active,N} with a large enough N value proved to be pretty good. That doesn't appear to be the case now, at least not when reading larger bodies. This is because Cowboy receives a lot of small data packets: it can process packets literally as they arrive, rather than Erlang previously buffering them more. This is great for low latency work, not so much for high throughput. To restore performance today would require letting the VM buffer input data more than it currently does. I will have to figure out how to do it.

essen · 2025-01-24T21:54:59Z

OK I verified the configuration of buffer does help a lot with performance. This allows active mode to work in a way that is closer to the previous recv. So it seems that {active, once} and higher buffer would work better at least in the body reading scenario.

essen · 2025-01-24T22:09:59Z

With a large buffer and active_n set to 1 the difference is pretty clear.

          http.plain_h_1M_post_1: ********µs     92.6reqs/s
          http.plain_h_1M_post_1:  7023122µs   1423.9reqs/s

Numbers are above the 10x difference reported in the ticket. These requests are all 1MB in size and there are 10000 so 10GB is transferred in total, in about 7s so around 1.4GB per second. On localhost of course.

Remains to be seen whether body reading is the only thing requiring this change or if it's good to have for requests too.

essen · 2025-01-28T14:48:10Z

I opened erlang/otp#9355 to question the default in OTP. I don't think we can set an appropriate default in Cowboy because Cowboy can't know in what environment it will run in (constrained or not). But we can definitely provide guidance in the documentation for what should be configured for high performance, as well as have a better default like the one I recommend OTP changes to. If OTP doesn't change its default Cowboy can set its own default to that value, and recommend a higher value in documentation.

essen · 2025-01-29T14:49:53Z

A large buffer can be harmful. It depends on what the protocol is doing.

For HTTP/1, a large buffer (131072) makes requests without bodies a little slower but not significantly, and requests with large bodies a lot faster.

For HTTP/2, a large buffer (131072) is just as bad as the default. A better value is around 32768. But that's only true with the default HTTP/2 protocol options. If you tweak those, then 131072 becomes better, and 32768 worse. This is because by default the HTTP/2 frames are smaller and so increasing their size leads to the larger buffer being more appropriate.

For Websocket, a large buffer (131072) is a clear improvement, at least when a lot of data goes through (the larger the frames, the better the improvement, but also the smaller the frame, the worse it gets).

In other words there's no real way to set an appropriate buffer value that works in all cases.

On the other hand I am looking at doing two things:

Increasing the default to 8192 if it is not set. This value has no negative impact that I could see.
When we know the size of the data we are reading (e.g. a large request body) and that size is over a certain size, we can dynamically increase the buffer value. Cowboy would have a new option that it can use to increase or decrease the buffer size depending on what is going on. The details of how this option is used would differ depending on the protocol. For HTTP/1 it can use the request's body as a hint. For HTTP/2 it can use the stream or the connection window. For Websocket perhaps an average of frame sizes.

essen · 2025-02-03T14:40:07Z

I believe the branch at #1666 solves the issue.

The solution ended up a little different than described above; we only consider incoming packet sizes for resizing the buffer.

essen added this to the 2.13.0 milestone Jan 15, 2025

essen mentioned this issue Feb 3, 2025

Implement dynamic socket buffer sizes #1666

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of POST body processing speed is 10x slower in Cowboy 2.10.0 compared to 1.1.2 #1611

Performance of POST body processing speed is 10x slower in Cowboy 2.10.0 compared to 1.1.2 #1611

EzoeRyou commented Jun 28, 2023

essen commented Jun 28, 2023

EzoeRyou commented Jul 13, 2023

essen commented Jul 14, 2023

essen commented Jan 24, 2025

essen commented Jan 24, 2025

essen commented Jan 24, 2025

essen commented Jan 28, 2025

essen commented Jan 29, 2025

essen commented Feb 3, 2025

Performance of POST body processing speed is 10x slower in Cowboy 2.10.0 compared to 1.1.2 #1611

Performance of POST body processing speed is 10x slower in Cowboy 2.10.0 compared to 1.1.2 #1611

Comments

EzoeRyou commented Jun 28, 2023

essen commented Jun 28, 2023

EzoeRyou commented Jul 13, 2023

essen commented Jul 14, 2023

essen commented Jan 24, 2025

essen commented Jan 24, 2025

essen commented Jan 24, 2025

essen commented Jan 28, 2025

essen commented Jan 29, 2025

essen commented Feb 3, 2025