Skip to content

consider dropping wsaccel dependency #127

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
belm0 opened this issue Jun 15, 2019 · 20 comments
Closed

consider dropping wsaccel dependency #127

belm0 opened this issue Jun 15, 2019 · 20 comments

Comments

@belm0
Copy link
Contributor

belm0 commented Jun 15, 2019

wsaccel doesn't appear to be maintained, and has broken build under pypy

This article describes a pure Python mask implementation which is near the performance of wsaccel:

https://www.willmcgugan.com/blog/tech/post/speeding-up-websockets-60x/

@Kriechi
Copy link
Member

Kriechi commented Jun 15, 2019

Amazing results! wsproto is already py3 only, so I would suggest using "Solution 3" with this snippet:

native_byteorder = sys.byteorder
def mask3(mask, data):
    datalen = len(data)
    data = int.from_bytes(data, native_byteorder)
    mask = int.from_bytes(mask * (datalen // 4) + mask[: datalen % 4], native_byteorder)
    return (data ^ mask).to_bytes(datalen, native_byteorder)

PRs are welcome!

@belm0
Copy link
Contributor Author

belm0 commented Jun 15, 2019

It's a little confusing from the article, but solution 4 (using bytearray.translate()) is the fastest solution in Python. The author was merely looking for a python2 solution-- the twist was that he didn't expect it to be faster than python3 from_bytes().

@belm0
Copy link
Contributor Author

belm0 commented Jun 15, 2019

I measured the bytearray.translate() implementation to actually be about 10% faster than wsaccel.

https://nbviewer.jupyter.org/gist/belm0/2b610cc405dd3dac977f34650cb187cc

@njsmith
Copy link
Member

njsmith commented Jun 15, 2019

The translate-based solution does have some tradeoffs though: it adds something like 100 KB of extra ram usage for the lookup table, and several milliseconds of startup time to compute that table (on a fast desktop CPU, more on a rasbpi or underpowered arm router or whatever).

So I'm also leaning towards "solution 3". It's still like 50x faster than what we have right now, and doesn't have the risk of creating some pathological behavior in constrained environments.

@belm0
Copy link
Contributor Author

belm0 commented Jun 15, 2019

Fair point about memory and startup time. Though "what we have now" is wsaccel, which is 2x faster than solution 3. So perhaps stay with wsaccel then.

@njsmith
Copy link
Member

njsmith commented Jun 16, 2019

Note that this discussion started over here: python-trio/trio-websocket#119

See in particular python-trio/trio-websocket#119 (comment) and below, which discuss the up- and down-sides of implementing our own accelerator. (Basically: upside is that it's pretty easy to get 10x faster than wsaccel or the pure python stuff above; downside is then you have a C accelerator module to maintain.)

@njsmith
Copy link
Member

njsmith commented Jun 16, 2019

...though, ugh, I guess wsaccel doesn't even have wheels, so right now by depending on it, trio-websocket forces everyone to have a compiler installed. (This is particularly problematic on windows.)

@bluetech
Copy link
Contributor

bluetech commented Jul 5, 2019

The int.from_bytes() version has the disadvantage over the others of using 4x peak memory. Given 10 MiB of data, the peak memory is:

import tracemalloc

mask = b'\0\1\2\3'
n = 10 * 2**20
data = b'\1' * n

tracemalloc.start()
mask3(mask, data)
print(tracemalloc.get_traced_memory()[1] // 2**20, 'MiB')
  • mask1: 11 MiB
  • mask2: 10 MiB
  • mask3: 42 MiB
  • mask4: 20 MiB

The optimal result is 0 MiB, i.e. doing the masking in-place, which should be allowed given payload in payload = self.masker.process(payload) is a bytearray (unlike what the type of Buffer.consume_at_most() says).

BTW, the blog post uses data = b'\0' * data_size to generate the test data; but somewhere in mask3 there seems to be a special case for all-zeros. Changing to \1 makes it 25% slower for me (Python 3.7.3, Linux).

@bluetech
Copy link
Contributor

bluetech commented Jul 5, 2019

BTW2, the entire masking business is unfortunate. It is meant to protect against bad proxies, but probably 99% of WebSocket traffic in the open internet is over TLS, which makes it unnecessary. If I had control over all WebSocket clients in the world, I would just make them use \x00\x00\x00\x00 as the mask key over wss, and the servers would detect that and skip the pointless exercise.

@pgjones
Copy link
Member

pgjones commented Jul 18, 2019

I don't think wsaccel is (or has been) an explicit dependency of wsproto, I've added it as an extra requirement in #133 and implemented solution 4 as that is what aiohttp uses. Does #133 provide a sufficient solution for this?

@willmcgugan
Copy link

You could save the startup time by pregenerating that lookup table, and inserting a literal...

@belm0
Copy link
Contributor Author

belm0 commented Jul 19, 2019

implemented solution 4 as that is what aiohttp uses

Doesn't the article say that aiohttp uses solution 3?

@pgjones
Copy link
Member

pgjones commented Jul 19, 2019

It was solution 3 at the time the article was written, but now it is 4 thanks to the author of the article.

@willmcgugan
Copy link

Something to consider with the translate method is that it has to convert to and from a bytearray if the input and output is bytes.

The XOR part will still take the majority of the time, but if the payload data were a bytearray to begin with, it can avoid some extra copying. It does become more significant with smaller payload sizes.

@njsmith
Copy link
Member

njsmith commented Jul 19, 2019

It does become more significant with smaller payload sizes.

Copying small payloads is very cheap though :-).

@willmcgugan
Copy link

True. I was thinking the copying to XORing ratio would be more significant with small payloads. But yeah, small payloads are always going to be fast.

@pgjones
Copy link
Member

pgjones commented Aug 10, 2019

wsaccel has been dropped, see 89f442b

@pgjones pgjones closed this as completed Aug 10, 2019
@belm0
Copy link
Contributor Author

belm0 commented Nov 25, 2020

For what it's worth, I did have a measurable regression in my app upon upgrading to wsproto to >= 0.15.0.

It was a 1% increase in CPU cycles, which is significant given that little of the time is spent on websockets. I'm speculating that an app using wsproto heavily might see 5 or 10% regression.

A simple benchmark was added recently (#150), so perhaps worth a comparison with backport to 0.14.1.

@belm0
Copy link
Contributor Author

belm0 commented Nov 25, 2020

I couldn't reproduce the regression with whatever bench/connection.py is measuring-- it shows 0.14.1 as significantly slower.

using 500 iterations and a few hacks for 0.14 compatibility:

$ pip install wsproto==0.14.1 wsaccel
$ python bench/connection.py
0.4089s

$ pip uninstall wsaccel
$ python bench/connection.py
18.4648s

$ pip install wsproto==1.0.0
$ python bench/connection.py
0.3506s

@belm0
Copy link
Contributor Author

belm0 commented Nov 25, 2020

I couldn't reproduce the regression with whatever bench/connection.py is measuring

well, it's because my app has small message sizes

filed #152

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants