-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero-copy I/O #161
Comments
Oh, yes indeed, since we'd be storing to a single output instead of one per Protobuf message. As for the second part, basically I'd like to offer users the option to append the binary encoding to a buffer they already own (which could be a few things), instead of generating a new one to copy from. With |
For nested types, wouldn't that mean traversing the values several times? And performing the varint encoding several times too, to measure how many bytes are required? This can still be a performance win in practice. I'm just pointing our the potential accidentally quadratic behavior :) |
For nested types, that's the caveat I noted (3rd paragraph of "encoding" section): I'll want to calculate the size of every message exactly once. Some kind of temporary memoization at that level, I suppose. For varints specifically, note that determining the size doesn't involve actually encoding, but just predicting how many multiples of 7 bits you need. That should be quite fast. I'll definitely take hints from |
You can guarantee a single length calculation for nested objects by serializing them back to front and keeping track of the written bytes (roughly a reverse in-order traversal of the tree and returning the total written bytes up the stack, s.t. you know the total length of the just-written content before writing the tag). |
That's not possible, since the length of a block is written before its data and is itself encoded as a Varint. This is why the current implementation encodes recursively and copies buffers at each nesting step. My goal here is to remove this copying by calculating sizes (which is fairly cheap). |
You are correct iff you write the tag before the content. The idea is that instead of writing from the front of your buffer starting with the tag, you start with the deepest-nested object writing it to the end of the buffer, then prepend its tag, passing the written size up the recursion. So you’d 1. write content, 2. write tag, 3. return written bytes incl the tag. This is definitely possible (and works, I’ve tried it in production) but it requires either a method for approximating the buffer size you need, or a way to copy into a larger buffer if the message turns out to be too large. Also, the complexity may not be warranted here, as this is probably not used for low latency/high throughput situations. |
I like this idea, although I think here it would not be ideal:
*If the internal storage used was a Bigstring, a "sub" could be returned for free, which would solve that problem. |
Is this really worth the complexity? Allocating a bigstring might fall short (if it's too small), causing a lot of trouble to re-encode again with a larger buffer. Computing sizes means doing a lot of the work twice, or even more (quadratically in the depth of nested messages I believe). I think 0-copy is not worth it, especially not in OCaml. Just use a small pool of buffers for sub-messages, so as to re-use them (GC pressure is important) and move on :) |
With deep structures of messages (as I will deal with), I believe the difference can be significant. Again, this idea presupposes computing sizes exactly once, not multiple times. There's no quadratic work at all here. Calculating the size of a message is cheap if done only once. I will back this up with proper benchmarking before formulating a PR, of course. Also, FWIW a fixed-size buffer having to be reallocated larger in @RNabel 's scenario would not imply restarting the encoding from scratch. That would be wasteful. For example, |
Where do you store the precomputed sizes? The types to be encoded are external, they don't have hidden fields to store their length, as far as I can tell. If you can't store the length, how do you compute it once? That's what evades me. |
I haven't landed on a solution for that yet. This issue is a back-burner brainstorm for me, not something I'm implementing this week. (Things like Yojson support for Protobuf |
@RNabel I tried the write-from-backward method in #169 and it's not a big improvement on my (synthetic) benchmark. Reusing sub-buffers is simpler and better. However. I have an idea on how to do the size-computation pass efficiently. The main issue is that, for a message at (say) depth 3, its size will be computed twice: once when it's time to output it, and before that, once to compute the size of its parent (which, being at depth 1, is nested and must be length prefixed). At depth 4, the size will be computed thrice, I think, and so on. The total amount of work is quadratic in the depth of the values, which can be pathological if one has recursive messages. Anyway, that's why we need to cache sizes if we go this way. To do so, if we standardize on the traversal order (meaning, The order must basically be a post-order (parent messages being traversed after their children). This is somehow similar to how imgui and the likes, use the traversal order (of the tree of widgets) to assign each item a unique index, which can then be used to carry state from an iteration of the main loop to the next. Here we do exactly 2 iterations and the state we carry is the size of sub-messages. |
@c-cube I really like that stack which relies on a deterministic traversal order. A kind of |
Yes, that's an idea. It takes 20 lines to do a little custom one for protoc. The issue though is that i think this would need more code generation (for a size function) *or* to have a mode where the encode function actually doesn't write data but just pushes sizes.... So that's more complicated.
It's also not guaranteed it'll be faster than reusing sub buffers :)
|
I don't understand the buffer recycling idea. If there was a pool of buffers, sure there'd be a few less allocations (less calls to |
There's still copying, but it doesn't mean it's slower than traversing twice and storing sizes in a side-buffer (which is also more complex, I think, because of the need for 2 different kinds of traversals). Benchmarks can be used to assess which is faster in practice. |
About the size function: gogo/protobuf#559 . They seem to have settled to writing backwards, which is definitely the cleanest solution conceptually, I think. |
Doesn't writing backwards require having a big enough buffer to begin with? Or do they resize periodically like |
Exactly, that's what the benchmark example already does: resizing like
any other dynamic vector, but once you resize you copy to the end, not
the beginning. So really it just requires a custom buffer that writes
from the end.
The benefits of this approach, I think:
- the best in terms of cache (exactly one traversal of the data)
- the best in terms of buffer storage (only one buffer; no memoization
of sizes)
- still somewhat simple conceptually
however it's fiddly, and requires a few modifications of the generated
code (not too big), so as to use combinators for lists, and for fields
(since we want to control that field tag/list items are emitted in the
reverse order). I think it's still worth it if we want a _robust at
scale_ solution.
|
Yes, I really like that idea. I can't think of other uses to having a sizing function besides actually encoding, so if we can avoid it entirely we should. It seems obvious to me even without a formal benchmark, that writing in reverse once would be faster than calculating the size (properly memoized) and then doing the same writing but forwards. My only concern is that without knowing sizes, we end up with an area of memory which is larger than what we need to return. That wouldn't be a problem with a |
That's always been the case, even with the current simple |
It might be trivial to even eliminate that final alloc+blit for users who don't want it. For example if the new "from-the-end" internal buffer used a
(Using a That being said, writing in reverse to a single area will already be a massive improvement! |
Or, avoid bigstrings entirely, and return |
Yes, returning the data storage and the offset instead of blitting would be nice to have as an option. I'll still need to blit that into a It'd be fun to abstract away the storage entirely, but since we want to write in reverse that'd be very non-standard. |
If you do non blocking IOs, then yes. If you do blocking IOs you generally can directly use the byte buffer + offset :) |
I found this very interesting document: https://perfetto.dev/docs/design-docs/protozero there's in particular mention of "scattered IO", which certainly must help with nested messages:
|
With #223 we now write back-to-front, in one go. Benchmarks show that, if one reuses the encoder, there are close to 0 allocations done in the common case. For example on my i7-12700, here are a few benchmarks ("current" is the new master branch, "nested-bufs" is what we had until recently with cached nested buffers): 292 bytes value:
3050 bytes value:
The new encoder can write at > 700MB/s (on a very good desktop machine, granted). |
I have a spare-time idea to improve encoding performance: encoding to a single buffer. This low-hanging fruit would significantly reduce allocations when dealing with trees of messages by writing sequentially to a single
Buffer.t
.In Jane Street's
Bin_prot
, each type has a function which predicts the binary encoding size for its current state. I think it would be quite doable to implement a kind of…_payload_pb.size
function which_payload_pb.encode_…
could then use to know its size and its nested messages' sizes to avoid having to use temporary buffers. OCaml's memory representation ofstring
andbytes
already includes size, which is thus very efficient to get, and it's trivial to predict the size of a Varint for field IDs, enums, ints and message sizes themselves.The main issue to watch out for there would be to take note of sizes collected during the process, to avoid calculating them more than once.
A nice-to-have addition to this could be to abstract reading and writing into a module signature instead of using
Buffer
directly. This would allow users to use other types of buffers or channels.References
The text was updated successfully, but these errors were encountered: