This repository was archived by the owner on Aug 18, 2023. It is now read-only.
Manually initialize GcBox contents post-allocation to reduce memory copying #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Ideally, when calling
You'd assume
data
to be constructed in-place or moved into newly allocated memory. (the first one isn't really common as stable Rust lacks placement-new-like features). And with the struct being relatively big, you'd expect the compiler to generate amemcpy
call to simply move the structure's bytes into place.The issue currently is that due to either rustc not being smart enough or the gc-arena code not being optimizer friendly (or both), the compiler can
memcpy
your Data object several times before actually moving it into its final place.For example here:
The generated code will firstly do
memcpy
to movet
into thegc_box
object on stack, then allocate memory, and then do the secondmemcpy
to move thegc_box
object onto heap memory. For some reason, on wasm target the compiler is even worse at optimizing this; at the worst case, I've seen fourmemcpy
calls for a single GC allocation. This can obviously cause unnecessary overhead.My patch helps the compiler by simplifying the initialization - first we allocate the uninitialized memory, then we manually build the
GcBox
by moving its fields into place. This way the objectt
is moved straight into its final place without being moved into intermediate stack variablegc_box
.I was trying to show a comparison on godbolt, but as soon as I drop some layers of abstractions, rustc catches on and generates better code. This is my best attempt: https://godbolt.org/z/aaK75W . You can see that in
old()
there is onememcpy
before allocation and one after, but innew()
there is only onememcpy
.Here's a comparison on "production" code, with a decompiled wasm build of https://github.com/ruffle-rs/ruffle/ . In practice, I've seen this cause up to 15-20% speedups in some edge cases.
Before, 4x
memcpy
:After, just two:
And when rust-lang/rust#82806 gets merged into Rustc , with my patch it'll become just one, how it's supposed to work :)
I made sure the patch passes tests with miri.