Skip to content

Possible addition: string.new_ascii / string.new_ascii_array #53

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jakobkummerow opened this issue Nov 8, 2022 · 0 comments
Open

Comments

@jakobkummerow
Copy link
Collaborator

Sufficiently advanced engines tend to have a specialized internal string representation for ASCII-only strings (V8 certainly does; I'm pretty sure other existing engines do too) [1]. There are also some common use cases where a string is being created that's known to be in ASCII range, such as number-to-string conversions. Of course it is possible to use string.new_utf8[_array] or string.new_wtf16[_array] in these situations, but they both require an engine to scan the memory/array for non-ASCII elements before deciding which kind of string to allocate and copy the characters into, which has significant overhead [2]. We could avoid this by adding instructions to create strings from ASCII data.

There's partial, but only partial, overlap of this suggestion and #51, insofar as number-to-string conversions are a use case that could benefit from either instruction set addition but is unlikely to benefit from both. That said, if we e.g. decide that integer-to-string is sufficiently common and standard (in the sense that everyone does it the same way) to warrant its own instruction, whereas float-to-string is sufficiently uncommon and/or language specific that we'll leave it up to languages to ship their own implementation for it, then the latter would still benefit from a string.new_ascii_array instruction. Also, there might well be common uses cases aside from number conversion that know on the producer side that they're creating ASCII strings.

I wouldn't mind adding such instructions to the MVP; I'm also fine with postponing them to a post-MVP follow-up.

[1] Strictly speaking, any form of "one-byte" string representation is relevant here, e.g. "Latin1"; ASCII is the lowest common denominator of these. In fact, in V8, our "one-byte" strings actually support the Latin1 range, yet I'm suggesting ASCII (i.e. character codes 0 through 127) for standardization here, because I believe that's the subset that maximizes the intersection of usefulness to applications and freedom of implementation choice to engines.

[2] To illustrate with specific numbers: on a particular microbenchmark I'm looking at, which converts 32-bit integers to strings, the score is 20 when I check for ASCII-only characters, and 27 (+35%) when I blindly copy i16 array elements to 2-byte string characters, which wastes memory. There may be potential for (minor?) improvements using SIMD instructions or similar for faster checking, but why bet on engine magic/heroics when it's so trivial to add a Wasm-level primitive that makes it easy and reliable to get high performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant