-
Notifications
You must be signed in to change notification settings - Fork 692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: HW-Specialized WebAssembly #1528
Comments
Thanks for filing this issue! Adding some notes from offline discussion that aren't fully reflected here:
I would argue that this isn't strictly true on the CPU, while older ISA extensions are more fragmented (which is what relaxed-simd was targeting), there's definitely a movement towards convergence on the newer extensions. As FP16 is used in this context, it's also the extension where hardware is moving towards convergence soon as seen in the instruction lowerings for FP16.
This is the use case that we are currently focused on - though arguably to experiment with new features a simple opcode prefix for experimentation could also suffice, with a potential path towards standardization.
In theory, I like this approach, but there is a practical usability concern that this sidesteps. In the web context, a big value add for Wasm is the access to lower level primitives which complement existing Web APIs. For example, providing optionality for codecs or storage APIs (SQLite on the web through Wasm) that benefit from lower level primitives, or ease of use for existing native libraries that can then use Wasm intrinsics as a drop in replacement instead of higher level primitives/functions that applications need to rewrite their program for. I'm not an expert in the cryptographic domain, but usually the requests for AES instructions are motivated by applications wanting to use a Wasm intrinsics header as a drop-in replacement for native intrinsic header files without actually having to rewrite much of their program.
I suspect engine authors may not want to sign up for maintaining or updating kernels, especially at the rate they're changing. The other side of this is also that engines may be left with supporting multiple versions of the kernels because it is quite hard to deprecate something once there are actual users. My argument is that the complexity of handling this belongs in the library/application space and not in the Wasm space.
This raises an interesting point, should tensor manipulation be squeezed into current WebAssembly? CPUs (and by extension Wasm) are great for general purpose compute, but probably aren't very well suited for special purpose compute. There should be a graceful fallback path to the CPU when special purpose compute isn't available, but that seems more like an implementation consideration for runtimes. Again, far from an expert, but please correct me if I'm misunderstanding the intent of this use case.
I think the point that is tripping me up a bit is that the fallback path is the default path - i.e the default
In the web environment, standardization has been critical for adoption and for consistent behavior across engines. While it is true that different engines adopt features at different points in time, they have largely conformed to the standard which I see as a strength of the Wasm process. From the perspective of an application developer, the user experience of having significant difference in performance profiles is quite challenging to navigate. That said, I'm sure the web ecosytem is different in this regard, so I understand that some guardrails should be there to make sure that non-web engines don't bear the implementation burden for features that aren't relevant to them. |
Thanks for writing this up. It is a good time to explore directions for a generic extension mechanism. My first reaction after reading this proposal is that it is adding a lot of complex and cross-cutting machinery to the language, some of which (conditional compilation) we tried before. What I am missing is an argument why the much simpler approach taken for JS string builtins, plus the addition of type imports to allow for stricter static typing, is not sufficient for other use cases. |
@rossberg I think the main argument here is that a builtin comes with its own specification via a standard lowering (i.e. the fallback) which is guaranteed to be semantically equivalent to the desired HW instruction(s). While this new mechanism is more spec burden than, e.g. the string built-ins (which AFAICT can be spec'd in an embedding) it seems like a more general mechanism that could enable many use cases with a single mechanism. |
@titzer, the built-ins approach would also specify a precise semantics for those built-ins, so what's the practical difference? And it is not just more spec burden — it is multiple difficult new features with non-obvious design. |
A couple thoughts here, thanks for writing this up!
I think that true semantic equivalence here will be very difficult. It's already the case that engines could pattern match whole functions and replace them with equivalent native instructions. We don't do that for various reasons, but one major one is that it's really hard to prove that any non-trivial wasm function is equivalent to a single machine instruction for all inputs. And also maintaining a deterministic trapping order across mutations to the store is hard. That leads me to wonder, does anyone actually care about running the fallback code? It seems like the reason for having an 'is_available' predicate was so that users could avoid calling the fallback code altogether (as it's likely much slower). If that's the case, we could avoid the whole semantic equivalence issue by not specifying a fallback and instead specifying a deterministic trap if a builtin is called but not available. Another question here, why not use the approach from js-string-builtins of importing the builtin function? This would avoid the need to change core wasm. If the platform supports the builtin, you let it satisfy the import. If you want to instead use a fallback, you provide the import yourself. For example with the 'half-precision' proposal, nearly every instruction there could be an imported function. The only two that would need some care would be the load/store instructions. I do support the idea of using 'builtins' to express things that aren't a fit for all platforms. This can lower the bar for new instructions from e.g. 'good fit for all platforms' to 'good fit for the web platform'. However, when it comes to the web platform we do still have constraints that will make adoption of 'hardware experimental features' (as one of the use-cases) very difficult. The biggest issue is that we can't really unship things from the web. If we ship a builtin for some experimental hardware feature and it gets adopted in a major ML framework as a critical optimization, it's very difficult to remove it. Firefox still supports asm.js even though it's usage is very low. The kinds of builtins that we'd support for the Web would need to meet a certain amount of stability that may be lower than the core spec, but still pretty high. There's also concerns about fingerprinting that I'm not an expert in, but know it can cause issues for shipping things. |
I explored some aspects of this proposal for a final project in @titzer's Virtual Machines class, so I thought I'd provide an experience report and share the full write-up. SummaryProof of concept demonstrates that SHA-1 with dedicated AArch64 C intrinsics can be executed via Wasm intrinsics in Wasmtime at 1.3x native performance. Potential issues revealed by this experiment:
ApplicationThe experiment provides a proof-of-concept for a representative use case, namely the SHA-1 hash algorithm using the Cryptographic Extension on AArch64. The prototype demonstrates how C code written against ARM's C intrinsics API can be executed both natively and via Wasm. Wasm execution is achieved with a Wasm AArch64 intrinsics C API layer that serves as a "drop-in replacement for native intrinsic header files", as @dtig mentioned earlier. In addition, I have a fork of Wasmtime with support for intrinsic calls for a select group of AArch64 instructions. The end result is SHA-1 execution via Wasm with intrinsics at 1.3x native AArch64 performance. To give a feel for the implementation, four rounds of the SHA-1 compression function in C with AArch64 intrinsics are: // Rounds 28-31
e0 = vsha1h_u32(vgetq_lane_u32(abcd, 0));
abcd = vsha1pq_u32(abcd, e1, t1);
t1 = vaddq_u32(m1, vdupq_n_u32(K1));
m2 = vsha1su1q_u32(m2, m1);
m3 = vsha1su0q_u32(m3, m0, m1); These intrinsics are defined in LessonsSome lessons from this proof-of-concept, with the caveat that they may not generalize to other intrinsics domains. Challenge of Semantics Mismatches. Compilation via intrinsics passes through many layers: C intrinsics API, engine intrinsics API, Wasm operators, CLIF IR and machine code representation. Each of these has their own semantics and value representations. Earlier stages of this project showed that if not handled correctly, semantics mismatches can eliminate any performance you might hope to gain from the intrinsics calls. Specifically, in this case some of the special SHA-1 instructions have the oddity that they accept Significance of the Intrinsics API. The design of the C API layer was critical in achieving near native performance. Specifically, it should be designed to limit the number of intrinsics required in the engine, and intrinsics offered by the engine should be as close as possible to the machine instructions. Therefore:
Importance of Accompanying Optimizations. The first version of SHA-1 via Wasm intrinsics had poor performance (3.2x native), showing that merely mapping to the right machine instructions is not enough. Supporting optimization passes are critical. In the SHA-1 case, it was crucial to eliminate redundant moves between register classes, but it is reasonable to expect instances of this problem for other classes of intrinsics. Optimizing JITs are designed for compile speed and therefore have a much more limited set of optimizations than a full AOT compiler. In this case we were able to work around missing Cranelift JIT optimizations by moving the problem to the AOT compilation layer, however it is not clear that would always be possible. Indeed, the remaining approximately 30% overhead over native execution may be a difficult gap to close, given the lack of optimizations such as instruction scheduling in JIT compilers. Overall, we might expect that Wasm intrinsics performance would be limited by JIT compiler optimization capabilities. Fallback Performance. When the intrinsics implementation is executed under Wasm with the fallback implementations, the performance is very poor (over 9x native intrinsics). In fact, it's even worse than a generic version of SHA-1 compiled to Wasm. The function call overhead is likely a major problem, so inlining of fallbacks would likely be necessary for tolerable performance. Alternatively we could accept that fallback performance is not a goal, to @eqrion's point, and the Engineering Aspects. The fork of Wasmtime for this project was modified with this proof-of-concept in mind. While the engineering was reasonable, the approach taken is not one that would scale to adding hundreds or thousands of intrinsic calls. At the time of writing, the ARM intrinsics database contains 12,855 function calls, with 4,344 in the Neon instruction set extension. A full production-grade version of the Wasm intrinsic header library and accompanying engine support would be a substantial undertaking. You would almost certainly want automation and code-generation involved, but also certain parts of the engine integration would not scale well. The current hand-written assembler would need to support many more instructions. You also probably would not want to actually extend the Engine's IR to support every intrinsic either, but instead perhaps support an explicit passthrough or intrinsic IR node that would effectively perform a trivial lowering to a wrapped machine instruction. None of these engineering challenges are intractable, but they would need careful thought. Reference |
This proposal describes a mechanism for extending WebAssembly with access to hardware primitives for
improving WebAssembly’s performance. Its structure begins with the motivation and relevant
background (“why?”) and proceeds to discuss options for implementing this mechanism (“how?”). Some
issues are still up for debate and are collected in the open questions section at the end.
Why?
WebAssembly “aims to execute at native speed by taking advantage of common hardware capabilities”
(https://webassembly.org). The “common hardware capabilities” referred to
have, in many cases, been sufficiently “common” for this to be true. But WebAssembly is now used in
domains where native primitives are no longer “common” hardware: e.g., native execution of machine
learning code enjoys support from matrix multiplication primitives like AMX, wider vector sizes like
AVX-512, uncommon types like FP16, and even non-CPU accelerators, like GPUs and NPUs.
Problem
The WebAssembly specification cannot extend itself to support all hardware primitives:
was initially built on relied on half a century of CPU development. More recent HW features can be
quite distinct between HW vendors; it becomes difficult or impossible to paper over the semantic
differences (e.g., see relaxed SIMD).
WebAssembly specification cannot change at the same rate. And, it could be argued, it should
not: some new HW features here today may not be here tomorrow and WebAssembly would,
unfortunately, be stuck with them.
embedded) in which additional performance may not always be worth the additional implementation
effort (e.g., some engines refuse to support SIMD, GC, etc.). Each new feature complicates the
WebAssembly specification itself, raising the burden to maintain it.
Use Cases
An extension to allow HW-specialized WebAssembly could:
specification; this would allow for soliciting user feedback and realistic performance numbers
without the overhead of the CG process
libssl with AESNI support).
compilers (Triton, IREE, ONNX, etc.) that target WebAssembly lose critical information about
tensor manipulation when squeezed into current WebAssembly.
Objections
The primary objection is fragmentation. If WebAssembly can be extended in any conceivable way,
what is “standard” WebAssembly then? A HW-specialized module relying on a specific ARM instruction,
e.g., cannot be executed on any engine — it is no longer portable.
This proposal suggests a mechanism to solve this — built-in functions with WebAssembly
fallbacks. We argue that fragmentation is both inevitable and a present reality, but that it can
be addressed in a portable way that doesn’t burden the core specification with the union of all
possible instruction sets. We see hardware ISA fragmentation as inevitable in the sense that
businesses relying on WebAssembly will want to use the HW they purchased — efficiency is a
steady force towards environment-specific code.
Secondly, we see software ecosystem fragmentation as a different but related problem. Presently, a
WebAssembly module compiled to run on Amazon’s embedded engine will surely not instantiate in
Emscripten’s browser environment, Fastly’s Compute@Edge, DFINITY, etc. — the imports simply do
not line up! Software fragmentation already exists. We, however, claim that solutions for SW
fragmentation do not solve the HW fragmentation problem. E.g., WASI requires a canonical ABI for
guest-host boundary calls; what we propose here could even allow inlining single HW instructions
in JIT-compiled machine code. And though V8’s builtins are a step in this direction, they do not go
far enough, as we will see.
How?
Our proposal is an annotation mechanism for marking WebAssembly functions and types as built-ins
which engines use to emit semantically equivalent HW instructions. It mainly relies on conventions
(e.g., tool-conventions) but may require slight changes to the WebAssembly specification. Beyond
these tweaks, it requires no spec involvement when creating a new built-in.
Built-in functions
A cryptographic example:
We propose that toolchains use the custom annotations
proposal to emit @Builtin functions; programmers express this intent with C attributes, e.g. Engines
may use this annotation to replace the WebAssembly function body with a HW-specialized body,
optimizing this in a semantics-preserving way (more on this later). For portability, engines that
choose not to implement built-ins simply execute the fallback function body as before.
We expect each built-in function group (e.g., libssl above) to provide an
is_available
functionreturning 0 or 1 to query whether the engine will optimize the built-in (i.e., inline the
HW-specialized machine code) or not (i.e., run the slower fallback code). The fallback code for this
must always return 0.
Built-in types
We additionally consider HW representations that have no equivalent Wasm representation today. The
problem with many HW features is that they require types unknown to WebAssembly. Examples include
smaller-precision floats (fp16), different-sized integers (i1, i8, i128), wider vectors (v256,
v512), tensors (!), masks, flags, etc. We propose a mechanism to introduce built-in types. E.g.,
certain operations may need a wider SIMD type:
As with built-in functions, built-in types are annotated with a
@builtin
annotation and include afallback mechanism for portability. Because (a) fallback WebAssembly code must construct a value of
the type in WebAssembly and (b) to avoid misusing
struct
in a non-GC sense, we propose a new way toorganize types: a
tuple
. It would collect several known WebAssembly types under one type index,including a
tuple.new
for creation andtuple.get|set
for field access. It would have no GCsemantics (each tuple slot occupies a Wasm stack slot) and could even be considered an alias (e.g.,
returning a
$zmm
could be considered sugar for a multi-value result). But it has an advantage: JITcompilers can now reason about the built-in type and emit specialized code; e.g., a tuple.get of
$zmm
could be lowered to a lane extraction instruction.This is quite similar to the
type-imports
proposal in that both enforce a representation, but that proposal only allows existing heap types
which won’t work here.
Semantic equivalence
For an engine to correctly replace built-ins with HW-specialized code, it must preserve the
semantics of the fallback code (i.e., the WebAssembly function body and tuple structure). We expect
engines to verify this, but do not mandate a mechanism since this engine-specific decision is open
for experimentation.
One way engines could check that built-in functions and types are equivalent is by “bit by bit”
equivalence. The engine would maintain a map of the fallback code to its HW-specialized equivalent
and, for each
@builtin
function, would check that the guest’s fallback code matches the engine’sexactly. An optimization of this idea is to use collision-resistant hashes, though this entails some
discussion of risk.
We expect “bit by bit” equivalence to be a good starting point, but encourage research along a
different path: checking semantic equivalence. Changes to the toolchain and optimization
settings will break “bit by bit” equivalence, but analyzing the data flow trees into each
WebAssembly effect (global, table, and memory access, along with function calls) should provide a
more robust check.
Yet another solution here is to mix “bit by bit” equivalence with golden fallback libraries. If
the fallback code is kept separate (i.e., in a library), used only via import, and linked together
at instantiation in the engine, it is more likely that the fallback bits are never changed.
Built-in databases
The addition of builtin functions and types could be problematic for engines with JIT compilers: not
all engines have V8-like infrastructure for translating specific function imports to an internal IR.
We envision creating a “database” of HW-specialized templates that engines could integrate into
their JIT-compilation pipeline, e.g., distributed as a library engines could depend on. Each entry
would contain: the fallback WebAssembly code, a HW-specialized machine code template, the
HW-specific features required (e.g., CPUID checks), and a description of how to emit each HW
instruction. The HW-specialized template might look like:
The template is not true assembly code: it has (b) meta-instructions for any WebAssembly operations
that only the engine knows how to emit and (b) holes for engine-specific register allocation. During
compilation, a JIT compiler would “fill out” and emit the template for a built-in function rather
than JIT compiling the fallback code. (This approach is inspired by the “Copy and
Patch” paper.)
Of course, accepting and emitting machine code from a template database is risky. We encourage
research in developing an automated equivalence checker between the WebAssembly fallback code and
these HW-specialized templates. This would necessarily build on the formal semantics of WebAssembly
and any machine languages (e.g., ARM, x86). This research direction does not block the general idea,
though: engines can (and may even prefer) to manually vet which database entries they choose to
emit.
Versioning
In certain cases, checking the built-in fallback code for semantic equivalence is not enough; in
these cases applications need different code versions. For example, relaxed SIMD added a small,
fixed set of semantic choices; these could have been expressed by versions. Another motivating
example: if WebAssembly’s popcount instruction did not exist, a natural way to express it would be
via different versions, each version representing the different HW semantics. This section proposes
two alternate ways to express different semantics; note that neither necessarily depends on the
built-in infrastructure described above, though we use built-ins as easy examples.
Function-based Versioning
One way to conditionally define semantics is at the function level:
Because the function ID remains the same, call sites (e.g., call $foo) are unchanged by the decision
of which version to use. For WebAssembly modules containing versions, applications may specify which
version to use:
If no version is chosen, the engine must choose the first one listed in the module.
Block-based Versioning
Another way to conditionally specify semantics is at the block level:
Toolchains emit
choice-block
much like they would a regular WebAssemblyif-else-end
construct.The difference is that, at compile time, the engine decides to use the first block that matches a
set of version strings. Users pass version strings as boolean flags:
As before, if no version is selected (e.g.,
[]
), the engine must choose the first block.Example: half-precision
To understand the concepts above, let’s consider a recent WebAssembly use case that could benefit
from this proposal. The half-precision proposal
adding FP16 SIMD operations to WebAssembly has met resistance in the WebAssembly CG due to the
problem of HW availability: not all users have CPUs that support the optimal lowering of the new
instructions. Nevertheless, FP16 is clearly a valuable feature in certain domains, such as ML
kernels, and would be useful now. This proposal could resolve that tension:
Define a new f16x8 type: the performance benefit of this proposal comes from understanding
that certain HW has vector registers supporting f16. Since the proposal only allows accessing
lanes as f32, we could define:
Define f16x8 instructions as built-ins: e.g.,
Compile code to use built-ins: e.g., port kernels from XNNPACK’s
fp16_gemm
to use the f16x8built-in functions and types. This presupposes that the toolchain now supports C/C++ attributes
that emit the WebAssembly
@builtin
annotations.Implement engine support: in a WebAssembly engine, add the HW
lowerings
of f16x8 built-ins. In V8/Chrome’s case, these new built-ins would naturally be placed behind an
origin trial flag. At this point, users could experiment with the new built-ins and compare
against the fallback path in other engines.
Upstream built-in definitions: several commenters have requested a centralized process to
reach community consensus. While not the main thrust of this proposal, one such process might be
to add them to the tool-conventions repository, maintained by a subgroup of the WebAssembly CG
(alternately: a separate builtins repository). Remember that built-in optimizations are optional
for engines, so engine maintainers would still be free to choose which built-ins to implement and
when — the fallback code ensures compatibility in the meantime. Proposers would submit an
entry much like the
lowerings
FP16 already describes, e.g.:
Feature detection: during the period where support for
f16x8.mul
is not available in allengines, developers can provide an alternate code path. While the fallback code guarantees
semantic compatibility, it may be slow and an alternate algorithm may be preferable. To choose a
different algorithm, a developer could write:
By checking the
f16x8.is_available
built-in, developers could select which version to compile.Adopt into WebAssembly specification (or not): if the f16x8 built-ins are successfully
adopted by multiple engines and applications, this is a strong indication for addition to the
WebAssembly specification. If any HW concerns are alleviated and the proposal is adopted, the
original fallback code can eventually be replaced by the new f16x8 instructions. If it is not
adopted, the built-ins continue to work in engines that continue to support them. If usage of a
built-in declines, engines can safely drop support, relying on the fallback code.
Clearly the process side of this proposal is up for discussion; this example explains one concrete
elaboration of the built-in idea. Similar examples could be crafted for cryptographic primitives, JS
strings, and 128-bit arithmetic — with their various motivation and details.
Open Questions
for built-ins that directly corresponded to HW instructions, others considered a higher-level
approach more interesting (replace multi3, XNNPACK kernels, entire libssl functions, etc.). It is
not clear which approach is best so we have left this open. One approach might be to limit
functions to well-defined instruction sets, such as x86, ARM, or domain-specific IRs. Examples of
such domains are machine learning, security, etc.
wide adoption in disparate environments would hint that this is already happening; our hope is
that this brings those extensions under a common umbrella. We admit, however, that this would
result in different performance between engines based on what built-ins they support.
the motivations behind this is MLIR’s popularity as an intermediate format for various
high-performance compiler tools, including machine language model compilers (Triton, IREE,
etc.). What MLIR has done well by allowing custom instructions,
types, attributes, and regions is to retain the original program intent through various compiler
transformations. When MLIR is emitted as WebAssembly, though, it meets an impedance mismatch:
high-level tensor operations are squeezed into non-performant i32.load|store operations, e.g. The
hope is that this proposal could bridge MLIR to WebAssembly in a different way, via built-ins.
Then, one might imagine WebAssembly code that interleaves MLIR-based GPU invocations with
tightly-compiled vector kernels with run-of-the-mill WebAssembly, etc.
built-ins
proposal? The JS string proposal, now at phase 4, is a proof point that engine-provided built-ins
are in fact necessary for performance. One difference between that proposal and this one is in
scope: this proposal would allow the use of engine built-ins far beyond JS strings. One can
imagine implementing the
"wasm:js_string"
imports from that proposal in terms of this one:(@builtin "wasm:js_string" "...")
. If this were the case, this would result in improvedcompatibility: using this proposal’s built-in fallback code, the JS string built-ins would “come
with” their sub-optimal WebAssembly-only implementation, ensuring modules are at least executable
on any engine — not just browsers — albeit less optimally.
imports
proposal? The type imports proposal, now at phase 1, is similar in spirit to this one: both
intend to extend a WebAssembly module with additional type information. But, whereas type imports
are concerned with types coming from outside (e.g., a browser reference), this proposal has to
“lay out” (i.e., provide a representation for) new value types for HW-specialized built-ins. We
expect the particular layout of these new types to be critical for performance but, at the same
time, the type must be transparent to WebAssembly to be useful. This led to the built-in tuple
syntax, but we are open to better syntax suggestions. One possible future is for the type import
proposal to be extended with this or another aliasing syntax which this proposal could then depend
on when marking new types as built-in ones.
Written collaboratively by @abrown, @titzer, @woodsmc, @ppenzin. Thanks to all those who provided
feedback, including @dtig, @tlively, @ajklein, @mingqiusun, @alexcrichton, @dicej, @cfallin.
The text was updated successfully, but these errors were encountered: