-
Notifications
You must be signed in to change notification settings - Fork 695
Floating-point rounding mode control prototyping #1456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since the actual proposal is to add more control of rounding modes to Wasm, it would probably make sense to broaden the scope from interval arithmetic to any use of different rounding modes. Casting a wider net will help find more motivating use cases.
I'm not as concerned with trying to do more performance tests at this point, since you convincingly explained how slow emulating different rounding modes would be. I'm more concerned right now about having concrete motivating use cases. Ideally we would find someone with something (an application, a library, a language, etc) they would like to port to Wasm for whom the lack of rounding mode control is a significant hurdle. Even with a perfect design and convincing benchmarks, it would be hard to motivate engine and tool implementers to work on this (or anything else) without such a concrete use case. |
I hear that. For now, some history of the Boost interval package going back to 1999 or so (beware, not HTTPS): http://www.mscs.mu.edu/~georgec/IFAQ/maurer1.html "Originally, I only intended the package to assess rounding errors in lengthy computations. Since then, I've learned that there are other application domains where interval arithmetic is useful" <- And with this quaint understatement, the analog security era was born! The point being simply that this is hardly a new innovation (not that anyone here seems to think that). Note that G. Bill Walster was deeply involved. More to come by way of current examples. |
To address your specific comment about broadening the scope, I certainly don't oppose that. I just don't personally know of any practical uses for RM tweaking apart from interval arithmetic. The one sort of exception would be integer rounding. On X86, for one: "The FRNDINT instruction returns a floating-point value that is the integral value closest to the source value in the direction of the rounding mode specified in the RC field". Not sure how float-to-int is implemented in WASM, but in theory it would be amenable to RC flavors just like any other floating-point instruction, but would have this very different effect (nudging by up to 1.0 as opposed to one unit of the last place (ULP)). |
TL;DR: Some algorithms are polynomially slower without interval arithmetic; others just simply don't work. Alright, I promised to list some practical examples. On the one hand, you're obviously not going to find people ranting that they can't get intervals to be performant (or even correct, due to the lack of compliant libraries) in WASM because, like, what purpose would that serve without going through the entire process that we're literally going through here and now? From the public discourse I've read on various forums, it's clear that people just assume they need a specialized library to run intervals in a performant manner on a particular piece of hardware. Pending any change to WASM, that seems to be just how it is. Therefore, in the absense of any published rant about this, I'll attempt to provide the next best thing, which is a list of compelling applications that I've gathered from public sources. Here goes:
Is that helpful? |
It's a list of situations where interval arithmetic may be useful. Nobody argued that interval arithmetic is never useful. (I think some of the specific entries on this list are quite debatable, but that's beside the point.) What folks were asking for is a concrete example of an application/library/etc that would like to adopt Wasm but currently can't because of lack of control over rounding modes. To state it very clearly, there are two issues: To reiterate: nobody has said that we should never add support for control over rounding modes to Wasm; on the contrary, people were generally supportive of this feature. But to actually start working on it, there needs to be evidence of concrete usefulness. And to decide what the details should look like, there needs to be a collaboration with actual users of the feature. If your statement:
is another way of saying "yeah, admittedly, nobody is asking for this", then realistically that means that nobody is going to work on it either, unless and until someone does ask for it. |
I have a few progress updates on this. Basically I've been trying to contact the team leaders or authors of various interval applications across a variety of languages. Based on the feedback I've received, the common themes are as follows:
For my own part, there are a few interval-dependent projects that I've wanted to deploy in WASM for a long time, but of course I realize that the point here is to surface third parties with a similar problem. (I haven't complained about this on any forum, so my own problems in this regard are no more searchable than anyone else's.) It's a tough slog but I haven't given up. |
I'd like to chime in to support this proposal. I'm a collaborator to inari, a Rust interval arithmetic library. We would definitely like to have this library work on WASM. I use inari to do computer assisted proofs and it would be great to be able to make some of them available to colleagues and students as web apps. I am not the only one in need of this: someone ported inari to WASM but, due to the lack of rounding primitives, the author of this library chose to forego the inclusion guarantees that are essential to any valid interval arithmetic library. 😬 The author of inari agrees to port the library to WASM once the required primitives are in I hope that directed rounding primitives will materialize in WASM. That will be good for the web but also to send executables to colleagues who don't program so can then play with them safely, whatever their operating system (executing them with Wasmtime). |
Thanks for that info @Chris00. I support the addition of a rounding immediate to all floating point bytecodes that do rounding. It turns out to also be useful in situations short of full-blown interval arithmetic; e.g. comparing a floating point number to an integer that cannot be represented exactly. E.g. if we compare Unfortunately because of the binary encoding, that means encoding rounding modes in immediates requires new, prefixed bytecodes. (in hindsight, I wish we had reserved a |
@Chris00 "That implies that you will have to collaborate with the developers of Rust so we can make use of the rounding primitives." It's possible to port Inari to WASM even without Rust support (although Rust would be the next obvious target for the inclusion of rounding primitives). You could do this by manually coding a few lines of WASM instructions in order to obtain the desired behavior (once primitives are supported), then calling the resulting WASM module from Rust as needed with some sort of wrapper that works on the library or LLVM level. Then at some point if Rust agrees to nativize support, you can remove the wrapper. Granted, this approach is bit awkward, but you could have your library available on the Web before the Rust team even decides what they want to do. Right? @titzer Yes indeed. Totally forgot about that. It happens if, say, your 64-bit integer won't fit into a 64-bit float because the 53-bit mantissa is so small. (This didn't happen on the 8087 math coprocessor from 1982 because it converted everything internally from 64-bit to 80-bit precision, unless you manually downshifted to 64-bit precision through the PC bits. But in SSE2, which I assume is how WASM implements its floating-point, it's really truly just 64-bits, so you have the problem. Ditto on any other modern architecture I'm aware of.) At that point, the comparison result is sort of random because you can't control the rounding policy. Speaking of which, if this feature request gets implemented, then we should look at the existing integer-to-float and float-to-integer functionality in WASM and ensure that it, too, ends up with the same 4 modes as all the other floating-point instructions. |
Anyone still not satisfied with the use cases on the table? @conrad-watt Can we get something off the ground here so that the Inari team can prototype their library, probably along the lines of the switch statement macro you proposed? Let me know what you need. |
I think we have enough motivation here to spin up a phase 1 proposal for per-instruction rounding mode control. That's just a matter of writing down the proposed instructions and putting a presentation and phase 1 poll on a CG meeting agenda. The tricky part will be after that when you have to find folks (possibly including yourself) with the time and motivation to work on implementations. |
@tlively Sorry for the slow reply. If you have any link to a previous phase 1 proposal that you think is well written, please share it. I'll try to make mine conform to the same format so as to avoid confusion. Once it's done I'll schedule a review and poll at a future CG meeting. I'm willing to help on the implementation (with the caveat that I'm hardly a WASM expert). |
There's no specific format you need to match, and the exact content will depend on the proposal. Here's an example of a phase 1 proposal presentation for a change to validation: https://docs.google.com/presentation/d/1-ajjGZpjAiGYOJlwswij9Mq6YGltmuELg3tbhu30VrI. That presentation was given at this meeting: https://github.com/WebAssembly/meetings/blob/main/main/2020/CG-09-29.md. Basically you want to cover 1) the motivation(s) for the proposal, 2) an initial design for what would be implemented, and 3) an explanation of how it would be used and by whom. |
@Chris00 @tlively (and all concerned). Working on this. I'm trying to get the required instructions down to as little complexity as possible. The 4 standard RMs are a must, but there also needs to be sign inspection as well (mainly for multiplication and division of intervals). I notice that WASM already has f32/64.copysign, but when I click on the link on this page, it redirects to a page about float instructions which doesn't discuss the copysign itself. (The whole WASM site seems to have this weird CSS setting where you can't search for text within a page because the hits are not highlighted, making it very difficult to search for specific definitions.) In any event, f32/f64.copysign isn't sufficient because the output is just another float. What's needed is a way to efficiently read the high bit of a float. I guess this can be done via f32/64.reinterpret_i32/64, but that sort of idiomatic casting makes it hard for the compiler to do its job efficiently, so "Get Logical Sign" and "Get Arithmetic Sign", both from float to integer, would be useful to have. The other issue concerns -0.0 vs +0.0. There are ways of redefining the floats so that this distinction is meaningful, but in the interval world, they're identical because they amount to the same boundary. Allowing the distinction to persist throughout a given computation can create the appearance of unequal bounds, which can result in improper branching. (For example, the reciprocal of (-0.0, +0.0) is the reals, whereas the reciprocal of (+0.0, +0.0) is +infinity.) Therefore, we need a way to flush them both to +0.0. This would likely occur often because multiplication (and division) have radically different results depending on the particular signs of the respective upper and lower bounds. Therefore I'm leaning toward the need for a "Unify Zeroes" instruction, but again, it could be done through either (1) more branching on every single multiply or (2) hacking -0.0 to +0.0 by passing through the integer world. One small piece of good news is that I now suspect we can avoid (1) any sort of "Get ULP" instruction and (2) the need for outward rounding (antichop). This is because you need to inspect the signs of the bounds of each interval that you're multiplying before doing so, so therefore you already know how to round each of the resultant bounds in light of their signs. Any thoughts on this are welcome. Barring that, I'll just partition my proposal into "necessary" vs. "nice-to-have". |
Another option is to
An efficient way to unify zeros is to add 0.0 to a number, since That said, I'm not very familiar with interval arithmetic, but this approach to interpreting negative zero surprises me. In the reciprocal function |
@sunfishcode You raised some important considerations. First of all, at least in AMD64, you can't just do comparisons on floats. There needs to be a signal propagation to the RFLAGS register (in the integer unit) in order to conduct a subsequent branch. There are lots of ways to do this, but the point is that it takes a number of instructions and, unfortunately, some heavy pipeline synchronization. (This is hidden to some extent by branch prediction, but it's still costly.) That's why I think we're stuck with setting an integer based on the result of the float comparison. Unless I'm mistaken, WASM doesn't have any equivalent of RFLAGS, so producing an integer result is the next most efficient approach: "Get Logical Sign" means: "Copy sign bit from float into integer, and clear all but the LSB of the integer". "Get Arithmetic Sign" means: "If float is -0.0 or +0.0, then set integer to 0. Otherwise, if float is negative (anything, even NaN), then set integer to -1. Otherwise set integer to 1." I actually tested (-0.0)+(+0.0) and got +0.0. I'm not in a position to expediently verify whether this is standard in IEEE754, but if it is, then you deserve major credit for eliminating the need for "Unify Zeroes"! (If nobody else knows then I'll have to dig in.) As to the 1/x limits thing, yeah, it's a serious problem. (Namely, math doesn't have a notion of -0.0, but IEEE754 does because it's used idiomatically as (-epsilon).) The fundamental question is whether or not you want to retain sign information. In other words, should the reciprocals of (-0.0, -0.0), (-0.0, +0.0), and (+0.0, +0.0) all be one and the same, or unique? We could just say "It's whatever the RM says it is under IEEE754". And we'll have to. But does that then leave the library providers with cases that they can't efficiently handle? If so, then we may need another instruction. (I'm leaning towards that being unnecessary, but I'm not super confident that I haven't overlooked something.) I guess one way to resolve this impasse is to say that if you do want those intervals to be handled as one and the same, then you're going to have to first zero-unify the limits by adding +0.0 to both of them. What do you think? Ping @Chris00 |
Ah, ok. This looks a little more involved, but it probably also comes down to how much a dedicated opcode would help performance in realistic use cases.
It is standard in IEEE 754, and it's in the testsuite. Also, I figured out the error in my question about negative zeros. If there's a non-zero negative value which rounds up to -0, interval arithmetic would already have a non-zero lower bound that is less than it. So whenever you'd have a -0 lower bound, it really is the same as +0. There still are some tricky situations with approaching zero from the negative direction, but I expect these are just part of the trickiness that comes with treating "infinity" like a number. |
First of all, I guess you've really found a way to avoid having "Unify Zeroes". Kudos! Secondly, what about your f64.copysign(1.0, x)? According to the list of instructions, wouldn't that only work if x were a float? I see only one definition for f32/64.copysign, which is "[f32/64 f32/64] -> [f32/64]" (which really doesn't make sense because it has 2 inputs). But when I put "copysign" in the search dialog I get a whole bunch of nonspecific links to nothing in particular about copysign. Someone will hopefully make the website easier to navigate at some point, but for now I guess I'm stuck unless I can dig out a more navigable spec somewhere (or even a lousy PDF). Anyways, let's sync this issue first so we understand what value a new opcode would or would not offer. Third, back to -0.0. Consider this: (-1.0, -0.9)/(+infinity) = (-0.0, -0.0) But as you effectively pointed out, both of these intervals are just points, namely zeroes. This is literally correct interval arithmetic even after rounding (and IEEE754 behaves as such, I believe). The problem is that now your code thinks the outputs are different because, hey look, the bits don't match (even if you f32/64.reinterpret)! So you branch off the wrong way and it's all downhill from there. However, I now think the solution is simply to use your +0.0 addition trick whenever and wherever it matters; we don't need any special instruction to fix this, even if the code has to worry about it all the time. Unless maybe @unageek (or anyone else lurking) has some problem with it. |
The semantics of
|
It's also worthwhile noting that |
@tlively @titzer Thanks. Definitely copysign isn't going to fit the bill, unfortunately. It's necessary to branch based on the state of signs prior to interval multiplication, for instance. Because this is going to be done as often as interval multiplication itself, this would support the creation of the aforementioned pair of "get sign" type of instructions. I'll be creating a proposal for CG review soon. Meanwhile, anyone with further concerns is invited to speak up. |
I've been reading through the WASM instruction set reference (thanks @tlively) so as to try and not mess this up. Here's what I'm thinking: There are a few ways to do this, but ultimately the simplest is probably with "metaops", i.e. instructions that tell the compiler how to compile. (This could be used for lots of other stuff in the future like branch hints, cache proximity hints, reentrant compilation, atomic transactions, secure enclave implementation, etc.) We might as well get started with a generic approach that could be repurposed for all those other potential uses down the road. Rather than have a single opcode for all metaops, which would just result in another level of indirection, we can assign one for each specific purpose. By definition, metaops don't do anything visible in and of themselves; they have side effects which may manifest on subsequent instruction boundaries, or only in the analog world, e.g. power consumption or latency effects. In this case, we need 4 opcodes: round nearest-or-even (implied in effect for backward compatibility), round negativeward, round positiveward, and chop. In the interest of simplicity, it doesn't matter if you use these as prefixes (i.e. immediately prior to a float instruction) or not. You could even be silly and put lots of them in a long chain after a return instruction, if you like. (Why tax the validator with such innocuous concerns?) In any case, their effect should persist until either a different mode is selected, or one of the following instructions is encountered: br, br_if, br_table, return, call, or call_indirect. Those instructions are guaranteed to reset back to nearest-or-even because we don't want to deal with weird cases where a branch target ends up with a superposition of RMs, or a call instruction ends up bouncing off the OS and suffers some resulting reset of the float pipes (can that even happen?). Ultimately, as I've said before, the optimizer should rearrange instructions so as to minimize the number of RM control register touches (but this is not required). And BTW the RMs affect not only arithmetic instructions, but the sort of conversions mentioned above, wherein the target lacks sufficient precision to fully represent the source. It's even possible that one RM will produce a finite approximation while another RM will generate an infinity in response to the same input. There are also the aforementioned sign extraction instructions, but those are straightforward. Thoughts? |
@KloudKoder, the direction sounds good! As far as terminology goes, I would describe these new instructions as normal instructions that modify new rounding mode state made explicit in the store rather than as "metaops." We want to avoid the impression that the behavior of these instructions is meant to be outside the formally specified semantics. For features that actually are outside the specified semantics, there is a separate effort being led in the branch hinting proposal to develop a framework for code annotations. I want to emphasize that rounding mode control should not be using that framework because it does affect semantics. I believe @titzer previously expressed a preference for adding individual new arithmetic and conversion instructions with explicit rounding behavior rather than introducing new rounding mode state and instructions to modify it and I share that preference. At least for optimizing engine tiers, I wouldn't expect this to have a performance difference, especially since you envision the rounding mode being reset at the end of blocks anyway. It would be great if you could comment on the trade offs you see between these two designs as part of your presentation. |
@tlively It sounds like you're advocating for 2 different schemes. (Sorry if I'm just misinterpreting your comment.) To clarify, do you actually want many "new arithmetic and conversion instructions with explicit rounding behavior" or 4 new instructions with "the rounding mode being reset at the end of blocks"? (One of the foregoing plus the 2 sign extractors, actually.) Yep, I got your point about metaops, which is fine. |
It's an unfortunate encoding mistake, in retrospect, that we didn't reserve a My preference against the implicit rounding mode state option is that it makes otherwise pure instructions dependent on that global state. I get it that most hardware works this way, but my impression is that hardware architects don't like that either. RISC-V actually has both: the rounding mode field is large enough to encode 4 rounding modes plus a dynamic rounding mode. AFAICT they did this because of the software dependency on managing the rounding mode state inherited from previous architectures. |
Strongly agree with what @titzer and others have said: an ambient mode state should be avoided, not least because it has terrible composability, i.e., gets in the way of transformations like inlining. |
@titzer @rossberg Fine by me, with the understanding that your validator now has to worry about the following:
Implicitly, prefixes have no lifetime beyond the prefixed instruction itself; it all reverts to nearest-or-even thereafter. OK? |
Alright, I think I've identified every single instruction whose behavior would be altered by a change in RM. They're as follows: 91 f32.sqrt I couldn't find any documentation on the last 2, so I need to confirm that they're actually impacted by RM. (Are they taking 2 products of 2 i32s? Or concatenating them or what?) If I've neglected or wrongly included something, please speak up. In keeping with the KISS principle, and considering that we're in no desperate shortage of root-level opcodes, I'd like to propose the creation of 3 additional prefixes (tentatively F9, FA, and FB), which serve as RM indicators for (round-positive, round-negative, and chop, respectively). This would match the existing opcode order for existing "ceil", "floor", "trunc", and "nearest" instructions provided in the spec (but FC would be skipped). This would, in turn, result in minimal complexity growth for upstream compilers. It would also minimize the pain of adding additional float or SIMD instructions in the future. The rules would be simple:
In the interest of simplicity, I'll post about sign extraction separately. |
Now for the sign extractors. While such instructions would be useful for a wide variety of applications other than interval arithmetic, I'm just going to focus on the latter. To summarize, there are 8 cases from which to choose upon each multiply: (+a, +b) * (+c, +d) = (+ac, +bd) (There are 16 cases if you include antiintervals (+a, -b), which is sometimes necessary.) As shown, selecting a case requires extracting sign bits. (Doing arithmetic comparison against +0.0 would be overkill and actually doesn't always behave like sign bit extraction.) Alternatively, you could simply compute ac, ad, bc, and bd, then apply both min and max to obtain the resulting interval. That's really expensive, though, not the least of which because there's no "min/max of all lanes" SIMD instruction, so you need to serialize 2 min/max instructions. (Granted, you need to do so anyway in the last case, but it's rare.) To make matters worse, the 8 cases above all feature each variable with both signs, so it might be most efficient to use indirect branches from a sparsely populated table of 16 labels. Which brings us back to "really expensive". Even if the programmer knows that some of the cases are impossible, the compiler might not know that, so the best it can do is optimize for common cases such as the first one. Which is why, in practice, it's probably going to be fastest to just nest 4 predicates. Which, finally, is why sign extractors matter. I should emphasize, again, that division is even worse because dividing by any interval including +/-0.0 will give rise an antiinterval, hence the need for an arithmetic sign extractor to complement the bitwise one. (Insert debate about conflicting limits from left and right here.) And this time, we can't even avoid the problem by using min/max. It's just ugly. I would just add that all of the above explains why I think actual interval instructions would be advantageous. I know that's not a popular opinion because doing so would not expose anything which is currently atomic on the CPU level (but it might be in the future). The problem is that there's no good idiom for "hey I'm going to multiply 2 intervals, so do what I mean, not what I say". So the compiler has to take everything quite literally and cause loads of superfluous communication among floats, integers, flags, and the branch predictor. I thought I should make this clear, even if such instructions are categorically off the table for the moment, as I don't want anyone to be surprised when native assembly language is still multiples faster than even the RM-aware WASM equivalent. OK so here are the sign extractors that I'd like to propose, absent full-blown native interval support: C5 f32.sgn (32-Bit Get Sign) This is a conventional branch preparation instruction, so to be consistent, it has this name instead of the more verbose "i32.sgn_f32". This instruction just copies the sign bit of f32 into i32 without regard to any other bits in f32. C6 i32.asgn_f32 (32-Bit Get Arithmetic Sign) This instruction sets i32 to zero if and only if f32 is +/-0. Otherwise it's equivalent to (1-2*f32.sgn). C7 f64.sgn (64-Bit Get Sign) And the following instructions would help immensely, and not just for interval arithmetic, because lots of code branches depending on signs: CD i32.psgn_f32x4 (Packed 4x32-Bit Get Sign) This indirect branch preparation instruction applies f32.sgn to lane N and deposits the result into bit N of i32. CE i32.pasgn_f32x4 (Packed 4x32-Bit Get Arithmetic Sign) This indirect branch preparation instruction applies i32.asgn_f32 to lane N compresses the results into the 2 successive corresponding bits of i32. CF i32.psgn_f64x2 (Packed 2x64-Bit Get Sign) |
We generally try to be conservative about using new prefix opcodes because we expect WebAssembly to be around for a long time and because they come out of the same space we could use for new single-byte opcodes. It would be better to put all the new instructions behind prefix FC, as @rossberg suggested. It might also be a good idea to leave out new SIMD instructions for now, unless you can show that there are efficient lowerings for the alternative semantics on common hardware and libraries that would use these instructions. |
@tlively Understood. I'll rehash everything under FC. And when you say to leave out the SIMDs, are you referring to only the sign extractors? I assume so because there are definitely efficient hardware instructions for ordinary arithmetic SIMDs of all RMs, at least on AMD64; these are also helpful for combining parallel interval arithmetic instructions at the source code level into a single hardware operation (to great effect, since 32 bits (x4) goes a long way when you have error containment). But please clarify. |
In general there is a high bar to add SIMD instructions. You’ll need to show that there are efficient lowerings for all the new SIMD instructions on Arm and x86, and ideally risc-v as well, with details about which ISA extensions would be necessary. If you can show that, then including the SIMD instructions should be fine. |
@tlively Well it turns out that the SIMD arithmetic instructions don't have efficient ARM8 equivalents. So I'll pull those out, but what about the sign extractors? Those are all integer operations and definitely have efficient (but not necessarily single-instruction) equivalents on all modern architectures. EDIT: On further thought, the SIMD sign extractors aren't useful enough to justify, absent custom RMs, so I'll kill them too. I'll come back later with an updated proposal. |
OK how about this: The following RM-aware float instructions would all be prefixed with FC. 20 f32.sqrt_ceil And likewise with these sign extractors with no need for RM: 1C f32.sgn (32-Bit Get Sign (to i32)) |
Nice, looks good to me. One more note is that typically the prefixes instruction names correspond to the result type of the instruction, so I would expect |
It's worth considering making the rounding mode an immediate field rather than being folded into the opcode. Something like this: FC 20 00 f32.sqrt_ceil and so on. Upsides: This makes it cleaner to extend the feature with new rounding modes in the future, if we ever want to do that. There are algorithms that use round-to-odd, round-away-from-zero, and other things. We don't have popular hardware for them today, but who knows what the future may hold. And this leaves more of the one-byte space under the FC prefix available for future features. Downside: This makes the instructions one byte longer. However, floating-point operations don't tend to be the most frequent opcodes even in floating-point-heavy code, so it should have a low overall impact on code size. |
@tlively Maybe I'm confused. So that's why I stated above that "[f32.sgn] is a conventional branch preparation instruction, so to be consistent, it has this name instead of the more verbose "i32.sgn_f32"." Are you saying that f32.sgn and friends should not be consistent with, say, f32.eq? Fine either way but I want to make sure. @sunfishcode Yeah, I thought of this but I figured that we would never see all the major silicon vendors supporting these other modes. And the opcode map is already sort of chaotic, so if antichop suddenly appears tomorrow, then we could just add the same instructions elsewhere in their own contiguous region of FC or whatever map. If there's a new float instruction (without a new RM), then we can just inject it after the last triplet. But if y'all still want a broken-out RM byte, then fine by me. And would you want to wrap the precision bit into the RM byte as well, so we don't need separate 32/64? |
Oh yeah, you’re right that |
You're not wrong :-), but my instinct is that the immediate field is worth doing.
Splitting out 32/64 isn't necessarily a bad idea, but since we didn't do that in the earlier float opcodes, my inclination would be to keep them separate here too. |
@sunfishcode OK we can do that. I don't see anyone complaining as of yet. Let me get a final presentation together and schedule for a CG meeting date. |
I've submitted a request to debate this proposal at the December 20 CG meeting. I also have an updated synopsis of the proposed changes based on the above discussion (but I'm not presently sure which document format to post where). |
Summarizing the current proposal for those in the CG meeting who want something in writing: This proposal is for the compiler prototyping of floating-point instructions which mirror those already in the spec, but with distinct rounding modes (RMs) which differ from the universally assumed nearest-or-even policy. The RMs involved include the 3 others which all major CPUs seem to support, namely "ceil" (round toward positive infinity), "floor" (round toward negative infinity), and "trunc" (chop toward zero without changing the sign bit). SIMD instructions aren't included because they don't yet have succinct equivalents on ARM. All of these instructions will use the FC opcode prefix so as to conserve space in the main map. This prefix is followed by the secondary opcode, and finally an RM byte indicating one of the aforementioned RMs (or perhaps yet another one in the future). The opcode map ordering is identical to that of their corresponding progenitors so as to minimize compiler complexity. As previously, a suffix of "s" or "u" implies "signed" or "unsigned" operation, respectively. The proposed map is thus as follows: 20 00 f32.sqrt_ceil Considering that all of these instructions are directly relevant to interval arithmetic, the following primitives should be included as well, which would be frequently used in that context, e.g. upon every multiply operation. They are also prefixed with FC but don't contain (and don't need) an RM byte. In the tradition of branch preparation instructions, f32.sgn and f64.sgn have abbreviated names rather than "i32.sgn_f32" and "i32.sgn_f64", respectively. (The arithmetic sign instructions technically aren't branch preparation instructions because they return ternary outputs, so they conform to the usual notation.) 1C f32.sgn (32-Bit Get Sign Bit (to i32)) |
Sorry this has taken so long. Anyhow, we now have a sandbox repo (big thanks to @dschuff ). https://github.com/WebAssembly/rounding-mode-control I've opened an issue over there for discussion about how to hack up a prototype. All, feel free to dive in and post your thoughts. I assure you that I have absolutely no idea what I'm doing with respect to the WASM codebase. At least, I can help with formal revisions to the PDF specification, as well as testing. |
In case you are still looking for potential users, CGAL (computational geometry, meshing) is a C++ library that heavily relies on interval arithmetic (and exceptions) for performance: the fallback is bignum rationals. We have some slow emulation so the code can at least run, but that's really not good. We have had several users ask for wasm support for demos or web applications, and several told us they had to switch to a different (inferior) solution because of this. We do not mind if it is necessary to write a bit of code specifically for wasm, we already have plenty of platform-specific code in that area. On x86_64, we even use rounding for SIMD operations, but for wasm, rounding of scalar operations would already be great. (sorry, I won't have time to help much, all I can offer is to try porting CGAL's Interval_nt once there is support for both compiling and evaluating, although I have no experience with wasm) |
@mglisse Cool! Sounds like CGAL would make a great test case and eventual use case. Ping us here again when we have a prototype working. I got my work cut out for me, in the meantime. |
@mglisse @Chris00 Guys, do you have any preference with respect to rounding toward positive or negative infinity? I ask because @whirlicote is going to be writing some optimized implementations for common use cases, and it all works faster if he sticks to a single rounding mode. (When rounding positive, rounding negative is sometimes implementable by just negating before and after the operation. So you can do that cheaply instead of switching the RM (expensively). And similarly in the opposite case.) This will, at least initially, be done via optimization hints in a Javascript context. We don't want to offer optimization for both cases because that's begging to create a divergence of coding practice which serves no constructive purpose. So it's either favoring ceil, or favoring floor. My personal preference is ceil because it's slightly more useful (on account of positive-only quantities which require bounding) but I don't have a strong opinion either way. To be clear, functional support for both will be provided, but we'll only offer optimization for one of them in addition to nearest-or-even. |
I can work with either. Currently in CGAL for intervals we use only rounding towards +inf (and the negation trick you mention), so it would be slightly easier to stick to that. The one thing that would suck would be for 2 compilers or browsers to optimize only one direction but not the same one... |
In inari, rounding towards +∞ is most often used, so the library will be simpler to adapt if this is available. CC: @unageek |
@mglisse @Chris00 Thanks for your feedback. So it looks like ceil should be the favored way. "The one thing that would suck would be for 2 compilers or browsers to optimize only one direction but not the same one." Indeed. whirlicote was concerned about that as well. He also plans to include some reference code in public notes to the spec, in order to ensure that everyone is optimizing for the same expected compiler behavior, which I think will further serve to avoid your feared outcome. For updates on the proposal's progress, see: |
I'm not sure I fully understand the implications of the last handful of messages, but if you're saying that you're adding (among others) Taking a step back, this would be an example of a very important effect of prototyping: if it turns out that a design that looked good on paper is either not feasible (e.g., because changing the control mode very frequently is too expensive on actual hardware) or not necessary in practice, then this feedback should definitely be used to make changes to the original design. (If I'm misunderstanding what you're saying, then never mind and carry on.) |
@jakobkummerow It's my fault for not explaining better. What we're proposing is that add, subtract, multiply, and divide will not be optimizable for floor because their floor flavors will be implemented as negate+ceil+negate. In practice, this means that interval arithmetic will be fast because we avoid RM switching with these common operations. sqrt (and any future transcendental functions) will not be affected because there is no economical trick to pull; we will have to pay the full cost of RM switching if we want to change from ceil to floor and back. In the long term, there could be another proposal to reorder the code so as to elide as much of that switching as possible. Or, perhaps more realistically, the CPU marketshare pie will evolve in favor of those chips which feature per-instruction RM decorations (such as nVidia or RISC-V) rather than this global RM foolishness. |
I was just using |
@jakobkummerow OK I see your point. So while there won't be any significant performance difference with sqrt (or future transcendentals), there will in fact be superior performance with basic arithmetic using ceil as opposed to floor (when you use appropriate hints to optimize for ceil). There will be no equivalent hint available for optimizing for floor. However, and I know this sounds counterintuitive, floor will be as fast as it can possibly be if you optimize for ceil. This is because, if we got rid of basic arithmetic with floor, then each developer would need to implement negate+ceil+negate at the source code level, resulting in performance that's at least as poor, combined with lots more verbosity and thus opportunity for error; or else drop the optimization entirely and let us change and restore the global RM on every RM-sensitive instruction, which would be way worse still. You might rightly wonder why floor and ceil should not have symmetric performance. The reason is that the only way to enforce that symmetry is to change the RM to ceil, then back to floor, then back to ceil, etc. all over the place; whereas, it's actually faster just to choose one of them and stick with it. In other words, the negates end up being way cheaper than all the would-be RM switching. That's likely to be the case even if we use heuristics and instruction reordering in order to minimize the number of switches required, simply because the most common use case (interval arithmetic) needs both of them for each and every basic operation. |
Keep in mind that Wasm modules are toolchain-generated, not handwritten. It is perfectly acceptable that toolchains have to do some work. IIUC you're suggesting that engines should implement |
No. The difference in performance is only small. Something like two negate instructions. (Which is neglectable to a pipeline flush)
No. I expect _floor to be still faster than the negate-ceil-negate on hardware with support of AVX-512.
No. The performance penalty is only about two negations (on certain hardware).
Keep in mind that Wasm modules are toolchain-generated, not handwritten. It is perfectly acceptable that toolchains have to do some work.
It is not far slower. You are expected to use _floor and _ceil. This is to make sure that the same code runs fast on different hardware.
This optimization is only relevant on certain hardware for certain use cases. It only works for interval arithmetic heavy code for example. It is an easy but effective optimization for the production engines. Concerning this proposal it only fits into a NOTE in the spec.
I belive this problem can be solved with documentation or simply trying out both variants and measure the time whats faster in ones code. |
This is the issue for testing out approaches to floating-point rounding mode control, which is required for performant interval arithmetic. The original informal discussion is found here.
For an understanding of the practical ramifications of interval arithmetic including some use cases, I suggest this YouTube presentation by G. Bill Walster.
For starters, Thomas Lively identified the Boost library, boost::interval. An example of its rounding control approach is here.
Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test. Conrad Watt has suggested a switch-based approach (for testing, not actual implementation) for every floating-point instruction, wherein the subject of the switch is a rounding control (RC) global variable (nearest-or-even, round toward negative infinity, round toward positive infinity, or chop toward zero). His hypothesis is that the WASM compiler will eliminate many of the switch statements because the rounding mode (RM) at any given point is usually known at compile time. This might at least allow us to do some crude benchmarking.
A more painful but performant approach would be to have the same virtualized RC register, but to reorder instructions where possible in order to minimize the frequency of touching the actual hardware RC. This would also compile seamlessly in architectures in which the RM is embedded in the floating-point instruction opcode itself (in which case the reordering would be redundant).
At this point, we're only concerned with RISC, but as Deepti Gandluri pointed out, any mature implementation would need to address the same issues with SIMD, wherein each parallel operand adheres to a common but custom RM.
@conrad-watt @tlively @jakobkummerow @dschuff @dtig
The text was updated successfully, but these errors were encountered: