Skip to content

Floating-point rounding mode control prototyping #1456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
KloudKoder opened this issue Aug 16, 2022 · 57 comments
Open

Floating-point rounding mode control prototyping #1456

KloudKoder opened this issue Aug 16, 2022 · 57 comments

Comments

@KloudKoder
Copy link

This is the issue for testing out approaches to floating-point rounding mode control, which is required for performant interval arithmetic. The original informal discussion is found here.

For an understanding of the practical ramifications of interval arithmetic including some use cases, I suggest this YouTube presentation by G. Bill Walster.

For starters, Thomas Lively identified the Boost library, boost::interval. An example of its rounding control approach is here.

Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test. Conrad Watt has suggested a switch-based approach (for testing, not actual implementation) for every floating-point instruction, wherein the subject of the switch is a rounding control (RC) global variable (nearest-or-even, round toward negative infinity, round toward positive infinity, or chop toward zero). His hypothesis is that the WASM compiler will eliminate many of the switch statements because the rounding mode (RM) at any given point is usually known at compile time. This might at least allow us to do some crude benchmarking.

A more painful but performant approach would be to have the same virtualized RC register, but to reorder instructions where possible in order to minimize the frequency of touching the actual hardware RC. This would also compile seamlessly in architectures in which the RM is embedded in the floating-point instruction opcode itself (in which case the reordering would be redundant).

At this point, we're only concerned with RISC, but as Deepti Gandluri pointed out, any mature implementation would need to address the same issues with SIMD, wherein each parallel operand adheres to a common but custom RM.

@conrad-watt @tlively @jakobkummerow @dschuff @dtig

@tlively
Copy link
Member

tlively commented Aug 16, 2022

Since the actual proposal is to add more control of rounding modes to Wasm, it would probably make sense to broaden the scope from interval arithmetic to any use of different rounding modes. Casting a wider net will help find more motivating use cases.

Feel free to add references to practical use cases or libraries related to intervals. The ultimate output of this issue, hopefully, will be a performance test.

I'm not as concerned with trying to do more performance tests at this point, since you convincingly explained how slow emulating different rounding modes would be. I'm more concerned right now about having concrete motivating use cases. Ideally we would find someone with something (an application, a library, a language, etc) they would like to port to Wasm for whom the lack of rounding mode control is a significant hurdle. Even with a perfect design and convincing benchmarks, it would be hard to motivate engine and tool implementers to work on this (or anything else) without such a concrete use case.

@KloudKoder
Copy link
Author

I hear that.

For now, some history of the Boost interval package going back to 1999 or so (beware, not HTTPS):

http://www.mscs.mu.edu/~georgec/IFAQ/maurer1.html

"Originally, I only intended the package to assess rounding errors in lengthy computations. Since then, I've learned that there are other application domains where interval arithmetic is useful" <- And with this quaint understatement, the analog security era was born!

The point being simply that this is hardly a new innovation (not that anyone here seems to think that). Note that G. Bill Walster was deeply involved. More to come by way of current examples.

@KloudKoder
Copy link
Author

To address your specific comment about broadening the scope, I certainly don't oppose that. I just don't personally know of any practical uses for RM tweaking apart from interval arithmetic. The one sort of exception would be integer rounding. On X86, for one: "The FRNDINT instruction returns a floating-point value that is the integral value closest to the source value in the direction of the rounding mode specified in the RC field". Not sure how float-to-int is implemented in WASM, but in theory it would be amenable to RC flavors just like any other floating-point instruction, but would have this very different effect (nudging by up to 1.0 as opposed to one unit of the last place (ULP)).

@KloudKoder
Copy link
Author

TL;DR: Some algorithms are polynomially slower without interval arithmetic; others just simply don't work.

Alright, I promised to list some practical examples. On the one hand, you're obviously not going to find people ranting that they can't get intervals to be performant (or even correct, due to the lack of compliant libraries) in WASM because, like, what purpose would that serve without going through the entire process that we're literally going through here and now? From the public discourse I've read on various forums, it's clear that people just assume they need a specialized library to run intervals in a performant manner on a particular piece of hardware. Pending any change to WASM, that seems to be just how it is.

Therefore, in the absense of any published rant about this, I'll attempt to provide the next best thing, which is a list of compelling applications that I've gathered from public sources. Here goes:

  • Adaptive exploration. I've touched on this previously, but its significance can hardly be underestimated for the sake of performance. We're not talking hundreds of percent to be gained here (as would potentially be the case with an efficient implementation of rounding flavors). We're talking about changes to the computational complexity of an algorithm. And why? Because adaptive exploration means that you can say "Search around this space until you know that the global optimum is at least X and at most Y" instead of "Search this space exhaustively and return whatever optimum that you find". Think about what that implies: you can quit early if you learn that the minimum "around here" is too big to be the global minimum (or maximum). This can result in actual polynomial acceleration of a variety of exploratory algorithms (and especially recursive ones). For example, you can write a raytracer that says "Trace this light ray back from this pixel to all its reflections in the scene until you know the color to within the limits of the color encoding used on the display". That's revolutionary and extremely fast compared to the traditional and rather arbitrary "Trace this ray back through the first N reflections".

  • Cognitive safety. Imagine you have a neural network which is classifying objects in front of a moving car. Suddenly, an object appears which seems to be a man crossing the street. Do you slam on the breaks, risking injury to the driver? Or do you ignore the classification as a false positive, and risk injuring the man? For safety's sake, you need to know the interval on which the classification rests. At minimum, a wide confidence interval should be fed back to the manufacturer in order to alert it to the poor performance of the classifier in this situation.

  • Physical performance guarantees. You're designing an engine that needs to be at least 21% energy efficient. You run millions of engine simulations in parallel, using genetic optimization to discover the best combination of hyperparameters. Finally, you discover one that's actually 23% efficient. But how do you know that's not just numerical error? Unless you're using intervals, you don't.

  • Nonlinear equation solvers. Suppose you have a set of differential equations, such as Navier-Stokes for fluid dynamics. Most of these systems, such as flames dancing on a candle, end up being subject to the butterfly effect. However, chaos doesn't evolve instantly, so there's a finite time window during which one can actually predict the evolution of the system with an acceptable degree of uncertainty. Without interval arithmetic, you won't have any idea about the extent of that window. But with its help, the future state of the system will become increasingly blurry, such that you can literally see the butterfly effect taking hold at each subsequent timestep.

  • Matrix operations. Complicated operations on large matrices can cause numerical errors to compound significantly, for example, when determining an Eigenvalue or a determinant. Without interval arithmetic, you don't know what the error is, which could potentially be huge in the case that large matrix elements cancel but for some tiny residue. Also, even ambiguity in the sign of a determinant which is close to zero can create havoc in predicate-based exploration ("If negative, move to this halfspace, else move to that halfspace...").

  • Path planning. This comes up in video games, among other applications. You have an object, say a spaceship, that needs to fly through a 3D maze. But it needs to do so without overlapping the solid walls of the maze. The spaceship is a complicated object but it can be bounded by a 3D interval for the sake of expeditious collision detection. Take away interval arithmetic, and you don't know if an overlap actually occurs or not, which could cause the code to take the wrong path relative to what the user expects to occur, for example causing the spaceship to explode even though it wasn't moving fast enough to hit the wall.

  • Root searches. One famous example is the Riemann Hypothesis, which states that the nontrivial zeroes of a particular complex function all have real part 0.5. While this can't be proven numerically (because there are an infinite number of them), searches for counterexamples are possible (for example, Zeta Grid). One can start anywhere in the relevant domain, converge on a root, and in theory demonstrate that the real part is something other than 0.5. (There is literally a $1M prize on the line.) But ordinary floats can't give you this. You need intervals to prove that the real part of your root is, say, between 0.6 and 0.7.

  • Deterministic convex optimization. Convex optimization is the process of finding the "bottom of the bowl", so to speak, in N dimensions. Conceptually, it's straightforward because convex functions have, by definition, just one global minimum (with some trivial exception cases). So you can chase the derivative using Newton's method and end up there pretty quickly. But in reality, because of accumulating numerical error and various quantization effects, it could happen that your estimate of the global minimum (or maximum) suddenly starts getting less and less accurate as you continue attempts to refine it. But you don't know that unless you're using interval arithmetic. With its help, you can watch as the interval on which the minimum lives shrinks with each iteration, until you hit a timestep at which it starts to expand, so you need to stop the algorithm. Otherwise it's just "iterate N times and hope for the best".

Is that helpful?

@jakobkummerow
Copy link

Is that helpful?

It's a list of situations where interval arithmetic may be useful. Nobody argued that interval arithmetic is never useful. (I think some of the specific entries on this list are quite debatable, but that's beside the point.)

What folks were asking for is a concrete example of an application/library/etc that would like to adopt Wasm but currently can't because of lack of control over rounding modes.

To state it very clearly, there are two issues:
(1) People across the ecosystem are busy. There are many possible things to work on (just look at the list of in-flight proposals!). In order to convince anyone to work on this, some evidence is needed why this is impactful. "I think it could be useful for X, Y, Z" does not qualify as sufficient evidence, neither does "I think this will be useful once Wasm runs on GPUs".
(2) Assuming someone (maybe yourself?) starts working on this, to make a proposal advance through the phases, feedback will be required: is it working correctly? Is it as helpful as expected? Is it performing well enough? If all you have is a proposed specification and a prototype implementation, you can't really answer these questions. There'll have to be a project that says "we previously couldn't adopt Wasm, but now we can thanks to this" or "our previous Wasm port was slow, but adopting this feature yielded great improvements", or similar. (Of course, it could also happen that the feedback is "we thought this would unblock us, but then we experimented with the prototype and it turned out that it doesn't actually solve our problem", which would be very useful feedback as well, and could cause either redesign or abandoning of the proposal.)

To reiterate: nobody has said that we should never add support for control over rounding modes to Wasm; on the contrary, people were generally supportive of this feature. But to actually start working on it, there needs to be evidence of concrete usefulness. And to decide what the details should look like, there needs to be a collaboration with actual users of the feature.

If your statement:

you're obviously not going to find people ranting that they can't get intervals to be performant (or even correct, due to the lack of compliant libraries) in WASM

is another way of saying "yeah, admittedly, nobody is asking for this", then realistically that means that nobody is going to work on it either, unless and until someone does ask for it.

@KloudKoder
Copy link
Author

I have a few progress updates on this. Basically I've been trying to contact the team leaders or authors of various interval applications across a variety of languages. Based on the feedback I've received, the common themes are as follows:

  • Some have made it clear that they won't even discuss this with the WASM CG because they see it as hopeless, based on similar discussions in the past. For example, there was apparently a proposal long ago to add rounding mode support to C++, but it got into the same chicken-and-egg problem of no developers because of no compiler support and no compiler support because of no developers.

  • However, this position isn't universal. I've had a number of papers pushed my way with recommendations of "just show them this and they'll see why this support is so important". When I've explained that it's actually examples of uncompilable/nonperformant code that the team wants, the developers I contacted don't seem to want to discuss their designs. I guess I shouldn't be surprised at that.

  • 2 teams in particular ended up having to abandon floating-point entirely due to various limitations. One of them ended up writing critical analog logic in custom fixed-point code, and the other ended up just using integer intervals instead (which only works efficiently for a subset of interval applications). Needless to say, neither deployed a WASM application using floating-point interval support.

For my own part, there are a few interval-dependent projects that I've wanted to deploy in WASM for a long time, but of course I realize that the point here is to surface third parties with a similar problem. (I haven't complained about this on any forum, so my own problems in this regard are no more searchable than anyone else's.)

It's a tough slog but I haven't given up.

@Chris00
Copy link

Chris00 commented Oct 7, 2022

I'd like to chime in to support this proposal. I'm a collaborator to inari, a Rust interval arithmetic library. We would definitely like to have this library work on WASM. I use inari to do computer assisted proofs and it would be great to be able to make some of them available to colleagues and students as web apps. I am not the only one in need of this: someone ported inari to WASM but, due to the lack of rounding primitives, the author of this library chose to forego the inclusion guarantees that are essential to any valid interval arithmetic library. 😬

The author of inari agrees to port the library to WASM once the required primitives are in core::arch::wasm32 and I agree to the perform some testing. That implies that you will have to collaborate with the developers of Rust so we can make use of the rounding primitives.

I hope that directed rounding primitives will materialize in WASM. That will be good for the web but also to send executables to colleagues who don't program so can then play with them safely, whatever their operating system (executing them with Wasmtime).

@titzer
Copy link

titzer commented Oct 7, 2022

Thanks for that info @Chris00.

I support the addition of a rounding immediate to all floating point bytecodes that do rounding. It turns out to also be useful in situations short of full-blown interval arithmetic; e.g. comparing a floating point number to an integer that cannot be represented exactly. E.g. if we compare f < i, then we want to round i towards positive infinity before doing a float comparison against f, and if we have i < f, we want to round i towards negative infinity before comparison.

Unfortunately because of the binary encoding, that means encoding rounding modes in immediates requires new, prefixed bytecodes. (in hindsight, I wish we had reserved a 0 byte for floating point opcodes).

@KloudKoder
Copy link
Author

@Chris00 "That implies that you will have to collaborate with the developers of Rust so we can make use of the rounding primitives." It's possible to port Inari to WASM even without Rust support (although Rust would be the next obvious target for the inclusion of rounding primitives). You could do this by manually coding a few lines of WASM instructions in order to obtain the desired behavior (once primitives are supported), then calling the resulting WASM module from Rust as needed with some sort of wrapper that works on the library or LLVM level. Then at some point if Rust agrees to nativize support, you can remove the wrapper. Granted, this approach is bit awkward, but you could have your library available on the Web before the Rust team even decides what they want to do. Right?

@titzer Yes indeed. Totally forgot about that. It happens if, say, your 64-bit integer won't fit into a 64-bit float because the 53-bit mantissa is so small. (This didn't happen on the 8087 math coprocessor from 1982 because it converted everything internally from 64-bit to 80-bit precision, unless you manually downshifted to 64-bit precision through the PC bits. But in SSE2, which I assume is how WASM implements its floating-point, it's really truly just 64-bits, so you have the problem. Ditto on any other modern architecture I'm aware of.) At that point, the comparison result is sort of random because you can't control the rounding policy. Speaking of which, if this feature request gets implemented, then we should look at the existing integer-to-float and float-to-integer functionality in WASM and ensure that it, too, ends up with the same 4 modes as all the other floating-point instructions.

@KloudKoder
Copy link
Author

Anyone still not satisfied with the use cases on the table?

@conrad-watt Can we get something off the ground here so that the Inari team can prototype their library, probably along the lines of the switch statement macro you proposed? Let me know what you need.

@tlively
Copy link
Member

tlively commented Oct 12, 2022

I think we have enough motivation here to spin up a phase 1 proposal for per-instruction rounding mode control. That's just a matter of writing down the proposed instructions and putting a presentation and phase 1 poll on a CG meeting agenda. The tricky part will be after that when you have to find folks (possibly including yourself) with the time and motivation to work on implementations.

@KloudKoder
Copy link
Author

@tlively Sorry for the slow reply. If you have any link to a previous phase 1 proposal that you think is well written, please share it. I'll try to make mine conform to the same format so as to avoid confusion. Once it's done I'll schedule a review and poll at a future CG meeting.

I'm willing to help on the implementation (with the caveat that I'm hardly a WASM expert).

@tlively
Copy link
Member

tlively commented Oct 17, 2022

There's no specific format you need to match, and the exact content will depend on the proposal. Here's an example of a phase 1 proposal presentation for a change to validation: https://docs.google.com/presentation/d/1-ajjGZpjAiGYOJlwswij9Mq6YGltmuELg3tbhu30VrI. That presentation was given at this meeting: https://github.com/WebAssembly/meetings/blob/main/main/2020/CG-09-29.md.

Basically you want to cover 1) the motivation(s) for the proposal, 2) an initial design for what would be implemented, and 3) an explanation of how it would be used and by whom.

@KloudKoder
Copy link
Author

KloudKoder commented Oct 30, 2022

@Chris00 @tlively (and all concerned). Working on this. I'm trying to get the required instructions down to as little complexity as possible. The 4 standard RMs are a must, but there also needs to be sign inspection as well (mainly for multiplication and division of intervals). I notice that WASM already has f32/64.copysign, but when I click on the link on this page, it redirects to a page about float instructions which doesn't discuss the copysign itself. (The whole WASM site seems to have this weird CSS setting where you can't search for text within a page because the hits are not highlighted, making it very difficult to search for specific definitions.)

In any event, f32/f64.copysign isn't sufficient because the output is just another float. What's needed is a way to efficiently read the high bit of a float. I guess this can be done via f32/64.reinterpret_i32/64, but that sort of idiomatic casting makes it hard for the compiler to do its job efficiently, so "Get Logical Sign" and "Get Arithmetic Sign", both from float to integer, would be useful to have.

The other issue concerns -0.0 vs +0.0. There are ways of redefining the floats so that this distinction is meaningful, but in the interval world, they're identical because they amount to the same boundary. Allowing the distinction to persist throughout a given computation can create the appearance of unequal bounds, which can result in improper branching. (For example, the reciprocal of (-0.0, +0.0) is the reals, whereas the reciprocal of (+0.0, +0.0) is +infinity.) Therefore, we need a way to flush them both to +0.0. This would likely occur often because multiplication (and division) have radically different results depending on the particular signs of the respective upper and lower bounds. Therefore I'm leaning toward the need for a "Unify Zeroes" instruction, but again, it could be done through either (1) more branching on every single multiply or (2) hacking -0.0 to +0.0 by passing through the integer world.

One small piece of good news is that I now suspect we can avoid (1) any sort of "Get ULP" instruction and (2) the need for outward rounding (antichop). This is because you need to inspect the signs of the bounds of each interval that you're multiplying before doing so, so therefore you already know how to round each of the resultant bounds in light of their signs.

Any thoughts on this are welcome. Barring that, I'll just partition my proposal into "necessary" vs. "nice-to-have".

@sunfishcode
Copy link
Member

In any event, f32/f64.copysign isn't sufficient because the output is just another float. What's needed is a way to efficiently read the high bit of a float. I guess this can be done via f32/64.reinterpret_i32/64, but that sort of idiomatic casting makes it hard for the compiler to do its job efficiently, so "Get Logical Sign" and "Get Arithmetic Sign", both from float to integer, would be useful to have.

Another option is to copysign the sign bit onto a 1.0 value, and then compare that. That said, I wouldn't necessarily object to a "read the sign bit into an integer" instruction. I'd guess that comparing as less-than-zero using the lt instruction suffices for the "arithmetic sign".

The other issue concerns -0.0 vs +0.0. There are ways of redefining the floats so that this distinction is meaningful, but in the interval world, they're identical because they amount to the same boundary. Allowing the distinction to persist throughout a given computation can create the appearance of unequal bounds, which can result in improper branching. (For example, the reciprocal of (-0.0, +0.0) is the reals, whereas the reciprocal of (+0.0, +0.0) is +infinity.) Therefore, we need a way to flush them both to +0.0. This would likely occur often because multiplication (and division) have radically different results depending on the particular signs of the respective upper and lower bounds. Therefore I'm leaning toward the need for a "Unify Zeroes" instruction, but again, it could be done through either (1) more branching on every single multiply or (2) hacking -0.0 to +0.0 by passing through the integer world.

An efficient way to unify zeros is to add 0.0 to a number, since -0.0 + 0.0 evaluates to 0.0.

That said, I'm not very familiar with interval arithmetic, but this approach to interpreting negative zero surprises me. In the reciprocal function f(x) = 1/x, the limit as x approaches zero isn't just positive infinity. It's positive infinity or negative infinity, depending on which direction you approach zero from. I would have guessed the interval arithmetic would want to represent a conservative bound, which would mean you wouldn't it want to ignore the possibility that a value with an interval at zero might represent a zero approached from the negative side, or even just a finite negative value which got rounded to zero.

@KloudKoder
Copy link
Author

@sunfishcode You raised some important considerations.

First of all, at least in AMD64, you can't just do comparisons on floats. There needs to be a signal propagation to the RFLAGS register (in the integer unit) in order to conduct a subsequent branch. There are lots of ways to do this, but the point is that it takes a number of instructions and, unfortunately, some heavy pipeline synchronization. (This is hidden to some extent by branch prediction, but it's still costly.) That's why I think we're stuck with setting an integer based on the result of the float comparison. Unless I'm mistaken, WASM doesn't have any equivalent of RFLAGS, so producing an integer result is the next most efficient approach:

"Get Logical Sign" means: "Copy sign bit from float into integer, and clear all but the LSB of the integer".

"Get Arithmetic Sign" means: "If float is -0.0 or +0.0, then set integer to 0. Otherwise, if float is negative (anything, even NaN), then set integer to -1. Otherwise set integer to 1."

I actually tested (-0.0)+(+0.0) and got +0.0. I'm not in a position to expediently verify whether this is standard in IEEE754, but if it is, then you deserve major credit for eliminating the need for "Unify Zeroes"! (If nobody else knows then I'll have to dig in.)

As to the 1/x limits thing, yeah, it's a serious problem. (Namely, math doesn't have a notion of -0.0, but IEEE754 does because it's used idiomatically as (-epsilon).) The fundamental question is whether or not you want to retain sign information. In other words, should the reciprocals of (-0.0, -0.0), (-0.0, +0.0), and (+0.0, +0.0) all be one and the same, or unique? We could just say "It's whatever the RM says it is under IEEE754". And we'll have to. But does that then leave the library providers with cases that they can't efficiently handle? If so, then we may need another instruction. (I'm leaning towards that being unnecessary, but I'm not super confident that I haven't overlooked something.)

I guess one way to resolve this impasse is to say that if you do want those intervals to be handled as one and the same, then you're going to have to first zero-unify the limits by adding +0.0 to both of them.

What do you think?

Ping @Chris00

@sunfishcode
Copy link
Member

"Get Logical Sign" means: "Copy sign bit from float into integer, and clear all but the LSB of the integer".

f64.copysign(1.0, x) < 0 implements this. But I wouldn't necessarily be opposed to adding a dedicated opcode for this if there's a performance case for it.

"Get Arithmetic Sign" means: "If float is -0.0 or +0.0, then set integer to 0. Otherwise, if float is negative (anything, even NaN), then set integer to -1. Otherwise set integer to 1."

Ah, ok. This looks a little more involved, but it probably also comes down to how much a dedicated opcode would help performance in realistic use cases.

I actually tested (-0.0)+(+0.0) and got +0.0. I'm not in a position to expediently verify whether this is standard in IEEE754, but if it is, then you deserve major credit for eliminating the need for "Unify Zeroes"! (If nobody else knows then I'll have to dig in.)

It is standard in IEEE 754, and it's in the testsuite.

Also, I figured out the error in my question about negative zeros. If there's a non-zero negative value which rounds up to -0, interval arithmetic would already have a non-zero lower bound that is less than it. So whenever you'd have a -0 lower bound, it really is the same as +0. There still are some tricky situations with approaching zero from the negative direction, but I expect these are just part of the trickiness that comes with treating "infinity" like a number.

@KloudKoder
Copy link
Author

First of all, I guess you've really found a way to avoid having "Unify Zeroes". Kudos!

Secondly, what about your f64.copysign(1.0, x)? According to the list of instructions, wouldn't that only work if x were a float? I see only one definition for f32/64.copysign, which is "[f32/64 f32/64] -> [f32/64]" (which really doesn't make sense because it has 2 inputs). But when I put "copysign" in the search dialog I get a whole bunch of nonspecific links to nothing in particular about copysign. Someone will hopefully make the website easier to navigate at some point, but for now I guess I'm stuck unless I can dig out a more navigable spec somewhere (or even a lousy PDF). Anyways, let's sync this issue first so we understand what value a new opcode would or would not offer.

Third, back to -0.0. Consider this:

(-1.0, -0.9)/(+infinity) = (-0.0, -0.0)
(-1.0, +0.9)/(+infinity) = (-0.0, +0.0)

But as you effectively pointed out, both of these intervals are just points, namely zeroes. This is literally correct interval arithmetic even after rounding (and IEEE754 behaves as such, I believe). The problem is that now your code thinks the outputs are different because, hey look, the bits don't match (even if you f32/64.reinterpret)! So you branch off the wrong way and it's all downhill from there. However, I now think the solution is simply to use your +0.0 addition trick whenever and wherever it matters; we don't need any special instruction to fix this, even if the code has to worry about it all the time.

Unless maybe @unageek (or anyone else lurking) has some problem with it.

@tlively
Copy link
Member

tlively commented Nov 1, 2022

The semantics of copysign are given here: https://webassembly.github.io/spec/core/exec/numerics.html#op-fcopysign

copysign_n(z1, z2)

  • if z1 and z2 have the same sign, then return z1
  • else return z1 with negated sign.

@titzer
Copy link

titzer commented Nov 2, 2022

It's also worthwhile noting that fN.copysign is not typically a hardware instruction; at least not on any of the ISAs that V8 supports. It's done with a couple of instructions of bitwise logic.

@KloudKoder
Copy link
Author

KloudKoder commented Nov 3, 2022

@tlively @titzer Thanks. Definitely copysign isn't going to fit the bill, unfortunately. It's necessary to branch based on the state of signs prior to interval multiplication, for instance. Because this is going to be done as often as interval multiplication itself, this would support the creation of the aforementioned pair of "get sign" type of instructions.

I'll be creating a proposal for CG review soon. Meanwhile, anyone with further concerns is invited to speak up.

@KloudKoder
Copy link
Author

KloudKoder commented Nov 7, 2022

I've been reading through the WASM instruction set reference (thanks @tlively) so as to try and not mess this up. Here's what I'm thinking:

There are a few ways to do this, but ultimately the simplest is probably with "metaops", i.e. instructions that tell the compiler how to compile. (This could be used for lots of other stuff in the future like branch hints, cache proximity hints, reentrant compilation, atomic transactions, secure enclave implementation, etc.) We might as well get started with a generic approach that could be repurposed for all those other potential uses down the road.

Rather than have a single opcode for all metaops, which would just result in another level of indirection, we can assign one for each specific purpose. By definition, metaops don't do anything visible in and of themselves; they have side effects which may manifest on subsequent instruction boundaries, or only in the analog world, e.g. power consumption or latency effects.

In this case, we need 4 opcodes: round nearest-or-even (implied in effect for backward compatibility), round negativeward, round positiveward, and chop. In the interest of simplicity, it doesn't matter if you use these as prefixes (i.e. immediately prior to a float instruction) or not. You could even be silly and put lots of them in a long chain after a return instruction, if you like. (Why tax the validator with such innocuous concerns?) In any case, their effect should persist until either a different mode is selected, or one of the following instructions is encountered: br, br_if, br_table, return, call, or call_indirect. Those instructions are guaranteed to reset back to nearest-or-even because we don't want to deal with weird cases where a branch target ends up with a superposition of RMs, or a call instruction ends up bouncing off the OS and suffers some resulting reset of the float pipes (can that even happen?). Ultimately, as I've said before, the optimizer should rearrange instructions so as to minimize the number of RM control register touches (but this is not required).

And BTW the RMs affect not only arithmetic instructions, but the sort of conversions mentioned above, wherein the target lacks sufficient precision to fully represent the source. It's even possible that one RM will produce a finite approximation while another RM will generate an infinity in response to the same input.

There are also the aforementioned sign extraction instructions, but those are straightforward.

Thoughts?

@tlively
Copy link
Member

tlively commented Nov 7, 2022

@KloudKoder, the direction sounds good! As far as terminology goes, I would describe these new instructions as normal instructions that modify new rounding mode state made explicit in the store rather than as "metaops." We want to avoid the impression that the behavior of these instructions is meant to be outside the formally specified semantics.

For features that actually are outside the specified semantics, there is a separate effort being led in the branch hinting proposal to develop a framework for code annotations. I want to emphasize that rounding mode control should not be using that framework because it does affect semantics.

I believe @titzer previously expressed a preference for adding individual new arithmetic and conversion instructions with explicit rounding behavior rather than introducing new rounding mode state and instructions to modify it and I share that preference. At least for optimizing engine tiers, I wouldn't expect this to have a performance difference, especially since you envision the rounding mode being reset at the end of blocks anyway. It would be great if you could comment on the trade offs you see between these two designs as part of your presentation.

@KloudKoder
Copy link
Author

@tlively It sounds like you're advocating for 2 different schemes. (Sorry if I'm just misinterpreting your comment.) To clarify, do you actually want many "new arithmetic and conversion instructions with explicit rounding behavior" or 4 new instructions with "the rounding mode being reset at the end of blocks"? (One of the foregoing plus the 2 sign extractors, actually.)

Yep, I got your point about metaops, which is fine.

@titzer
Copy link

titzer commented Nov 8, 2022

It's an unfortunate encoding mistake, in retrospect, that we didn't reserve a 0 byte for floating point ops. But as it is, I think we could carve out a prefix page for floating point ops that include an explicit rounding mode, and then put (actually, repeat) all ops that use a rounding mode under that prefix, with the rounding mode explicit in them.

My preference against the implicit rounding mode state option is that it makes otherwise pure instructions dependent on that global state. I get it that most hardware works this way, but my impression is that hardware architects don't like that either. RISC-V actually has both: the rounding mode field is large enough to encode 4 rounding modes plus a dynamic rounding mode. AFAICT they did this because of the software dependency on managing the rounding mode state inherited from previous architectures.

@rossberg
Copy link
Member

rossberg commented Nov 9, 2022

Strongly agree with what @titzer and others have said: an ambient mode state should be avoided, not least because it has terrible composability, i.e., gets in the way of transformations like inlining.

@KloudKoder
Copy link
Author

KloudKoder commented Nov 10, 2022

@titzer @rossberg Fine by me, with the understanding that your validator now has to worry about the following:

  1. No control transfers allowed to the byte after the RM prefix. So you can't jump into the middle of "chop multiply" in order to just "multiply" for instance.
  2. No redundant prefixes, e.g. "nearest-or-even chop divide".
  3. No prefixes to nonfloat instructions, e.g. "round-positive return".
  4. No prefixes to float instructions with no RM sensitivity, e.g. "round-negative f32.min".
  5. Functions (and even just scopes) cannot terminate with prefixes.

Implicitly, prefixes have no lifetime beyond the prefixed instruction itself; it all reverts to nearest-or-even thereafter.

OK?

@KloudKoder
Copy link
Author

Alright, I think I've identified every single instruction whose behavior would be altered by a change in RM. They're as follows:

91 f32.sqrt
92 f32.add
93 f32.sub
94 f32.mul
95 f32.div
9F f64.sqrt
A0 f64.add
A1 f64.sub
A2 f64.mul
A3 f64.div
B2 f32.convert_i32_s
B3 f32.convert_i32_u
B4 f64.convert_i64_s
B5 f64.convert_i64_u
B6 f32.demote_f64
B7 f32.convert_i32_s
B8 f32.convert_i32_u
B9 f64.convert_i64_s
BA f64.convert_i64_u
FD 5E f32x4.demote_f64x2_zero
FD E3 f32x4.sqrt
FD E4 f32x4.add
FD E5 f32x4.sub
FD E6 f32x4.mul
FD E7 f32x4.div
FD EF f64x2.sqrt
FD F0 f64x2.add
FD F1 f64x2.sub
FD F2 f64x2.mul
FD F3 f64x2.div
FD FA f32x4.convert_i32x4_s
FD FB f32x4.convert_i32x4_u
FD FE f64x2.convert_low_i32x4_s
FD FF f64x2.convert_low_i32x4_u

I couldn't find any documentation on the last 2, so I need to confirm that they're actually impacted by RM. (Are they taking 2 products of 2 i32s? Or concatenating them or what?) If I've neglected or wrongly included something, please speak up.

In keeping with the KISS principle, and considering that we're in no desperate shortage of root-level opcodes, I'd like to propose the creation of 3 additional prefixes (tentatively F9, FA, and FB), which serve as RM indicators for (round-positive, round-negative, and chop, respectively). This would match the existing opcode order for existing "ceil", "floor", "trunc", and "nearest" instructions provided in the spec (but FC would be skipped). This would, in turn, result in minimal complexity growth for upstream compilers. It would also minimize the pain of adding additional float or SIMD instructions in the future. The rules would be simple:

  1. If you're using nearest-or-even with any instruction in the list, then nothing changes.

  2. If you're using round-positive, round-negative, or chop; then use prefix opcodes F9, FA, or FB, respectively. Follow this by the equivalent nearest-or-even encoding, but skip the leading FD if it's present.

In the interest of simplicity, I'll post about sign extraction separately.

@KloudKoder
Copy link
Author

Now for the sign extractors. While such instructions would be useful for a wide variety of applications other than interval arithmetic, I'm just going to focus on the latter. To summarize, there are 8 cases from which to choose upon each multiply:

(+a, +b) * (+c, +d) = (+ac, +bd)
(-a, +b) * (+c, +d) = (-ad, +bd)
(-a, -b) * (+c, +d) = (-ad, -bc)
(-a, -b) * (-c, +d) = (-ad, +ac)
(-a, +b) * (-c, -d) = (-bd, +ad)
(+a, +b) * (-c, -d) = (-bc, -ad)
(-a, -b) * (-c, -d) = (+bc, +ad)
(-a, +b) * (-c, +d) = (min(-ad, -bc), max(+ac, +bd))

(There are 16 cases if you include antiintervals (+a, -b), which is sometimes necessary.) As shown, selecting a case requires extracting sign bits. (Doing arithmetic comparison against +0.0 would be overkill and actually doesn't always behave like sign bit extraction.) Alternatively, you could simply compute ac, ad, bc, and bd, then apply both min and max to obtain the resulting interval. That's really expensive, though, not the least of which because there's no "min/max of all lanes" SIMD instruction, so you need to serialize 2 min/max instructions. (Granted, you need to do so anyway in the last case, but it's rare.) To make matters worse, the 8 cases above all feature each variable with both signs, so it might be most efficient to use indirect branches from a sparsely populated table of 16 labels. Which brings us back to "really expensive". Even if the programmer knows that some of the cases are impossible, the compiler might not know that, so the best it can do is optimize for common cases such as the first one. Which is why, in practice, it's probably going to be fastest to just nest 4 predicates. Which, finally, is why sign extractors matter.

I should emphasize, again, that division is even worse because dividing by any interval including +/-0.0 will give rise an antiinterval, hence the need for an arithmetic sign extractor to complement the bitwise one. (Insert debate about conflicting limits from left and right here.) And this time, we can't even avoid the problem by using min/max. It's just ugly.

I would just add that all of the above explains why I think actual interval instructions would be advantageous. I know that's not a popular opinion because doing so would not expose anything which is currently atomic on the CPU level (but it might be in the future). The problem is that there's no good idiom for "hey I'm going to multiply 2 intervals, so do what I mean, not what I say". So the compiler has to take everything quite literally and cause loads of superfluous communication among floats, integers, flags, and the branch predictor. I thought I should make this clear, even if such instructions are categorically off the table for the moment, as I don't want anyone to be surprised when native assembly language is still multiples faster than even the RM-aware WASM equivalent.

OK so here are the sign extractors that I'd like to propose, absent full-blown native interval support:

C5 f32.sgn (32-Bit Get Sign)

This is a conventional branch preparation instruction, so to be consistent, it has this name instead of the more verbose "i32.sgn_f32". This instruction just copies the sign bit of f32 into i32 without regard to any other bits in f32.

C6 i32.asgn_f32 (32-Bit Get Arithmetic Sign)

This instruction sets i32 to zero if and only if f32 is +/-0. Otherwise it's equivalent to (1-2*f32.sgn).

C7 f64.sgn (64-Bit Get Sign)
C8 i32.asgn_f64 (64-Bit Get Arithmetic Sign)
C9 i32x4.sgn_f32x4 (4x32-Bit Get Sign)
CA i32x4.asgn_f32x4 (4x32-Bit Get Arithmetic Sign)
CB i64x2.sgn_f64x2 (2x64-Bit Get Sign)
CC i64x2.asgn_f64x2 (2x64-Bit Get Arithmetic Sign)

And the following instructions would help immensely, and not just for interval arithmetic, because lots of code branches depending on signs:

CD i32.psgn_f32x4 (Packed 4x32-Bit Get Sign)

This indirect branch preparation instruction applies f32.sgn to lane N and deposits the result into bit N of i32.

CE i32.pasgn_f32x4 (Packed 4x32-Bit Get Arithmetic Sign)

This indirect branch preparation instruction applies i32.asgn_f32 to lane N compresses the results into the 2 successive corresponding bits of i32.

CF i32.psgn_f64x2 (Packed 2x64-Bit Get Sign)
D0 i32.pasgn_f64x2 (Packed 2x64-Bit SIMD Get Arithmetic Sign)

@tlively
Copy link
Member

tlively commented Nov 21, 2022

In keeping with the KISS principle, and considering that we're in no desperate shortage of root-level opcodes, I'd like to propose the creation of 3 additional prefixes (tentatively F9, FA, and FB), which serve as RM indicators for (round-positive, round-negative, and chop, respectively)

We generally try to be conservative about using new prefix opcodes because we expect WebAssembly to be around for a long time and because they come out of the same space we could use for new single-byte opcodes. It would be better to put all the new instructions behind prefix FC, as @rossberg suggested.

It might also be a good idea to leave out new SIMD instructions for now, unless you can show that there are efficient lowerings for the alternative semantics on common hardware and libraries that would use these instructions.

@KloudKoder
Copy link
Author

@tlively Understood. I'll rehash everything under FC. And when you say to leave out the SIMDs, are you referring to only the sign extractors? I assume so because there are definitely efficient hardware instructions for ordinary arithmetic SIMDs of all RMs, at least on AMD64; these are also helpful for combining parallel interval arithmetic instructions at the source code level into a single hardware operation (to great effect, since 32 bits (x4) goes a long way when you have error containment). But please clarify.

@tlively
Copy link
Member

tlively commented Nov 22, 2022

In general there is a high bar to add SIMD instructions. You’ll need to show that there are efficient lowerings for all the new SIMD instructions on Arm and x86, and ideally risc-v as well, with details about which ISA extensions would be necessary. If you can show that, then including the SIMD instructions should be fine.

@KloudKoder
Copy link
Author

KloudKoder commented Nov 22, 2022

@tlively Well it turns out that the SIMD arithmetic instructions don't have efficient ARM8 equivalents. So I'll pull those out, but what about the sign extractors? Those are all integer operations and definitely have efficient (but not necessarily single-instruction) equivalents on all modern architectures.

EDIT: On further thought, the SIMD sign extractors aren't useful enough to justify, absent custom RMs, so I'll kill them too. I'll come back later with an updated proposal.

@KloudKoder
Copy link
Author

OK how about this: The following RM-aware float instructions would all be prefixed with FC.

20 f32.sqrt_ceil
21 f32.sqrt_floor
22 f32.sqrt_trunc
23 f32.add_ceil
24 f32.add_floor
25 f32.add_trunc
26 f32.sub_ceil
27 f32.sub_floor
28 f32.sub_trunc
29 f32.mul_ceil
2A f32.mul_floor
2B f32.mul_trunc
2C f32.div_ceil
2D f32.div_floor
2E f32.div_trunc
2F f64.sqrt_ceil
30 f64.sqrt_floor
31 f64.sqrt_trunc
32 f64.add_ceil
33 f64.add_floor
34 f64.add_trunc
35 f64.sub_ceil
36 f64.sub_floor
37 f64.sub_trunc
38 f64.mul_ceil
39 f64.mul_floor
3A f64.mul_trunc
3B f64.div_ceil
3C f64.div_floor
3D f64.div_trunc
3E f32.convert_i32_ceil_s
3F f32.convert_i32_floor_s
40 f32.convert_i32_trunc_s
41 f32.convert_i32_ceil_u
42 f32.convert_i32_floor_u
43 f32.convert_i32_trunc_u
44 f64.convert_i64_ceil_s
45 f64.convert_i64_floor_s
46 f64.convert_i64_trunc_s
47 f64.convert_i64_ceil_u
48 f64.convert_i64_floor_u
49 f64.convert_i64_trunc_u
4A f32.demote_f64_ceil
4B f32.demote_f64_floor
4C f32.demote_f64_trunc
4D f32.convert_i32_ceil_s
4E f32.convert_i32_floor_s
4F f32.convert_i32_trunc_s
50 f32.convert_i32_ceil_u
51 f32.convert_i32_floor_u
52 f32.convert_i32_trunc_u
53 f64.convert_i64_ceil_s
54 f64.convert_i64_floor_s
55 f64.convert_i64_trunc_s
56 f64.convert_i64_ceil_u
57 f64.convert_i64_floor_u
58 f64.convert_i64_trunc_u

And likewise with these sign extractors with no need for RM:

1C f32.sgn (32-Bit Get Sign (to i32))
1D i32.asgn_f32 (32-Bit Get Arithmetic Sign)
1E f64.sgn (64-Bit Get Sign (to i32))
1F i32.asgn_f64 (64-Bit Get Arithmetic Sign)

@tlively
Copy link
Member

tlively commented Nov 23, 2022

Nice, looks good to me.

One more note is that typically the prefixes instruction names correspond to the result type of the instruction, so I would expect f32.sgn to start with i32. I would also be prepared for further bikeshedding on names once this is presented to the full group.

@sunfishcode
Copy link
Member

It's worth considering making the rounding mode an immediate field rather than being folded into the opcode. Something like this:

FC 20 00 f32.sqrt_ceil
FC 20 01 f32.sqrt_floor
FC 20 02 f32.sqrt_trunc
FC 21 00 f32.add_ceil
FC 21 01 f32.add_floor
FC 21 02 f32.add_trunc

and so on.

Upsides: This makes it cleaner to extend the feature with new rounding modes in the future, if we ever want to do that. There are algorithms that use round-to-odd, round-away-from-zero, and other things. We don't have popular hardware for them today, but who knows what the future may hold. And this leaves more of the one-byte space under the FC prefix available for future features.

Downside: This makes the instructions one byte longer. However, floating-point operations don't tend to be the most frequent opcodes even in floating-point-heavy code, so it should have a low overall impact on code size.

@KloudKoder
Copy link
Author

@tlively Maybe I'm confused. So that's why I stated above that "[f32.sgn] is a conventional branch preparation instruction, so to be consistent, it has this name instead of the more verbose "i32.sgn_f32"." Are you saying that f32.sgn and friends should not be consistent with, say, f32.eq? Fine either way but I want to make sure.

@sunfishcode Yeah, I thought of this but I figured that we would never see all the major silicon vendors supporting these other modes. And the opcode map is already sort of chaotic, so if antichop suddenly appears tomorrow, then we could just add the same instructions elsewhere in their own contiguous region of FC or whatever map. If there's a new float instruction (without a new RM), then we can just inject it after the last triplet.

But if y'all still want a broken-out RM byte, then fine by me. And would you want to wrap the precision bit into the RM byte as well, so we don't need separate 32/64?

@tlively
Copy link
Member

tlively commented Nov 28, 2022

Oh yeah, you’re right that f32.eq and the other comparison instructions are already exceptions to the rule I described. The names you have seem fine for now, and we can discuss with the full group of we should make any changes.

@sunfishcode
Copy link
Member

@sunfishcode Yeah, I thought of this but I figured that we would never see all the major silicon vendors supporting these other modes. And the opcode map is already sort of chaotic, so if antichop suddenly appears tomorrow, then we could just add the same instructions elsewhere in their own contiguous region of FC or whatever map. If there's a new float instruction (without a new RM), then we can just inject it after the last triplet.

You're not wrong :-), but my instinct is that the immediate field is worth doing.

But if y'all still want a broken-out RM byte, then fine by me. And would you want to wrap the precision bit into the RM byte as well, so we don't need separate 32/64?

Splitting out 32/64 isn't necessarily a bad idea, but since we didn't do that in the earlier float opcodes, my inclination would be to keep them separate here too.

@KloudKoder
Copy link
Author

@sunfishcode OK we can do that. I don't see anyone complaining as of yet. Let me get a final presentation together and schedule for a CG meeting date.

@KloudKoder
Copy link
Author

I've submitted a request to debate this proposal at the December 20 CG meeting. I also have an updated synopsis of the proposed changes based on the above discussion (but I'm not presently sure which document format to post where).

@KloudKoder
Copy link
Author

Summarizing the current proposal for those in the CG meeting who want something in writing:

This proposal is for the compiler prototyping of floating-point instructions which mirror those already in the spec, but with distinct rounding modes (RMs) which differ from the universally assumed nearest-or-even policy. The RMs involved include the 3 others which all major CPUs seem to support, namely "ceil" (round toward positive infinity), "floor" (round toward negative infinity), and "trunc" (chop toward zero without changing the sign bit). SIMD instructions aren't included because they don't yet have succinct equivalents on ARM.

All of these instructions will use the FC opcode prefix so as to conserve space in the main map. This prefix is followed by the secondary opcode, and finally an RM byte indicating one of the aforementioned RMs (or perhaps yet another one in the future). The opcode map ordering is identical to that of their corresponding progenitors so as to minimize compiler complexity. As previously, a suffix of "s" or "u" implies "signed" or "unsigned" operation, respectively.

The proposed map is thus as follows:

20 00 f32.sqrt_ceil
20 01 f32.sqrt_floor
20 02 f32.sqrt_trunc
21 00 f32.add_ceil
21 01 f32.add_floor
21 02 f32.add_trunc
22 00 f32.sub_ceil
22 01 f32.sub_floor
22 02 f32.sub_trunc
23 00 f32.mul_ceil
23 01 f32.mul_floor
23 02 f32.mul_trunc
24 00 f32.div_ceil
24 01 f32.div_floor
24 02 f32.div_trunc
25 00 f64.sqrt_ceil
25 01 f64.sqrt_floor
25 02 f64.sqrt_trunc
26 00 f64.add_ceil
26 01 f64.add_floor
26 02 f64.add_trunc
27 00 f64.sub_ceil
27 01 f64.sub_floor
27 02 f64.sub_trunc
28 00 f64.mul_ceil
28 01 f64.mul_floor
28 02 f64.mul_trunc
29 00 f64.div_ceil
29 01 f64.div_floor
29 02 f64.div_trunc
2A 00 f32.convert_i32_ceil_s
2A 01 f32.convert_i32_floor_s
2A 02 f32.convert_i32_trunc_s
2B 00 f32.convert_i32_ceil_u
2B 01 f32.convert_i32_floor_u
2B 02 f32.convert_i32_trunc_u
2C 00 f64.convert_i64_ceil_s
2C 01 f64.convert_i64_floor_s
2C 02 f64.convert_i64_trunc_s
2D 00 f64.convert_i64_ceil_u
2D 01 f64.convert_i64_floor_u
2D 02 f64.convert_i64_trunc_u
2E 00 f32.demote_f64_ceil
2E 01 f32.demote_f64_floor
2E 02 f32.demote_f64_trunc
2F 00 f64.convert_i32_ceil_s
2F 01 f64.convert_i32_floor_s
2F 02 f64.convert_i32_trunc_s
30 00 f64.convert_i32_ceil_u
30 01 f64.convert_i32_floor_u
30 02 f64.convert_i32_trunc_u
31 00 f64.convert_i64_ceil_s
31 01 f64.convert_i64_floor_s
31 02 f64.convert_i64_trunc_s
32 00 f64.convert_i64_ceil_u
32 01 f64.convert_i64_floor_u
32 02 f64.convert_i64_trunc_u

Considering that all of these instructions are directly relevant to interval arithmetic, the following primitives should be included as well, which would be frequently used in that context, e.g. upon every multiply operation. They are also prefixed with FC but don't contain (and don't need) an RM byte.

In the tradition of branch preparation instructions, f32.sgn and f64.sgn have abbreviated names rather than "i32.sgn_f32" and "i32.sgn_f64", respectively. (The arithmetic sign instructions technically aren't branch preparation instructions because they return ternary outputs, so they conform to the usual notation.)

1C f32.sgn (32-Bit Get Sign Bit (to i32))
1D i32.asgn_f32 (32-Bit Get Arithmetic Sign (-1/0/1))
1E f64.sgn (64-Bit Get Sign Bit (to i32))
1F i32.asgn_f64 (64-Bit Get Arithmetic Sign (-1/0/1))

@KloudKoder
Copy link
Author

Sorry this has taken so long. Anyhow, we now have a sandbox repo (big thanks to @dschuff ).

https://github.com/WebAssembly/rounding-mode-control

I've opened an issue over there for discussion about how to hack up a prototype. All, feel free to dive in and post your thoughts. I assure you that I have absolutely no idea what I'm doing with respect to the WASM codebase. At least, I can help with formal revisions to the PDF specification, as well as testing.

WebAssembly/rounding-mode-control#2

@mglisse
Copy link

mglisse commented Mar 9, 2023

In case you are still looking for potential users, CGAL (computational geometry, meshing) is a C++ library that heavily relies on interval arithmetic (and exceptions) for performance: the fallback is bignum rationals. We have some slow emulation so the code can at least run, but that's really not good. We have had several users ask for wasm support for demos or web applications, and several told us they had to switch to a different (inferior) solution because of this. We do not mind if it is necessary to write a bit of code specifically for wasm, we already have plenty of platform-specific code in that area. On x86_64, we even use rounding for SIMD operations, but for wasm, rounding of scalar operations would already be great.

(sorry, I won't have time to help much, all I can offer is to try porting CGAL's Interval_nt once there is support for both compiling and evaluating, although I have no experience with wasm)

@KloudKoder
Copy link
Author

@mglisse Cool! Sounds like CGAL would make a great test case and eventual use case. Ping us here again when we have a prototype working. I got my work cut out for me, in the meantime.

@KloudKoder
Copy link
Author

@mglisse @Chris00 Guys, do you have any preference with respect to rounding toward positive or negative infinity? I ask because @whirlicote is going to be writing some optimized implementations for common use cases, and it all works faster if he sticks to a single rounding mode. (When rounding positive, rounding negative is sometimes implementable by just negating before and after the operation. So you can do that cheaply instead of switching the RM (expensively). And similarly in the opposite case.) This will, at least initially, be done via optimization hints in a Javascript context. We don't want to offer optimization for both cases because that's begging to create a divergence of coding practice which serves no constructive purpose. So it's either favoring ceil, or favoring floor. My personal preference is ceil because it's slightly more useful (on account of positive-only quantities which require bounding) but I don't have a strong opinion either way. To be clear, functional support for both will be provided, but we'll only offer optimization for one of them in addition to nearest-or-even.

@mglisse
Copy link

mglisse commented Oct 14, 2023

I can work with either. Currently in CGAL for intervals we use only rounding towards +inf (and the negation trick you mention), so it would be slightly easier to stick to that. The one thing that would suck would be for 2 compilers or browsers to optimize only one direction but not the same one...

@Chris00
Copy link

Chris00 commented Oct 14, 2023

In inari, rounding towards +∞ is most often used, so the library will be simpler to adapt if this is available.

CC: @unageek

@KloudKoder
Copy link
Author

KloudKoder commented Oct 14, 2023

@mglisse @Chris00 Thanks for your feedback. So it looks like ceil should be the favored way.

"The one thing that would suck would be for 2 compilers or browsers to optimize only one direction but not the same one."

Indeed. whirlicote was concerned about that as well. He also plans to include some reference code in public notes to the spec, in order to ensure that everyone is optimizing for the same expected compiler behavior, which I think will further serve to avoid your feared outcome.

For updates on the proposal's progress, see:

WebAssembly/rounding-mode-control#2

@jakobkummerow
Copy link

I'm not sure I fully understand the implications of the last handful of messages, but if you're saying that you're adding (among others) f32.sqrt_ceil and f32.sqrt_floor such that only the _ceil version will be fast and everyone will be expected to use that (with user-space tricks to map any desired rounding behavior onto this one instruction), and the _floor version only exists for completeness but nobody should use it because it will be slow, then that seems like a very suboptimal outcome, because it would be a very surprising performance pitfall. So in that case, I'd postulate that the _floor version shouldn't exist at all, to prevent folks from accidentally using it.

Taking a step back, this would be an example of a very important effect of prototyping: if it turns out that a design that looked good on paper is either not feasible (e.g., because changing the control mode very frequently is too expensive on actual hardware) or not necessary in practice, then this feedback should definitely be used to make changes to the original design.

(If I'm misunderstanding what you're saying, then never mind and carry on.)

@KloudKoder
Copy link
Author

KloudKoder commented Oct 16, 2023

@jakobkummerow It's my fault for not explaining better. What we're proposing is that add, subtract, multiply, and divide will not be optimizable for floor because their floor flavors will be implemented as negate+ceil+negate. In practice, this means that interval arithmetic will be fast because we avoid RM switching with these common operations. sqrt (and any future transcendental functions) will not be affected because there is no economical trick to pull; we will have to pay the full cost of RM switching if we want to change from ceil to floor and back. In the long term, there could be another proposal to reorder the code so as to elide as much of that switching as possible. Or, perhaps more realistically, the CPU marketshare pie will evolve in favor of those chips which feature per-instruction RM decorations (such as nVidia or RISC-V) rather than this global RM foolishness.

@jakobkummerow
Copy link

I was just using sqrt as an example. My larger point still stands: if f32.add_floor will be far slower than f32.add_ceil and nobody is supposed/expected to use it, then it probably shouldn't exist in the first place.

@KloudKoder
Copy link
Author

@jakobkummerow OK I see your point. So while there won't be any significant performance difference with sqrt (or future transcendentals), there will in fact be superior performance with basic arithmetic using ceil as opposed to floor (when you use appropriate hints to optimize for ceil). There will be no equivalent hint available for optimizing for floor. However, and I know this sounds counterintuitive, floor will be as fast as it can possibly be if you optimize for ceil. This is because, if we got rid of basic arithmetic with floor, then each developer would need to implement negate+ceil+negate at the source code level, resulting in performance that's at least as poor, combined with lots more verbosity and thus opportunity for error; or else drop the optimization entirely and let us change and restore the global RM on every RM-sensitive instruction, which would be way worse still.

You might rightly wonder why floor and ceil should not have symmetric performance. The reason is that the only way to enforce that symmetry is to change the RM to ceil, then back to floor, then back to ceil, etc. all over the place; whereas, it's actually faster just to choose one of them and stick with it. In other words, the negates end up being way cheaper than all the would-be RM switching. That's likely to be the case even if we use heuristics and instruction reordering in order to minimize the number of switches required, simply because the most common use case (interval arithmetic) needs both of them for each and every basic operation.

@jakobkummerow
Copy link

each developer would need to implement negate+ceil+negate at the source code level

Keep in mind that Wasm modules are toolchain-generated, not handwritten. It is perfectly acceptable that toolchains have to do some work.

IIUC you're suggesting that engines should implement add_floor as negate+add_ceil+negate. Which creates the problems mentioned above: all engines must do this the same way, and toolchain authors must be aware that that's what engines do -- it would be really unfortunate if some toolchain decided to express add_ceil as negate + add_floor + negate, in a well-meaning attempt to avoid costly RM switches on naive engines.
This reinforces my thinking that a more robust way to ensure consistency across the ecosystem is to not have add_floor, thereby forcing toolchains to emit negate+ceil+negate when the original source code requests flooring, and hence making sure everyone is on the same fast path. As an added benefit, AOT optimizers in toolchains can afford to do more costly analysis (to e.g. reduce the number of required negations) than JIT compilers in engines.

@whirlicote
Copy link

such that only the _ceil version will be fast and everyone will be expected to use that

No. The difference in performance is only small. Something like two negate instructions. (Which is neglectable to a pipeline flush)

and the _floor version only exists for completeness but nobody should use it because it will be slow, then that seems like a very sub-optimal outcome

No. I expect _floor to be still faster than the negate-ceil-negate on hardware with support of AVX-512.

because it would be a very surprising performance pitfall.

No. The performance penalty is only about two negations (on certain hardware).

So in that case, I'd postulate that the _floor version shouldn't exist at all, to prevent folks from accidentally using it.

Keep in mind that Wasm modules are toolchain-generated, not handwritten. It is perfectly acceptable that toolchains have to do some work.

I was just using sqrt as an example. My larger point still stands: if f32.add_floor will be far slower than f32.add_ceil and nobody is supposed/expected to use it, then it probably shouldn't exist in the first place.

It is not far slower. You are expected to use _floor and _ceil. This is to make sure that the same code runs fast on different hardware.

This reinforces my thinking that a more robust way to ensure consistency across the ecosystem is to not have add_floor, thereby forcing toolchains to emit negate+ceil+negate when the original source code requests flooring, and hence making sure everyone is on the same fast path.

This optimization is only relevant on certain hardware for certain use cases. It only works for interval arithmetic heavy code for example. It is an easy but effective optimization for the production engines. Concerning this proposal it only fits into a NOTE in the spec.

and hence making sure everyone is on the same fast path.

I belive this problem can be solved with documentation or simply trying out both variants and measure the time whats faster in ones code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants