-
Notifications
You must be signed in to change notification settings - Fork 2
Rounding mode hacking discussion #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ping @rossberg @sunfishcode @titzer @tlively <- You're all more experienced with WASM than I am. I'd like to hear your recommendations for how to hack this up. Can we maybe decide on some tasks to distribute? For starters, I'm willing to edit the instruction descriptions in the WASM manual if someone can point me to which doc I need to edit. Invite whoever else you see fit. |
The first step would be to create a new proposals/rounding-mode-control/Overview.md document and link to it from the top-level README.md. It would be good to check out the overview docs on some of the other proposals to get an idea for the kind of content that usually goes in there, but that would be the best place to list the details for the newly proposed instructions at this stage. |
Here are pointers to relevant documentation:
Just to set expectations: by default, the champion of a proposal is the main person responsible for most of the work that is described in the above docs. Of course, they can try to get other interested people on board to distribute that work. But everybody is very busy, so don't expect too much. Often the only help champions get is that other folks do code reviews. Either way, better plan to spend substantial amounts of time on it. ;) The only work for which you usually have to depend on others is the implementation in production engines required in a later phase. For that, you basically need engine vendors prioritising the proposal high enough that they start working on incorporating it. That can take considerable time, but if the proposal is in a good shape then it will happen. |
@tlively Thanks, got it. I'll post an update here when I have an overview markdown to share. I'll try to follow the format of the existing proposals you linked. @rossberg Thanks for the links, so at least I have a "map" now. I've literally never compiled anything to WASM so yeah, this could be a long learning curve. So long, in fact, that I think it would be preferable to take up Conrad Watt's offer to hack up a bunch of switch statements for the sake of quickly turning every float arithmetic instruction into an RM-sensitive one. You there, @conrad-watt ? If so, what do you suggest? Have I misunderstood what you were proposing? |
To be explicit, I was imagining supporting a source/IR-level "ambient mode + op" set of intrinsics by lowering to a Wasm-level
In this example (with apologies for the suspicious pseudo-code), source-level This was meant to be a relatively low-effort suggestion to get programs with such intrinsics compiling to Wasm if we went with instruction-level rounding modes as currently proposed. However, this isn't something I have the bandwidth or expertise to engineer myself. |
@conrad-watt Thanks, I understand. I think it's a good hack for a proof-of-concept. I could just copy the rounding mode from the opcode byte directly into your $rounding_mode. But in order to do that... is there any particular assembler which any of you would recommend for use with WASM, mainly for this sort of low-level test hacking? It would be easier if I could just bypass the usual runtime bloat of C++, Rust, etc. Specifically, I'm looking for the ability to write individual WASM instructions (and, ultimately, modify the assembler in order to encode the proposed new ones so that a hacked version of WASM itself could execute them). An internet search basically just pointed me back to WASM. |
A couple options would be to add the new instructions to WABT to use its wat2wasm tool or Binaryen to use its wasm-as tool. You could also implement the instructions in the reference interpreter itself, right here in this repo. Here's a PR adding instructions to Binaryen that you could look at to get a feel for what changes would be necessary in that code base: WebAssembly/binaryen#5214. |
@tlively Thanks! Seems like I ought to be able to cobble something together from all that. |
Just one more thought - to reduce your exposure to the Wasm-specific parts of the compilation it may make sense to directly expose the hypothetical "per-instruction rounding" Wasm instructions as intrinsics, and then do the strategy sketched above (#2 (comment)) as more of a source-to-source or generic IR-to-IR translation on code written in the |
@conrad-watt Thanks, I think I see what you mean. So basically connect the source code to the compiled code with macro middleware that implements a "set rounding mode" instruction. That makes sense. |
Of course, if you know of any code that is already directly written in terms of "per-instruction rounding mode" intrinsics, that would be even better! |
Progress update... Finally got started on this in earnest. I've decided to go the WABT route, hoping to implement RMs in the interpreter. At present I'm still trying to figure out where the relevant instructions are actually implemented. For example, if I search for "f64.sqrt", I see some tests for it along with some SIMD implementation stuff, but I don't see the basic floating-point implementation either as an interpretation function (which literally executes it) or a compiler function (which injects machine code into an output stream). Surely it's here somewhere but it's probably obscured by renaming. Same goes for other instructions impacted by RM. If I can't make progress on this soon, I guess I'll give up and try my luck with Binaryen. |
For adding new instructions to the interpreter, there are a few layers to go through. For instructions like unary operators, the dispatching is done in https://github.com/WebAssembly/wabt/blob/main/src/interp/interp.cc#L1141 with some of the implementations bottoming out in the corresponding header (https://github.com/WebAssembly/wabt/blob/b335131b5983c055bae8e7d0e6724ea556aa843d/include/wabt/interp/interp-math.h#L250) From the top end, you'd also need to add opcodes (https://github.com/WebAssembly/wabt/blob/main/include/wabt/opcode.h) and fill in some layers in between. I'd suggest posting an issue in the WABT repo and you can get some more specific help. |
@dschuff Thanks! Based on those filenames you provided, I was able to track down the problem: my search script was silently truncating the result list so they never showed up. Problem between keyboard and chair. So, yeah, I think I have enough to chew on now, and will inquire with WABT folks as needed. |
I have a first attempt at implementing a WebAssembly module that realizes the proposed instructions in WebAssembly/design#1456 (comment). https://gitlab.com/pauldennis/rounding-fiasco/-/blob/main/RoundingFiasco.wasm?ref_type=heads
|
@whirlicote All I see is a 13 MB WASM file. Care to elaborate? If you've really implemented all those instructions, even in a mockup form, that would be a very big deal, so I'm probably misunderstanding. |
sure @KloudKoder WebAssembly has a binary format and a text format for WebAssembly modules. The file extension for binary files is The signature of the export statements have the same signature as the proposed intrinsics have. The module exports all proposed rounding functions. I tried out some exported functions by using a tool called The module implements the hard part of the instructions. That means if you call the exports with real numbers (e.g. The whole thing is "first try" because:
|
@whirlicote Well cool! Unlike some other folks lurking around here, I'm distinctly unqualified to evaluate your code. However, I'm willing and able help test it (and especially assist with weird cases like square roots, which in the interval world emerge from series truncation). But this would require extensive communication that's probably too much for this thread. If that's of interest to you, then reply with some garbage email or chat ID where I might contact you, as Github doesn't seem to allow me to do that on platform. If not, feel free to continue your discussion here. |
Several questions arise: How should the rounding instruction be implemented in the reference implementation? There are several possible approaches. One approach is to only allow the reference implementation to compile for CPUs that have the corresponding instructions. For example in C++ adding two doubles with rounding down is as "easy" as something like this:
Another option might be to find an expert that is familiar with a suitable softfloat implementation that supports rounding and use that one. Also certain compilers have flags that allow to emulate the floating point instructions. These flags are usually used for hardware that does not have any floating point arithmetic at all. Then there are questions concerning the behavior of NaNs. According to Wikipedia a possible anatomy of a single precision NaN is as follows:
The inputs of a binary operation could both be NaN. Should the left or the right payload be chosen for the payload of the resulting NaN? How should signaling NaNs be handled? To answer these questions there are several strategies:
As for the circle of blame:
I will try to fix any testcase you provide.
The repository for the development of |
@whirlicote So I can provide a few hints with respect to your questions: Regarding your f32_add_floor(), seems like you meant FE_DOWNWARD rather than FE_TOWARDZERO. The NaN corner cases are currently the subject of debate. Therefore I don't think this needs to be a blocking issue for the first iteration of your code. As to performance optimization, first of all, any silicon such as Intel that doesn't integrate the rounding mode (RM) into the instruction itself is just badly bottlenecked, so there's only so much that can be done by way of mitigation. One option is to reorder instructions so as to minimize the total number of RM switches. But this implies potentially inefficient reordering of memory accesses in a manner which might confuse stride detection logic, resulting in performance losses that exceed the gains, especially when most of the data ends up uncached. So not reordering might be superior overall, and even in such cases, we could still elide all redundant RM switching (like setting FE_DOWNWARD over and over again). And in the long term, once threading is commonplace and the frontside bus is often saturated, we'll have plenty more time to fiddle with the RM control register. Fortunately, the vast majority of functions will never change RM at all, in which case we won't see any negative impact from these new instructions. All considered, this is going to end up being a complex optimization problem requiring extensive profiling of what's actually being run in the wild (which itself is a moving target influenced by our own optimization). We don't necessarily need to hit maximum performance on the first pass. As to the proposal of only implementing those instructions which are actually supported in hardware, I don't think you have to worry. I'm pretty sure all silicon which can implement WASM code is capable of handling the 4 proposed RMs without having to resort to full-blown software emulation. (Some SIMD FP operations had to be removed in order to ensure this.) As to testing, I'm thinking of something like this: There's a test mode (even a lousy command line app or the equivalent in the browser) which allows the user to input a pair of f32s or f64s, and apply an instruction to them in the context of a given RM. So for example I could take the ratio of a pair of f64s and round the result toward positive infinity. Hex in, hex out. Then one could write a rather repetitive bash script to do the hundreds of required corner case tests and check every single corresponding result down to the last bit. Vastly better than pingponging case by care here in the discussion. But it would, obviously, require some crude UI to be constructed upfront. Gitlab tickets would be a viable but last resort. |
I corrected the code example in the edit. @KloudKoder I exported testcase function to be used with A testcase is invoked as follows
All numbers are encoded as
For example taking the squareRoot of
For example taking the squareRoot of
For example taking the ratio of
The result of the testcase is indicated by the exit code.
A simple testscript could than be written like this:
hope that helps |
@whirlicote Really impressive! And yes I get the representation as decimal from i32 or i64, as well as your packed function type representation. That'll do. I'm thinking the best way to find out what your test results you should return is to go back to ancient 8087 assembly language and implement all of the relevant WASM instructions using some obvious corner cases (denormals, signed zeroes, infinities, etc.). Then manually set the precision control (PC) and rounding control (RC) of the control register before each instruction. Then finally print out all the results using your same decimal format, and decorate them into commands of the above format. So probably a C wrapper on 8087 assembly that dumps tons of test cases in your particular format. Then literally just post them here so you can run them on your end and see if anything fails. Unfortunately I'm slammed for the next couple weeks but I will work on this as I can, unless you have an easier approach. |
I wrote a c++ implementation for the proposed instructions using |
For the record I'm working offline on this with whirlicote. |
I added rounded instructions into the reference interpreter. The repository is here https://github.com/whirlicote/rounding-mode-control The diff can be seen here: |
I'm working on automated test cases for whirlicote's code. In the process I realized that, for clarity, I should consolidate the proposed opcode list here, imported from the original issue linked above. I removed opcodes 0x31 and 0x32 because they were accidental duplicates of 0x2C and 0x2D. 1C f32.sgn (32-Bit Get Sign Bit (to i32)) |
Sorry let's try that again. whirlicote realized that the real problem was that opcodes 2C and 2D were showing an f64 output when in fact it should have been f32. Given that, then the corresponding f64 forms are in fact distinct instructions. However, if you think about it, opcodes 2F and 30 in the foregoing comment are redundant because i32, whether signed or unsigned, will always convert to f64 with no loss of information (on account of 53-bit precision), so rounding is irrelevant. Putting it all together, the corrected map would look like this: 1C f32.sgn (32-Bit Get Sign Bit (to i32)) |
Thanks to the provided testcases from KloudKoder the RoundingFiasco.wasm WebAssembly module now consideres edgecases for rounded instructions such as:
|
The provided testcases from KloudKoder pass the fork of the reference implementation that implements the proposal: |
From the 11/21/2023 meeting, we have 2 immediate tasks per Deepti and Conrad, respectively:
|
@dtig In the last meeting regarding rounding variants there was a request to get appraisal from the V8 project regarding the rounding variants proposal. Where/how should we get in touch with the V8 developers? |
I did some digging. This is what I found out by now: for firefox:
For chrome:
|
Communication with v8: https://groups.google.com/g/v8-dev/c/J5pHNIKBsGk/m/4m4hx9DyCAAJ |
Here are the interesting line diffs of a prototype implementation of the rounding variants proposal in v8. (commit diff --git a/src/codegen/x64/assembler-x64.cc b/src/codegen/x64/assembler-x64.cc
index a9f9c2dd447..835f691a015 100644
--- a/src/codegen/x64/assembler-x64.cc
+++ b/src/codegen/x64/assembler-x64.cc
@@ -2267,6 +2267,38 @@ void Assembler::pushq_imm32(int32_t imm32) {
emitl(imm32);
}
+
+
+void Assembler::prolog_ceil() {
+ EnsureSpace ensure_space(this);
+ emit(0x0F); emit(0xAE); emit(0x15); emit(0x01); emit(0x00); emit(0x00); emit(0x00);
+ emit(0xA9); emit(0xbf); emit(0x5f); emit(0x00); emit(0x00);
+}
+void Assembler::prolog_trunc() {
+ EnsureSpace ensure_space(this);
+ emit(0x0F); emit(0xAE); emit(0x15); emit(0x01); emit(0x00); emit(0x00); emit(0x00);
+ emit(0xA9); emit(0xbf); emit(0x7f); emit(0x00); emit(0x00);
+}
+void Assembler::prolog_floor() {
+ EnsureSpace ensure_space(this);
+ emit(0x0F); emit(0xAE); emit(0x15); emit(0x01); emit(0x00); emit(0x00); emit(0x00);
+ emit(0xA9); emit(0xbf); emit(0x3f); emit(0x00); emit(0x00);
+}
+
+void Assembler::epilog_ceil() {
+ EnsureSpace ensure_space(this);
+ emit(0x0F); emit(0xAE); emit(0x15); emit(0x01); emit(0x00); emit(0x00); emit(0x00);
+ emit(0xA9); emit(0xbf); emit(0x1f); emit(0x00); emit(0x00);
+}
+void Assembler::epilog_floor() {
+ epilog_ceil();
+}
+void Assembler::epilog_trunc() {
+ epilog_ceil();
+}
+
+
+
void Assembler::pushfq() {
EnsureSpace ensure_space(this);
emit(0x9C);
diff --git a/src/codegen/x64/assembler-x64.h b/src/codegen/x64/assembler-x64.h
index 49f03cb3d3a..08fb0405080 100644
--- a/src/codegen/x64/assembler-x64.h
+++ b/src/codegen/x64/assembler-x64.h
@@ -649,6 +649,14 @@ class V8_EXPORT_PRIVATE Assembler : public AssemblerBase {
void CodeTargetAlign();
void LoopHeaderAlign();
+ // rounding mode
+ void prolog_ceil();
+ void epilog_ceil();
+ void prolog_floor();
+ void epilog_floor();
+ void prolog_trunc();
+ void epilog_trunc();
+
// Stack
void pushfq();
void popfq();
@@ -1839,6 +1847,28 @@ class V8_EXPORT_PRIVATE Assembler : public AssemblerBase {
#undef AVX_3
+ void vaddss_floor(XMMRegister dst, XMMRegister src1, XMMRegister src2) {
+ prolog_floor();
+ vaddss(dst, src1, src2);
+ epilog_floor();
+ }
+ void vaddss_floor(XMMRegister dst, XMMRegister src1, Operand src2) {
+ prolog_floor();
+ vaddss(dst, src1, src2);
+ epilog_floor();
+ }
+
+ void vaddsd_floor(XMMRegister dst, XMMRegister src1, XMMRegister src2) {
+ prolog_floor();
+ vaddsd(dst, src1, src2);
+ epilog_floor();
+ }
+ void vaddsd_floor(XMMRegister dst, XMMRegister src1, Operand src2) {
+ prolog_floor();
+ vaddsd(dst, src1, src2);
+ epilog_floor();
+ }
+
#define AVX_SSE2_SHIFT_IMM(instr, prefix, escape, opcode, extension) \
void v##instr(XMMRegister dst, XMMRegister src, uint8_t imm8) { \
XMMRegister ext_reg = XMMRegister::from_code(extension); \ |
This is a placeholder issue for discussion of tasks required for phase 1 testing of new instructions with embedded rounding mode (RM).
The text was updated successfully, but these errors were encountered: