-
Notifications
You must be signed in to change notification settings - Fork 56
Switch from simd to packed_simd #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks. I hadn't noticed that To my surprise, there are no longer distinct boolean vector types in Yet, I don't see analogs for the I guess I should find out the rationale for the change and what the design intent is for replacements for |
@hsivonen The current plan is to be as conservative as possible. I don't think the OP, "stdsimd is the replacement for simd," is accurate as of present. My suspicion is that the next phase is probably going to be something like, "stable, but inconvenient portability story." |
Portable SIMD got merged in to |
This is happening on a branch for now. Swapping out the dependency was pretty easy for the portable part. (I didn't switch the intrinsics over yet.) I added a On x86_64, performance regressed for everything the depends heavily on |
Yesterday, I failed to notice that in |
|
any news here? |
No news yet. I was away for a bit and will need to catch up on the situation regarding |
@hsivonen packed_simd fixed all of those issues a while ago, and provides many many new features (shuffles with run-time indices, portable vector gather/scatters, ...). Let me know if that isn't the case, I think the simd crate can pretty much be removed. All its features are now available in |
@gnzlbg, can I now get fast boolean reductions on ARMv7+NEON without having the standard library compiled with NEON enabled? I.e. do I need to revive that PR and debug the Android Docker issue? |
You can if you enable the Note, however, that this uses a particular released version of |
I just released version
Yeah, you should (its almost there, and if you have access to a linux workstation debugging locally shouldn't be too hard). This is a short term workaround that will never work on stable. |
I've ported encoding_rs to packed_simd on a branch. At least on x86_64 (Haswell desktop i7), there are repeatable performance differences. Here's the output of per-benchmark best results from 4
Many of the changes are with buffer sizes that are so short that SIMD code isn't used, which suggests flips in branch prediction heuristics or something like that. The actually worrying cases are these:
|
Oh, and those at at |
Which Could you extract the code of those three benchmarks into its own crate ? |
Also the worst benchmark appears to be |
I've left a couple of comments in your branch, there is one suspicious function there that might be causing the regression, we should take a closer look at that comparing the output of the simd and packed_simd crates for the version with and without select (without select, both crates should generate exactly the same code). |
The above numbers were with the
I'll try to minimize the test case.
Indeed. That benchmark shouldn't even run SIMD code, because 3 < 16, which is why I speculated about prediction heuristics flipping or something.
Thanks!
The suspicious function affects only |
Thank you! Feel free to open an issue in |
On aarch64 (ThunderX), the fluctuation of the numbers is a bit larger, so there's are more changes in both directions with the 2-percentage-point threshold. I'm not too worried about that. Instead of (These are with the horizontal max intrinsics instead of the portable horizontal max operations.) |
@hsivonen could you try again with |
Maybe there is some inlining issue / target-feature issue popping up, since the |
On |
It seems that |
Sorry about the delay. After updating Still more to investigate. Meanwhile: It appears that |
I thought that was It may be that |
It is |
Specifically, the code using
In practice, What's the practical breakage situation with |
The library cannot build on stable Rust, only on nightly. It is still being developed, with new features, bug fixes, and performance fixes being added every now and then. Sometimes these require patching rustc, and we tend to then start using the features as soon as the next nightly is released, and until now the only stakeholders that complain are the If a bigger stake holder (e.g. servo / firefox) with stricter "unstable stability requirements" starts depending on For example, we could add a CI build bot to test the nightly version used by stakeholders, and add any new features that would break that behind a feature gate. Once all stakeholders upgrade, then we could remove those feature gates. |
FWIW the error message could have been a lot clearer, you might want to fill in an issue in rust-lang/rust upstream about it. |
OK. With this problem fixed, it compiles for Is this a previously known problem? memory allocation of 1073741824 bytes failed Caused by: |
I haven't seen that one before, which target are you using? aarch64-unknown-linux-android ? It does not surprise me much, rustc consumes a lot of memory when compiling |
Self-hosted |
Hmm. Trying to compile on an x86_64 host both for aarch64 and x86_64 shows that rustc's memory usage climbs to a bit over 2 GB when compiling I'm pretty sure that 0.3.0 with earlier rustc didn't have this problem, but given the earlier comments today, I no longer trust myself having actually built this code on aarch64 previously. I'll see if I can buy more RAM for my aarch64 VM, since it's annoying to run |
Sadly it is not easy to only build a "subset" of I've been thinking about making |
Maybe I should try commenting out code to get some results.
The provider represents their aarch64 capacity as "out of stock", so can't get more RAM today. |
You can try commenting out certain APIs, e.g., in |
The build-time RAM requirement can be addressed with swap space plus waiting for the resulting IO, so it seems that building |
|
Filed an issue about rustc memory usage when compiling |
In case ripgrep users find their way here: This Firefox bug has "Depends on" and "See also" fields that point to relevant bugs/issues that are blocking migration at present. |
So far, I don't see a simple characterization of the regression other than it seems to generally affect things that I expect to correlate with horizontal boolean reduction and does not seem to really affect workloads that are mostly about shuffles. |
From which types (integers, floats, etc.) are the masks created ? Unrelated update, some stdsimd refactorings have landed in nightly, and packed_simd should start to build again properly soon. |
Discovered so far: The presence of thumb trampolines is the very first thing that stands out in the assembly. I need to go back and run the
From
It seems to be in the present nightly already. Thanks! |
Once we get past the trampolines on the crate boundary, inlining from |
Thumb-to-Thumb comparison still shows a regression. |
With the |
|
simd, inline(never) simd, inline(always) packed_simd, inline(never) packed_simd, inline(always) For the never cases, here's the assembly from
0006c160 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE>:
6c160: b580 push {r7, lr}
6c162: 428b cmp r3, r1
6c164: d36a bcc.n 6c23c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xdc>
6c166: 468e mov lr, r1
6c168: 2910 cmp r1, #16
6c16a: d31b bcc.n 6c1a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x44>
6c16c: f002 030f and.w r3, r2, #15
6c170: f1ae 0c10 sub.w ip, lr, #16
6c174: 0701 lsls r1, r0, #28
6c176: d01a beq.n 6c1ae <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4e>
6c178: b373 cbz r3, 6c1d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x78>
6c17a: 2300 movs r3, #0
6c17c: 18c1 adds r1, r0, r3
6c17e: f921 0a0f vld1.8 {d0-d1}, [r1]
6c182: ef89 2050 vshr.s8 q1, q0, #7
6c186: ff02 2a03 vpmax.u8 d2, d2, d3
6c18a: ff02 2a00 vpmax.u8 d2, d2, d0
6c18e: ee12 1b10 vmov.32 r1, d2[0]
6c192: 2900 cmp r1, #0
6c194: d14a bne.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c196: 18d1 adds r1, r2, r3
6c198: 3310 adds r3, #16
6c19a: 4563 cmp r3, ip
6c19c: f901 0a0f vst1.8 {d0-d1}, [r1]
6c1a0: d9ec bls.n 6c17c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x1c>
6c1a2: e043 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1a4: 2300 movs r3, #0
6c1a6: 4573 cmp r3, lr
6c1a8: d342 bcc.n 6c230 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xd0>
6c1aa: 4670 mov r0, lr
6c1ac: bd80 pop {r7, pc}
6c1ae: b33b cbz r3, 6c200 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa0>
6c1b0: 2300 movs r3, #0
6c1b2: 18c1 adds r1, r0, r3
6c1b4: f921 0acf vld1.64 {d0-d1}, [r1]
6c1b8: ef89 2050 vshr.s8 q1, q0, #7
6c1bc: ff02 2a03 vpmax.u8 d2, d2, d3
6c1c0: ff02 2a00 vpmax.u8 d2, d2, d0
6c1c4: ee12 1b10 vmov.32 r1, d2[0]
6c1c8: bb81 cbnz r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1ca: 18d1 adds r1, r2, r3
6c1cc: 3310 adds r3, #16
6c1ce: 4563 cmp r3, ip
6c1d0: f901 0a0f vst1.8 {d0-d1}, [r1]
6c1d4: d9ed bls.n 6c1b2 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x52>
6c1d6: e029 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1d8: 2300 movs r3, #0
6c1da: 18c1 adds r1, r0, r3
6c1dc: f921 0a0f vld1.8 {d0-d1}, [r1]
6c1e0: ef89 2050 vshr.s8 q1, q0, #7
6c1e4: ff02 2a03 vpmax.u8 d2, d2, d3
6c1e8: ff02 2a00 vpmax.u8 d2, d2, d0
6c1ec: ee12 1b10 vmov.32 r1, d2[0]
6c1f0: b9e1 cbnz r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c1f2: 18d1 adds r1, r2, r3
6c1f4: 3310 adds r3, #16
6c1f6: 4563 cmp r3, ip
6c1f8: f901 0acf vst1.64 {d0-d1}, [r1]
6c1fc: d9ed bls.n 6c1da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x7a>
6c1fe: e015 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c200: 2300 movs r3, #0
6c202: 18c1 adds r1, r0, r3
6c204: f921 0acf vld1.64 {d0-d1}, [r1]
6c208: ef89 2050 vshr.s8 q1, q0, #7
6c20c: ff02 2a03 vpmax.u8 d2, d2, d3
6c210: ff02 2a00 vpmax.u8 d2, d2, d0
6c214: ee12 1b10 vmov.32 r1, d2[0]
6c218: b941 cbnz r1, 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c21a: 18d1 adds r1, r2, r3
6c21c: 3310 adds r3, #16
6c21e: 4563 cmp r3, ip
6c220: f901 0acf vst1.64 {d0-d1}, [r1]
6c224: d9ed bls.n 6c202 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xa2>
6c226: e001 b.n 6c22c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xcc>
6c228: 54d1 strb r1, [r2, r3]
6c22a: 3301 adds r3, #1
6c22c: 4573 cmp r3, lr
6c22e: d2bc bcs.n 6c1aa <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0x4a>
6c230: 56c1 ldrsb r1, [r0, r3]
6c232: 2900 cmp r1, #0
6c234: daf8 bge.n 6c228 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xc8>
6c236: 469e mov lr, r3
6c238: 4670 mov r0, lr
6c23a: bd80 pop {r7, pc}
6c23c: 4803 ldr r0, [pc, #12] ; (6c24c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xec>)
6c23e: 2130 movs r1, #48 ; 0x30
6c240: 4a03 ldr r2, [pc, #12] ; (6c250 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17he774493b8d7dcf7bE+0xf0>)
6c242: 4478 add r0, pc
6c244: 447a add r2, pc
6c246: f7ff fefb bl 6c040 <_ZN3std9panicking11begin_panic17hb6db914fa10d35c1E>
6c24a: defe udf #254 ; 0xfe
6c24c: 009c918c .word 0x009c918c
6c250: 009f3988 .word 0x009f3988
00056314 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E>:
56314: b570 push {r4, r5, r6, lr}
56316: 428b cmp r3, r1
56318: f0c0 8082 bcc.w 56420 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x10c>
5631c: 468e mov lr, r1
5631e: 2910 cmp r1, #16
56320: d320 bcc.n 56364 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x50>
56322: f002 030f and.w r3, r2, #15
56326: f1ae 0c10 sub.w ip, lr, #16
5632a: 0701 lsls r1, r0, #28
5632c: d01f beq.n 5636e <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5a>
5632e: b3cb cbz r3, 563a4 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x90>
56330: 2300 movs r3, #0
56332: 18c1 adds r1, r0, r3
56334: f961 0a0f vld1.8 {d16-d17}, [r1]
56338: efc9 2070 vshr.s8 q9, q8, #7
5633c: ee33 1b90 vmov.32 r1, d19[1]
56340: ee32 4b90 vmov.32 r4, d18[1]
56344: ee13 5b90 vmov.32 r5, d19[0]
56348: ee12 6b90 vmov.32 r6, d18[0]
5634c: 4321 orrs r1, r4
5634e: ea46 0405 orr.w r4, r6, r5
56352: 4321 orrs r1, r4
56354: d15c bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
56356: 18d1 adds r1, r2, r3
56358: 3310 adds r3, #16
5635a: 4563 cmp r3, ip
5635c: f941 0a0f vst1.8 {d16-d17}, [r1]
56360: d9e7 bls.n 56332 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x1e>
56362: e055 b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
56364: 2300 movs r3, #0
56366: 4573 cmp r3, lr
56368: d354 bcc.n 56414 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x100>
5636a: 4670 mov r0, lr
5636c: bd70 pop {r4, r5, r6, pc}
5636e: b39b cbz r3, 563d8 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc4>
56370: 2300 movs r3, #0
56372: 18c1 adds r1, r0, r3
56374: f961 0acf vld1.64 {d16-d17}, [r1]
56378: efc9 2070 vshr.s8 q9, q8, #7
5637c: ee33 1b90 vmov.32 r1, d19[1]
56380: ee32 4b90 vmov.32 r4, d18[1]
56384: ee13 5b90 vmov.32 r5, d19[0]
56388: ee12 6b90 vmov.32 r6, d18[0]
5638c: 4321 orrs r1, r4
5638e: ea46 0405 orr.w r4, r6, r5
56392: 4321 orrs r1, r4
56394: d13c bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
56396: 18d1 adds r1, r2, r3
56398: 3310 adds r3, #16
5639a: 4563 cmp r3, ip
5639c: f941 0a0f vst1.8 {d16-d17}, [r1]
563a0: d9e7 bls.n 56372 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x5e>
563a2: e035 b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563a4: 2300 movs r3, #0
563a6: 18c1 adds r1, r0, r3
563a8: f961 0a0f vld1.8 {d16-d17}, [r1]
563ac: efc9 2070 vshr.s8 q9, q8, #7
563b0: ee33 1b90 vmov.32 r1, d19[1]
563b4: ee32 4b90 vmov.32 r4, d18[1]
563b8: ee13 5b90 vmov.32 r5, d19[0]
563bc: ee12 6b90 vmov.32 r6, d18[0]
563c0: 4321 orrs r1, r4
563c2: ea46 0405 orr.w r4, r6, r5
563c6: 4321 orrs r1, r4
563c8: d122 bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563ca: 18d1 adds r1, r2, r3
563cc: 3310 adds r3, #16
563ce: 4563 cmp r3, ip
563d0: f941 0acf vst1.64 {d16-d17}, [r1]
563d4: d9e7 bls.n 563a6 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x92>
563d6: e01b b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563d8: 2300 movs r3, #0
563da: 18c1 adds r1, r0, r3
563dc: f961 0acf vld1.64 {d16-d17}, [r1]
563e0: efc9 2070 vshr.s8 q9, q8, #7
563e4: ee33 1b90 vmov.32 r1, d19[1]
563e8: ee32 4b90 vmov.32 r4, d18[1]
563ec: ee13 5b90 vmov.32 r5, d19[0]
563f0: ee12 6b90 vmov.32 r6, d18[0]
563f4: 4321 orrs r1, r4
563f6: ea46 0405 orr.w r4, r6, r5
563fa: 4321 orrs r1, r4
563fc: d108 bne.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
563fe: 18d1 adds r1, r2, r3
56400: 3310 adds r3, #16
56402: 4563 cmp r3, ip
56404: f941 0acf vst1.64 {d16-d17}, [r1]
56408: d9e7 bls.n 563da <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xc6>
5640a: e001 b.n 56410 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xfc>
5640c: 54d1 strb r1, [r2, r3]
5640e: 3301 adds r3, #1
56410: 4573 cmp r3, lr
56412: d2aa bcs.n 5636a <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x56>
56414: 56c1 ldrsb r1, [r0, r3]
56416: 2900 cmp r1, #0
56418: daf8 bge.n 5640c <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0xf8>
5641a: 469e mov lr, r3
5641c: 4670 mov r0, lr
5641e: bd70 pop {r4, r5, r6, pc}
56420: 4803 ldr r0, [pc, #12] ; (56430 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x11c>)
56422: 2130 movs r1, #48 ; 0x30
56424: 4a03 ldr r2, [pc, #12] ; (56434 <_ZN11encoding_rs3mem19copy_ascii_to_ascii17h490246fe632404a1E+0x120>)
56426: 4478 add r0, pc
56428: 447a add r2, pc
5642a: f000 f9bb bl 567a4 <_ZN3std9panicking11begin_panic17hd61fceca69156f6cE>
5642e: defe udf #254 ; 0xfe
56430: 009a3b71 .word 0x009a3b71
56434: 009ebd14 .word 0x009ebd14 The very first observation: |
OK, so the horizontal reductions generate worse code under
6c17e: f921 0a0f vld1.8 {d0-d1}, [r1]
6c182: ef89 2050 vshr.s8 q1, q0, #7
6c186: ff02 2a03 vpmax.u8 d2, d2, d3
6c18a: ff02 2a00 vpmax.u8 d2, d2, d0
6c18e: ee12 1b10 vmov.32 r1, d2[0]
6c192: 2900 cmp r1, #0
56334: f961 0a0f vld1.8 {d16-d17}, [r1]
56338: efc9 2070 vshr.s8 q9, q8, #7
5633c: ee33 1b90 vmov.32 r1, d19[1]
56340: ee32 4b90 vmov.32 r4, d18[1]
56344: ee13 5b90 vmov.32 r5, d19[0]
56348: ee12 6b90 vmov.32 r6, d18[0]
5634c: 4321 orrs r1, r4
5634e: ea46 0405 orr.w r4, r6, r5
56352: 4321 orrs r1, r4 |
Filed as a |
Are you using the exact same rustc version for the comparisons? |
No, because there isn't a single Rust version that both 1) compiles |
This is now fixed. Thank you for your help and patience. |
stdsimd is the replacement for simd
The text was updated successfully, but these errors were encountered: