-
Notifications
You must be signed in to change notification settings - Fork 88
Implement fallback to smaller vector size for swizzle_dyn #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
169c541
to
041e43b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, sorry for taking so long to get around to this. I like the idea but I would prefer it was written to avoid unnecessary allowances of lints.
@@ -15,7 +15,7 @@ where | |||
/// A planned compiler improvement will enable using `#[target_feature]` instead. | |||
#[inline] | |||
pub fn swizzle_dyn(self, idxs: Simd<u8, N>) -> Self { | |||
#![allow(unused_imports, unused_unsafe)] | |||
#![allow(unused_imports, unused_unsafe, unreachable_patterns)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh...? 🤔
all(target_arch = "arm", target_feature = "v7") | ||
), | ||
target_feature = "neon", | ||
target_endian = "little" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conflicts hereabouts
_ => dispatch_compat(self, idxs), | ||
_ => swizzle_dyn_scalar(self, idxs), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not like this new structure in the match. Some allowances are required in this code without making it totally illegible, but this requires allowing a lint that should not be allowed. It can be avoided by making the first _
guarded by an if cfg!(..)
so that swizzle_dyn_scalar
only catches the true base case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider that approach, however unless I'm missing something, it would have to effectively duplicate all of the previous if-cfgs in a negated form which seems to be a lot uglier than just having a silently unreachable pattern in some build configurations. I don't have an issue with changing it if you prefer it like that.
#![allow( | ||
dead_code, | ||
unused_unsafe, | ||
unreachable_patterns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not need this
unreachable_patterns, |
This PR adds a fallback implementation so that e.g. u8x64::swizzle_dyn can be reasonably efficient even when only compiled with 128-bit SSSE.
A "downgraded" swizzle_dyn op on N lanes emits 4 swizzle_dyn ops on N/2 lanes. If the optimizer can deduce that index values are bounded to N/2 or less, then it will generally be more efficient.
For example,
u8x64::swizzle_dyn
will only emit 4pshufb
instructions on SSSE, instead of 16 in the general case, if the optimizer can prove index values are always <16 (this is generally achieved by preceding the swizzle with a 0xf mask).Additionaly, for non-power-of-two N values, this PR adds a fallback implementation which zero-extends to the next power of two size.
Benchmarks
Below are benchmark results for the following code, on 5 target-cpu levels:
x86-64
- baseline, no vectorized shufflesx86-64-v2
- ssse, adds pshufb on u8x16x86-64-v3
- avx2, adds vpshufb on u8x32 (vpshufb is not a true extension of pshufb to 256-bit, instead it's more like 2 pshufb ops ran in parallel, with 4-bit indices in the corresponding lane)x86-64-v4
- avx512, adds vpshufb on u8x64 (again really just 4x pshufb in parallel)icelake-server
- avx512vbmi, adds vpermb (this is a true 256/512-bit shuffle with 5/6-bit indices)N.B. the code used to benchmark includes #431, and removes the
src/masks/bitmask.rs
avx512f mask implementation to work around the problem discussed here.The main thing to look for is the performance of
swizzle_dyn_64*
on non-vbmi targets (v2, v3, v4), and the performance ofswizzle_dyn_32*
on x86-64-v2. With the previous implementation, sizes without native vector instructions fall back to the scalar implementation, while in the PR version they fall back to a lower vector size for ~4-10x performance over the scalar version.Benchmark data for old version
-Ctarget-cpu=x86-64
-Ctarget-cpu=x86-64-v2
-Ctarget-cpu=x86-64-v3
-Ctarget-cpu=x86-64-v4
-Ctarget-cpu=icelake-server
(avx512vbmi)Benchmark data for new version
-Ctarget-cpu=x86-64
-Ctarget-cpu=x86-64-v2
-Ctarget-cpu=x86-64-v3
-Ctarget-cpu=x86-64-v4
-Ctarget-cpu=icelake-server
(avx512vbmi)