-
Notifications
You must be signed in to change notification settings - Fork 214
Possible missed vectorization in unrolled_dot #3218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
Benchmarks for Intel Mac main
dynamic feature detection using
compile time feature detection
|
nice! |
Leads to a regression This reverts commit e9d4bd3.
The dot implementation possible cases: - "--target-feature=+avx,+fma" + (x86, x86_64) compiles avx versions for dot_1 and dot_2 [compile time] - "--target-feature=+neon" + aarch64 + little endian compiles neon versions for dot_1 and dot_2 [compile time] - None of the above features enabled - no_std defaults to using unrolled dot versions as runtime feature detection requires "std" [compile time] - if std is enabled, the fastest implementation is assigned to DOT_1_PTR, DOT_2_PTR during initialization, depending on the feature detected defaulting to the unrolled versions. We incur the penalty of accessing once_cell::sync::Lazy each time dot is called.
The dot routine used in
math_helper
(before ZeroSlice) does not vectorize https://godbolt.org/z/v6bdroEPr. There are a bunch ofvmulss
(multiply scalar single precision) instructions in the asm.llvm complains that the loop cannot be vectorized as floating-point operations are not commtative.
A similar thing occurs with a naive dot product impl https://godbolt.org/z/5G9hMvP63, which also fails to vectorize.
This pr adds an avx dot product using
fmadd
(fused multiply add) instructions that leads to a performance improvement on my Mac pro (x86-64) (comparing withc7567d46b (HEAD -> main, origin/main) Bump webpack in /ffi/diplomat/js/examples/wasm-demo (#3199)
)The test suite run with
RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo test --all-features
also passes underexperimental/segmenter
.Edit:
Reran the benchmarks
HEAD: using command
cargo bench --all-features -- "lstm"
(It seems compiling at HEAD with-Ctarget-cpu=native
lead to a performance regression?)PR: using command
RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo bench --all-features -- "lstm"