Possible missed vectorization in unrolled_dot #3218

pdogr · 2023-03-23T17:04:25Z

The dot routine used in math_helper (before ZeroSlice) does not vectorize https://godbolt.org/z/v6bdroEPr. There are a bunch of vmulss(multiply scalar single precision) instructions in the asm.
llvm complains that the loop cannot be vectorized as floating-point operations are not commtative.

A similar thing occurs with a naive dot product impl https://godbolt.org/z/5G9hMvP63, which also fails to vectorize.

This pr adds an avx dot product using fmadd(fused multiply add) instructions that leads to a performance improvement on my Mac pro (x86-64) (comparing with c7567d46b (HEAD -> main, origin/main) Bump webpack in /ffi/diplomat/js/examples/wasm-demo (#3199))

Line Break/UTF8/Th/lstm time:   [451.82 µs 454.34 µs 457.04 µs]
                        change: [-4.9073% -3.7435% -2.4420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [452.50 µs 456.03 µs 459.75 µs]
                        change: [-4.8695% -3.6713% -2.5002%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

The test suite run with RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo test --all-features also passes under experimental/segmenter.

Edit:
Reran the benchmarks

HEAD: using command cargo bench --all-features -- "lstm" (It seems compiling at HEAD with -Ctarget-cpu=native lead to a performance regression?)

PR: using command RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo bench --all-features -- "lstm"

jira-pull-request-webhook · 2023-05-04T09:47:16Z

Notice: the branch changed across the force-push!

components/segmenter/src/complex/lstm/matrix.rs is now changed in the branch
experimental/segmenter/src/lib.rs is no longer changed in the branch
experimental/segmenter/src/line.rs is no longer changed in the branch
experimental/segmenter/src/math_helper.rs is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

pdogr · 2023-05-23T19:08:12Z

Benchmarks for Intel Mac

main cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [400.98 µs 404.45 µs 408.42 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [396.37 µs 400.42 µs 405.52 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

dynamic feature detection using OnceCell cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [367.98 µs 370.66 µs 373.47 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Line Break/UTF16/Th/lstm
                        time:   [367.89 µs 370.54 µs 373.36 µs]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

compile time feature detection RUSTFLAGS="-C opt-level=3 -C target-feature=+avx,+fma" cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [350.15 µs 352.51 µs 355.14 µs]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [351.04 µs 354.48 µs 358.58 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

sffc · 2023-05-23T20:27:13Z

nice!

Leads to a regression This reverts commit e9d4bd3.

The dot implementation possible cases: - "--target-feature=+avx,+fma" + (x86, x86_64) compiles avx versions for dot_1 and dot_2 [compile time] - "--target-feature=+neon" + aarch64 + little endian compiles neon versions for dot_1 and dot_2 [compile time] - None of the above features enabled - no_std defaults to using unrolled dot versions as runtime feature detection requires "std" [compile time] - if std is enabled, the fastest implementation is assigned to DOT_1_PTR, DOT_2_PTR during initialization, depending on the feature detected defaulting to the unrolled versions. We incur the penalty of accessing once_cell::sync::Lazy each time dot is called.

robertbastian added the C-segmentation Component: Segmentation label Apr 13, 2023

sse fmadd

6374ce7

robertbastian force-pushed the sse-dot branch from d28a3ed to 6374ce7 Compare May 4, 2023 09:47

pdogr and others added 5 commits May 23, 2023 00:39

Merge branch 'unicode-org:main' into sse-dot

103d8b1

compile time target-feature detection with fallback

49126ee

use OnceCell instead of lazy_static

4fba1f3

Merge branch 'unicode-org:main' into sse-dot

1d81cde

revert unrolled_dot_1, unrolled_dot_2

1084e5a

pdogr added 9 commits May 24, 2023 10:03

aarch64 neon intrinsics

8d0ccdc

multiply in chunks of 8 for aarch64 neon

35635b3

unroll neon x3

229460e

fix as_ule_slice in dot_2_neon

addbd3e

revert back to 8xf32/loop

904c5bd

load 2 vectors together in neon loop

e9d4bd3

Revert "load 2 vectors together in neon loop"

d29f9fb

Leads to a regression This reverts commit e9d4bd3.

move dot product to separate module ops

f2c43bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible missed vectorization in unrolled_dot #3218

Possible missed vectorization in unrolled_dot #3218

Uh oh!

pdogr commented Mar 23, 2023 •

edited

Loading

Uh oh!

jira-pull-request-webhook bot commented May 4, 2023

Uh oh!

pdogr commented May 23, 2023

Uh oh!

sffc commented May 23, 2023

Uh oh!

Uh oh!

Possible missed vectorization in unrolled_dot #3218

Are you sure you want to change the base?

Possible missed vectorization in unrolled_dot #3218

Uh oh!

Conversation

pdogr commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jira-pull-request-webhook bot commented May 4, 2023

Uh oh!

pdogr commented May 23, 2023

Uh oh!

sffc commented May 23, 2023

Uh oh!

Uh oh!

pdogr commented Mar 23, 2023 •

edited

Loading