Skip to content

Possible missed vectorization in unrolled_dot #3218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

pdogr
Copy link
Contributor

@pdogr pdogr commented Mar 23, 2023

The dot routine used in math_helper (before ZeroSlice) does not vectorize https://godbolt.org/z/v6bdroEPr. There are a bunch of vmulss(multiply scalar single precision) instructions in the asm.
llvm complains that the loop cannot be vectorized as floating-point operations are not commtative.

A similar thing occurs with a naive dot product impl https://godbolt.org/z/5G9hMvP63, which also fails to vectorize.

This pr adds an avx dot product using fmadd(fused multiply add) instructions that leads to a performance improvement on my Mac pro (x86-64) (comparing with c7567d46b (HEAD -> main, origin/main) Bump webpack in /ffi/diplomat/js/examples/wasm-demo (#3199))

Line Break/UTF8/Th/lstm time:   [451.82 µs 454.34 µs 457.04 µs]
                        change: [-4.9073% -3.7435% -2.4420%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [452.50 µs 456.03 µs 459.75 µs]
                        change: [-4.8695% -3.6713% -2.5002%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

The test suite run with RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo test --all-features also passes under experimental/segmenter.

Edit:
Reran the benchmarks

HEAD: using command cargo bench --all-features -- "lstm" (It seems compiling at HEAD with -Ctarget-cpu=native lead to a performance regression?)

PR: using command RUSTFLAGS="-C opt-level=2 -C target-cpu=native" cargo bench --all-features -- "lstm"

@robertbastian robertbastian added the C-segmentation Component: Segmentation label Apr 13, 2023
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • components/segmenter/src/complex/lstm/matrix.rs is now changed in the branch
  • experimental/segmenter/src/lib.rs is no longer changed in the branch
  • experimental/segmenter/src/line.rs is no longer changed in the branch
  • experimental/segmenter/src/math_helper.rs is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@pdogr
Copy link
Contributor Author

pdogr commented May 23, 2023

Benchmarks for Intel Mac

main cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [400.98 µs 404.45 µs 408.42 µs]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [396.37 µs 400.42 µs 405.52 µs]
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

dynamic feature detection using OnceCell cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [367.98 µs 370.66 µs 373.47 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Line Break/UTF16/Th/lstm
                        time:   [367.89 µs 370.54 µs 373.36 µs]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

compile time feature detection RUSTFLAGS="-C opt-level=3 -C target-feature=+avx,+fma" cargo bench --all-features -p icu_segmenter -- "lstm"

Line Break/UTF8/Th/lstm time:   [350.15 µs 352.51 µs 355.14 µs]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Line Break/UTF16/Th/lstm
                        time:   [351.04 µs 354.48 µs 358.58 µs]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

@sffc
Copy link
Member

sffc commented May 23, 2023

nice!

pdogr added 9 commits May 24, 2023 10:03
Leads to a regression

This reverts commit e9d4bd3.
The dot implementation possible cases:
- "--target-feature=+avx,+fma" + (x86, x86_64) compiles avx versions for
  dot_1 and dot_2 [compile time]
- "--target-feature=+neon" + aarch64 + little endian compiles neon
  versions for dot_1 and dot_2 [compile time]
- None of the above features enabled
 - no_std defaults to using unrolled dot versions as runtime feature
   detection requires "std" [compile time]
 - if std is enabled, the fastest implementation is assigned to
   DOT_1_PTR, DOT_2_PTR during initialization, depending on the feature
   detected defaulting to the unrolled versions. We incur the penalty of
   accessing once_cell::sync::Lazy each time dot is called.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-segmentation Component: Segmentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants