Add Neon implementation of find_first_of#6094
Add Neon implementation of find_first_of#6094hazzlim wants to merge 10 commits intomicrosoft:mainfrom
find_first_of#6094Conversation
stl/src/vector_algorithms.cpp
Outdated
| // Heuristic of Haystack smaller than Neon width, or Needle larger than 2 x Neon width. | ||
| if (_Count1 * sizeof(_Ty) < 16 || _Count2 * sizeof(_Ty) >= 32) { | ||
| return _Pos_from_ptr<_Ty>( | ||
| _Fallback_find_not_2(_First1, _Last1, _First2, _Last2), _First1, _Last1); | ||
| } |
There was a problem hiding this comment.
These cases showing SU <1 seem to indicate a weakness of this choice.
It might be better to go with scalar, if _Shuffle_impl_dispatch isn't good here.
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54
There was a problem hiding this comment.
Hmm I'll take a look and see if the scalar is better - the results I got for these was a larger regression with the Shuffle implementation (~0.5x). When I took a look at what the baseline was doing, I saw that it does this
STL/stl/inc/__msvc_string_view.hpp
Lines 1085 to 1092 in 8e0c6ff
hence doing this here. I'm not entirely sure why it's a regression, as they should be doing essentially the same thing (?)
There was a problem hiding this comment.
I think _Traits::find is element-wise loop, and not a vectorized loop here. We except for not it will exit shortly, so scalar may be better due to not spending time for setup.
There was a problem hiding this comment.
But if they are essentially the same, and the regression is due to spending some time on branching to here, then probably we can't do much about it.
There was a problem hiding this comment.
So it is wmemchr:
STL/stl/inc/__msvc_string_view.hpp
Lines 416 to 430 in 8e0c6ff
There was a problem hiding this comment.
AH I wonder if the call gets inlined in the header and not in my implementation because I'm calling the C function not the template...
There was a problem hiding this comment.
Maybe using wmemchr in _Fallback_find_not_2 would be the way. I remember you optimized find for the long run more that it is done on x64/x86, and more than in wmemchr
And maybe also benchmark is not fair enough, in a sense it is biased to something unusual.
There was a problem hiding this comment.
Because the new Neon implementation of Find is definitely faster than wmemchr
There was a problem hiding this comment.
It is a point that maybe wmemchr has better early-return characteristics. It does look like we weren't inlining, I'll re-spin the benchmarks
There was a problem hiding this comment.
Now it gets inlined, rather than having a call in the loop, the numbers are like this. (I included speedup vs. the previous commit and speedup vs. the baseline of main)
| MSVC SU (prev) | MSVC SU (baseline) | Clang SU (prev) | Clang SU (baseline) | |
|---|---|---|---|---|
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2/3 | 1 | 0.769 | 1 | 0.73 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/6/81 | 1.067 | 0.957 | 1.152 | 0.866 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/7/4 | 0.967 | 1.516 | 1 | 1.17 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/9/3 | 1 | 1.702 | 1.003 | 0.767 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/22/5 | 0.944 | 2.619 | 1.01 | 2.464 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/58/2 | 1 | 5 | 0.952 | 1.792 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/75/85 | 1.182 | 1.013 | 1.195 | 1.022 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/102/4 | 0.923 | 5.111 | 1.023 | 5.341 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/200/46 | 1.145 | 0.96 | 1.214 | 1.023 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/325/1 | 1.079 | 6.635 | 0.915 | 2.108 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50 | 1.163 | 0.967 | 1.2 | 1.024 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1011/11 | 1 | 2.5 | 0.978 | 2.63 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46 | 1.148 | 0.978 | 1.173 | 1.019 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23 | 1.175 | 1.15 | 1.239 | 1.043 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54 | 1.154 | 1.022 | 1.227 | 1.044 |
| bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/3056/7 | 1.006 | 4.219 | 0.99 | 4.33 |
This PR adds a Neon implementation of
find_first_ofusing the Shuffle approach. For_Posvariants, we select between this and the scalar bitmap approach based on thresholds.For
__std_find_first_not_of_trivial_pos_2, in the case where elements do not fit into the scalar bitmap, it can be better to use the approach ofFinding the haystack elements in the needle, so we do this.Benchmark results: