Skip to content

Comments

Add Neon implementation of find_first_of#6094

Open
hazzlim wants to merge 10 commits intomicrosoft:mainfrom
hazzlim:find-meow-of-neon-pr-1
Open

Add Neon implementation of find_first_of#6094
hazzlim wants to merge 10 commits intomicrosoft:mainfrom
hazzlim:find-meow-of-neon-pr-1

Conversation

@hazzlim
Copy link
Contributor

@hazzlim hazzlim commented Feb 20, 2026

This PR adds a Neon implementation of find_first_of using the Shuffle approach. For _Pos variants, we select between this and the scalar bitmap approach based on thresholds.

For __std_find_first_not_of_trivial_pos_2, in the case where elements do not fit into the scalar bitmap, it can be better to use the approach of Finding the haystack elements in the needle, so we do this.

Benchmark results:

  MSVC SU Clang SU
bm<AlgType::std_func, uint8_t>/2/3 0.979 0.934
bm<AlgType::std_func, uint8_t>/6/81 1 1
bm<AlgType::std_func, uint8_t>/7/4 0.949 1.048
bm<AlgType::std_func, uint8_t>/9/3 2.804 3.409
bm<AlgType::std_func, uint8_t>/22/5 6.136 6.5
bm<AlgType::std_func, uint8_t>/58/2 10 11.951
bm<AlgType::std_func, uint8_t>/75/85 9.65 10
bm<AlgType::std_func, uint8_t>/102/4 11.956 13.677
bm<AlgType::std_func, uint8_t>/200/46 9.268 9.5
bm<AlgType::std_func, uint8_t>/325/1 1.022 1.017
bm<AlgType::std_func, uint8_t>/400/50 9.556 9.783
bm<AlgType::std_func, uint8_t>/1011/11 11.201 11.5
bm<AlgType::std_func, uint8_t>/1280/46 9.37 9.778
bm<AlgType::std_func, uint8_t>/1502/23 9.778 10
bm<AlgType::std_func, uint8_t>/2203/54 9.773 10.252
bm<AlgType::std_func, uint8_t>/3056/7 13.25 14.261
bm<AlgType::std_func, uint16_t>/2/3 0.954 0.978
bm<AlgType::std_func, uint16_t>/6/81 1.005 1
bm<AlgType::std_func, uint16_t>/7/4 0.979 1.024
bm<AlgType::std_func, uint16_t>/9/3 2.02 2.463
bm<AlgType::std_func, uint16_t>/22/5 5.286 5.932
bm<AlgType::std_func, uint16_t>/58/2 5.462 5.581
bm<AlgType::std_func, uint16_t>/75/85 5.669 5.465
bm<AlgType::std_func, uint16_t>/102/4 9.333 7.475
bm<AlgType::std_func, uint16_t>/200/46 5.281 5.281
bm<AlgType::std_func, uint16_t>/325/1 1.098 0.968
bm<AlgType::std_func, uint16_t>/400/50 6.603 5.342
bm<AlgType::std_func, uint16_t>/1011/11 5.864 6
bm<AlgType::std_func, uint16_t>/1280/46 5.745 5.489
bm<AlgType::std_func, uint16_t>/1502/23 5.265 5.385
bm<AlgType::std_func, uint16_t>/2203/54 5.505 5.5
bm<AlgType::std_func, uint16_t>/3056/7 6.667 6.667
bm<AlgType::std_func, uint32_t>/2/3 1 0.957
bm<AlgType::std_func, uint32_t>/6/81 0.957 1.002
bm<AlgType::std_func, uint32_t>/7/4 1.019 0.977
bm<AlgType::std_func, uint32_t>/9/3 1.861 2.199
bm<AlgType::std_func, uint32_t>/22/5 2.826 2.889
bm<AlgType::std_func, uint32_t>/58/2 3.566 3.223
bm<AlgType::std_func, uint32_t>/75/85 2.938 2.938
bm<AlgType::std_func, uint32_t>/102/4 3.27 3.568
bm<AlgType::std_func, uint32_t>/200/46 3 2.897
bm<AlgType::std_func, uint32_t>/325/1 1.023 1
bm<AlgType::std_func, uint32_t>/400/50 3.009 2.926
bm<AlgType::std_func, uint32_t>/1011/11 3.226 3.283
bm<AlgType::std_func, uint32_t>/1280/46 3.07 3
bm<AlgType::std_func, uint32_t>/1502/23 3.043 3.113
bm<AlgType::std_func, uint32_t>/2203/54 2.891 3.017
bm<AlgType::std_func, uint32_t>/3056/7 3.302 3.659
bm<AlgType::std_func, uint64_t>/2/3 0.904 0.933
bm<AlgType::std_func, uint64_t>/6/81 1.062 1
bm<AlgType::std_func, uint64_t>/7/4 0.937 0.854
bm<AlgType::std_func, uint64_t>/9/3 1.5 1.563
bm<AlgType::std_func, uint64_t>/22/5 1.706 1.773
bm<AlgType::std_func, uint64_t>/58/2 2.667 1.792
bm<AlgType::std_func, uint64_t>/75/85 1.736 1.634
bm<AlgType::std_func, uint64_t>/102/4 1.643 1.5
bm<AlgType::std_func, uint64_t>/200/46 1.69 1.636
bm<AlgType::std_func, uint64_t>/325/1 1.005 0.978
bm<AlgType::std_func, uint64_t>/400/50 1.6 1.637
bm<AlgType::std_func, uint64_t>/1011/11 1.714 1.643
bm<AlgType::std_func, uint64_t>/1280/46 1.606 1.817
bm<AlgType::std_func, uint64_t>/1502/23 1.638 1.601
bm<AlgType::std_func, uint64_t>/2203/54 1.643 1.669
bm<AlgType::std_func, uint64_t>/3056/7 1.627 1.627
bm<AlgType::str_member_first, char>/2/3 1.042 0.988
bm<AlgType::str_member_first, char>/6/81 0.978 0.794
bm<AlgType::str_member_first, char>/7/4 0.977 0.92
bm<AlgType::str_member_first, char>/9/3 1.174 1.325
bm<AlgType::str_member_first, char>/22/5 1.244 1.677
bm<AlgType::str_member_first, char>/58/2 2.58 4.377
bm<AlgType::str_member_first, char>/75/85 1 1.132
bm<AlgType::str_member_first, char>/102/4 2.24 3.688
bm<AlgType::str_member_first, char>/200/46 1 1.333
bm<AlgType::str_member_first, char>/325/1 5.319 9.287
bm<AlgType::str_member_first, char>/400/50 1 1.402
bm<AlgType::str_member_first, char>/1011/11 0.983 1.66
bm<AlgType::str_member_first, char>/1280/46 1.007 1.533
bm<AlgType::str_member_first, char>/1502/23 1.012 1.588
bm<AlgType::str_member_first, char>/2203/54 1 1.564
bm<AlgType::str_member_first, char>/3056/7 1.674 2.914
bm<AlgType::str_member_first, wchar_t>/2/3 0.872 0.792
bm<AlgType::str_member_first, wchar_t>/6/81 0.96 0.765
bm<AlgType::str_member_first, wchar_t>/7/4 0.975 0.714
bm<AlgType::str_member_first, wchar_t>/9/3 0.966 0.798
bm<AlgType::str_member_first, wchar_t>/22/5 1.308 1.173
bm<AlgType::str_member_first, wchar_t>/58/2 2.15 2.1
bm<AlgType::str_member_first, wchar_t>/75/85 0.991 0.893
bm<AlgType::str_member_first, wchar_t>/102/4 1.979 2.086
bm<AlgType::str_member_first, wchar_t>/200/46 1.028 1.072
bm<AlgType::str_member_first, wchar_t>/325/1 3.932 3.546
bm<AlgType::str_member_first, wchar_t>/400/50 1.029 1.077
bm<AlgType::str_member_first, wchar_t>/1011/11 1.086 1.117
bm<AlgType::str_member_first, wchar_t>/1280/46 0.957 1.108
bm<AlgType::str_member_first, wchar_t>/1502/23 0.976 1.15
bm<AlgType::str_member_first, wchar_t>/2203/54 1.023 1.116
bm<AlgType::str_member_first, wchar_t>/3056/7 1.314 1.366
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2/3 0.958 0.709
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/6/81 1.223 1.127
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/7/4 2.008 1.484
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/9/3 2.215 1.077
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/22/5 4.229 3.394
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/58/2 7.778 3.035
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/75/85 1.591 1.515
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/102/4 7.299 7.219
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/200/46 1.76 1.741
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/325/1 10.222 3.258
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/400/50 1.816 1.736
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1011/11 4.2 3.917
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1280/46 1.913 1.905
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/1502/23 2.663 2.506
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/2203/54 1.87 1.867
bm<AlgType::str_member_first, wchar_t, L'\x03B1'>/3056/7 6.825 6.349
bm<AlgType::str_member_first, char32_t>/2/3 1.148 1
bm<AlgType::str_member_first, char32_t>/6/81 1.142 1.024
bm<AlgType::str_member_first, char32_t>/7/4 0.909 1.029
bm<AlgType::str_member_first, char32_t>/9/3 0.875 0.979
bm<AlgType::str_member_first, char32_t>/22/5 0.938 1.074
bm<AlgType::str_member_first, char32_t>/58/2 1.419 1.893
bm<AlgType::str_member_first, char32_t>/75/85 1.037 1.217
bm<AlgType::str_member_first, char32_t>/102/4 0.969 1.218
bm<AlgType::str_member_first, char32_t>/200/46 0.987 1.256
bm<AlgType::str_member_first, char32_t>/325/1 2.245 2.904
bm<AlgType::str_member_first, char32_t>/400/50 1 1.308
bm<AlgType::str_member_first, char32_t>/1011/11 1.023 1.294
bm<AlgType::str_member_first, char32_t>/1280/46 0.875 1.278
bm<AlgType::str_member_first, char32_t>/1502/23 0.891 1.29
bm<AlgType::str_member_first, char32_t>/2203/54 0.957 1.306
bm<AlgType::str_member_first, char32_t>/3056/7 1 1.29
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2/3 1.016 0.946
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/6/81 2.16 2.2
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/7/4 1.646 1.867
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/9/3 1.436 1.633
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/22/5 2.442 2.636
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/58/2 3.389 3.217
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/75/85 2.878 2.938
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/102/4 3.173 3.297
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/200/46 2.809 3.143
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/325/1 3.115 3.351
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/400/50 2.943 3.077
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1011/11 3.343 3.54
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1280/46 3.006 3.07
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/1502/23 3.182 3.162
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/2203/54 2.949 3.07
bm<AlgType::str_member_first, char32_t, U'\x03B1'>/3056/7 3.492 3.81
bm<AlgType::str_member_first_not, char>/2/3 1.024 1.011
bm<AlgType::str_member_first_not, char>/6/81 0.956 0.782
bm<AlgType::str_member_first_not, char>/7/4 1 0.974
bm<AlgType::str_member_first_not, char>/9/3 1.272 1.318
bm<AlgType::str_member_first_not, char>/22/5 1.524 1.6
bm<AlgType::str_member_first_not, char>/58/2 2.833 3.255
bm<AlgType::str_member_first_not, char>/75/85 1.059 0.976
bm<AlgType::str_member_first_not, char>/102/4 2.864 3.151
bm<AlgType::str_member_first_not, char>/200/46 1.173 1.125
bm<AlgType::str_member_first_not, char>/325/1 5.581 6.095
bm<AlgType::str_member_first_not, char>/400/50 1.152 1.029
bm<AlgType::str_member_first_not, char>/1011/11 1.046 1.209
bm<AlgType::str_member_first_not, char>/1280/46 1.163 1.117
bm<AlgType::str_member_first_not, char>/1502/23 1.136 1.091
bm<AlgType::str_member_first_not, char>/2203/54 1.15 1.103
bm<AlgType::str_member_first_not, char>/3056/7 1.897 2.316
bm<AlgType::str_member_first_not, wchar_t>/2/3 1 0.755
bm<AlgType::str_member_first_not, wchar_t>/6/81 0.968 0.852
bm<AlgType::str_member_first_not, wchar_t>/7/4 0.976 0.667
bm<AlgType::str_member_first_not, wchar_t>/9/3 0.933 0.67
bm<AlgType::str_member_first_not, wchar_t>/22/5 1.193 1.068
bm<AlgType::str_member_first_not, wchar_t>/58/2 1.967 2.2
bm<AlgType::str_member_first_not, wchar_t>/75/85 0.953 1
bm<AlgType::str_member_first_not, wchar_t>/102/4 1.915 2.222
bm<AlgType::str_member_first_not, wchar_t>/200/46 1 1.217
bm<AlgType::str_member_first_not, wchar_t>/325/1 3.13 4.286
bm<AlgType::str_member_first_not, wchar_t>/400/50 0.965 1.205
bm<AlgType::str_member_first_not, wchar_t>/1011/11 0.987 1.281
bm<AlgType::str_member_first_not, wchar_t>/1280/46 1 1.222
bm<AlgType::str_member_first_not, wchar_t>/1502/23 1.02 1.29
bm<AlgType::str_member_first_not, wchar_t>/2203/54 0.977 1.306
bm<AlgType::str_member_first_not, wchar_t>/3056/7 1.23 1.602
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2/3 0.769 0.73
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/6/81 0.897 0.752
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/7/4 1.568 1.17
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/9/3 1.702 0.764
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/22/5 2.773 2.439
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/58/2 5 1.882
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/75/85 0.856 0.855
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/102/4 5.535 5.222
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/200/46 0.839 0.843
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/325/1 6.147 2.304
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50 0.832 0.853
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1011/11 2.5 2.689
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46 0.852 0.869
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23 0.979 0.842
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54 0.886 0.851
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/3056/7 4.193 4.372

@hazzlim hazzlim requested a review from a team as a code owner February 20, 2026 12:58
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Feb 20, 2026
@StephanTLavavej StephanTLavavej added performance Must go faster ARM64 Related to the ARM64 architecture ARM64EC I can't believe it's not x64! labels Feb 20, 2026
@StephanTLavavej StephanTLavavej self-assigned this Feb 20, 2026
Comment on lines 6888 to 6892
// Heuristic of Haystack smaller than Neon width, or Needle larger than 2 x Neon width.
if (_Count1 * sizeof(_Ty) < 16 || _Count2 * sizeof(_Ty) >= 32) {
return _Pos_from_ptr<_Ty>(
_Fallback_find_not_2(_First1, _Last1, _First2, _Last2), _First1, _Last1);
}
Copy link
Contributor

@AlexGuteniev AlexGuteniev Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These cases showing SU <1 seem to indicate a weakness of this choice.
It might be better to go with scalar, if _Shuffle_impl_dispatch isn't good here.

bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'll take a look and see if the scalar is better - the results I got for these was a larger regression with the Shuffle implementation (~0.5x). When I took a look at what the baseline was doing, I saw that it does this

for (auto _Match_try = _Hay_start; _Match_try < _Hay_end; ++_Match_try) {
if (!_Traits::find(_Needle, _Needle_size, *_Match_try)) {
return static_cast<size_t>(_Match_try - _Haystack); // found a match
}
}
return static_cast<size_t>(-1); // no match
}

hence doing this here. I'm not entirely sure why it's a regression, as they should be doing essentially the same thing (?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _Traits::find is element-wise loop, and not a vectorized loop here. We except for not it will exit shortly, so scalar may be better due to not spending time for setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if they are essentially the same, and the regression is due to spending some time on branching to here, then probably we can't do much about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it is wmemchr:

_NODISCARD static _CONSTEXPR17 const _Elem* find(
_In_reads_(_Count) const _Elem* _First, const size_t _Count, const _Elem& _Ch) noexcept /* strengthened */ {
// look for _Ch in [_First, _First + _Count)
#if _HAS_CXX17
if (_STD _Is_constant_evaluated()) {
if constexpr (is_same_v<_Elem, wchar_t>) {
return __builtin_wmemchr(_First, _Ch, _Count);
} else {
return _Primary_char_traits::find(_First, _Count, _Ch);
}
}
#endif // _HAS_CXX17
return reinterpret_cast<const _Elem*>(_CSTD wmemchr(reinterpret_cast<const wchar_t*>(_First), _Ch, _Count));
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH I wonder if the call gets inlined in the header and not in my implementation because I'm calling the C function not the template...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using wmemchr in _Fallback_find_not_2 would be the way. I remember you optimized find for the long run more that it is done on x64/x86, and more than in wmemchr

And maybe also benchmark is not fair enough, in a sense it is biased to something unusual.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the new Neon implementation of Find is definitely faster than wmemchr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a point that maybe wmemchr has better early-return characteristics. It does look like we weren't inlining, I'll re-spin the benchmarks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it gets inlined, rather than having a call in the loop, the numbers are like this. (I included speedup vs. the previous commit and speedup vs. the baseline of main)

  MSVC SU (prev) MSVC SU (baseline) Clang SU (prev) Clang SU (baseline)
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2/3 1 0.769 1 0.73
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/6/81 1.067 0.957 1.152 0.866
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/7/4 0.967 1.516 1 1.17
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/9/3 1 1.702 1.003 0.767
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/22/5 0.944 2.619 1.01 2.464
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/58/2 1 5 0.952 1.792
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/75/85 1.182 1.013 1.195 1.022
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/102/4 0.923 5.111 1.023 5.341
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/200/46 1.145 0.96 1.214 1.023
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/325/1 1.079 6.635 0.915 2.108
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/400/50 1.163 0.967 1.2 1.024
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1011/11 1 2.5 0.978 2.63
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1280/46 1.148 0.978 1.173 1.019
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/1502/23 1.175 1.15 1.239 1.043
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/2203/54 1.154 1.022 1.227 1.044
bm<AlgType::str_member_first_not, wchar_t, L'\x03B1'>/3056/7 1.006 4.219 0.99 4.33

@StephanTLavavej StephanTLavavej removed their assignment Feb 21, 2026
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Feb 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ARM64EC I can't believe it's not x64! ARM64 Related to the ARM64 architecture performance Must go faster

Projects

Status: Ready To Merge

Development

Successfully merging this pull request may close these issues.

3 participants