std.mem: add countScalar and countAny #24104

nektro · 2025-06-06T22:41:10Z

this brings 'count' into sync with the available 'indexOf*Pos' options

nektro · 2025-06-07T21:28:31Z

CI fails were #24109

lib/std/mem.zig

Rexicon226 · 2025-08-15T05:48:27Z

@mrjbq7, I see we've moved over here before I was able to finish my idea and post it! 😄
Could you try benchmarking this? It is a very rough idea, but I think it's correct and might be faster.

const std = @import("std");
const N = std.simd.suggestVectorLength(u8) orelse @sizeOf(u8);
const V = @Vector(N, u8);
const Int = std.meta.Int(.unsigned, N);

pub fn countScalarButGood(haystack: []const u8, needle: u8) usize {
    var found: usize = 0;

    if (haystack.len < N) {
        for (haystack) |item| {
            if (item == needle) found += 1;
        }
        return found;
    }

    const broad: V = @splat(needle);
    for (0..haystack.len / N) |i| {
        const h: V = @bitCast(haystack[i * N ..][0..N].*);
        const integer: Int = @bitCast(h == broad);
        found += @popCount(integer);
    }

    const remaining = (haystack.len % N);
    if (remaining == 0) return found;
    const overlapped: std.math.Log2Int(Int) = @intCast(N - remaining);
    const mask: Int = (@as(Int, 1) << overlapped) - 1;
    const last: V = @bitCast(haystack[haystack.len - N ..][0..N].*);
    const integer: Int = @bitCast(last == broad);
    found += @popCount(integer & ~mask);

    return found;
}

Rexicon226 · 2025-08-15T05:54:50Z

I think you could do a branch tree and force unroll to make it even faster...

mrjbq7 · 2025-08-15T06:03:43Z

@mrjbq7, I see we've moved over here before I was able to finish my idea and post it! 😄 Could you try benchmarking this? It is a very rough idea, but I think it's correct and might be faster.

slower!

no needle in 1mb:

Benchmark 1 (2156 runs): ./indexOfScalarPos
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          2.28ms ±  110us    2.14ms … 2.98ms        178 ( 8%)        0%
  peak_rss           1.43MB ± 5.80KB    1.17MB … 1.43MB          3 ( 0%)        0%
  cpu_cycles         8.14M  ± 93.7K     8.01M  … 9.07M         178 ( 8%)        0%
  instructions       21.6M  ± 1.09      21.6M  … 21.6M          90 ( 4%)        0%
  cache_references   61.2K  ± 5.64K     47.4K  … 71.0K           0 ( 0%)        0%
  cache_misses       1.16K  ±  125       856   … 1.57K           0 ( 0%)        0%
  branch_misses       375   ± 42.4       354   … 1.78K          76 ( 4%)        0%
Benchmark 2 (2691 runs): ./simpleForLoop
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.83ms ± 90.9us    1.76ms … 2.43ms        287 (11%)        ⚡- 19.8% ±  0.2%
  peak_rss           1.43MB ± 6.35KB    1.11MB … 1.43MB          2 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles         6.27M  ± 21.8K     6.26M  … 7.10M         145 ( 5%)        ⚡- 23.0% ±  0.0%
  instructions       16.0M  ± 0.96      16.0M  … 16.0M         165 ( 6%)        ⚡- 26.0% ±  0.0%
  cache_references   56.1K  ± 4.96K     48.8K  … 66.9K           0 ( 0%)        ⚡-  8.4% ±  0.5%
  cache_misses        760   ± 40.7       625   … 1.11K          40 ( 1%)        ⚡- 34.4% ±  0.4%
  branch_misses       351   ± 5.09       339   …  384          127 ( 5%)        ⚡-  6.5% ±  0.4%
Benchmark 3 (1475 runs): ./scalarButGood
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.36ms ±  130us    3.21ms … 4.00ms        125 ( 8%)        💩+ 47.2% ±  0.3%
  peak_rss           1.43MB ± 10.8KB    1.11MB … 1.43MB          2 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         13.2M  ± 95.5K     13.1M  … 14.3M         174 (12%)        💩+ 62.4% ±  0.1%
  instructions       44.3M  ± 1.14      44.3M  … 44.3M          86 ( 6%)        💩+104.7% ±  0.0%
  cache_references   61.1K  ± 5.63K     46.4K  … 70.8K           0 ( 0%)          -  0.2% ±  0.6%
  cache_misses       1.02K  ±  123       738   … 1.35K           0 ( 0%)        ⚡- 12.3% ±  0.7%
  branch_misses       369   ± 8.81       349   …  419           43 ( 3%)        ⚡-  1.8% ±  0.6%

all needle in 1mb:

Benchmark 1 (57 runs): ./indexOfScalarPos
measurement          mean ± σ            min … max           outliers         delta
wall_time          88.6ms ± 4.97ms    82.1ms … 98.5ms          0 ( 0%)        0%
peak_rss           1.43MB ± 12.1KB    1.37MB … 1.43MB          4 ( 7%)        0%
cpu_cycles          460M  ± 6.93M      456M  …  495M           6 (11%)        0%
instructions       1.36G  ± 8.50      1.36G  … 1.36G           0 ( 0%)        0%
cache_references   59.5K  ± 5.28K     45.9K  … 75.1K           7 (12%)        0%
cache_misses       1.79K  ± 1.25K     1.12K  … 7.51K           6 (11%)        0%
branch_misses      1.00K  ± 82.0       857   … 1.19K           0 ( 0%)        0%
Benchmark 2 (2173 runs): ./simpleForLoop
measurement          mean ± σ            min … max           outliers         delta
wall_time          2.27ms ± 96.9us    2.17ms … 2.89ms        276 (13%)        ⚡- 97.4% ±  0.2%
peak_rss           1.43MB ± 7.30KB    1.24MB … 1.43MB          3 ( 0%)          +  0.2% ±  0.1%
cpu_cycles         8.27M  ± 10.6K     8.26M  … 8.43M         113 ( 5%)        ⚡- 98.2% ±  0.1%
instructions       28.0M  ± 1.07      28.0M  … 28.0M          75 ( 3%)        ⚡- 97.9% ±  0.0%
cache_references   56.1K  ± 5.03K     48.9K  … 66.6K           0 ( 0%)        ⚡-  5.8% ±  2.2%
cache_misses        786   ± 41.5       601   … 1.15K          43 ( 2%)        ⚡- 56.0% ±  3.0%
branch_misses       352   ± 5.65       338   …  381           42 ( 2%)        ⚡- 64.9% ±  0.4%
Benchmark 3 (1474 runs): ./scalarButGood
measurement          mean ± σ            min … max           outliers         delta
wall_time          3.36ms ±  134us    3.20ms … 3.96ms        122 ( 8%)        ⚡- 96.2% ±  0.3%
peak_rss           1.43MB ± 8.86KB    1.24MB … 1.43MB          3 ( 0%)          +  0.2% ±  0.2%
cpu_cycles         13.2M  ± 99.4K     13.1M  … 14.3M         179 (12%)        ⚡- 97.1% ±  0.1%
instructions       44.3M  ± 1.15      44.3M  … 44.3M         102 ( 7%)        ⚡- 96.7% ±  0.0%
cache_references   60.6K  ± 5.64K     46.8K  … 71.2K           0 ( 0%)          +  1.8% ±  2.5%
cache_misses       1.00K  ±  123       744   … 1.34K           0 ( 0%)        ⚡- 44.0% ±  4.0%
branch_misses       368   ± 9.38       349   …  466           48 ( 3%)        ⚡- 63.2% ±  0.5%

mrjbq7 · 2025-08-15T06:04:17Z

it's possible i need to properly randomize the data for correct benchmarking, anyway... more to come

Rexicon226 · 2025-08-15T06:05:04Z

could you share your benchmark quickly? im curious to get ~~nerd sniped again~~.

mrjbq7 · 2025-08-15T06:14:02Z

could you share your benchmark quickly? im curious to get ~~nerd sniped again~~.

well, i tried a couple of them, but basically i used a 1mb array with every index set to a constant value and then looked for it, and then looked for something that wasn't equal to that value, and then i tried this version with not many finds:

pub fn main() !void {
   const allocator = std.heap.smp_allocator;
   const n = 1_000_000;
   const bytes = try allocator.alloc(u8, n);
   defer allocator.free(bytes);
   for (0..n) |i| {
       bytes[i] = @intCast(i % 256);
   }
   const count = countScalar(u8, bytes, 0);
   const expected: usize = @intFromFloat(@ceil(@as(f64, @floatFromInt(n)) / 256.0));
   if (count != expected) {
       std.debug.print("oops {d}\n", .{count});
       std.process.exit(1);
   }
}

this one produces less varied results:

Benchmark 1 (491 runs): ./indexOfScalarPos
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          10.1ms ±  635us    8.74ms … 12.5ms          9 ( 2%)        0%
  peak_rss           1.56MB ±  133KB    1.20MB … 1.70MB          0 ( 0%)        0%
  cpu_cycles         45.2M  ± 3.06M     40.2M  … 53.8M          14 ( 3%)        0%
  instructions       69.4M  ± 2.01      69.4M  … 69.4M           4 ( 1%)        0%
  cache_references   93.9K  ± 8.46K     78.5K  …  116K           0 ( 0%)        0%
  cache_misses       4.67K  ± 4.80K     1.09K  … 16.7K           0 ( 0%)        0%
  branch_misses       445   ± 22.9       391   …  561           13 ( 3%)        0%
Benchmark 2 (547 runs): ./simpleForLoop
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          9.11ms ±  622us    7.38ms … 11.5ms         10 ( 2%)        ⚡- 10.0% ±  0.8%
  peak_rss           1.43MB ±    0      1.43MB … 1.43MB          0 ( 0%)        ⚡-  8.3% ±  0.7%
  cpu_cycles         40.9M  ± 2.79M     36.3M  … 50.2M          52 (10%)        ⚡-  9.6% ±  0.8%
  instructions       61.1M  ± 1.78      61.1M  … 61.1M           5 ( 1%)        ⚡- 12.1% ±  0.0%
  cache_references   90.4K  ± 6.29K     79.4K  …  112K           7 ( 1%)        ⚡-  3.7% ±  1.0%
  cache_misses       3.36K  ± 4.19K      819   … 16.6K          64 (12%)        ⚡- 28.2% ± 11.7%
  branch_misses      1.49K  ± 1.65K      364   … 4.31K           0 ( 0%)        💩+235.5% ± 32.8%
Benchmark 3 (461 runs): ./scalarButGood
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          10.8ms ±  576us    9.23ms … 13.1ms          5 ( 1%)        💩+  6.9% ±  0.8%
  peak_rss           1.43MB ± 4.96KB    1.33MB … 1.43MB          1 ( 0%)        ⚡-  8.4% ±  0.8%
  cpu_cycles         49.0M  ± 2.89M     44.2M  … 58.2M          36 ( 8%)        💩+  8.4% ±  0.8%
  instructions       89.3M  ± 1.98      89.3M  … 89.3M          12 ( 3%)        💩+ 28.6% ±  0.0%
  cache_references   93.1K  ± 8.41K     76.1K  …  114K           0 ( 0%)          -  0.8% ±  1.1%
  cache_misses       3.87K  ± 4.44K      893   … 16.5K          43 ( 9%)        ⚡- 17.3% ± 12.6%
  branch_misses       396   ± 11.3       374   …  445           17 ( 4%)        ⚡- 11.1% ±  0.5%

Rexicon226 · 2025-08-15T07:04:36Z

Here are my results from running my benchmark script on a AMD Ryzen 9 7950X3D 16-Core Processor 4.20 GHz,
https://gist.github.com/Rexicon226/1e45ea2086283fb668948fb83f235e23

I'd be very curious for input from some people with different hardware, or if there are any mistakes in my benchmark script :). Just as a sanity check, I have confirmed that using random bytes for the input does generate matches on basically all runs, usually hundreds/thousands of them, so "stuff" is happening.

std.mem: add countScalar and countAny

a55dc08

Merge branch 'master' into nektro-patch-45095

da2f6ba

gabeuehlein mentioned this pull request Jul 9, 2025

Writergate #24329

Merged

12 tasks

nektro mentioned this pull request Aug 15, 2025

std.mem: adding countScalar() #24854

Closed

nektro commented Aug 15, 2025

View reviewed changes

lib/std/mem.zig Outdated Show resolved Hide resolved

nektro added 2 commits August 14, 2025 22:46

Update lib/std/mem.zig

19b526b

Merge branch 'master' into nektro-patch-45095

c6545c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

std.mem: add countScalar and countAny #24104

std.mem: add countScalar and countAny #24104

Uh oh!

nektro commented Jun 6, 2025

Uh oh!

nektro commented Jun 7, 2025

Uh oh!

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

mrjbq7 commented Aug 15, 2025

Uh oh!

mrjbq7 commented Aug 15, 2025

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

mrjbq7 commented Aug 15, 2025 •

edited

Loading

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

std.mem: add countScalar and countAny #24104

Are you sure you want to change the base?

std.mem: add countScalar and countAny #24104

Uh oh!

Conversation

nektro commented Jun 6, 2025

Uh oh!

nektro commented Jun 7, 2025

Uh oh!

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

mrjbq7 commented Aug 15, 2025

Uh oh!

mrjbq7 commented Aug 15, 2025

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

mrjbq7 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rexicon226 commented Aug 15, 2025

Uh oh!

Uh oh!

mrjbq7 commented Aug 15, 2025 •

edited

Loading