Skip to content

Conversation

nektro
Copy link
Contributor

@nektro nektro commented Jun 6, 2025

this brings 'count' into sync with the available 'indexOf*Pos' options

@nektro
Copy link
Contributor Author

nektro commented Jun 7, 2025

CI fails were #24109

@gabeuehlein gabeuehlein mentioned this pull request Jul 9, 2025
12 tasks
@Rexicon226
Copy link
Contributor

@mrjbq7, I see we've moved over here before I was able to finish my idea and post it! 😄
Could you try benchmarking this? It is a very rough idea, but I think it's correct and might be faster.

const std = @import("std");
const N = std.simd.suggestVectorLength(u8) orelse @sizeOf(u8);
const V = @Vector(N, u8);
const Int = std.meta.Int(.unsigned, N);

pub fn countScalarButGood(haystack: []const u8, needle: u8) usize {
    var found: usize = 0;

    if (haystack.len < N) {
        for (haystack) |item| {
            if (item == needle) found += 1;
        }
        return found;
    }

    const broad: V = @splat(needle);
    for (0..haystack.len / N) |i| {
        const h: V = @bitCast(haystack[i * N ..][0..N].*);
        const integer: Int = @bitCast(h == broad);
        found += @popCount(integer);
    }

    const remaining = (haystack.len % N);
    if (remaining == 0) return found;
    const overlapped: std.math.Log2Int(Int) = @intCast(N - remaining);
    const mask: Int = (@as(Int, 1) << overlapped) - 1;
    const last: V = @bitCast(haystack[haystack.len - N ..][0..N].*);
    const integer: Int = @bitCast(last == broad);
    found += @popCount(integer & ~mask);

    return found;
}

@Rexicon226
Copy link
Contributor

I think you could do a branch tree and force unroll to make it even faster...

@mrjbq7
Copy link

mrjbq7 commented Aug 15, 2025

@mrjbq7, I see we've moved over here before I was able to finish my idea and post it! 😄 Could you try benchmarking this? It is a very rough idea, but I think it's correct and might be faster.

slower!

no needle in 1mb:

Benchmark 1 (2156 runs): ./indexOfScalarPos
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          2.28ms ±  110us    2.14ms … 2.98ms        178 ( 8%)        0%
  peak_rss           1.43MB ± 5.80KB    1.17MB … 1.43MB          3 ( 0%)        0%
  cpu_cycles         8.14M  ± 93.7K     8.01M  … 9.07M         178 ( 8%)        0%
  instructions       21.6M  ± 1.09      21.6M  … 21.6M          90 ( 4%)        0%
  cache_references   61.2K  ± 5.64K     47.4K  … 71.0K           0 ( 0%)        0%
  cache_misses       1.16K  ±  125       856   … 1.57K           0 ( 0%)        0%
  branch_misses       375   ± 42.4       354   … 1.78K          76 ( 4%)        0%
Benchmark 2 (2691 runs): ./simpleForLoop
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.83ms ± 90.9us    1.76ms … 2.43ms        287 (11%)        ⚡- 19.8% ±  0.2%
  peak_rss           1.43MB ± 6.35KB    1.11MB … 1.43MB          2 ( 0%)          +  0.0% ±  0.0%
  cpu_cycles         6.27M  ± 21.8K     6.26M  … 7.10M         145 ( 5%)        ⚡- 23.0% ±  0.0%
  instructions       16.0M  ± 0.96      16.0M  … 16.0M         165 ( 6%)        ⚡- 26.0% ±  0.0%
  cache_references   56.1K  ± 4.96K     48.8K  … 66.9K           0 ( 0%)        ⚡-  8.4% ±  0.5%
  cache_misses        760   ± 40.7       625   … 1.11K          40 ( 1%)        ⚡- 34.4% ±  0.4%
  branch_misses       351   ± 5.09       339   …  384          127 ( 5%)        ⚡-  6.5% ±  0.4%
Benchmark 3 (1475 runs): ./scalarButGood
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          3.36ms ±  130us    3.21ms … 4.00ms        125 ( 8%)        💩+ 47.2% ±  0.3%
  peak_rss           1.43MB ± 10.8KB    1.11MB … 1.43MB          2 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         13.2M  ± 95.5K     13.1M  … 14.3M         174 (12%)        💩+ 62.4% ±  0.1%
  instructions       44.3M  ± 1.14      44.3M  … 44.3M          86 ( 6%)        💩+104.7% ±  0.0%
  cache_references   61.1K  ± 5.63K     46.4K  … 70.8K           0 ( 0%)          -  0.2% ±  0.6%
  cache_misses       1.02K  ±  123       738   … 1.35K           0 ( 0%)        ⚡- 12.3% ±  0.7%
  branch_misses       369   ± 8.81       349   …  419           43 ( 3%)        ⚡-  1.8% ±  0.6%

all needle in 1mb:

Benchmark 1 (57 runs): ./indexOfScalarPos
measurement          mean ± σ            min … max           outliers         delta
wall_time          88.6ms ± 4.97ms    82.1ms … 98.5ms          0 ( 0%)        0%
peak_rss           1.43MB ± 12.1KB    1.37MB … 1.43MB          4 ( 7%)        0%
cpu_cycles          460M  ± 6.93M      456M  …  495M           6 (11%)        0%
instructions       1.36G  ± 8.50      1.36G  … 1.36G           0 ( 0%)        0%
cache_references   59.5K  ± 5.28K     45.9K  … 75.1K           7 (12%)        0%
cache_misses       1.79K  ± 1.25K     1.12K  … 7.51K           6 (11%)        0%
branch_misses      1.00K  ± 82.0       857   … 1.19K           0 ( 0%)        0%
Benchmark 2 (2173 runs): ./simpleForLoop
measurement          mean ± σ            min … max           outliers         delta
wall_time          2.27ms ± 96.9us    2.17ms … 2.89ms        276 (13%)        ⚡- 97.4% ±  0.2%
peak_rss           1.43MB ± 7.30KB    1.24MB … 1.43MB          3 ( 0%)          +  0.2% ±  0.1%
cpu_cycles         8.27M  ± 10.6K     8.26M  … 8.43M         113 ( 5%)        ⚡- 98.2% ±  0.1%
instructions       28.0M  ± 1.07      28.0M  … 28.0M          75 ( 3%)        ⚡- 97.9% ±  0.0%
cache_references   56.1K  ± 5.03K     48.9K  … 66.6K           0 ( 0%)        ⚡-  5.8% ±  2.2%
cache_misses        786   ± 41.5       601   … 1.15K          43 ( 2%)        ⚡- 56.0% ±  3.0%
branch_misses       352   ± 5.65       338   …  381           42 ( 2%)        ⚡- 64.9% ±  0.4%
Benchmark 3 (1474 runs): ./scalarButGood
measurement          mean ± σ            min … max           outliers         delta
wall_time          3.36ms ±  134us    3.20ms … 3.96ms        122 ( 8%)        ⚡- 96.2% ±  0.3%
peak_rss           1.43MB ± 8.86KB    1.24MB … 1.43MB          3 ( 0%)          +  0.2% ±  0.2%
cpu_cycles         13.2M  ± 99.4K     13.1M  … 14.3M         179 (12%)        ⚡- 97.1% ±  0.1%
instructions       44.3M  ± 1.15      44.3M  … 44.3M         102 ( 7%)        ⚡- 96.7% ±  0.0%
cache_references   60.6K  ± 5.64K     46.8K  … 71.2K           0 ( 0%)          +  1.8% ±  2.5%
cache_misses       1.00K  ±  123       744   … 1.34K           0 ( 0%)        ⚡- 44.0% ±  4.0%
branch_misses       368   ± 9.38       349   …  466           48 ( 3%)        ⚡- 63.2% ±  0.5%

@mrjbq7
Copy link

mrjbq7 commented Aug 15, 2025

it's possible i need to properly randomize the data for correct benchmarking, anyway... more to come

@Rexicon226
Copy link
Contributor

could you share your benchmark quickly? im curious to get nerd sniped again.

@mrjbq7
Copy link

mrjbq7 commented Aug 15, 2025

could you share your benchmark quickly? im curious to get nerd sniped again.

well, i tried a couple of them, but basically i used a 1mb array with every index set to a constant value and then looked for it, and then looked for something that wasn't equal to that value, and then i tried this version with not many finds:

pub fn main() !void {
   const allocator = std.heap.smp_allocator;
   const n = 1_000_000;
   const bytes = try allocator.alloc(u8, n);
   defer allocator.free(bytes);
   for (0..n) |i| {
       bytes[i] = @intCast(i % 256);
   }
   const count = countScalar(u8, bytes, 0);
   const expected: usize = @intFromFloat(@ceil(@as(f64, @floatFromInt(n)) / 256.0));
   if (count != expected) {
       std.debug.print("oops {d}\n", .{count});
       std.process.exit(1);
   }
}

this one produces less varied results:

Benchmark 1 (491 runs): ./indexOfScalarPos
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          10.1ms ±  635us    8.74ms … 12.5ms          9 ( 2%)        0%
  peak_rss           1.56MB ±  133KB    1.20MB … 1.70MB          0 ( 0%)        0%
  cpu_cycles         45.2M  ± 3.06M     40.2M  … 53.8M          14 ( 3%)        0%
  instructions       69.4M  ± 2.01      69.4M  … 69.4M           4 ( 1%)        0%
  cache_references   93.9K  ± 8.46K     78.5K  …  116K           0 ( 0%)        0%
  cache_misses       4.67K  ± 4.80K     1.09K  … 16.7K           0 ( 0%)        0%
  branch_misses       445   ± 22.9       391   …  561           13 ( 3%)        0%
Benchmark 2 (547 runs): ./simpleForLoop
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          9.11ms ±  622us    7.38ms … 11.5ms         10 ( 2%)        ⚡- 10.0% ±  0.8%
  peak_rss           1.43MB ±    0      1.43MB … 1.43MB          0 ( 0%)        ⚡-  8.3% ±  0.7%
  cpu_cycles         40.9M  ± 2.79M     36.3M  … 50.2M          52 (10%)        ⚡-  9.6% ±  0.8%
  instructions       61.1M  ± 1.78      61.1M  … 61.1M           5 ( 1%)        ⚡- 12.1% ±  0.0%
  cache_references   90.4K  ± 6.29K     79.4K  …  112K           7 ( 1%)        ⚡-  3.7% ±  1.0%
  cache_misses       3.36K  ± 4.19K      819   … 16.6K          64 (12%)        ⚡- 28.2% ± 11.7%
  branch_misses      1.49K  ± 1.65K      364   … 4.31K           0 ( 0%)        💩+235.5% ± 32.8%
Benchmark 3 (461 runs): ./scalarButGood
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          10.8ms ±  576us    9.23ms … 13.1ms          5 ( 1%)        💩+  6.9% ±  0.8%
  peak_rss           1.43MB ± 4.96KB    1.33MB … 1.43MB          1 ( 0%)        ⚡-  8.4% ±  0.8%
  cpu_cycles         49.0M  ± 2.89M     44.2M  … 58.2M          36 ( 8%)        💩+  8.4% ±  0.8%
  instructions       89.3M  ± 1.98      89.3M  … 89.3M          12 ( 3%)        💩+ 28.6% ±  0.0%
  cache_references   93.1K  ± 8.41K     76.1K  …  114K           0 ( 0%)          -  0.8% ±  1.1%
  cache_misses       3.87K  ± 4.44K      893   … 16.5K          43 ( 9%)        ⚡- 17.3% ± 12.6%
  branch_misses       396   ± 11.3       374   …  445           17 ( 4%)        ⚡- 11.1% ±  0.5%

@Rexicon226
Copy link
Contributor

Here are my results from running my benchmark script on a AMD Ryzen 9 7950X3D 16-Core Processor 4.20 GHz,
https://gist.github.com/Rexicon226/1e45ea2086283fb668948fb83f235e23
image

I'd be very curious for input from some people with different hardware, or if there are any mistakes in my benchmark script :). Just as a sanity check, I have confirmed that using random bytes for the input does generate matches on basically all runs, usually hundreds/thousands of them, so "stuff" is happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants