-
-
Notifications
You must be signed in to change notification settings - Fork 3k
std.mem: add countScalar and countAny #24104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
CI fails were #24109 |
@mrjbq7, I see we've moved over here before I was able to finish my idea and post it! 😄 const std = @import("std");
const N = std.simd.suggestVectorLength(u8) orelse @sizeOf(u8);
const V = @Vector(N, u8);
const Int = std.meta.Int(.unsigned, N);
pub fn countScalarButGood(haystack: []const u8, needle: u8) usize {
var found: usize = 0;
if (haystack.len < N) {
for (haystack) |item| {
if (item == needle) found += 1;
}
return found;
}
const broad: V = @splat(needle);
for (0..haystack.len / N) |i| {
const h: V = @bitCast(haystack[i * N ..][0..N].*);
const integer: Int = @bitCast(h == broad);
found += @popCount(integer);
}
const remaining = (haystack.len % N);
if (remaining == 0) return found;
const overlapped: std.math.Log2Int(Int) = @intCast(N - remaining);
const mask: Int = (@as(Int, 1) << overlapped) - 1;
const last: V = @bitCast(haystack[haystack.len - N ..][0..N].*);
const integer: Int = @bitCast(last == broad);
found += @popCount(integer & ~mask);
return found;
} |
I think you could do a branch tree and force unroll to make it even faster... |
slower! no needle in 1mb:
all needle in 1mb:
|
it's possible i need to properly randomize the data for correct benchmarking, anyway... more to come |
could you share your benchmark quickly? im curious to get |
well, i tried a couple of them, but basically i used a 1mb array with every index set to a constant value and then looked for it, and then looked for something that wasn't equal to that value, and then i tried this version with not many finds: pub fn main() !void {
const allocator = std.heap.smp_allocator;
const n = 1_000_000;
const bytes = try allocator.alloc(u8, n);
defer allocator.free(bytes);
for (0..n) |i| {
bytes[i] = @intCast(i % 256);
}
const count = countScalar(u8, bytes, 0);
const expected: usize = @intFromFloat(@ceil(@as(f64, @floatFromInt(n)) / 256.0));
if (count != expected) {
std.debug.print("oops {d}\n", .{count});
std.process.exit(1);
}
} this one produces less varied results:
|
Here are my results from running my benchmark script on a I'd be very curious for input from some people with different hardware, or if there are any mistakes in my benchmark script :). Just as a sanity check, I have confirmed that using random bytes for the input does generate matches on basically all runs, usually hundreds/thousands of them, so "stuff" is happening. |
this brings 'count' into sync with the available 'indexOf*Pos' options