perf: fold LSB-test `i32.and X 1` into `i32.ctz` in boolean contexts by ggreif · Pull Request #8562 · WebAssembly/binaryen

ggreif · 2026-04-01T09:39:03Z

Summary

An if-else conditioned on (i32.and X (i32.const 1)) tests the least significant bit of X. Since i32.ctz X == 0 iff the LSB of X is set, we can replace the condition with i32.ctz X and swap the branches — saving one instruction.

The second commit extends this to the primary pattern from the issue — eqz(and X 1) as a boolean condition (used in br_if, if, select) — handled in optimizeBoolean so all three sites benefit from one insertion.

Handles the constant on either side (left or right of and)
visitIf: (and X 1); if T E → (ctz X); if E T
optimizeBoolean: eqz(and X 1) → ctz X — covers the typical br_if (eqz (and X 1)) pattern

Motivation

Filed in #5752. The Motoko compiler already implements this in its own peephole optimizer (instrList.ml); the goal is to bring it to wasm-opt so that hand-written Wasm (e.g. the Motoko RTS, written in Rust) benefits too.

The optimizeBoolean rule alone fires 26–105 times across the three Motoko RTS variants (mo-rts-eop, mo-rts-incremental, mo-rts-non-incremental), targeting the is_skewed/is_scalar pointer-tagging checks in the GC hot path.

Applying wasm-opt --optimize-instructions to the Motoko RTS and running the benchmark suite shows the following gross effects (the submitted optimisation is a contributing factor alongside other rules triggered in the same pass):

Benchmark	Before	After	Δ
`heap-32` (GC-heavy, run 1)	1,153,792,735 instr	1,151,398,207 instr	−2,394,528 (−0.21%)
`heap-32` (run 2)	1,256,407,315 instr	1,253,408,059 instr	−2,999,256 (−0.24%)
`heap-64` (run 1)	1,324,057,357 instr	1,321,855,449 instr	−2,201,908 (−0.17%)
`heap-64` (run 2)	1,295,845,087 instr	1,293,744,743 instr	−2,100,344 (−0.16%)
`bignum`	2,504,499 cycles	2,504,383 cycles	−116
`candid-subtype-cost`	1,115,011 cycles	1,114,823 cycles	−188

The GC-heavy heap benchmarks benefit most, consistent with the is_skewed check firing frequently during pointer traversal.

Test plan

New lit test test/lit/passes/optimize-instructions-lsb-if.wast covers if (const left and right) and br_if (eqz (and X 1))
All three test cases produce i32.ctz in the output

🤖 Generated with Claude Code

Add ggreif/binaryen (branch gabor/lsb-if-ctz-flake) as a flake input, exposing a patched wasm-opt that folds LSB-test `i32.and X 1` patterns into `i32.ctz` (WebAssembly/binaryen#8562). Apply it to the non-debug RTS variants in installPhase, yielding ~0.2% instruction count reductions in GC-heavy benchmarks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kripken · 2026-04-01T16:31:52Z

Interesting. I worry this is not always faster, though: AND usually has a cost of 1, while TZCNT often has 2: https://www.agner.org/optimize/instruction_tables.pdf

Perhaps check what LLVM does here? They likely reasoned about this thoroughly.

ggreif · 2026-04-05T10:22:47Z

Interesting. I worry this is not always faster, though: AND usually has a cost of 1, while TZCNT often has 2: https://www.agner.org/optimize/instruction_tables.pdf

Perhaps check what LLVM does here? They likely reasoned about this thoroughly.

I have answered a similar question here.

The i32.ctz approach is also semantically cleaner: it captures the "is LSB set?" intent more directly than and 1; eqz.

kripken · 2026-04-06T14:56:02Z

I agree it might be cleaner in a way. I also agree that VMs could alter what they emit, as you wrote in the linked issue. However, if this would regress performance on major VMs right now, we'd want to wait for them to fix that before landing anything.

MaxGraey · 2026-04-07T16:19:02Z

Even if JIT compilers start optimizing similarly to wasmtime, it still won’t solve the performance issue, for example, in runtimes with interpreters (some smart contracts, embedded oriented like wasm3 and etc). If such optimization is to be done at all, in my opinion, it should only be for “optimized for size” (-Os).

ggreif · 2026-04-07T21:37:37Z

Even if JIT compilers start optimizing similarly to wasmtime, it still won’t solve the performance issue, for example, in runtimes with interpreters (some smart contracts, embedded oriented like wasm3 and etc). If such optimization is to be done at all, in my opinion, it should only be for “optimized for size” (-Os).

That went through my thoughts too. I'll submit a revision soon.

…X; if E T` An if-else conditioned on `(i32.and X (i32.const 1))` tests the LSB of X. Since `i32.ctz X == 0` iff the LSB of X is set, we can replace the condition with `i32.ctz X` and swap the branches — saving one instruction. Handles the constant on either side (left or right of `and`). Relates to: WebAssembly#5752 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…an context In boolean contexts (if, br_if, select), `eqz(and X 1)` and `ctz X` have the same truthiness: both are truthy iff LSB(X) == 0. Replacing eqz+and with ctz saves one instruction and covers the primary pattern from WebAssembly#5752: i32.const 1; i32.and; i32.eqz; br_if N ==> i32.ctz; br_if N This fires via `optimizeBoolean`, so it covers `if`, `br_if`, and `select` conditions in one place. Observed ~26–105 hits across Motoko RTS variants. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Per WebAssembly#8562 review (kripken, MaxGraey): the `(if (i32.and X 1) ...)` and `eqz(and X 1)` → `i32.ctz X` rewrites save one instruction (a byte) but TZCNT can cost 1-2 cycles more than AND on common JIT VMs (Agner Fog tables), and JIT-less interpreters (wasm3, smart-contract runtimes) lack a fast path for ctz at all. The byte-saving is unambiguously the win we want under shrink modes; under speed modes the AND form stays. Restrict both folds to `getPassOptions().shrinkLevel >= 1` — fires under -Os and -Oz, no-ops everywhere else. Test rewritten with two RUN lines (DEFAULT + SHRINK prefixes) so both directions are asserted: the fold suppresses cleanly under the default --optimize-instructions invocation, and fires as before when --shrink-level=1 is added.

ggreif · 2026-05-08T21:58:34Z

Pushed a revision: the LSB→ctz fold is now gated on getPassOptions().shrinkLevel >= 1, so it fires under -Os / -Oz only and stays out of the default and speed-optimised pipelines. The lit test was split into DEFAULT (fold suppressed) and SHRINK (fold fires) prefixes.

tlively

It seems we could do something similar when there is no i32.eqz in the input as well:

(i32.and X (i32.const 1))
=>
(i32.eqz (i32.ctz X))

This would save one byte and would often be further optimized by removing the i32.eqz and flipping the branches in the next iteration.

In fact, I think we could optimize only the pattern without i32.eqz in the input and depend on the existing optimizations to remove the outer i32.eqz to cover all the cases this PR already covers.

tlively · 2026-05-08T22:54:52Z

Let's add tests for select as well.

kripken · 2026-05-08T23:16:00Z

+      // win we want under shrink modes; under speed modes the AND form
+      // stays. See WebAssembly/binaryen#8562.
+      if (auto* binary = curr->condition->dynCast<Binary>()) {
+        if (binary->op == AndInt32 && getPassOptions().shrinkLevel >= 1) {


How about making this >= 2, i.e., only in -Oz? -Os is meant to be a good balance between size and speed, and without more data I'm not sure how balanced this is. -Oz is "size at all costs".

I don't think so, as (i32.and X (i32.const 1)) would often feed conditionals and the proposed transform would unlock ripple effects. This is not only space but also time saving.

Do you mean that we would expect this transformation to be good for performance because of follow-on optimizations it enables, even though it appears to be bad for performance locally when taking into account the cycles the relevant native instructions take?

Do you have any data showing that this plays out in practice? If not, I agree that gating this behind -Oz (at least for now) makes sense.

We should definitely measure the ripple effects here before deciding what to do. Even in -Oz it might end up increasing size on average in general (for reasons like @MaxGraey mentioned).

Concretely, measuring this on files from various compilers would be useful, to get a broad range. I really don't have a good intuition here so such data seems necessary.

Per WebAssembly#8562 review (kripken, MaxGraey): the `(if (i32.and X 1) ...)` and `eqz(and X 1)` → `i32.ctz X` rewrites save one instruction (a byte) but TZCNT can cost 1-2 cycles more than AND on common JIT VMs (Agner Fog tables), and JIT-less interpreters (wasm3, smart-contract runtimes) lack a fast path for ctz at all. The byte-saving is unambiguously the win we want under shrink modes; under speed modes the AND form stays. Restrict both folds to `getPassOptions().shrinkLevel >= 1` — fires under -Os and -Oz, no-ops everywhere else. Test rewritten with two RUN lines (DEFAULT + SHRINK prefixes) so both directions are asserted: the fold suppresses cleanly under the default --optimize-instructions invocation, and fires as before when --shrink-level=1 is added.

ggreif · 2026-05-08T23:22:31Z

Added select coverage in the same lit test — two cases (constant on the right, constant on the left of the AND), non-constant arms so an unrelated select c1 c0 P → P simplification doesn't eat the select before the boolean fold runs. DEFAULT keeps the AND form (with the eqz-arm-swap that was already there); SHRINK collapses to ctz.

MaxGraey · 2026-05-09T04:29:09Z

(i32.and X (i32.const 1))

Even if such an expression is one byte larger compare to (i32.eqz (i32.ctz X)), it's fairly stable and repetitive since it's a predicate. That means compression will handle it very well due to sufficient dictionary-friendly entropy. Is it even worth it, considering that for interpreters and 1-tier compilers we consistently end up with a heavier instruction?

ggreif · 2026-05-09T11:28:25Z

compression will handle it very well

Obviously you (and me too) can come up with as many new metrics as you want. Every use case has its own optimum. I brought hard numbers for a real use case. On the Internet Computer a canister gets billed for the (weighted) instructions that ran in a basic block. We can optionally filter the binaries that rustc produces through binaryen and it gives good results. The results would be even better with this PR applied. That's all. Our Motoko codegen does peephole elimination for this already. But since the GCs are in the Rust part, we have to live with the LLVM codegen. A critical step in the GC is to differentiate between heap objects and scalar values. The and 1 p test is it. Runs on basically every reachable object.

MaxGraey · 2026-05-09T13:49:47Z

The fact that this looks slightly better in your particular compiler/runtime configuration (given the current wasmtime version and the exact shape of your heap tag check) does not make this pattern generally profitable for any Binaryen consumer. Moreover ctz is much less common in ordinary code, and unlike x & 1 (parity checks/LSB test) it is far more likely to be CSE'd away in this specific lowering pattern. I also updated wasmtime in Compiler Explorer (godbolt) and did not observe any improvement there for both approaches. So this may well be a peculiarity of the tier-1 / singlepass compiler rather than optimized cranelift compiler itself. Have you asked the wasmtime devs what could be causing this? Because then you would have solved perf problem at its root, including for other potential suboptimal code generation issues related to bitwise predicates in wasmtime.

…gn-bit tests When the result of a count-trailing/leading-zeros instruction is fed into a comparison against zero (the only thing the consumer cares about is whether the count is zero, not its numeric value), rewrite to test the corresponding bit of X directly: ctz(X) == 0 iff LSB of X is set iff (X & 1) != 0 clz(X) == 0 iff MSB of X is set iff X is signed-negative The bit-counting instruction can then be DCE'd. Backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — saves ~3 cycles of latency on Intel x86_64 per occurrence and removes the false GPR dependency. JIT-less backends benefit even more: their bit-counting paths are typically loops. Motivated by the converse wasm-side peephole in WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings). With these mid-end rules in place, that fold is cycle-neutral on cranelift JITs even when fed unconditionally. Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (ctz(X) == 4 must NOT trigger — that's a numeric-value test on the count, a different rewrite family).

…/ sign-bit tests (#13332) * cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests When the result of a count-trailing/leading-zeros instruction is fed into a comparison against zero (the only thing the consumer cares about is whether the count is zero, not its numeric value), rewrite to test the corresponding bit of X directly: ctz(X) == 0 iff LSB of X is set iff (X & 1) != 0 clz(X) == 0 iff MSB of X is set iff X is signed-negative The bit-counting instruction can then be DCE'd. Backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — saves ~3 cycles of latency on Intel x86_64 per occurrence and removes the false GPR dependency. JIT-less backends benefit even more: their bit-counting paths are typically loops. Motivated by the converse wasm-side peephole in WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings). With these mid-end rules in place, that fold is cycle-neutral on cranelift JITs even when fed unconditionally. Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (ctz(X) == 4 must NOT trigger — that's a numeric-value test on the count, a different rewrite family). * Add disas test for ctz/clz boolean-context fold Exercises three consumers (if, select, eqz) over the icmp-mediated shapes the egraph rewrites in `cranelift/codegen/src/opts/icmp.isle` target: `(ctz X) == 0`, `(ctz X) != 0`, and the analogous clz forms, across i32/i64. The blessed disassembly shows: - icmp-mediated cases collapse to a single bit test (`testl $1, %edx; jne` for ctz, `testl %edx, %edx; jl` for clz). - a bare `if (ctz X)` / `if (clz X)` form (no icmp interposed, i.e. the wasm-natural shape produced by frontends like Motoko's `moc`) compiles to full bsf+cmov+test or bsr+cmov+sub+test, since the brif's implicit zero-test is not visible to the value-level egraph rules. - `(ctz X) == 4` (numeric, not boolean) correctly stays as bsf+cmp+je — the rules don't over-fire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tlively · 2026-05-11T15:50:34Z

@MaxGraey, I think the performance concerns are resolved by gating this behind -Os (or even better, -Oz, as suggested above). And when optimizing for size we mostly care about the uncompressed size (for simplicity of reasoning, if nothing else), so this transformation looks fine.

ggreif · 2026-05-11T17:24:21Z

performance concerns are resolved by gating this behind -Os

Also by recent wasmtime backend tweaks: bytecodealliance/wasmtime#13332, bytecodealliance/wasmtime#13334 and friends (to come).

ggreif requested a review from a team as a code owner April 1, 2026 09:39

ggreif requested review from tlively and removed request for a team April 1, 2026 09:39

ggreif changed the title ~~perf(OptimizeInstructions): fold i32.and X 1; if T E into i32.ctz X; if E T~~ perf: fold LSB-test i32.and X 1 into i32.ctz in boolean contexts Apr 1, 2026

This was referenced Apr 1, 2026

nix: apply patched wasm-opt (LSB-test → ctz) to RTS wasm files caffeinelabs/motoko#5967

Draft

optimise LSBit mask (followed by branch) to use ctz #5752

Open

ggreif and others added 2 commits May 8, 2026 23:55

ggreif force-pushed the gabor/lsb-if-ctz branch from 575cd27 to 5de41e5 Compare May 8, 2026 21:57

tlively reviewed May 8, 2026

View reviewed changes

Comment thread test/lit/passes/optimize-instructions-lsb-if.wast

Copy link
Copy Markdown

Member

tlively May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add tests for select as well.

kripken reviewed May 8, 2026

View reviewed changes

ggreif added 2 commits May 9, 2026 01:22

chore: remove nix files (not for upstream)

784ed83

ggreif force-pushed the gabor/lsb-if-ctz branch from 5de41e5 to 784ed83 Compare May 8, 2026 23:22

ggreif mentioned this pull request May 9, 2026

cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests bytecodealliance/wasmtime#13332

Merged

ggreif mentioned this pull request May 13, 2026

[REQUEST]: bump wasmtime to v44.0.1 (and v45.0.0 when it ships) in wasm.yaml compiler-explorer/compiler-explorer#8706

Closed

Conversation

ggreif commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

kripken commented Apr 1, 2026

Uh oh!

ggreif commented Apr 5, 2026

Uh oh!

kripken commented Apr 6, 2026

Uh oh!

MaxGraey commented Apr 7, 2026

Uh oh!

ggreif commented Apr 7, 2026

Uh oh!

ggreif commented May 8, 2026

Uh oh!

tlively left a comment

Choose a reason for hiding this comment

Uh oh!

tlively May 8, 2026

Choose a reason for hiding this comment

Uh oh!

kripken May 8, 2026

Choose a reason for hiding this comment

Uh oh!

ggreif May 8, 2026

Choose a reason for hiding this comment

Uh oh!

tlively May 11, 2026

Choose a reason for hiding this comment

Uh oh!

kripken May 11, 2026

Choose a reason for hiding this comment

Uh oh!

kripken May 11, 2026

Choose a reason for hiding this comment

Uh oh!

ggreif commented May 8, 2026

Uh oh!

MaxGraey commented May 9, 2026

Uh oh!

ggreif commented May 9, 2026

Uh oh!

MaxGraey commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlively commented May 11, 2026

Uh oh!

ggreif commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggreif commented Apr 1, 2026 •

edited

Loading

MaxGraey commented May 9, 2026 •

edited

Loading

ggreif commented May 11, 2026 •

edited

Loading