Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Normalizing Flow (RealNVP) example #1215

Merged
merged 4 commits into from
Jan 20, 2025
Merged

docs: Normalizing Flow (RealNVP) example #1215

merged 4 commits into from
Jan 20, 2025

Conversation

avik-pal
Copy link
Member

No description provided.

examples/RealNVP/main.jl Outdated Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Jan 20, 2025

Benchmark Results (ASV)

main a5689ef... main/a5689ef8f0b227...
basics/overhead 0.155 ± 0.0018 μs 0.124 ± 0.0015 μs 1.25
time_to_load 0.906 ± 0.0027 s 0.904 ± 0.0077 s 1

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 8d8661b Previous: 46a012d Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3667 ns 3791 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4125 ns 4500 ns 0.92
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4645.5 ns 4875 ns 0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3875 ns 3666 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10750 ns 10167 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11020.5 ns 10458 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10750 ns 10750 ns 1
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10458 ns 10625 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1208 ns 1062.5 ns 1.14
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1167 ns 1167 ns 1
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1416 ns 1500 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1000 ns 1125 ns 0.89
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4000 ns 4083 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4083 ns 4042 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4312.5 ns 4208 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4125 ns 3958 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57292 ns 57542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46208.5 ns 46416 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 37792 ns 47125 ns 0.80
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82834 ns 80875 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2044750 ns 2035395.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2084541.5 ns 2078396 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2093812.5 ns 2078708 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2012166 ns 1998584 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144959 ns 144250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 143958 ns 144166.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 147312.5 ns 145125 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 143459 ns 153104.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1132375 ns 1120291.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1121187.5 ns 1113167 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1143521 ns 832708.5 ns 1.37
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1130667 ns 1117084 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3208 ns 3375 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3833 ns 3542 ns 1.08
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4375 ns 4166 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4041 ns 3125 ns 1.29
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8750 ns 9042 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9292 ns 8750 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10042 ns 10208 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9583 ns 8833 ns 1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17375 ns 17041 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15916.5 ns 15834 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17000 ns 16604.5 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14625 ns 16791 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 211687.5 ns 213750 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214667 ns 214875 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 215645.5 ns 215667 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216042 ns 226125 ns 0.96
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 583 ns 708 ns 0.82
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 709 ns 709 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 541 ns 1.08
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1458 ns 1375 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 1500 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1458 ns 1458 ns 1
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7166 ns 7000 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5916 ns 5750 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5250 ns 6042 ns 0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10209 ns 9750 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220229.5 ns 222021 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 280542 ns 228542 ns 1.23
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229771 ns 229292 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 226625 ns 213937.5 ns 1.06
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3875 ns 3875 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3875 ns 3959 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 3917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16667 ns 16917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16709 ns 16792 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16833 ns 17250 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17084 ns 16750 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 574667 ns 568792 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 577375 ns 578645.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 579083 ns 578083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 571125 ns 575625 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1424750 ns 1422625 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1429834 ns 1420000 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1422000 ns 1422375 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1424583.5 ns 1426708 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1069478.5 ns 1077687.5 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 957333 ns 960917 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1308750 ns 1353229.5 ns 0.97
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1299833 ns 1315312 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5963396 ns 5961958 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4588687.5 ns 4633250 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4776312 ns 4975188 ns 0.96
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5565291.5 ns 5557125 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 583 ns 542 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2208 ns 0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2250 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2167 ns 2125 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3625 ns 4125 ns 0.88
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4417 ns 4375 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5083 ns 5167 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4292 ns 4250 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10917 ns 11875 ns 0.92
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11125 ns 11000 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12208 ns 11917 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11562.5 ns 11500 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6292 ns 7000 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6250 ns 6958 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7458 ns 8250 ns 0.90
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 6125 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16708 ns 18708.5 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17459 ns 18625 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18250 ns 18375 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18084 ns 16708 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 500 ns 625 ns 0.80
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 792 ns 708 ns 1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 708 ns 667 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8333 ns 8834 ns 0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9042 ns 8875 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9084 ns 9334 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9292 ns 8354.5 ns 1.11
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64583 ns 64459 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64583 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64500 ns 64916 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64750 ns 64625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 286250 ns 279250 ns 1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 289000 ns 282167 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 288958 ns 284125 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 286208 ns 278708 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3386208 ns 3278417 ns 1.03
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3103375.5 ns 3081000 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 2775562 ns 3021792 ns 0.92
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 3943042 ns 4040979.5 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7565083 ns 7620208 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7465000 ns 7449187.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7247770.5 ns 7493708.5 ns 0.97
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8240833.5 ns 8208791 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17542770.5 ns 18366417 ns 0.96
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17514333 ns 17522312.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17599312.5 ns 17580834 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14110520.5 ns 14093354.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23519041 ns 23631333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 34016771 ns 33504604 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 41633812.5 ns 37034667 ns 1.12
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34946229 ns 34967583.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 188178833 ns 189693000 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 163938229 ns 165014875 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 157572709 ns 152416688 ns 1.03
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 436134604 ns 434850958 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 289055334 ns 289105312.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 263203333 ns 250867083 ns 1.05
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 304135917 ns 296775875 ns 1.02
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 475768979.5 ns 473537562.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22459 ns 22083 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24417 ns 22459 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24042 ns 25375 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24708.5 ns 24083 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104208 ns 103083 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 103292 ns 103250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 114292 ns 104542 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104229 ns 103041 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5917 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5958 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7042 ns 6708 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6458 ns 5791.5 ns 1.12
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15000 ns 14792 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15041 ns 15000 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16229.5 ns 16542 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15083 ns 14875 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3001375 ns 3002625 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2059667 ns 2079375 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2288292 ns 2272333 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4891812 ns 4882708 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23584417 ns 23536000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17994125 ns 18038562.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17280437.5 ns 16972167 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35020417 ns 34545146 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33372916.5 ns 33221458 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27623146 ns 27561792 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27689042 ns 27327000 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41048958.5 ns 42034750 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72000 ns 71417 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 73208 ns 71854.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75333 ns 75708 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 72041 ns 74708 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 318542 ns 205250.5 ns 1.55
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 205709 ns 206750 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 271583.5 ns 208958 ns 1.30
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217521 ns 217416 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11625 ns 11875 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 12208 ns 11416 ns 1.07
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12750 ns 12958 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11708.5 ns 11708 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 28292 ns 25667 ns 1.10
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27709 ns 26541.5 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 30041 ns 27729.5 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 28667 ns 26667 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12333 ns 12812.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 14458 ns 12209 ns 1.18
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 14146 ns 14208 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11750 ns 12291.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25375 ns 25625 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25917 ns 25916.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26125 ns 26250 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26500 ns 26604 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 180458.5 ns 178792 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180125 ns 180750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 181229.5 ns 181917 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181709 ns 179166 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 583708.5 ns 593333 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 584584 ns 582708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 595646 ns 583667 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 583417 ns 584542 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5792 ns 6167 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6166 ns 5875 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7333 ns 6875 ns 1.07
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6250 ns 5708.5 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13000 ns 13791 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14000 ns 13917 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15062.5 ns 15667 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14500 ns 14458 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1192542 ns 1225312.5 ns 0.97
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1241937.5 ns 1241959 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1243354 ns 1289958.5 ns 0.96
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 1029250 ns 1011625 ns 1.02
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4104583 ns 4103042 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4448375 ns 4403333 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4799354.5 ns 4523854.5 ns 1.06
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3705479 ns 3709771 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1792 ns 1875 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1834 ns 1916 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4833 ns 4958 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4875 ns 5000 ns 0.97
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4875 ns 4958 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 4875 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5583 ns 5833 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5333 ns 5917 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6895.5 ns 6667 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 5209 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10792 ns 11125 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10500 ns 11500 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11459 ns 11458 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12500 ns 10500 ns 1.19
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 291 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2792 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2708 ns 2833 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 3083 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3125 ns 2750 ns 1.14
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11209 ns 11459 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10917 ns 11625 ns 0.94
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12833 ns 12875 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10708.5 ns 10958 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24333 ns 25020.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24459 ns 25292 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24917 ns 25125 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25125 ns 24875 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4208 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16333 ns 16333 ns 1
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16292 ns 16375 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16166 ns 16520.5 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16500 ns 16208 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5709 ns 5833 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 5833 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5792 ns 6042 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5833 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20708 ns 21000 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20709 ns 21000 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21041 ns 21417 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21000 ns 20709 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 417666 ns 422124.5 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 382708 ns 387791 ns 0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 478542 ns 477333 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 103666 ns 103125 ns 1.01
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 917687 ns 921333 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 971125 ns 974250 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1191021 ns 1186458 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 372438 ns 457479.5 ns 0.81
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 79875 ns 80542 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80708 ns 80709 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 81250 ns 84896 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80042 ns 79833 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1909500 ns 1919250 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1920042 ns 1876583 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1930937.5 ns 1946041 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1927500 ns 1921396 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1917 ns 0.93
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1917 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1834 ns 1792 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6062.5 ns 6417 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6000 ns 6666 ns 0.90
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7458 ns 7771 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5833.5 ns 6145.5 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8708 ns 9604.5 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9250 ns 9459 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9250 ns 9500 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9334 ns 9041 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120020917 ns 120459792 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174548125.5 ns 173682208 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 154992375 ns 147804000 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105137604 ns 105720875 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 614828500 ns 610206729.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 554699958.5 ns 555562500 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 464117084 ns 452099291.5 ns 1.03
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 626050875 ns 626409896 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 718752958.5 ns 657253583 ns 1.09
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 663526292 ns 665008062.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 608055083 ns 581676208.5 ns 1.05
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 862671000.5 ns 857648458 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58583 ns 57875 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47625 ns 47791 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38792 ns 47500 ns 0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 86250 ns 83395.5 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1922417 ns 1915500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1973500 ns 1932792 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1969021 ns 1995084 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1914271 ns 1890500 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 264917 ns 267854.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 271042 ns 267708 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 268959 ns 269750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267708 ns 268166 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 694937.5 ns 594417 ns 1.17
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 696000 ns 681291 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 597833 ns 604895.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 590459 ns 689917 ns 0.86
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2216812.5 ns 2176375 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2213583 ns 2222812.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2163271 ns 2205042 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2106687.5 ns 2093562.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5496834 ns 5514416 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5520271 ns 5508500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5519416.5 ns 5535958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5545334 ns 5491750 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 644458 ns 638167 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 645875 ns 647708 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 648958 ns 659416 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 638458 ns 643750 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1827417 ns 1822167 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1726250 ns 1723042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1661146 ns 1727833 ns 0.96
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2106729.5 ns 2106333 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58000 ns 58458 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46333 ns 46917 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 37708 ns 47292 ns 0.80
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85334 ns 84125 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2019542 ns 2030041 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2073521 ns 2004250 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2096979 ns 2122125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024062.5 ns 1985979.5 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13171958 ns 13357770.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12353000 ns 12440000 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12535625 ns 12492250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 14887895.5 ns 15108458 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 47042084 ns 47178791.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41832000 ns 41760334 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 41213000 ns 40950875 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58337834 ns 58205437.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 73900959 ns 97014458.5 ns 0.76
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 91258583 ns 91152834 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90284771 ns 90701604.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98634667 ns 98541521.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59000 ns 58959 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47417 ns 47375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 38542 ns 47750 ns 0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 85375 ns 79958 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1917458 ns 1918645.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1962334 ns 1971000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1979125 ns 1997667 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1881458.5 ns 1889750 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 416 ns 0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6167 ns 6292 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6583 ns 6542 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6541 ns 6834 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6667 ns 6125 ns 1.09
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2625 ns 2833 ns 0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2833 ns 2917 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2917 ns 2708 ns 1.08
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 285347854 ns 289426812.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339972375 ns 339624334 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 320702208 ns 315284104.5 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 271968041.5 ns 274668667 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1012766312.5 ns 1014634416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 957997541.5 ns 953687125 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 865398041 ns 857733312.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1220255354 ns 1265357333 ns 0.96
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1439579208 ns 1675373667 ns 0.86
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1701281188 ns 1668941291 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1597552187.5 ns 1606744000 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1793880625 ns 1787636084 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1409958 ns 1409499.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1415208 ns 1413833 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1423333 ns 1419895.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1407208.5 ns 1458541.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4721229 ns 5016749.5 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5030333 ns 4651917 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5054229 ns 5058791 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4996896 ns 5012792 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 178769041 ns 171852250 ns 1.04
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 136771459 ns 129831062.5 ns 1.05
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 133977125 ns 115995771 ns 1.16
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 168161084 ns 168839667 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 820292083 ns 629070333 ns 1.30
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 490913021 ns 493488792 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 568885979.5 ns 456364583 ns 1.25
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 651602104.5 ns 675660292 ns 0.96
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8951709 ns 8950646 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8931709 ns 8924625 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7976063 ns 7865125 ns 1.01
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9843167 ns 9701750 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 36300187.5 ns 36024125 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 36985875 ns 37000208.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 34602250 ns 33425875 ns 1.04
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 38901708 ns 37661542 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47750 ns 47562.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47500 ns 47416 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47583 ns 47666 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47292 ns 47375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50417 ns 50542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50334 ns 50375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50375 ns 50584 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50500 ns 50583 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6833 ns 6958.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6458 ns 6500 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7854 ns 8042 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6250 ns 6542 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9917 ns 10042 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10062.5 ns 10437.5 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10208 ns 10500 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10209 ns 10375 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5666 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5750 ns 5958 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8021 ns 7417 ns 1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5542 ns 5458 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 15291 ns 13125 ns 1.17
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 15625 ns 13250 ns 1.18
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 16458 ns 13375 ns 1.23
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16416 ns 13208 ns 1.24
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 958 ns 1083 ns 0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 1083 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1084 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1042 ns 1084 ns 0.96
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7917 ns 8000 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8167 ns 8292 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8000 ns 8500 ns 0.94
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8417 ns 8125 ns 1.04
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23167 ns 23354.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23291 ns 23250 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23458 ns 23542 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23250 ns 23125 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52354.5 ns 52667 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52583 ns 52584 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53000 ns 52750 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 53791 ns 52417 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1451396 ns 1398084 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1396875 ns 1402791 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1404042 ns 1401792 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1458812.5 ns 1402875 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5016750 ns 5010813 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5003833 ns 5016584 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5035166.5 ns 5062708 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5016021 ns 5013500 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3055625 ns 3040417 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2092000 ns 2105083 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2274500 ns 2280208 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4823354.5 ns 4865521 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24320916 ns 24414604.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18840458 ns 18876208.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 18000166 ns 17652979 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 36072458.5 ns 35825688 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33977125 ns 34006188 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28191875 ns 28283750 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28516021 ns 27926083.5 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 42023416.5 ns 41742416.5 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144202645.5 ns 144750166 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 147176208 ns 146949375 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126618792 ns 126208208.5 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 175416083 ns 173205292 ns 1.01
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1010254500 ns 1847080125 ns 0.55
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 996927062.5 ns 809911709 ns 1.23
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 703067167 ns 755677291 ns 0.93
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 678176000 ns 667449084 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 72750.5 ns 76791 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75000 ns 76042 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 76500 ns 76417 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73000 ns 72541 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 291041.5 ns 277229 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 292875 ns 193583 ns 1.51
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 196625 ns 205417 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 294437.5 ns 303083.5 ns 0.97
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35434270.5 ns 35472875 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36405125 ns 36379896 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32827229.5 ns 32315333.5 ns 1.02
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40770375 ns 40618416.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 148344125 ns 146765250 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 153993750 ns 153200125 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 142353042 ns 137307792 ns 1.04
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 287523521 ns 285301125 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 121428292 ns 120518062.5 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173818000.5 ns 174031666 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 155253958 ns 148283312.5 ns 1.05
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106350875 ns 106552271 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 469442313 ns 469918416 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466327604 ns 466837917 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 454651209 ns 437920916.5 ns 1.04
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 745969937.5 ns 739774042 ns 1.01
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 776002375 ns 711087896 ns 1.09
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 638998125 ns 640897313 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 640762875 ns 630411896 ns 1.02
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 850922103.5 ns 849787625 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1342417 ns 1302125 ns 1.03
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 993250 ns 905958 ns 1.10
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 682813 ns 938334 ns 0.73
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2105083 ns 1987437 ns 1.06
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2957084 ns 2951687.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2595625 ns 2611020.5 ns 0.99
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2486583 ns 2639896 ns 0.94
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3752166 ns 3702396 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5802750 ns 5801417 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5380979 ns 5727666.5 ns 0.94
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5878375 ns 5818916 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2919708.5 ns 2913834 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7417 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6000 ns 6166 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5125 ns 6209 ns 0.83
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10083 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225583.5 ns 212792 ns 1.06
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230667 ns 220834 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222041.5 ns 221166 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218333 ns 215459 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 303779979 ns 300445333 ns 1.01
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 224368541.5 ns 214002042 ns 1.05
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 220896979.5 ns 196386541 ns 1.12
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 307934542 ns 307720792 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1230688666 ns 1232629833 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 896640584 ns 899311645.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 845686687.5 ns 825300584 ns 1.02
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1152841667 ns 1150330250 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5041 ns 5458 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5125 ns 5416 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6833 ns 6750.5 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5041 ns 5084 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9542 ns 7667 ns 1.24
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9916 ns 7333 ns 1.35
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10166 ns 7500 ns 1.36
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10417 ns 7250 ns 1.44
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 625 ns 625 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8833 ns 9542 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9042 ns 9833 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9375 ns 9667 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9083 ns 9041 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352083 ns 352562.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351167 ns 351833 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 354625 ns 353416.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 354541 ns 366166 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 822084 ns 826208 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 807229 ns 775333.5 ns 1.04
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 825833 ns 808520.5 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 824771 ns 828833 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 334625 ns 340917 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 340208 ns 342729.5 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 444167 ns 453708 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 11187.5 ns 10687.5 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 709208.5 ns 709875 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 727584 ns 728042 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 1020166 ns 1005792 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 27375 ns 26667 ns 1.03
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 374125 ns 380187.5 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 348541 ns 355542 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 444667 ns 442146 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 30833.5 ns 30959 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 723500 ns 726667 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 780145.5 ns 778791.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1057729 ns 1034042 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 101834 ns 105042 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3459 ns 3583 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3542 ns 3542 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3687.5 ns 3708 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3333.5 ns 3542 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4458 ns 4583 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4417 ns 4333 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4292 ns 4375 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4166 ns 4167 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3167 ns 3833 ns 0.83
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3834 ns 3542 ns 1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4583 ns 4292 ns 1.07
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 3500 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8334 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8334 ns 8334 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8583 ns 8708 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8541 ns 8625 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204500 ns 203709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211000 ns 209833 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209042 ns 213750 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200375 ns 200750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 647459 ns 611979.5 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 645416 ns 623084 ns 1.04
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 665708 ns 633542 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 594166.5 ns 630833 ns 0.94
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 990583.5 ns 991250 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1015416.5 ns 1017458.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 971208 ns 954833 ns 1.02
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 901437.5 ns 864916.5 ns 1.04
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4507562.5 ns 4517208 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4621500 ns 4768041 ns 0.97
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4636459 ns 4459667 ns 1.04
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4301375.5 ns 4281312 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 2917 ns 3625 ns 0.80
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3542 ns 3291 ns 1.08
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4041.5 ns 4250 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3917 ns 3166 ns 1.24
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7167 ns 7500 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 7458 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7458 ns 7687.5 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7459 ns 7084 ns 1.05
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1650812.5 ns 1644333 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1186187.5 ns 1183209 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1369271 ns 1370292 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2443833.5 ns 2475167 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12329542 ns 12346958.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9527625 ns 9593646 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9360666 ns 9292209 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18089562.5 ns 17963583.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17369792 ns 17361375 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14284958 ns 14393542 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14468958.5 ns 14339750 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21214791.5 ns 21095083 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 133750 ns 88167 ns 1.52
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 89833 ns 88875 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 91000 ns 91875 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 91270.5 ns 134020.5 ns 0.68
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027625 ns 2027813 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2016354 ns 2027000.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2038374.5 ns 2054000 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2037125 ns 2028125 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 3500 ns 2792 ns 1.25
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2834 ns 2583 ns 1.10
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 1917 ns 3458 ns 0.55
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 1584 ns 1917 ns 0.83
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2792 ns 2709 ns 1.03
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3125 ns 2792 ns 1.12
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3167 ns 2792 ns 1.13
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 3125 ns 2833.5 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7375 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 3792 ns 6041 ns 0.63
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5333 ns 6167 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10167 ns 10125 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225875 ns 242958 ns 0.93
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 224083.5 ns 220917 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228291.5 ns 220417 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207313 ns 240375 ns 0.86
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3709 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3791 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14375 ns 14584 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14459 ns 14542 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14250 ns 14584 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14666 ns 14417 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 138875 ns 92125 ns 1.51
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93437 ns 92458 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 95042 ns 98562.5 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 117250 ns 118229 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1930208 ns 1913333 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1920125 ns 1909771 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1935708.5 ns 1956333 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1922896 ns 1924333 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 864187.5 ns 879000 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 821125.5 ns 818395.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1190500 ns 1219520.5 ns 0.98
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 964729 ns 966459 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2817104.5 ns 2822917 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2497500 ns 2496917 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3366021 ns 3359000 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3305875 ns 3411333 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17812.5 ns 17000 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15667 ns 15458.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17625 ns 19041 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18375 ns 16875 ns 1.09
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228270.5 ns 258834 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223604.5 ns 215125 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223667 ns 215792 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 230500 ns 227875 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 224458 ns 219062.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222041 ns 221375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222062.5 ns 222875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 222292 ns 220791 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 517875 ns 497625 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 499292 ns 535916 ns 0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 504375 ns 499208 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 499687.5 ns 511125 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 3542 ns 3833.5 ns 0.92
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4458 ns 4250 ns 1.05
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 4833 ns 5166.5 ns 0.94
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 4042 ns 3792 ns 1.07
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7291 ns 7542 ns 0.97
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7625 ns 7167 ns 1.06
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7583.5 ns 7542 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7750 ns 7667 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16770.5 ns 18667 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18458.5 ns 16708 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19167 ns 20584 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18541 ns 18084 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228062 ns 224209 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228312.5 ns 212687 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 224770.5 ns 213167 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216229 ns 222979.5 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 3875 ns 4250 ns 0.91
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4250 ns 4333.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 4791.5 ns 5125 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4625 ns 3875 ns 1.19
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10458 ns 10542 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10000 ns 10791 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10583 ns 10959 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10584 ns 10333 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 2750 ns 3375 ns 0.81
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3459 ns 3333 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 3958.5 ns 4042 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3125 ns 2958 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 7500 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7208 ns 7750 ns 0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 7625 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7583 ns 7208 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23579146 ns 23498333.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35112833.5 ns 34789375 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 41153812.5 ns 37689958 ns 1.09
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34850459 ns 34909542 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 184580562.5 ns 184647292 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 170108312 ns 163834583 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 151018749.5 ns 146363541.5 ns 1.03
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 274274854.5 ns 274565083 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 279619083 ns 278243563 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 258793834 ns 245760791.5 ns 1.05
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 233192250 ns 231789354 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 323747917 ns 324000854.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183041.5 ns 182625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 184875 ns 184458 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 185229.5 ns 186250 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181542 ns 181875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 598583 ns 628291.5 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 636812.5 ns 608229.5 ns 1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 594187.5 ns 598250 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 596666.5 ns 637791 ns 0.94
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3855416.5 ns 3874375 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3913188 ns 3917042 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3533500 ns 3534687.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4656020.5 ns 4554291 ns 1.02
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17344625 ns 17461354.5 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17841875 ns 17833459 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16885979 ns 16559937.5 ns 1.02
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 20150208 ns 19938750 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 541 ns 625 ns 0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 750 ns 500 ns 1.50
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 666 ns 0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9187.5 ns 9292 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9125 ns 9458 ns 0.96
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9917 ns 9375 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9542 ns 9187.5 ns 1.04
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 654018000 ns 651812167 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 390984688 ns 390086667 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 393902709 ns 327502625 ns 1.20
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 684960271.5 ns 747314333 ns 0.92
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1893131833 ns 1879705041.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1645081395.5 ns 1650371917 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1542986334 ns 1514378771 ns 1.02
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2281810500 ns 2204966313 ns 1.03
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1639458 ns 1651458 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1188500 ns 1196083 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1383729.5 ns 1387103.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2316708 ns 2353958 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12708624.5 ns 12704667 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9913437 ns 9935187.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9739479 ns 9671333.5 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18414583 ns 18432334 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17682208 ns 17670625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14673584 ns 14743791.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14772312.5 ns 14593292 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21423125 ns 21437146 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26208 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26250 ns 26333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66916 ns 67166 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67500 ns 67208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67209 ns 67917 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67875 ns 66958 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204167 ns 202875 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 209667 ns 210375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 208875 ns 209916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199417 ns 198750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 648625 ns 645354 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 673417 ns 637500.5 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 632542 ns 634542 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 628250 ns 634250 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 651417 ns 672209 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 567770.5 ns 637917 ns 0.89
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 666812 ns 665042 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 654375 ns 664917 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2232125.5 ns 2224563 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1551083 ns 2248771 ns 0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2304833 ns 2241125 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2286271 ns 2237000 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19333.5 ns 17417 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18000 ns 17333 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19333 ns 19500 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18167 ns 16875 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 260937.5 ns 260770.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219500 ns 219458.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229187.5 ns 229000 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 261521 ns 263334 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 750 ns 666 ns 1.13
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 667 ns 584 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9625 ns 10000 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9792 ns 9750 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10375 ns 10125 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9833 ns 9750 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5042 ns 5375 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5520.5 ns 5625 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6041 ns 6604.5 ns 0.91
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5000 ns 5000 ns 1
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7042 ns 7875 ns 0.89
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7292 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7584 ns 7687.5 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7333 ns 7334 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1875 ns 2041 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2084 ns 2250 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2417 ns 2458 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2209 ns 2084 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6416 ns 6542 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6584 ns 6458 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6708 ns 6708 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6584 ns 6541 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 752084 ns 747125 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 748833 ns 749958.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 754625 ns 747167 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 748416.5 ns 771333.5 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 813792 ns 791000 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 792416.5 ns 780041.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 792083 ns 775416 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 813416 ns 794812.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 6959 ns 1.06
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5834 ns 6000 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5167 ns 6125 ns 0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 10167 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 231771 ns 259750 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 250938 ns 238854 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 236833 ns 231104 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 253479.5 ns 250208 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9834 ns 10125 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10083 ns 10312.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10625 ns 10875 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10041 ns 10167 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24292 ns 24167 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24333 ns 24583 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25250 ns 25333 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24792 ns 24584 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106495021 ns 106104729.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 117604000 ns 117502187.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 123750792 ns 120758625 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117620437.5 ns 117423500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 394288333 ns 392280708 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 359155771 ns 358697709 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 359596000 ns 357440917 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 619251583 ns 540821208.5 ns 1.15
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 612403375 ns 781416292 ns 0.78
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 766456583.5 ns 760831458 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 748934875 ns 750885583.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 790773583 ns 784554021 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6937.5 ns 7583 ns 0.91
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6833 ns 6875 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8667 ns 8208 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6167 ns 7917 ns 0.78
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13667 ns 14542 ns 0.94
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14416.5 ns 13667 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14292 ns 14125 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14166 ns 14375 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5667 ns 5750 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5958 ns 6125 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7750 ns 7500 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5208 ns 5500 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12416.5 ns 12875 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12666 ns 12417 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13125 ns 12687.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12792 ns 13042 ns 0.98
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 5687.5 ns 5250 ns 1.08
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 6042 ns 5709 ns 1.06
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 5729.5 ns 6542 ns 0.88
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5292 ns 5375 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15542 ns 15750 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15833 ns 15375 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 16375 ns 15584 ns 1.05
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 15875 ns 15916 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 334 ns 417 ns 0.80
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 417 ns 334 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6000 ns 6583 ns 0.91
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6708 ns 6625 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6667 ns 6625 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6583 ns 6375 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6083 ns 6041 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5875 ns 5959 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6000 ns 5875 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 20917 ns 21520.5 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21541 ns 21209 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21875 ns 21667 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21541 ns 21334 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 186042 ns 144062.5 ns 1.29
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 145584 ns 143042 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148916 ns 146334 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 144666 ns 188146 ns 0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1319667 ns 1317583 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1324750 ns 1321709 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1355500 ns 1365791.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1316750 ns 1318666 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25042 ns 24708 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 24437.5 ns 24375 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24895.5 ns 24375 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23333.5 ns 22374.5 ns 1.04
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 125083.5 ns 134750 ns 0.93
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 182083 ns 181250 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 183083 ns 130000 ns 1.41
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 129646 ns 130958 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 334 ns 333 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6354.5 ns 6625 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6500 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 6708 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6792 ns 6792 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4000 ns 4625 ns 0.86
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4208 ns 4541.5 ns 0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5334 ns 5333 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4292 ns 4583 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11458 ns 9875 ns 1.16
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11458 ns 9916.5 ns 1.16
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12542 ns 10417 ns 1.20
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11917 ns 10375 ns 1.15
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1625 ns 1667 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5666 ns 5750 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5666 ns 5750 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5958 ns 6083 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5875 ns 5709 ns 1.03
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6809395.5 ns 6814041 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6387250 ns 6367459 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6520541.5 ns 6578812.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7587709 ns 7695958 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24068334 ns 24052709 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21246167 ns 21310875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 21108958 ns 21123834 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29944875 ns 29855166.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 37302104.5 ns 48838979.5 ns 0.76
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45420666.5 ns 45549667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45699104 ns 45706771 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49469250 ns 49408500 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6000 ns 5875 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5708 ns 5709 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6958 ns 6708 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5333 ns 5541 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8208 ns 8875 ns 0.92
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8875 ns 8167 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8458 ns 8542 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8208 ns 1.02
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1552583.5 ns 1556417 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1272166.5 ns 1270792 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1637208 ns 1624187.5 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2141208 ns 2180520.5 ns 0.98
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7887750 ns 7888792 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6579958 ns 6591250 ns 1.00
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7283375 ns 7197854 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10434541 ns 10478229.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 365645.5 ns 366500 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 374292 ns 371020.5 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 453083 ns 457708 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 22042 ns 33208.5 ns 0.66
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 723166 ns 723916.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 803834 ns 801750 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1081146 ns 1064875 ns 1.02
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 93437 ns 115334 ns 0.81
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397667 ns 397291 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287917 ns 287834 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 212333 ns 288166 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 752334 ns 750833 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 671458 ns 661875 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 529958 ns 532416 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 472709 ns 535458 ns 0.88
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 975833 ns 973250 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 539584 ns 670958 ns 0.80
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 635646 ns 644229 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 657667 ns 680667 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 646812 ns 648125 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2458312.5 ns 2459333 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2443792 ns 2456084 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2448042 ns 2464542 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2512229.5 ns 2456083 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 3896 ns 3708 ns 1.05
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 3708 ns 3334 ns 1.11
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 2979.5 ns 4334 ns 0.69
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 2417 ns 2667 ns 0.91
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5667 ns 5500 ns 1.03
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5916 ns 5458 ns 1.08
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5917 ns 5625 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5875 ns 5542 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1464958 ns 1458167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1504208 ns 1500500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1493083 ns 1499333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1439375 ns 1437750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5125084 ns 5130750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5289187.5 ns 5285584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5312354 ns 5315979 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4700041 ns 4998959 ns 0.94
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3709 ns 3708 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3709 ns 3709 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15083 ns 15375 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15417 ns 15417 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15292 ns 15500 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15667 ns 15167 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 71375 ns 70667 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71208 ns 71208 ns 1
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71291 ns 71959 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71333 ns 71333 ns 1
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318833 ns 318500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 317834 ns 318000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 327791 ns 323666 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 322375 ns 317125 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1000 ns 1084 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1125 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1042 ns 1084 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1000 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8125 ns 8458 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8166 ns 8334 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8292 ns 8292 ns 1
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8209 ns 8375 ns 0.98
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 511604.5 ns 506709 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 486208 ns 492375 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 564291 ns 562708 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 214312 ns 222187.5 ns 0.96
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1381666.5 ns 1387250 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1452208 ns 1449208 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1762229 ns 1788375 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 883458 ns 865812.5 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6667 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6250 ns 6458 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6709 ns 6625 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6542 ns 6458 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1720104.5 ns 1722042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1726417 ns 1723208.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1743083.5 ns 1721083 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1723104 ns 1723750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4360479 ns 4362042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4364124.5 ns 4261187.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4397250 ns 4415583.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4402666 ns 4366958.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6834 ns 6750 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6500 ns 6959 ns 0.93
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 9041.5 ns 6959 ns 1.30
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6792 ns 6708.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 32958.5 ns 51417 ns 0.64
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 32792 ns 32917 ns 1.00
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 52459 ns 33333 ns 1.57
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 48542 ns 51208.5 ns 0.95
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 17687.5 ns 17542 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 18000 ns 17875 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18813 ns 18916 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17458 ns 17750 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53500 ns 53458 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53625 ns 53334 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53687.5 ns 53250 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53667 ns 53500 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75459 ns 75292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75125 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75458 ns 75792 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75167 ns 75208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 339583 ns 324375 ns 1.05
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 325792 ns 327625 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 335916 ns 329583 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 331104.5 ns 324208 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1489417 ns 1484375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1527750 ns 1527958 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1517791 ns 1527583 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1463542 ns 1462209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5114875 ns 5124708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4937750 ns 5280333 ns 0.94
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5302896 ns 5332500 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4980291 ns 4985875 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28209 ns 28250 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28208 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28292 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28333 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66459 ns 66459 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66584 ns 66458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66667 ns 66833 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 67125 ns 66416 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1471666.5 ns 1501229 ns 0.98
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1121500 ns 1127563 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 887000 ns 1119291.5 ns 0.79
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2217250 ns 2246375 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3056291.5 ns 3082875 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2124041.5 ns 2738375 ns 0.78
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2627041 ns 2760354 ns 0.95
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3821708 ns 3780667 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7928250 ns 7895333 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7900562.5 ns 7893459 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7965354 ns 7944812.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4845500 ns 4834521 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 80437.5 ns 80959 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81375 ns 80333 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 80708 ns 82166 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80542 ns 134375.5 ns 0.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027958 ns 2014625 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2019458 ns 2006229 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2036125.5 ns 2047021 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2017000 ns 2022958 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal force-pushed the ap/realnvp branch 2 times, most recently from fe87936 to 740af9d Compare January 20, 2025 18:27
@avik-pal avik-pal force-pushed the ap/realnvp branch 2 times, most recently from 252bb69 to d79386d Compare January 20, 2025 21:37
@avik-pal avik-pal merged commit 521fefd into main Jan 20, 2025
11 of 14 checks passed
@avik-pal avik-pal deleted the ap/realnvp branch January 20, 2025 22:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant