Removing one extra copy #213

parsifal-47 · 2025-01-05T19:27:55Z

Explicit allocation is not removed by bufferization stage, but when there is no transformation to the data it is not needed. This change result in 30% speedup for addition of long vectors, small vectors and softmax are not affected.

$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.028035, min=0.025388, std=0.002643, 50pp=0.026588, max=0.033185
CPU: Avg=0.194125, min=0.109586, std=0.040768, 50pp=0.188898, max=0.337164
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.042732, min=0.031362, std=0.029388, 50pp=0.037026, max=0.170410
CPU: Avg=0.048043, min=0.031364, std=0.052494, 50pp=0.037026, max=0.276624
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045633, min=0.044927, std=0.000402, 50pp=0.045694, max=0.046703
CPU: Avg=0.255655, min=0.154600, std=0.025863, 50pp=0.260998, max=0.279482
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.079617, min=0.072278, std=0.001898, 50pp=0.080133, max=0.080801
CPU: Avg=0.084269, min=0.076415, std=0.018682, 50pp=0.080128, max=0.165613
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.105064, min=0.104668, std=0.000338, 50pp=0.105036, max=0.106324
CPU: Avg=0.420951, min=0.398625, std=0.005847, 50pp=0.421809, max=0.429125
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.160930, min=0.160456, std=0.000349, 50pp=0.160841, max=0.161741
CPU: Avg=0.162062, min=0.160459, std=0.005133, 50pp=0.160834, max=0.184401

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006638, min=0.005692, std=0.000240, 50pp=0.006683, max=0.006820
CPU: Avg=0.124803, min=0.017689, std=0.025586, 50pp=0.132032, max=0.140487
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.021547, min=0.014423, std=0.030466, 50pp=0.014550, max=0.154347
CPU: Avg=0.028021, min=0.014423, std=0.058684, 50pp=0.014550, max=0.283819
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.013854, min=0.013340, std=0.001092, 50pp=0.013473, max=0.018173
CPU: Avg=0.159848, min=0.042767, std=0.030257, 50pp=0.158168, max=0.201928
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.056862, min=0.055903, std=0.001415, 50pp=0.056542, max=0.062904
CPU: Avg=0.062644, min=0.055904, std=0.026587, 50pp=0.056541, max=0.178527
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.066440, min=0.057658, std=0.010104, 50pp=0.061157, max=0.091539
CPU: Avg=0.341102, min=0.296225, std=0.039675, 50pp=0.330194, max=0.468950
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.247126, min=0.244027, std=0.000854, 50pp=0.247365, max=0.247956
CPU: Avg=0.249327, min=0.246240, std=0.008907, 50pp=0.247429, max=0.288095
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.240948, min=0.232588, std=0.005253, 50pp=0.242903, max=0.247918
CPU: Avg=0.909776, min=0.897949, std=0.007355, 50pp=0.911185, max=0.920746
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.990383, min=0.984627, std=0.005506, 50pp=0.988228, max=1.001983
CPU: Avg=0.990355, min=0.984624, std=0.005523, 50pp=0.988213, max=1.001986

$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.028019, min=0.024638, std=0.003926, 50pp=0.025822, max=0.037850
CPU: Avg=0.206308, min=0.087896, std=0.055139, 50pp=0.192890, max=0.363112
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.058581, min=0.050894, std=0.029926, 50pp=0.051244, max=0.188880
CPU: Avg=0.064444, min=0.050892, std=0.055465, 50pp=0.051246, max=0.306130
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045687, min=0.044865, std=0.000342, 50pp=0.045697, max=0.046348
CPU: Avg=0.255593, min=0.165552, std=0.024769, 50pp=0.261888, max=0.285987
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.113111, min=0.111636, std=0.001145, 50pp=0.112891, max=0.115315
CPU: Avg=0.117115, min=0.111639, std=0.017151, 50pp=0.112947, max=0.191720
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.104407, min=0.104051, std=0.000239, 50pp=0.104377, max=0.104971
CPU: Avg=0.421223, min=0.393942, std=0.006962, 50pp=0.421754, max=0.427590
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.238135, min=0.232520, std=0.001440, 50pp=0.238539, max=0.239491
CPU: Avg=0.239498, min=0.236948, std=0.004708, 50pp=0.238564, max=0.259829

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006603, min=0.005808, std=0.000234, 50pp=0.006678, max=0.006708
CPU: Avg=0.123508, min=0.020989, std=0.024401, 50pp=0.131621, max=0.137913
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.021495, min=0.014405, std=0.030274, 50pp=0.014520, max=0.153455
CPU: Avg=0.027889, min=0.014406, std=0.058143, 50pp=0.014520, max=0.281326
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.018824, min=0.013350, std=0.005081, 50pp=0.016140, max=0.027493
CPU: Avg=0.178647, min=0.031018, std=0.039892, 50pp=0.177880, max=0.234586
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.056622, min=0.055601, std=0.000995, 50pp=0.056422, max=0.060713
CPU: Avg=0.062484, min=0.055604, std=0.026486, 50pp=0.056424, max=0.177926
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.058536, min=0.058117, std=0.000341, 50pp=0.058455, max=0.059743
CPU: Avg=0.322326, min=0.276561, std=0.015820, 50pp=0.322121, max=0.345675
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.247781, min=0.244237, std=0.001217, 50pp=0.247492, max=0.250554
CPU: Avg=0.249642, min=0.247209, std=0.007360, 50pp=0.247554, max=0.281481
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.246608, min=0.235026, std=0.009394, 50pp=0.249331, max=0.263891
CPU: Avg=0.920643, min=0.899138, std=0.013884, 50pp=0.921552, max=0.947152
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.989336, min=0.984804, std=0.005962, 50pp=0.987003, max=1.012817
CPU: Avg=0.989328, min=0.984795, std=0.005961, 50pp=0.987010, max=1.012817

parsifal-47 · 2025-01-05T23:49:36Z

and in case the mask is removed:

diff --git a/python/examples/test_vec_add.py b/python/examples/test_vec_add.py
index db2fa09..ce68008 100644
--- a/python/examples/test_vec_add.py
+++ b/python/examples/test_vec_add.py
@@ -24,14 +24,13 @@ def add_kernel(
     block_start = pid * BLOCK_SIZE
     offsets = block_start + tl.arange(0, BLOCK_SIZE)
     # Create a mask to guard memory operations against out-of-bounds accesses.
-    mask = offsets < n_elements
     # Load x and y from DRAM, masking out any extra elements in case the input is not a
     # multiple of the block size.
-    x = tl.load(x_ptr + offsets, mask=mask)
-    y = tl.load(y_ptr + offsets, mask=mask)
+    x = tl.load(x_ptr + offsets)
+    y = tl.load(y_ptr + offsets)
     output = x + y
     # Write x + y back to DRAM.
-    tl.store(output_ptr + offsets, output, mask=mask)
+    tl.store(output_ptr + offsets, output)

no allocations and 43% speedup for vector addition:

python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.032111, min=0.025130, std=0.006946, 50pp=0.029137, max=0.045271
CPU: Avg=0.234663, min=0.077274, std=0.082792, 50pp=0.207875, max=0.483639
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.036748, min=0.022976, std=0.038412, 50pp=0.028940, max=0.203814
CPU: Avg=0.040923, min=0.022978, std=0.056583, 50pp=0.028942, max=0.287311
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045666, min=0.044756, std=0.001394, 50pp=0.045304, max=0.051570
CPU: Avg=0.254962, min=0.161122, std=0.023800, 50pp=0.256076, max=0.279010
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.063821, min=0.056867, std=0.001860, 50pp=0.064403, max=0.064681
CPU: Avg=0.068500, min=0.060092, std=0.018950, 50pp=0.064414, max=0.150998
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.104448, min=0.104088, std=0.000209, 50pp=0.104432, max=0.105066
CPU: Avg=0.420013, min=0.393254, std=0.006998, 50pp=0.419529, max=0.426582
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.129658, min=0.128930, std=0.000492, 50pp=0.129620, max=0.130649
CPU: Avg=0.131077, min=0.128934, std=0.006420, 50pp=0.129618, max=0.158996

parsifal-47 · 2025-01-09T03:53:36Z

this optimization is not correct for #218 , it needs additional checks or a different approach

parsifal-47 added 2 commits January 5, 2025 19:11

Removing one extra copy

fc16441

Removing both copies in case there is no mask on load

4118b07

parsifal-47 closed this Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing one extra copy #213

Removing one extra copy #213

parsifal-47 commented Jan 5, 2025 •

edited

Loading

parsifal-47 commented Jan 5, 2025

parsifal-47 commented Jan 9, 2025

Removing one extra copy #213

Removing one extra copy #213

Conversation

parsifal-47 commented Jan 5, 2025 • edited Loading

parsifal-47 commented Jan 5, 2025

parsifal-47 commented Jan 9, 2025

parsifal-47 commented Jan 5, 2025 •

edited

Loading