Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing one extra copy #213

Closed
wants to merge 2 commits into from

Conversation

parsifal-47
Copy link
Contributor

@parsifal-47 parsifal-47 commented Jan 5, 2025

Explicit allocation is not removed by bufferization stage, but when there is no transformation to the data it is not needed. This change result in 30% speedup for addition of long vectors, small vectors and softmax are not affected.

$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.028035, min=0.025388, std=0.002643, 50pp=0.026588, max=0.033185
CPU: Avg=0.194125, min=0.109586, std=0.040768, 50pp=0.188898, max=0.337164
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.042732, min=0.031362, std=0.029388, 50pp=0.037026, max=0.170410
CPU: Avg=0.048043, min=0.031364, std=0.052494, 50pp=0.037026, max=0.276624
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045633, min=0.044927, std=0.000402, 50pp=0.045694, max=0.046703
CPU: Avg=0.255655, min=0.154600, std=0.025863, 50pp=0.260998, max=0.279482
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.079617, min=0.072278, std=0.001898, 50pp=0.080133, max=0.080801
CPU: Avg=0.084269, min=0.076415, std=0.018682, 50pp=0.080128, max=0.165613
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.105064, min=0.104668, std=0.000338, 50pp=0.105036, max=0.106324
CPU: Avg=0.420951, min=0.398625, std=0.005847, 50pp=0.421809, max=0.429125
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.160930, min=0.160456, std=0.000349, 50pp=0.160841, max=0.161741
CPU: Avg=0.162062, min=0.160459, std=0.005133, 50pp=0.160834, max=0.184401

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006638, min=0.005692, std=0.000240, 50pp=0.006683, max=0.006820
CPU: Avg=0.124803, min=0.017689, std=0.025586, 50pp=0.132032, max=0.140487
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.021547, min=0.014423, std=0.030466, 50pp=0.014550, max=0.154347
CPU: Avg=0.028021, min=0.014423, std=0.058684, 50pp=0.014550, max=0.283819
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.013854, min=0.013340, std=0.001092, 50pp=0.013473, max=0.018173
CPU: Avg=0.159848, min=0.042767, std=0.030257, 50pp=0.158168, max=0.201928
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.056862, min=0.055903, std=0.001415, 50pp=0.056542, max=0.062904
CPU: Avg=0.062644, min=0.055904, std=0.026587, 50pp=0.056541, max=0.178527
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.066440, min=0.057658, std=0.010104, 50pp=0.061157, max=0.091539
CPU: Avg=0.341102, min=0.296225, std=0.039675, 50pp=0.330194, max=0.468950
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.247126, min=0.244027, std=0.000854, 50pp=0.247365, max=0.247956
CPU: Avg=0.249327, min=0.246240, std=0.008907, 50pp=0.247429, max=0.288095
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.240948, min=0.232588, std=0.005253, 50pp=0.242903, max=0.247918
CPU: Avg=0.909776, min=0.897949, std=0.007355, 50pp=0.911185, max=0.920746
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.990383, min=0.984627, std=0.005506, 50pp=0.988228, max=1.001983
CPU: Avg=0.990355, min=0.984624, std=0.005523, 50pp=0.988213, max=1.001986
$ python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.028019, min=0.024638, std=0.003926, 50pp=0.025822, max=0.037850
CPU: Avg=0.206308, min=0.087896, std=0.055139, 50pp=0.192890, max=0.363112
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.058581, min=0.050894, std=0.029926, 50pp=0.051244, max=0.188880
CPU: Avg=0.064444, min=0.050892, std=0.055465, 50pp=0.051246, max=0.306130
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045687, min=0.044865, std=0.000342, 50pp=0.045697, max=0.046348
CPU: Avg=0.255593, min=0.165552, std=0.024769, 50pp=0.261888, max=0.285987
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.113111, min=0.111636, std=0.001145, 50pp=0.112891, max=0.115315
CPU: Avg=0.117115, min=0.111639, std=0.017151, 50pp=0.112947, max=0.191720
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.104407, min=0.104051, std=0.000239, 50pp=0.104377, max=0.104971
CPU: Avg=0.421223, min=0.393942, std=0.006962, 50pp=0.421754, max=0.427590
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.238135, min=0.232520, std=0.001440, 50pp=0.238539, max=0.239491
CPU: Avg=0.239498, min=0.236948, std=0.004708, 50pp=0.238564, max=0.259829

$ python test_softmax.py
bench_softmax(1024, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.006603, min=0.005808, std=0.000234, 50pp=0.006678, max=0.006708
CPU: Avg=0.123508, min=0.020989, std=0.024401, 50pp=0.131621, max=0.137913
bench_softmax(1024, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.021495, min=0.014405, std=0.030274, 50pp=0.014520, max=0.153455
CPU: Avg=0.027889, min=0.014406, std=0.058143, 50pp=0.014520, max=0.281326
bench_softmax(2048, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.018824, min=0.013350, std=0.005081, 50pp=0.016140, max=0.027493
CPU: Avg=0.178647, min=0.031018, std=0.039892, 50pp=0.177880, max=0.234586
bench_softmax(2048, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.056622, min=0.055601, std=0.000995, 50pp=0.056422, max=0.060713
CPU: Avg=0.062484, min=0.055604, std=0.026486, 50pp=0.056424, max=0.177926
bench_softmax(4096, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.058536, min=0.058117, std=0.000341, 50pp=0.058455, max=0.059743
CPU: Avg=0.322326, min=0.276561, std=0.015820, 50pp=0.322121, max=0.345675
bench_softmax(4096, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.247781, min=0.244237, std=0.001217, 50pp=0.247492, max=0.250554
CPU: Avg=0.249642, min=0.247209, std=0.007360, 50pp=0.247554, max=0.281481
bench_softmax(8192, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.246608, min=0.235026, std=0.009394, 50pp=0.249331, max=0.263891
CPU: Avg=0.920643, min=0.899138, std=0.013884, 50pp=0.921552, max=0.947152
bench_softmax(8192, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.989336, min=0.984804, std=0.005962, 50pp=0.987003, max=1.012817
CPU: Avg=0.989328, min=0.984795, std=0.005961, 50pp=0.987010, max=1.012817

@parsifal-47
Copy link
Contributor Author

and in case the mask is removed:

diff --git a/python/examples/test_vec_add.py b/python/examples/test_vec_add.py
index db2fa09..ce68008 100644
--- a/python/examples/test_vec_add.py
+++ b/python/examples/test_vec_add.py
@@ -24,14 +24,13 @@ def add_kernel(
     block_start = pid * BLOCK_SIZE
     offsets = block_start + tl.arange(0, BLOCK_SIZE)
     # Create a mask to guard memory operations against out-of-bounds accesses.
-    mask = offsets < n_elements
     # Load x and y from DRAM, masking out any extra elements in case the input is not a
     # multiple of the block size.
-    x = tl.load(x_ptr + offsets, mask=mask)
-    y = tl.load(y_ptr + offsets, mask=mask)
+    x = tl.load(x_ptr + offsets)
+    y = tl.load(y_ptr + offsets)
     output = x + y
     # Write x + y back to DRAM.
-    tl.store(output_ptr + offsets, output, mask=mask)
+    tl.store(output_ptr + offsets, output)

no allocations and 43% speedup for vector addition:

python test_vec_add.py
bench_vecadd(4194304, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.032111, min=0.025130, std=0.006946, 50pp=0.029137, max=0.045271
CPU: Avg=0.234663, min=0.077274, std=0.082792, 50pp=0.207875, max=0.483639
bench_vecadd(4194304, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.036748, min=0.022976, std=0.038412, 50pp=0.028940, max=0.203814
CPU: Avg=0.040923, min=0.022978, std=0.056583, 50pp=0.028942, max=0.287311
bench_vecadd(8388608, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.045666, min=0.044756, std=0.001394, 50pp=0.045304, max=0.051570
CPU: Avg=0.254962, min=0.161122, std=0.023800, 50pp=0.256076, max=0.279010
bench_vecadd(8388608, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.063821, min=0.056867, std=0.001860, 50pp=0.064403, max=0.064681
CPU: Avg=0.068500, min=0.060092, std=0.018950, 50pp=0.064414, max=0.150998
bench_vecadd(16777216, 'torch') {}, 20 times, all results in seconds
Wall: Avg=0.104448, min=0.104088, std=0.000209, 50pp=0.104432, max=0.105066
CPU: Avg=0.420013, min=0.393254, std=0.006998, 50pp=0.419529, max=0.426582
bench_vecadd(16777216, 'triton') {}, 20 times, all results in seconds
Wall: Avg=0.129658, min=0.128930, std=0.000492, 50pp=0.129620, max=0.130649
CPU: Avg=0.131077, min=0.128934, std=0.006420, 50pp=0.129618, max=0.158996

@parsifal-47
Copy link
Contributor Author

this optimization is not correct for #218 , it needs additional checks or a different approach

@parsifal-47 parsifal-47 closed this Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant