You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
name go time/op asm time/op delta
Sub10VW/1 10.1ns ± 2% 5.6ns ± 1% -45.03%
Sub10VW/2 11.7ns ± 1% 6.1ns ± 1% -48.01%
Sub10VW/3 13.2ns ± 2% 8.1ns ± 0% -39.03%
Sub10VW/4 14.6ns ± 0% 8.4ns ± 0% -42.77%
Sub10VW/5 14.9ns ± 1% 8.8ns ± 0% -40.66%
Sub10VW/10 15.1ns ± 0% 10.6ns ± 3% -30.12%
Sub10VW/100 116ns ± 1% 45ns ± 6% -61.62%
Sub10VW/1000 1.22µs ± 1% 0.54µs ±14% -55.85%
Sub10VW/10000 11.9µs ± 0% 5.4µs ± 1% -54.85%
Sub10VW/100000 122µs ± 0% 62µs ± 0% -49.45%
The Go implementation can check if the carry is zero and switch to
copy() for free (no need to have a standard add10VW vs. add10VW large).
In the assembler version I chose to keep a single implementation of the
function and switch to a memcpy whenever the carry is 0 (checked every 4
Words). Considering that the carry is almost always 0, this logic is the
likely cause of the performance drop between 5-15 Words.
Also past 1000 Words, the performance gains seem to slowly drop. The
very likely cause is the simplistic memcpy implementation vs.
runtime·memmove.
0 commit comments