Skip to content

Use immedietly invoked lambdas to make error checking less expensive #2249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SteveBronder opened this issue Dec 9, 2020 · 4 comments
Closed

Comments

@SteveBronder
Copy link
Collaborator

Description

Something kind of neat I was reading the other day

https://rigtorp.se/iife/

the tl;dr is that using some compiler attributes and immediately invoked lambdas we can decrease the size of the error functions, and since those error paths take up less space it can allow for better inlining. This is also nice because we don't really care about performance when a function errors so it makes sense to put that code on a cold path.

Example

I have an example below on godbolt that has our current version of check_range() and a version that uses the immediately invoked lambdas and some compiler attributes. You can ctrl+f for "runner" to see the main change which is that the block of asm in compiler #2 (.L28) has the tag [clone .cold] which means those instructions are never pre-fetched. for compiler #1 the block of asm it jumps to (.L50) is going to be pre-fetched because of the CPUs branch prediction.

https://gcc.godbolt.org/z/6zWTcY

I'm not sure how much these would save, but we do call these functions a lot.

Expected Output

Should be the same

Current Version:

v3.4.0

@t4c1
Copy link
Contributor

t4c1 commented Dec 9, 2020

How portable between compilers are these attributes?

Do you have any example model, or at least microbenchmark using some semi-realistic set of function calls that is showing any significant speedup from this?

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Dec 10, 2020

As portable as likely aka we need to wrap it in

#ifdef __GNUC__
#define STAN_COLD_PATH ...
#endif

Here's a benchmark just calling the check_range N times (check_range is the new style and check_range2 is our current.)

https://gist.github.com/SteveBronder/84ba7d4bfb0f5ded4289f3b702f6b940

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_range_bench/2             2.30 ns         2.30 ns    305371235
check_range_bench/4             4.59 ns         4.59 ns    152487809
check_range_bench/8             9.18 ns         9.18 ns     75119108
check_range_bench/16            18.4 ns         18.4 ns     37927216
check_range_bench/32            37.6 ns         37.6 ns     18620833
check_range_bench/64            74.3 ns         74.3 ns      9278324
check_range_bench/128            148 ns          148 ns      4744552
check_range_bench/256            295 ns          295 ns      2376836
check_range_bench/512            589 ns          589 ns      1180993
check_range_bench/1024          1177 ns         1177 ns       592229
check_range_bench/2048          2356 ns         2356 ns       296866
check_range_bench/4096          4728 ns         4727 ns       148123
check_range_bench/8192          9412 ns         9411 ns        74231
check_range_bench/16384        18871 ns        18867 ns        36962
check_range_bench/32768        37648 ns        37644 ns        18548
check_range_bench/65536        75274 ns        75264 ns         9249
check_range_bench/100000      115019 ns       114990 ns         6077
check_range2_bench/2            4.07 ns         4.07 ns    172928699
check_range2_bench/4            8.01 ns         8.01 ns     87142023
check_range2_bench/8            16.0 ns         16.0 ns     43742467
check_range2_bench/16           41.0 ns         41.0 ns     16911338
check_range2_bench/32           73.0 ns         73.0 ns      9502761
check_range2_bench/64            137 ns          137 ns      5124898
check_range2_bench/128           264 ns          264 ns      2662095
check_range2_bench/256           518 ns          518 ns      1350521
check_range2_bench/512          1024 ns         1024 ns       683465
check_range2_bench/1024         2043 ns         2043 ns       342490
check_range2_bench/2048         4075 ns         4074 ns       171748
check_range2_bench/4096         8138 ns         8137 ns        85742
check_range2_bench/8192        16290 ns        16288 ns        43168
check_range2_bench/16384       32503 ns        32503 ns        21565
check_range2_bench/32768       65192 ns        65177 ns        10616
check_range2_bench/65536      129872 ns       129860 ns         5378
check_range2_bench/100000     198301 ns       198246 ns         3525

tbh I'm pretty surprised by these answers. To have it be like 30% faster seems like way more than I thought it would be?

@t4c1
Copy link
Contributor

t4c1 commented Dec 10, 2020

Can you try to rerun the benchmark without the STAN_COLD_PATH? I bet it the same thing I hit in the #2205 - the check_range2 being too large, which makes the compiler not inline it. If I am right the results should be more or less the same.

@SteveBronder
Copy link
Collaborator Author

SteveBronder commented Dec 11, 2020

Oh I think your right! Running check_range with and without STAN_COLD_PATH shows the cold path attribute is a hair faster which is much more sane. check_range2 in the below is just check_range but without the STAN_COLD_PATH

I think the pattern lends itself nicely to writing error checks that are better inlined so I'll leave this issue open if anyone wants to have a go at it

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
check_range_bench/2            2.29 ns         2.29 ns    305431468
check_range_bench/4            4.59 ns         4.59 ns    152437809
check_range_bench/8            9.25 ns         9.25 ns     75988612
check_range_bench/16           18.4 ns         18.4 ns     38089731
check_range_bench/32           37.0 ns         37.0 ns     18912825
check_range_bench/64           73.7 ns         73.7 ns      9480755
check_range_bench/128           148 ns          148 ns      4751490
check_range_bench/256           294 ns          294 ns      2381701
check_range_bench/512           599 ns          599 ns      1184100
check_range_bench/1024         1173 ns         1173 ns       592631
check_range_bench/2048         2360 ns         2360 ns       297816
check_range_bench/4096         4731 ns         4730 ns       148880
check_range_bench/8192         9412 ns         9411 ns        74323
check_range_bench/16384       18821 ns        18818 ns        37104
check_range_bench/32768       37617 ns        37617 ns        18413
check_range_bench/65536       75068 ns        75055 ns         9302
check_range_bench/100000     115254 ns       115250 ns         6089
check_range2_bench/2            2.34 ns         2.34 ns    298569125
check_range2_bench/4            4.70 ns         4.70 ns    149291889
check_range2_bench/8            9.38 ns         9.38 ns     74682251
check_range2_bench/16           18.9 ns         18.9 ns     37359530
check_range2_bench/32           37.5 ns         37.5 ns     18684209
check_range2_bench/64           75.9 ns         75.9 ns      9222441
check_range2_bench/128           151 ns          150 ns      4634892
check_range2_bench/256           301 ns          301 ns      2335781
check_range2_bench/512           599 ns          599 ns      1162274
check_range2_bench/1024         1199 ns         1199 ns       583401
check_range2_bench/2048         2396 ns         2396 ns       290868
check_range2_bench/4096         4798 ns         4798 ns       146110
check_range2_bench/8192         9571 ns         9569 ns        72602
check_range2_bench/16384       19154 ns        19153 ns        36344
check_range2_bench/32768       38300 ns        38299 ns        18273
check_range2_bench/65536       76677 ns        76672 ns         9082
check_range2_bench/100000     117030 ns       117021 ns         5979

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants