Vectorize checks called by compiler #2556

SteveBronder · 2021-08-06T21:54:03Z

Summary

This PR vectorizes the check_* functions that the compiler generates for objects created in transformed parameters/data / generated quantities

The stan compiler currently calls several of the check_* functions for transformed params (specifically the ones here where each check is generated as check_<matched_bound_name>). For several of these checks the compiler will generate a for loop to iterate over each underlying element of the matrix/vector/array, but for the new matrix type we don't want iterations like this that have to look at individual elements.

So with this PR instead of stanc generating the following to test if an array of vectors is greater than or equal to another vector,

      for (int sym1__ = 1; sym1__ <= m; ++sym1__) {
        current_statement__ = 27;
        for (int sym2__ = 1; sym2__ <= k; ++sym2__) {
          current_statement__ = 27;
          check_greater_or_equal(function__, "tp_9[sym1__, sym2__]",
                                 tp_9[(sym1__ - 1)][(sym2__ - 1)],
                                 rvalue(ds, "ds", index_uni(1))[(sym1__ - 1)][
                                 (sym2__ - 1)]);
        }
      }

it can just generate

check_greater_or_equal(function__, "tp_9", tp_9, ds);

You'll notice this also fixes a bug in the compiler where if an error occurred we would call the name of the thrown object "tp_9[sym1__, sym2__]". In this impl we clean that up so that the actual iteration number for arrays / vectors / matrices is thrown such as tp_9[1][5].

Tests

Tests were changed for each check to test the vectorized version of the inputs (which then checks the underlying impls for matrices and scalars.) Tests can be run with

./runTests.py -j4 test/unit/math/prim/err/check_cholesky_factor_corr_test.cpp \
test/unit/math/prim/err/check_cholesky_factor_test.cpp \
test/unit/math/prim/err/check_corr_matrix_test.cpp \
test/unit/math/prim/err/check_cov_matrix_test.cpp \
test/unit/math/prim/err/check_greater_or_equal_test.cpp \
test/unit/math/prim/err/check_greater_test.cpp \
test/unit/math/prim/err/check_less_or_equal_test.cpp \
test/unit/math/prim/err/check_less_test.cpp \
test/unit/math/prim/err/check_ordered_test.cpp \
test/unit/math/prim/err/check_positive_ordered_test.cpp \
test/unit/math/prim/err/check_simplex_test.cpp \
test/unit/math/prim/err/check_unit_vector_test.cpp

Side Effects

Nope!

Release notes

Vectorize checks called by stanc compiler

Checklist

Math issue How to add static matrix? #1805
Copyright holder: Steve Bronder

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

…4.1 (tags/RELEASE_600/final)

…ater

stan-buildbot · 2021-08-10T11:09:49Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.96	3.0	0.98	-1.53% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-1.31% slower
eight_schools/eight_schools.stan	0.11	0.11	0.99	-0.72% slower
gp_regr/gp_regr.stan	0.16	0.16	1.0	-0.0% slower
irt_2pl/irt_2pl.stan	5.84	5.96	0.98	-1.89% slower
performance.compilation	88.7	87.74	1.01	1.08% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.66	8.94	0.97	-3.17% slower
pkpd/one_comp_mm_elim_abs.stan	29.36	30.54	0.96	-3.99% slower
sir/sir.stan	126.16	126.0	1.0	0.13% faster
gp_regr/gen_gp_data.stan	0.03	0.03	1.0	-0.46% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.17	2.94	1.08	7.42% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.38	1.01	0.94% faster
arK/arK.stan	1.88	2.52	0.75	-33.92% slower
arma/arma.stan	0.84	0.83	1.0	0.48% faster
garch/garch.stan	0.54	0.67	0.8	-24.44% slower
Mean result: 0.968637515091

Jenkins Console Log
Blue Ocean
Commit hash: 64c6019

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2021-08-10T21:52:30Z

@rok-cesnovar @serban-nicusor-toptal it looks like the github actions are taking v long and it won't let me look at the raw logs. Is there some way to see what's going on?

SteveBronder · 2021-08-10T22:52:35Z

@rok-cesnovar @serban-nicusor-catena nvm!

stan-buildbot · 2021-08-11T19:52:59Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.12	2.99	1.05	4.36% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.01	1.18% faster
eight_schools/eight_schools.stan	0.1	0.11	0.95	-5.23% slower
gp_regr/gp_regr.stan	0.16	0.16	1.02	1.86% faster
irt_2pl/irt_2pl.stan	5.83	5.87	0.99	-0.82% slower
performance.compilation	90.06	87.32	1.03	3.05% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.55	8.43	1.01	1.4% faster
pkpd/one_comp_mm_elim_abs.stan	30.56	30.05	1.02	1.68% faster
sir/sir.stan	127.78	128.07	1.0	-0.22% slower
gp_regr/gen_gp_data.stan	0.03	0.04	0.98	-2.25% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.97	2.91	1.02	2.15% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.39	1.02	1.66% faster
arK/arK.stan	1.87	2.53	0.74	-35.35% slower
arma/arma.stan	0.84	0.83	1.01	0.61% faster
garch/garch.stan	0.53	0.67	0.79	-27.19% slower
Mean result: 0.975136896002

Jenkins Console Log
Blue Ocean
Commit hash: 64c6019

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

bob-carpenter

I'm going to time myself out at 30 comments. A lot of them are redundant around two themes:

specifying what the function does in the first sentence of the doc
clarifying strict vs. non-strict comparison
efficiency concerns around the eager use of make_iter_name
documenting what the constraints enforce
not confusing types and values in the doc

Other than the efficiency tests, these are all minor. I'm happy to help with the doc changes once it's clarified for each test whether it's a strict inequality or not.

bob-carpenter · 2021-08-12T16:17:29Z

stan/math/prim/err/check_cholesky_factor.hpp

+ * elements are all positive.  Note that Cholesky factors need not
+ * be square, but require at least as many rows M as columns N
+ * (i.e., M &gt;= N).
+ * @tparam StdVec A standard vector with inner type inheriting from `MatrixBase`


[question]
Should this be checked? Or is that all that's going to get passed to it?

We only have a check_cholesky_factor that works for this definition so if someone tried passing something like a std::vector<std::vector<double>> it would just fail to compile

bob-carpenter · 2021-08-12T16:17:33Z

stan/math/prim/err/check_cholesky_factor.hpp

@@ -16,7 +19,7 @@ namespace math {
 * elements are all positive.  Note that Cholesky factors need not
 * be square, but require at least as many rows M as columns N
 * (i.e., M &gt;= N).
- * @tparam EigMat Type of the Cholesky factor (must be derived from \c
+ * @tparam Mat Type of the Cholesky factor (must be derived from \c


[optional]
What convention does Eigen use for these matrix args? I think it'd be nice to follow that. I think they may call it Derived.

For Eigen they use Derived in places like

template<typename Derived> bool solveInPlace(MatrixBase<Derived> &bAndX)

Where Derived is the inner type used in the CRTP of MatrixBase. Here we accept anything derived from MatrixBase. This function also takes in `var_valueEigen::Matrix types and I'll update the docs to reflect that

bob-carpenter · 2021-08-12T16:18:24Z

stan/math/prim/err/check_cholesky_factor.hpp

+ *   factor, if number of rows is less than the number of columns,
+ *   if there are 0 columns, or if any element in matrix is NaN
+ */
+template <typename StdVec, require_std_vector_t<StdVec>* = nullptr>


[optional]
I'd rather keep shorter template parameters, like just V for standard vectors or maybe C for generic containers.

For very generic functions I like using V or T etc. but for functions with specific requirements I like that the template parameter's name gives an idea of what the requirement is to use the function.

bob-carpenter · 2021-08-12T16:19:50Z

stan/math/prim/err/check_cholesky_factor.hpp

+void check_cholesky_factor(const char* function, const char* name,
+                           const StdVec& y) {
+  for (size_t i = 0; i < y.size(); ++i) {
+    check_cholesky_factor(function, internal::make_iter_name(name, i).c_str(),


This needs a performance evaluation as it's going to proactively create string names for each entry, which is pretty expensive.

[question]
Is there a way to be lazy and avoid creating the name until the check fails?

I'm fine with running a little performance check here, I def expect it to be slower for small matrices though hopefully not much.

Is there a way to be lazy and avoid creating the name until the check fails?

I tried thinking about this but the only thing I could figure out is to change all of the checks to take in a lambda instead of a const char* that doesn't evaluate until a throw occurs.

bob-carpenter · 2021-08-12T16:21:04Z

stan/math/prim/err/check_cholesky_factor_corr.hpp

+void check_cholesky_factor_corr(const char* function, const char* name,
+                                const StdVec& y) {
+  for (size_t i = 0; i < y.size(); ++i) {
+    check_cholesky_factor_corr(function,


same question for all of these and efficiency.

bob-carpenter · 2021-08-12T16:57:01Z

stan/math/prim/err/check_less.hpp

        }();
      }
    }
+  } else {


[question]
What happens with things like arrays of matrices? In that case, there are more than two indexes.

So this function is template by

template <typename T_y, typename T_high, require_all_matrix_t<T_y, T_high>* = nullptr>

So it requires the template is either an Eigen matrix types or var_value<MatrixXd> types. The ones below templated with requires for std_vectors are the ones that look over arrays (and arrays of matrices etc.)

bob-carpenter · 2021-08-12T16:58:16Z

stan/math/prim/err/check_less.hpp

+}
+
+/**
+ * Check if each element of <code>y</code> is strictly less than each associated


[question]
Is this strictly less or less than or equal? I'm confused about what's being checked in these. For our constraints, all the checks should be less than or equal in order to deal with rounding/underflow errors.

So we have check_less and check_less_or_equal which works well with the rounding/underflow errors. There's other parts of Stan math which use check_less, should those be changed to check_less_or_equal? I think we would want to do that in a separate PR

stan/math/prim/err/check_positive_ordered.hpp

bob-carpenter · 2021-08-12T17:01:22Z

stan/math/prim/err/check_simplex.hpp

- * @tparam C Eigen column type, either 1 if we have a column vector
- *         or -1 if we have a row vector. Moreover, we either have
- *         R = 1 and C = -1 or R = -1 and C = 1.
+ * @tparam T A type inheriting from EigenBase.


[optional]
Only sentences should have periods after them. Sorry the doc's already so inconsistent.

bob-carpenter · 2021-08-12T17:02:41Z

stan/math/prim/err/check_unit_vector.hpp

@@ -48,6 +52,26 @@ void check_unit_vector(const char* function, const char* name,
  }
 }

+/**
+ * Check if the each element in a standard vector is a unit vector.


Should say "unit Euclidean length" to specify what unit vector means here. "unit_vector" is a type in the Stan language.

stan-buildbot · 2021-08-14T01:55:57Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.02	3.05	0.99	-0.99% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.92	-8.9% slower
eight_schools/eight_schools.stan	0.11	0.1	1.05	4.34% faster
gp_regr/gp_regr.stan	0.16	0.16	1.0	-0.13% slower
irt_2pl/irt_2pl.stan	5.82	5.84	1.0	-0.27% slower
performance.compilation	90.34	87.87	1.03	2.73% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.58	8.4	1.02	2.15% faster
pkpd/one_comp_mm_elim_abs.stan	29.4	30.22	0.97	-2.79% slower
sir/sir.stan	129.64	131.95	0.98	-1.78% slower
gp_regr/gen_gp_data.stan	0.03	0.04	0.97	-3.52% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.98	2.91	1.02	2.38% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.41	0.94	-6.06% slower
arK/arK.stan	1.86	1.87	0.99	-0.56% slower
arma/arma.stan	0.83	0.78	1.07	6.28% faster
garch/garch.stan	0.54	0.56	0.96	-4.16% slower
Mean result: 0.994001642229

Jenkins Console Log
Blue Ocean
Commit hash: 87b15e5

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2021-08-16T17:25:51Z

@bob-carpenter fixed up the everything so this is ready for another look

SteveBronder · 2021-08-17T19:51:46Z

@bob-carpenter actually hang on I ran the performance checks and there's a regression I need to fix

…4.1 (tags/RELEASE_600/final)

…o fix/check-less-greater

…4.1 (tags/RELEASE_600/final)

SteveBronder · 2021-08-18T00:35:08Z

@bob-carpenter Alright so I think I have everything sorted out. I made a benchmark here for the vector case and the results seem good. The number in check_ge/*/manual_time is the size of the vector we are testing and Time is the manual benchmark time that doesn't include the construction of the std::vector<> for each test. For sizes 2..4096 this PR and develop are within a few nanoseconds of each other.

This PR

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          26.7 ns         57.1 ns     26208169
check_ge/4/manual_time          26.8 ns         57.1 ns     26002894
check_ge/8/manual_time          27.6 ns         57.9 ns     25198473
check_ge/16/manual_time         30.1 ns         60.4 ns     23274275
check_ge/32/manual_time         37.4 ns         67.6 ns     18707376
check_ge/64/manual_time         62.7 ns         93.0 ns     11294734
check_ge/128/manual_time        91.9 ns          122 ns      7793109
check_ge/256/manual_time         157 ns          187 ns      4503566
check_ge/512/manual_time         271 ns          301 ns      2528503
check_ge/1024/manual_time        521 ns          552 ns      1232462
check_ge/2048/manual_time       1007 ns         1037 ns       679082
check_ge/4096/manual_time       1978 ns         2008 ns       354471

Develop

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          24.8 ns         55.1 ns     28259239
check_ge/4/manual_time          25.9 ns         56.2 ns     26950003
check_ge/8/manual_time          27.6 ns         57.9 ns     25538389
check_ge/16/manual_time         30.4 ns         60.8 ns     22979588
check_ge/32/manual_time         37.1 ns         67.4 ns     18895034
check_ge/64/manual_time         58.5 ns         88.7 ns     11736093
check_ge/128/manual_time        91.4 ns          122 ns      7773196
check_ge/256/manual_time         151 ns          181 ns      4606845
check_ge/512/manual_time         261 ns          291 ns      2675818
check_ge/1024/manual_time        506 ns          536 ns      1348204
check_ge/2048/manual_time       1019 ns         1049 ns       694327
check_ge/4096/manual_time       1968 ns         1998 ns       356119

But what should I benchmark against for the ones that have internal::make_iter_name(name, i).c_str()? For those right now the compiler just returns stuff like y[sym32__][1] where this PR actually returns back the index number like y[1][1], y[2][1], etc. So if we don't do make_iter_name() that's fine but then are we fine with error checks that just report that y[sym32__][1]?

To benchmark this right now I'm doing this for the current PR where we use the vectorized version and like what stanc3 does here for develop where we just a loop over x_vec and y_vec

There's def a big cost for constructing these names correctly. If we're not cool with paying that cost then I'm fine with just reporting y. I think a big part of the cost here is making the std::string for appending arithmetic types to the const char* we pass in, but then just getting the c_str() from the string. That means a lot of the temporary strings we have actually end up getting copied a bunch.

This PR

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time           113 ns          144 ns      6226928
check_ge/4/manual_time           189 ns          219 ns      3674070
check_ge/8/manual_time           348 ns          378 ns      2010529
check_ge/16/manual_time          738 ns          768 ns       944092
check_ge/32/manual_time         1746 ns         1776 ns       390178
check_ge/64/manual_time         6281 ns         6311 ns       111164
check_ge/128/manual_time       19204 ns        19230 ns        36417
check_ge/256/manual_time       65290 ns        65314 ns        11477
check_ge/512/manual_time      226390 ns       226394 ns         2985
check_ge/1024/manual_time    1387108 ns      1387246 ns          511
check_ge/2048/manual_time    5066681 ns      5065742 ns          100
check_ge/4096/manual_time   20200250 ns     20195097 ns           37

Develop

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          27.2 ns         57.5 ns     25806638
check_ge/4/manual_time          33.3 ns         63.6 ns     21105640
check_ge/8/manual_time          68.0 ns         98.3 ns     10509997
check_ge/16/manual_time          174 ns          205 ns      4141638
check_ge/32/manual_time          775 ns          805 ns       941074
check_ge/64/manual_time         2572 ns         2602 ns       273127
check_ge/128/manual_time        9267 ns         9295 ns        76701
check_ge/256/manual_time       37915 ns        37944 ns        18639
check_ge/512/manual_time      144437 ns       144464 ns         4980
check_ge/1024/manual_time    1282011 ns      1282044 ns          543
check_ge/2048/manual_time    4706496 ns      4706217 ns          149
check_ge/4096/manual_time   18116610 ns     18116080 ns           38

rok-cesnovar · 2021-08-18T06:53:50Z

For those right now the compiler just returns stuff like y[sym32__][1] where this PR actually returns back the index number like y[1][1], y[2][1], etc. So if we don't do make_iter_name() that's fine but then are we fine with error checks that just report that y[sym32__][1]?

So this PR would then also fix stan-dev/stanc3#676 is what you are saying? And that fix comes with a performance penalty? Not sure I completely follow.

SteveBronder · 2021-08-18T09:03:12Z

Yes it also does that fix (once these are used in the compiler). The performance penalty comes from constructing the string for the index number.

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2021-08-18T22:28:13Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.85	2.95	0.97	-3.58% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.9	-11.09% slower
eight_schools/eight_schools.stan	0.11	0.1	1.01	0.79% faster
gp_regr/gp_regr.stan	0.16	0.15	1.01	1.03% faster
irt_2pl/irt_2pl.stan	5.81	5.84	0.99	-0.61% slower
performance.compilation	87.32	86.52	1.01	0.92% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.59	8.47	1.01	1.4% faster
pkpd/one_comp_mm_elim_abs.stan	30.67	29.66	1.03	3.29% faster
sir/sir.stan	140.03	127.18	1.1	9.17% faster
gp_regr/gen_gp_data.stan	0.03	0.03	0.98	-2.19% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.0	2.95	1.02	1.75% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.41	0.94	-6.32% slower
arK/arK.stan	2.5	1.85	1.35	25.92% faster
arma/arma.stan	0.75	0.85	0.89	-12.46% slower
garch/garch.stan	0.67	0.56	1.2	16.87% faster
Mean result: 1.02769815462

Jenkins Console Log
Blue Ocean
Commit hash: f96a182

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder · 2021-08-18T22:48:48Z

@bob-carpenter this is ready for another look (and if you can look at the perf tests of the above)

bob-carpenter · 2021-08-19T15:29:15Z

Interesting---so this isn't introducing any performance regressions in our current code? Should we run that again to be sure? I don't think we'd be willing to take a performance regression to check indices. Or is the code we'd be replacing also inefficient?

I may be wrong here (and would very much like to be), but the code pattern seems to be something like this:

for (...) {
  string msg = ... construct string "var[idx]" ...
  if (bad condition)
    throw(msg);
}

That is, the message string gets constructed eagerly. If that's not the case, then no worries on this PR and I just misunderstood.

Rather than the above, it's more efficient to use this pattern:

for (...) {
  if (bad condition) {
    string msg = ... construct string "var[idx]" ...;
    throw(msg);
}

The problem is passing the indexes down to the embedded check. I think the way to do that in this code would be to have the check functions take in zero or more indexes which would get appended to the names if there are errors. It'd require some fast footwork to do it in general with variadic last arguments consisting of the list of current indices. That way, you don't ever need to construct a string until there's an error and the string construction loop gets unfolded without any redundant copying.

Do you think that'd be feasible? Or worthwhile? If we really are checking eagerly, I think it'd be a huge speed win. And it'd explain why everyone's been wanting to remove checks from the code!

…ater

…_equal

SteveBronder · 2021-08-19T20:24:56Z

Do you think that'd be feasible? Or worthwhile? If we really are checking eagerly, I think it'd be a huge speed win. And it'd explain why everyone's been wanting to remove checks from the code!

Yes that's a great idea!! With those changes the speed checks for std::vector<std::vector<double>> go down considerably to be a hair faster than our current checks.

This PR

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          28.7 ns         59.4 ns     25040768
check_ge/4/manual_time          34.0 ns         64.4 ns     20434993
check_ge/8/manual_time          68.9 ns         99.2 ns     10456872
check_ge/16/manual_time          187 ns          218 ns      3692283
check_ge/32/manual_time          787 ns          817 ns       872611
check_ge/64/manual_time         2615 ns         2645 ns       266462
check_ge/128/manual_time        9372 ns         9403 ns        74366
check_ge/256/manual_time       37738 ns        37763 ns        18659
check_ge/512/manual_time      140853 ns       140872 ns         4971
check_ge/1024/manual_time    1197653 ns      1197643 ns          584
check_ge/2048/manual_time    4220595 ns      4220120 ns          166
check_ge/4096/manual_time   16148627 ns     16147081 ns           43

Develop

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          27.2 ns         57.5 ns     25806638
check_ge/4/manual_time          33.3 ns         63.6 ns     21105640
check_ge/8/manual_time          68.0 ns         98.3 ns     10509997
check_ge/16/manual_time          174 ns          205 ns      4141638
check_ge/32/manual_time          775 ns          805 ns       941074
check_ge/64/manual_time         2572 ns         2602 ns       273127
check_ge/128/manual_time        9267 ns         9295 ns        76701
check_ge/256/manual_time       37915 ns        37944 ns        18639
check_ge/512/manual_time      144437 ns       144464 ns         4980
check_ge/1024/manual_time    1282011 ns      1282044 ns          543
check_ge/2048/manual_time    4706496 ns      4706217 ns          149
check_ge/4096/manual_time   18116610 ns     18116080 ns           38

SteveBronder · 2021-08-19T20:30:02Z

Though I only implemented that for the check_less/greater(_or_equal) functions. For the other functions here

I think there checks inside of those checks are going to be more expensive than the string construction
To do this for the rest of the functions in this PR would require a pretty darn big rewrite of our error handling where pretty much everything takes in an parameter pack of indices.

I can do (2) but imo I'd rather do it in another PR since this PR is pretty darn big already. I think we can do quite a lot to actually speed up these checks. Like I think name in all of the error handlers should actually be a std::string. That would let us append to it for indices without constantly creating new const char* with c_str(). And we could then move those strings around if they are rvalues (which they almost always are)

…4.1 (tags/RELEASE_600/final)

bob-carpenter

Thanks. Especially for the extensive answers to all of the questions I had reading the code.

stan-buildbot · 2021-08-20T16:22:13Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.86	2.89	0.99	-1.14% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	1.0	0.38% faster
eight_schools/eight_schools.stan	0.1	0.11	0.94	-6.78% slower
gp_regr/gp_regr.stan	0.16	0.15	1.02	1.79% faster
irt_2pl/irt_2pl.stan	5.87	5.78	1.02	1.54% faster
performance.compilation	87.59	86.88	1.01	0.81% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.72	8.52	1.02	2.36% faster
pkpd/one_comp_mm_elim_abs.stan	30.03	29.47	1.02	1.86% faster
sir/sir.stan	123.22	126.58	0.97	-2.72% slower
gp_regr/gen_gp_data.stan	0.03	0.03	1.01	0.92% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.99	2.9	1.03	3.07% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.41	0.4	1.04	3.57% faster
arK/arK.stan	1.86	1.85	1.0	0.49% faster
arma/arma.stan	0.82	0.91	0.9	-10.56% slower
garch/garch.stan	0.71	0.63	1.12	10.49% faster
Mean result: 1.00615684576

Jenkins Console Log
Blue Ocean
Commit hash: 31b69b6

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder and others added 14 commits August 5, 2021 17:54

Adds vectorized versions of the checks used by the compiler

b7dd98c

start adding vectorized tests

46ce2e8

adds docs and finishes tests

11d11b4

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

df9255c

…4.1 (tags/RELEASE_600/final)

fix cpplint

83224ea

check dims for matrices and vectors for sizes

a5ffbd5

fix mem error in test for check_greater

e82fd29

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

a678bf4

…4.1 (tags/RELEASE_600/final)

Fixup na vectorized check

f892c31

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

e318a49

…4.1 (tags/RELEASE_600/final)

fix error msg

e3f196c

space after equal to

c0fd490

Merge remote-tracking branch 'origin/develop' into fix/check-less-gre…

6cba218

…ater

update profile tests to sleep for a duration

64c6019

bob-carpenter self-requested a review August 12, 2021 16:14

bob-carpenter requested changes Aug 12, 2021

View reviewed changes

SteveBronder added 3 commits August 12, 2021 23:19

refactor docs for err checks

f4ed5c1

update docs

87b15e5

update docs

383f6e9

SteveBronder and others added 4 commits August 17, 2021 17:40

update for specializations for looking over standard vectors of scalars

3ff2da0

Merge commit '02dc560b022f328f917af80a1b9b7f1feb249ee4' into HEAD

2d1029c

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

2b1b480

…4.1 (tags/RELEASE_600/final)

Add unlikely to check_greater/less

6bf2271

SteveBronder and others added 4 commits August 17, 2021 19:49

Merge branch 'fix/check-less-greater' of github.com:stan-dev/math int…

71e8b6b

…o fix/check-less-greater

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

82dc731

…4.1 (tags/RELEASE_600/final)

inline string construction for greater/less test

9994174

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

6b14ee7

…4.1 (tags/RELEASE_600/final)

value_of_rec for scalar case

07f31cb

stan-buildbot and others added 3 commits August 18, 2021 14:09

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

ea2a9eb

…4.1 (tags/RELEASE_600/final)

value_of_rec for high low scalar in error

3a36e2b

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

f96a182

…4.1 (tags/RELEASE_600/final)

SteveBronder added 6 commits August 19, 2021 13:50

Merge remote-tracking branch 'origin/develop' into fix/check-less-gre…

4273c1e

…ater

Passes an idx parameter pack for lazy name construction

f26b24c

Do lazy construction in check_less/greater

4a02109

remove initializer list and pass in as arguments for check_greater_or…

840075f

…_equal

pass error handling as args to lambda

aa4d183

update docs

64c804f

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

31b69b6

…4.1 (tags/RELEASE_600/final)

bob-carpenter approved these changes Aug 19, 2021

View reviewed changes

SteveBronder merged commit 7dd7b31 into develop Aug 20, 2021

rok-cesnovar deleted the fix/check-less-greater branch October 4, 2021 17:12

Uh oh!

Vectorize checks called by compiler #2556

Vectorize checks called by compiler #2556

Uh oh!

Conversation

SteveBronder commented Aug 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Side Effects

Release notes

Checklist

Uh oh!

stan-buildbot commented Aug 10, 2021

Uh oh!

SteveBronder commented Aug 10, 2021

Uh oh!

SteveBronder commented Aug 10, 2021

Uh oh!

stan-buildbot commented Aug 11, 2021

Uh oh!

bob-carpenter left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SteveBronder Aug 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stan-buildbot commented Aug 14, 2021

Uh oh!

SteveBronder commented Aug 16, 2021

Uh oh!

SteveBronder commented Aug 17, 2021

Uh oh!

SteveBronder commented Aug 18, 2021

This PR

Develop

This PR

Develop

Uh oh!

rok-cesnovar commented Aug 18, 2021

Uh oh!

SteveBronder commented Aug 18, 2021

Uh oh!

stan-buildbot commented Aug 18, 2021

Uh oh!

SteveBronder commented Aug 18, 2021

Uh oh!

bob-carpenter commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SteveBronder commented Aug 6, 2021 •

edited

Loading

SteveBronder Aug 12, 2021 •

edited

Loading

bob-carpenter commented Aug 19, 2021 •

edited

Loading