Skip to content

Vectorize checks called by compiler #2556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Aug 20, 2021
Merged

Conversation

SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Aug 6, 2021

Summary

This PR vectorizes the check_* functions that the compiler generates for objects created in transformed parameters/data / generated quantities

The stan compiler currently calls several of the check_* functions for transformed params (specifically the ones here where each check is generated as check_<matched_bound_name>). For several of these checks the compiler will generate a for loop to iterate over each underlying element of the matrix/vector/array, but for the new matrix type we don't want iterations like this that have to look at individual elements.

So with this PR instead of stanc generating the following to test if an array of vectors is greater than or equal to another vector,

      for (int sym1__ = 1; sym1__ <= m; ++sym1__) {
        current_statement__ = 27;
        for (int sym2__ = 1; sym2__ <= k; ++sym2__) {
          current_statement__ = 27;
          check_greater_or_equal(function__, "tp_9[sym1__, sym2__]",
                                 tp_9[(sym1__ - 1)][(sym2__ - 1)],
                                 rvalue(ds, "ds", index_uni(1))[(sym1__ - 1)][
                                 (sym2__ - 1)]);
        }
      }

it can just generate

check_greater_or_equal(function__, "tp_9", tp_9, ds);

You'll notice this also fixes a bug in the compiler where if an error occurred we would call the name of the thrown object "tp_9[sym1__, sym2__]". In this impl we clean that up so that the actual iteration number for arrays / vectors / matrices is thrown such as tp_9[1][5].

Tests

Tests were changed for each check to test the vectorized version of the inputs (which then checks the underlying impls for matrices and scalars.) Tests can be run with

./runTests.py -j4 test/unit/math/prim/err/check_cholesky_factor_corr_test.cpp \
test/unit/math/prim/err/check_cholesky_factor_test.cpp \
test/unit/math/prim/err/check_corr_matrix_test.cpp \
test/unit/math/prim/err/check_cov_matrix_test.cpp \
test/unit/math/prim/err/check_greater_or_equal_test.cpp \
test/unit/math/prim/err/check_greater_test.cpp \
test/unit/math/prim/err/check_less_or_equal_test.cpp \
test/unit/math/prim/err/check_less_test.cpp \
test/unit/math/prim/err/check_ordered_test.cpp \
test/unit/math/prim/err/check_positive_ordered_test.cpp \
test/unit/math/prim/err/check_simplex_test.cpp \
test/unit/math/prim/err/check_unit_vector_test.cpp

Side Effects

Nope!

Release notes

Vectorize checks called by stanc compiler

Checklist

  • Math issue How to add static matrix? #1805

  • Copyright holder: Steve Bronder

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.96 3.0 0.98 -1.53% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -1.31% slower
eight_schools/eight_schools.stan 0.11 0.11 0.99 -0.72% slower
gp_regr/gp_regr.stan 0.16 0.16 1.0 -0.0% slower
irt_2pl/irt_2pl.stan 5.84 5.96 0.98 -1.89% slower
performance.compilation 88.7 87.74 1.01 1.08% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.66 8.94 0.97 -3.17% slower
pkpd/one_comp_mm_elim_abs.stan 29.36 30.54 0.96 -3.99% slower
sir/sir.stan 126.16 126.0 1.0 0.13% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.0 -0.46% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.17 2.94 1.08 7.42% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.38 1.01 0.94% faster
arK/arK.stan 1.88 2.52 0.75 -33.92% slower
arma/arma.stan 0.84 0.83 1.0 0.48% faster
garch/garch.stan 0.54 0.67 0.8 -24.44% slower
Mean result: 0.968637515091

Jenkins Console Log
Blue Ocean
Commit hash: 64c6019


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

@rok-cesnovar @serban-nicusor-toptal it looks like the github actions are taking v long and it won't let me look at the raw logs. Is there some way to see what's going on?

@SteveBronder
Copy link
Collaborator Author

@rok-cesnovar @serban-nicusor-catena nvm!

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.12 2.99 1.05 4.36% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.01 1.18% faster
eight_schools/eight_schools.stan 0.1 0.11 0.95 -5.23% slower
gp_regr/gp_regr.stan 0.16 0.16 1.02 1.86% faster
irt_2pl/irt_2pl.stan 5.83 5.87 0.99 -0.82% slower
performance.compilation 90.06 87.32 1.03 3.05% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.55 8.43 1.01 1.4% faster
pkpd/one_comp_mm_elim_abs.stan 30.56 30.05 1.02 1.68% faster
sir/sir.stan 127.78 128.07 1.0 -0.22% slower
gp_regr/gen_gp_data.stan 0.03 0.04 0.98 -2.25% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.97 2.91 1.02 2.15% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.39 1.02 1.66% faster
arK/arK.stan 1.87 2.53 0.74 -35.35% slower
arma/arma.stan 0.84 0.83 1.01 0.61% faster
garch/garch.stan 0.53 0.67 0.79 -27.19% slower
Mean result: 0.975136896002

Jenkins Console Log
Blue Ocean
Commit hash: 64c6019


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@bob-carpenter bob-carpenter self-requested a review August 12, 2021 16:14
Copy link
Member

@bob-carpenter bob-carpenter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to time myself out at 30 comments. A lot of them are redundant around two themes:

  • specifying what the function does in the first sentence of the doc
  • clarifying strict vs. non-strict comparison
  • efficiency concerns around the eager use of make_iter_name
  • documenting what the constraints enforce
  • not confusing types and values in the doc

Other than the efficiency tests, these are all minor. I'm happy to help with the doc changes once it's clarified for each test whether it's a strict inequality or not.

* elements are all positive. Note that Cholesky factors need not
* be square, but require at least as many rows M as columns N
* (i.e., M &gt;= N).
* @tparam StdVec A standard vector with inner type inheriting from `MatrixBase`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question]
Should this be checked? Or is that all that's going to get passed to it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have a check_cholesky_factor that works for this definition so if someone tried passing something like a std::vector<std::vector<double>> it would just fail to compile

@@ -16,7 +19,7 @@ namespace math {
* elements are all positive. Note that Cholesky factors need not
* be square, but require at least as many rows M as columns N
* (i.e., M &gt;= N).
* @tparam EigMat Type of the Cholesky factor (must be derived from \c
* @tparam Mat Type of the Cholesky factor (must be derived from \c
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional]
What convention does Eigen use for these matrix args? I think it'd be nice to follow that. I think they may call it Derived.

Copy link
Collaborator Author

@SteveBronder SteveBronder Aug 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Eigen they use Derived in places like

    template<typename Derived>
    bool solveInPlace(MatrixBase<Derived> &bAndX)

Where Derived is the inner type used in the CRTP of MatrixBase. Here we accept anything derived from MatrixBase. This function also takes in `var_valueEigen::Matrix types and I'll update the docs to reflect that

* factor, if number of rows is less than the number of columns,
* if there are 0 columns, or if any element in matrix is NaN
*/
template <typename StdVec, require_std_vector_t<StdVec>* = nullptr>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional]
I'd rather keep shorter template parameters, like just V for standard vectors or maybe C for generic containers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For very generic functions I like using V or T etc. but for functions with specific requirements I like that the template parameter's name gives an idea of what the requirement is to use the function.

void check_cholesky_factor(const char* function, const char* name,
const StdVec& y) {
for (size_t i = 0; i < y.size(); ++i) {
check_cholesky_factor(function, internal::make_iter_name(name, i).c_str(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a performance evaluation as it's going to proactively create string names for each entry, which is pretty expensive.

[question]
Is there a way to be lazy and avoid creating the name until the check fails?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with running a little performance check here, I def expect it to be slower for small matrices though hopefully not much.

Is there a way to be lazy and avoid creating the name until the check fails?

I tried thinking about this but the only thing I could figure out is to change all of the checks to take in a lambda instead of a const char* that doesn't evaluate until a throw occurs.

void check_cholesky_factor_corr(const char* function, const char* name,
const StdVec& y) {
for (size_t i = 0; i < y.size(); ++i) {
check_cholesky_factor_corr(function,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question for all of these and efficiency.

}();
}
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question]
What happens with things like arrays of matrices? In that case, there are more than two indexes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this function is template by

template <typename T_y, typename T_high,
          require_all_matrix_t<T_y, T_high>* = nullptr>

So it requires the template is either an Eigen matrix types or var_value<MatrixXd> types. The ones below templated with requires for std_vectors are the ones that look over arrays (and arrays of matrices etc.)

}

/**
* Check if each element of <code>y</code> is strictly less than each associated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question]
Is this strictly less or less than or equal? I'm confused about what's being checked in these. For our constraints, all the checks should be less than or equal in order to deal with rounding/underflow errors.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have check_less and check_less_or_equal which works well with the rounding/underflow errors. There's other parts of Stan math which use check_less, should those be changed to check_less_or_equal? I think we would want to do that in a separate PR

* @tparam C Eigen column type, either 1 if we have a column vector
* or -1 if we have a row vector. Moreover, we either have
* R = 1 and C = -1 or R = -1 and C = 1.
* @tparam T A type inheriting from EigenBase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional]
Only sentences should have periods after them. Sorry the doc's already so inconsistent.

@@ -48,6 +52,26 @@ void check_unit_vector(const char* function, const char* name,
}
}

/**
* Check if the each element in a standard vector is a unit vector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say "unit Euclidean length" to specify what unit vector means here. "unit_vector" is a type in the Stan language.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.02 3.05 0.99 -0.99% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.92 -8.9% slower
eight_schools/eight_schools.stan 0.11 0.1 1.05 4.34% faster
gp_regr/gp_regr.stan 0.16 0.16 1.0 -0.13% slower
irt_2pl/irt_2pl.stan 5.82 5.84 1.0 -0.27% slower
performance.compilation 90.34 87.87 1.03 2.73% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.58 8.4 1.02 2.15% faster
pkpd/one_comp_mm_elim_abs.stan 29.4 30.22 0.97 -2.79% slower
sir/sir.stan 129.64 131.95 0.98 -1.78% slower
gp_regr/gen_gp_data.stan 0.03 0.04 0.97 -3.52% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.98 2.91 1.02 2.38% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.41 0.94 -6.06% slower
arK/arK.stan 1.86 1.87 0.99 -0.56% slower
arma/arma.stan 0.83 0.78 1.07 6.28% faster
garch/garch.stan 0.54 0.56 0.96 -4.16% slower
Mean result: 0.994001642229

Jenkins Console Log
Blue Ocean
Commit hash: 87b15e5


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

@bob-carpenter fixed up the everything so this is ready for another look

@SteveBronder
Copy link
Collaborator Author

@bob-carpenter actually hang on I ran the performance checks and there's a regression I need to fix

@SteveBronder
Copy link
Collaborator Author

@bob-carpenter Alright so I think I have everything sorted out. I made a benchmark here for the vector case and the results seem good. The number in check_ge/*/manual_time is the size of the vector we are testing and Time is the manual benchmark time that doesn't include the construction of the std::vector<> for each test. For sizes 2..4096 this PR and develop are within a few nanoseconds of each other.

This PR

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          26.7 ns         57.1 ns     26208169
check_ge/4/manual_time          26.8 ns         57.1 ns     26002894
check_ge/8/manual_time          27.6 ns         57.9 ns     25198473
check_ge/16/manual_time         30.1 ns         60.4 ns     23274275
check_ge/32/manual_time         37.4 ns         67.6 ns     18707376
check_ge/64/manual_time         62.7 ns         93.0 ns     11294734
check_ge/128/manual_time        91.9 ns          122 ns      7793109
check_ge/256/manual_time         157 ns          187 ns      4503566
check_ge/512/manual_time         271 ns          301 ns      2528503
check_ge/1024/manual_time        521 ns          552 ns      1232462
check_ge/2048/manual_time       1007 ns         1037 ns       679082
check_ge/4096/manual_time       1978 ns         2008 ns       354471

Develop

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          24.8 ns         55.1 ns     28259239
check_ge/4/manual_time          25.9 ns         56.2 ns     26950003
check_ge/8/manual_time          27.6 ns         57.9 ns     25538389
check_ge/16/manual_time         30.4 ns         60.8 ns     22979588
check_ge/32/manual_time         37.1 ns         67.4 ns     18895034
check_ge/64/manual_time         58.5 ns         88.7 ns     11736093
check_ge/128/manual_time        91.4 ns          122 ns      7773196
check_ge/256/manual_time         151 ns          181 ns      4606845
check_ge/512/manual_time         261 ns          291 ns      2675818
check_ge/1024/manual_time        506 ns          536 ns      1348204
check_ge/2048/manual_time       1019 ns         1049 ns       694327
check_ge/4096/manual_time       1968 ns         1998 ns       356119

But what should I benchmark against for the ones that have internal::make_iter_name(name, i).c_str()? For those right now the compiler just returns stuff like y[sym32__][1] where this PR actually returns back the index number like y[1][1], y[2][1], etc. So if we don't do make_iter_name() that's fine but then are we fine with error checks that just report that y[sym32__][1]?

To benchmark this right now I'm doing this for the current PR where we use the vectorized version and like what stanc3 does here for develop where we just a loop over x_vec and y_vec

There's def a big cost for constructing these names correctly. If we're not cool with paying that cost then I'm fine with just reporting y. I think a big part of the cost here is making the std::string for appending arithmetic types to the const char* we pass in, but then just getting the c_str() from the string. That means a lot of the temporary strings we have actually end up getting copied a bunch.

This PR

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time           113 ns          144 ns      6226928
check_ge/4/manual_time           189 ns          219 ns      3674070
check_ge/8/manual_time           348 ns          378 ns      2010529
check_ge/16/manual_time          738 ns          768 ns       944092
check_ge/32/manual_time         1746 ns         1776 ns       390178
check_ge/64/manual_time         6281 ns         6311 ns       111164
check_ge/128/manual_time       19204 ns        19230 ns        36417
check_ge/256/manual_time       65290 ns        65314 ns        11477
check_ge/512/manual_time      226390 ns       226394 ns         2985
check_ge/1024/manual_time    1387108 ns      1387246 ns          511
check_ge/2048/manual_time    5066681 ns      5065742 ns          100
check_ge/4096/manual_time   20200250 ns     20195097 ns           37

Develop

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          27.2 ns         57.5 ns     25806638
check_ge/4/manual_time          33.3 ns         63.6 ns     21105640
check_ge/8/manual_time          68.0 ns         98.3 ns     10509997
check_ge/16/manual_time          174 ns          205 ns      4141638
check_ge/32/manual_time          775 ns          805 ns       941074
check_ge/64/manual_time         2572 ns         2602 ns       273127
check_ge/128/manual_time        9267 ns         9295 ns        76701
check_ge/256/manual_time       37915 ns        37944 ns        18639
check_ge/512/manual_time      144437 ns       144464 ns         4980
check_ge/1024/manual_time    1282011 ns      1282044 ns          543
check_ge/2048/manual_time    4706496 ns      4706217 ns          149
check_ge/4096/manual_time   18116610 ns     18116080 ns           38

@rok-cesnovar
Copy link
Member

For those right now the compiler just returns stuff like y[sym32__][1] where this PR actually returns back the index number like y[1][1], y[2][1], etc. So if we don't do make_iter_name() that's fine but then are we fine with error checks that just report that y[sym32__][1]?

So this PR would then also fix stan-dev/stanc3#676 is what you are saying? And that fix comes with a performance penalty? Not sure I completely follow.

@SteveBronder
Copy link
Collaborator Author

Yes it also does that fix (once these are used in the compiler). The performance penalty comes from constructing the string for the index number.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.85 2.95 0.97 -3.58% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.9 -11.09% slower
eight_schools/eight_schools.stan 0.11 0.1 1.01 0.79% faster
gp_regr/gp_regr.stan 0.16 0.15 1.01 1.03% faster
irt_2pl/irt_2pl.stan 5.81 5.84 0.99 -0.61% slower
performance.compilation 87.32 86.52 1.01 0.92% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.59 8.47 1.01 1.4% faster
pkpd/one_comp_mm_elim_abs.stan 30.67 29.66 1.03 3.29% faster
sir/sir.stan 140.03 127.18 1.1 9.17% faster
gp_regr/gen_gp_data.stan 0.03 0.03 0.98 -2.19% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.0 2.95 1.02 1.75% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.41 0.94 -6.32% slower
arK/arK.stan 2.5 1.85 1.35 25.92% faster
arma/arma.stan 0.75 0.85 0.89 -12.46% slower
garch/garch.stan 0.67 0.56 1.2 16.87% faster
Mean result: 1.02769815462

Jenkins Console Log
Blue Ocean
Commit hash: f96a182


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

@bob-carpenter this is ready for another look (and if you can look at the perf tests of the above)

@bob-carpenter
Copy link
Member

bob-carpenter commented Aug 19, 2021

Interesting---so this isn't introducing any performance regressions in our current code? Should we run that again to be sure? I don't think we'd be willing to take a performance regression to check indices. Or is the code we'd be replacing also inefficient?

I may be wrong here (and would very much like to be), but the code pattern seems to be something like this:

for (...) {
  string msg = ... construct string "var[idx]" ...
  if (bad condition)
    throw(msg);
}

That is, the message string gets constructed eagerly. If that's not the case, then no worries on this PR and I just misunderstood.

Rather than the above, it's more efficient to use this pattern:

for (...) {
  if (bad condition) {
    string msg = ... construct string "var[idx]" ...;
    throw(msg);
}

The problem is passing the indexes down to the embedded check. I think the way to do that in this code would be to have the check functions take in zero or more indexes which would get appended to the names if there are errors. It'd require some fast footwork to do it in general with variadic last arguments consisting of the list of current indices. That way, you don't ever need to construct a string until there's an error and the string construction loop gets unfolded without any redundant copying.

Do you think that'd be feasible? Or worthwhile? If we really are checking eagerly, I think it'd be a huge speed win. And it'd explain why everyone's been wanting to remove checks from the code!

@SteveBronder
Copy link
Collaborator Author

Do you think that'd be feasible? Or worthwhile? If we really are checking eagerly, I think it'd be a huge speed win. And it'd explain why everyone's been wanting to remove checks from the code!

Yes that's a great idea!! With those changes the speed checks for std::vector<std::vector<double>> go down considerably to be a hair faster than our current checks.

This PR

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          28.7 ns         59.4 ns     25040768
check_ge/4/manual_time          34.0 ns         64.4 ns     20434993
check_ge/8/manual_time          68.9 ns         99.2 ns     10456872
check_ge/16/manual_time          187 ns          218 ns      3692283
check_ge/32/manual_time          787 ns          817 ns       872611
check_ge/64/manual_time         2615 ns         2645 ns       266462
check_ge/128/manual_time        9372 ns         9403 ns        74366
check_ge/256/manual_time       37738 ns        37763 ns        18659
check_ge/512/manual_time      140853 ns       140872 ns         4971
check_ge/1024/manual_time    1197653 ns      1197643 ns          584
check_ge/2048/manual_time    4220595 ns      4220120 ns          166
check_ge/4096/manual_time   16148627 ns     16147081 ns           43

Develop

--------------------------------------------------------------------
Benchmark                          Time             CPU   Iterations
--------------------------------------------------------------------
check_ge/2/manual_time          27.2 ns         57.5 ns     25806638
check_ge/4/manual_time          33.3 ns         63.6 ns     21105640
check_ge/8/manual_time          68.0 ns         98.3 ns     10509997
check_ge/16/manual_time          174 ns          205 ns      4141638
check_ge/32/manual_time          775 ns          805 ns       941074
check_ge/64/manual_time         2572 ns         2602 ns       273127
check_ge/128/manual_time        9267 ns         9295 ns        76701
check_ge/256/manual_time       37915 ns        37944 ns        18639
check_ge/512/manual_time      144437 ns       144464 ns         4980
check_ge/1024/manual_time    1282011 ns      1282044 ns          543
check_ge/2048/manual_time    4706496 ns      4706217 ns          149
check_ge/4096/manual_time   18116610 ns     18116080 ns           38

@SteveBronder
Copy link
Collaborator Author

Though I only implemented that for the check_less/greater(_or_equal) functions. For the other functions here

  1. I think there checks inside of those checks are going to be more expensive than the string construction
  2. To do this for the rest of the functions in this PR would require a pretty darn big rewrite of our error handling where pretty much everything takes in an parameter pack of indices.

I can do (2) but imo I'd rather do it in another PR since this PR is pretty darn big already. I think we can do quite a lot to actually speed up these checks. Like I think name in all of the error handlers should actually be a std::string. That would let us append to it for indices without constantly creating new const char* with c_str(). And we could then move those strings around if they are rvalues (which they almost always are)

Copy link
Member

@bob-carpenter bob-carpenter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Especially for the extensive answers to all of the questions I had reading the code.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.86 2.89 0.99 -1.14% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.0 0.38% faster
eight_schools/eight_schools.stan 0.1 0.11 0.94 -6.78% slower
gp_regr/gp_regr.stan 0.16 0.15 1.02 1.79% faster
irt_2pl/irt_2pl.stan 5.87 5.78 1.02 1.54% faster
performance.compilation 87.59 86.88 1.01 0.81% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.72 8.52 1.02 2.36% faster
pkpd/one_comp_mm_elim_abs.stan 30.03 29.47 1.02 1.86% faster
sir/sir.stan 123.22 126.58 0.97 -2.72% slower
gp_regr/gen_gp_data.stan 0.03 0.03 1.01 0.92% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.99 2.9 1.03 3.07% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.41 0.4 1.04 3.57% faster
arK/arK.stan 1.86 1.85 1.0 0.49% faster
arma/arma.stan 0.82 0.91 0.9 -10.56% slower
garch/garch.stan 0.71 0.63 1.12 10.49% faster
Mean result: 1.00615684576

Jenkins Console Log
Blue Ocean
Commit hash: 31b69b6


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder SteveBronder merged commit 7dd7b31 into develop Aug 20, 2021
@rok-cesnovar rok-cesnovar deleted the fix/check-less-greater branch October 4, 2021 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants