Skip to content

Conversation

bbbales2
Copy link
Member

Summary

var<mat> versions for quad_form and trace_quad_form

Not the cleanest. Comments inline.

Release notes

var<mat> versions of quad_form and trace_quad_form

Checklist

  • Math issue Make functions with custom autodiff var<mat> friendly #2101

  • Copyright holder: Columbia University

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

Copy link
Member Author

@bbbales2 bbbales2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Questions and comments

@@ -17,7 +17,7 @@ namespace math {
* @param y Matrix to test
* @throw <code>std::invalid_argument</code> if the matrix is not square
*/
template <typename T_y, require_eigen_t<T_y>* = nullptr>
template <typename T_y, require_matrix_t<T_y>* = nullptr>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a_few_varmats has this (#2106), so this will get fixed when that merges

auto arena_A = to_arena(A);
auto arena_B = to_arena(B);

check_not_nan("multiply", "A", value_of(arena_A));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to go through and make all the check functions work with var<mat> types. I'll make a list of the ones that need updated and add them to #2101. Here I just pass in the value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should prefer passing values into check functions anyway. That way we don't need them to support vars. Also checks on value will be faster than ones on Matrix<var>.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is an expression thing, again. Like, if we pass in MatrixXd, then if we switch to value then we have to do two full reads of the matrix to do a check. If we pass in Matrix<var, R, C> it is one read. If we pass in an expression view of the values in a Matrix<var, R, C>, then it is still one read. This sound right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but that is assuming the functions calling the check does not need to extract values anyway. Most functions do need values.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most functions do need values.

That is true.

I guess if we just put a auto x_val = value_of_rec(x) inside all the checks then that would be good? If someone passes in values, that is good. If someone passes in mat<var>, it works via expressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would work, but for mat<var> it would not be efficient. Also I am trying to say if we only pass in values we don't need to change all checks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense. I don't want to accidentally make them slow.

I think our checks should work with var<mat> though. That means for ones that read values just adding a version that takes in a var<mat> and then calls itself with the value of the var<mat>.

arena_res = (arena_res + arena_res.transpose()).eval();
}

return_t res = arena_res;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

arena_res is on the arena, so since return_t is a var<mat> it will just take the pointer arena_res has right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. arena_t<var<T>> is var<T>.

value_of(B).eval()), Mat1, Mat2>;

auto arena_A = to_arena(A);
auto arena_B = to_arena(B);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these matrices are either var<mat> s or just regular mat s then this code is fine. If one of these is a mat<var> though then it's not as fine because I access values and adjoints repeatedly. If the vals/adjoints are accessed more than once, it would be more economical to just copy to a var<mat>.

So my question is we should just by default be copying to var<mat> in all these functions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about you do something like:

const auto& A_adj = to_ref(arena_A)

Also it would be nice if we only copied these matrices to arena if we actually needed them in the reverse pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I'm not worried about evaluatiang expressions as much as converting things to var<mat> land so I can be lazy.

I checked and all the copying here is fine if inputs are a mix of mat<double> and var<mat>. The A inputs and the B inputs are required for both reverse passes. It's not so good for mat<var>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. to_arena does not convert to var<mat>, so there is steel a need to handle mat<var> efficiently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the solution is to just add a version of to_var_value (https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/to_var_value.hpp) that just forwards things through if it is given a var_value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"I think at a very high level" as a precursor to a sentence sounds like the most arrogant thing ever said. It is also very cringe on its own. I just meant "I'm not sure but", but it didn't come out that way lol.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of interesting, I did a search on github for quad_form in stan programs

https://github.com/search?l=&p=5&q=quad_form+language%3AStan&type=Code

It looks like the most used case is having data as the second argument. Mostly just something to note. (aka I'd rather not spend too much time optimizing this part)

If these matrices are either var<mat> s or just regular mat s then this code is fine. If one of these is a mat<var> though then it's not as fine because I access values and adjoints repeatedly. If the vals/adjoints are accessed more than once, it would be more economical to just copy to a var<mat>.

I agree with Tadej that mat<var> -> var<mat> is v expensive. For this func it may be fine since we do a lot of mat muls

I also don't like to_var_value_if, if we want a var_value from a mat<var> we could just do

var_value<promote_scalar_t<double, Mat1>>

imo I'm not a huge fan of the do_something_if functions with templates. I personally label them as "easy to write hard to read". They're also worse for small N (doing almost nothing is different from doing nothing like you get by pulling out the logic)

I went through and moved all the var/double logic out.

https://github.com/stan-dev/math/compare/feature-pr2164/quad_form

The other benefit of that is we don't have to use forward_as

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like to_var_value_if, if we want a var_value from a mat we could just do
var_value<promote_scalar_t<double, Mat1>>

I don't think this constructor exists though, so we have to call a conversion somehow.

For this func it may be fine since we do a lot of mat muls

I agree the conversion is more expensive than not doing the conversion in ideal cases, but the comparison is between doing the conversion, calling value_of like a ton of extra times (and there are like 9-10 of these in the implementations here), or writing special versions that support combos of mat<var> and var<mat> (and selectively copying out values).

The conversion is like 5 extra read ops and we do 3 every time we do an uncached .val() or .adj(). So I'd much rather be lazy on this one. And then I want to avoid writing three versions of the function to support the argument combos.

I went through and moved all the var/double logic out.

Hmm I think what I wrote should be efficient for var<mat> mat<double>. I prefer the shorter code, but if there is an efficiency problem I can change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also that is an interesting search neat trick

Copy link
Member Author

@bbbales2 bbbales2 Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the 3 read ops to do a .val() in a mat<var>, one of those is pointer chasing and I think will be way more expensive than loading up .val() three times from a var<mat>.

Edit: changed mat<var> to var<mat>... Got this exactly backwards on the first try lol.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.13 3.09 1.01 1.23% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -1.36% slower
eight_schools/eight_schools.stan 0.12 0.12 1.02 1.53% faster
gp_regr/gp_regr.stan 0.18 0.18 0.98 -1.7% slower
irt_2pl/irt_2pl.stan 5.71 5.7 1.0 0.14% faster
performance.compilation 91.11 88.42 1.03 2.95% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.44 8.44 1.0 -0.02% slower
pkpd/one_comp_mm_elim_abs.stan 28.91 29.77 0.97 -2.96% slower
sir/sir.stan 133.67 136.1 0.98 -1.82% slower
gp_regr/gen_gp_data.stan 0.04 0.05 0.98 -1.62% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.96 2.95 1.0 0.05% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.37 1.02 1.48% faster
arK/arK.stan 1.8 1.8 1.0 -0.29% slower
arma/arma.stan 0.62 0.61 1.01 0.93% faster
garch/garch.stan 0.7 0.7 1.0 0.0% slower
Mean result: 0.999259235198

Jenkins Console Log
Blue Ocean
Commit hash: db2b5c9


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Copy link
Member Author

@bbbales2 bbbales2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments

inline var trace_quad_form(const Mat1& A, const Mat2& B) {
check_square("trace_quad_form", "A", A);
check_multiplicable("trace_quad_form", "A", A, "B", B);

arena_t<Mat1> arena_A = A;
arena_t<Mat2> arena_B = B;
auto arena_A = to_arena(to_var_value_if<!is_constant<Mat1>::value>(A));
Copy link
Member Author

@bbbales2 bbbales2 Oct 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up going with to_var_value_if as the lazy support for mixed var<mat> and mat<var>.

The other thing I tried was templating out the functions but that is confusing in its own way. It looks like:

template <typename Mat1, typename Mat2,
          require_var_matrix_t<Mat1>* = nullptr,
          require_eigen_st<is_var, Mat2>* = nullptr>
inline var trace_quad_form(const Mat1& A, const Mat2& B) {
  return trace_quad_form(A, to_var_value(B));
}

template <typename Mat1, typename Mat2,
          require_eigen_st<is_var, Mat1>* = nullptr,
          require_var_matrix_t<Mat2>* = nullptr>
inline var trace_quad_form(const Mat1& A, const Mat2& B) {
  return trace_quad_form(to_var_value(A), B);
}

template <typename Mat1, typename Mat2,
          require_all_matrix_t<Mat1, Mat2>* = nullptr,
          require_any_var_matrix_t<Mat1, Mat2>* = nullptr>,
	  require_all_not_eigen_st<is_var, Mat1, Mat2>* = nullptr>
inline var trace_quad_form(const Mat1& A, const Mat2& B) {
 ...
}

Edit: updated code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following why they have to both be coerced to var matrices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other options are:

  1. Do not support mixes of mat<var> and var<mat>. I do not like that because we will have programs with both types (and so we could end up with mixed arguments)

  2. Write three different specializations. I do not like that because we found out optimizing for mat<var> is trickier than var<mat>. I do not think this is particularly necessary cuz this will be faster than not having autodiff.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just don't move the matrix<var> to a var<matrix>. I think Tadej and I both prefer just leaving them instead of doing the conversion. In another comment I have branch with one quad_form that handles all the cases of vars with one function

feature/quad_form...feature/quad_form-pullout-logic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to sit down one day and figure out this whole val() and val_op() thing. Something something const correctness

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The furthest I got was that it's something to do with Eigen's sizeof logic, there's a little more discussion in this stackoverflow thread. The error doesn't occur with float, only with double. I found a modification that fixed the compilation, but hard to know whether that would affect other behaviour. Tricky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another comment I have branch with one quad_form that handles all the cases of vars with one function

I was trying to avoid touching the pure mat<var> implementation since we decided to wait on benchmarks before we did that (which is why we decided not to merge #2019)

I think Tadej and I both prefer just leaving them

Yeah I saw the comments and I made a response where I argued the case that the conversion here was better: #2164 (comment)

Since I didn't see a response to that, I just went with the conversion to move this along.

I'm not on the same page with stuff like:

arena_t<promote_scalar_t<var, Mat1>> arena_A(A);
arena_t<promote_scalar_t<var, Mat2>> arena_B(B);
return_t res(arena_B.val_op().transpose() * arena_A.val_op() * arena_B.val_op());

Like up until this point I've been assuming if we read the .val() of a mat<var> more than once, we should make a copy of it on the arena.

Since this function does a lot of .val() calls on A and B, my understanding is that we need to make arena copies of the values, otherwise we're going to pointer-chase for every .val() that happens on a mat<var>, and it's the pointer chasing that is slow. I can just ditch the to_var_value stuff (it's gonna be faster than no autodiff either way) but I thought I made a pretty good argument back here: #2164 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the big Q btw. Can do the control flow thing sure.


var res
= (value_of(arena_B).transpose() * value_of(arena_A) * value_of(arena_B))
.trace();

reverse_pass_callback([arena_A, arena_B, res]() mutable {
if (!is_constant<Mat1>::value)
forward_as<promote_scalar_t<var, Mat1>>(arena_A).adj()
forward_as<promote_var_matrix_t<Mat1, Mat1, Mat2>>(arena_A).adj()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should redo promote_var_matrix_t in another pull request.

I think it should just be promote_var_matrix_t<T> returns the var_mat with the same shape as T.

I think our current design of promote_var_matrix_t would be useful when writing two argument functions that took all combinations of var<mat> and mat<var>.

It seems now we are writing functions where the signatures all separate:

One function does (mat<var>, mat<var>) and returns mat<var>, and then another function handles (var<mat>, mat<var>), (mat<var>, var<mat>), (var<mat>, var<mat>) and returns a var<mat> type.

So it's easy enough from the separation of functions to know what the return types should be, and so promote_var_matrix_t can simplify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what your seeing here is moreso because of the pattern your using. imo I'd rather just not use this pattern

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I could just do var_value<Mat1> since you mentioned it elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait that doesn't work cuz might end up with a var_value<var_value<...>>.

Anyway, you're right it's a result of the pattern so it's kinda downstream of the rest of the comments.

@bbbales2
Copy link
Member Author

bbbales2 commented Nov 4, 2020

@rok-cesnovar @t4c1 @SteveBronder do we have an opencl computer I can log in to to test this? I wasn't able to get opencl running on my local computer

@rok-cesnovar
Copy link
Member

Not really, I can give you access to our lab machine if needed, but I am not sure why would quad_form or trace_quad_form cause GLMs to fail. Maybe restart to double check?

@bbbales2
Copy link
Member Author

bbbales2 commented Nov 4, 2020

The test is test/unit/math/opencl/rev/normal_id_glm_lpdf_test. Not obvious to me either why it would fail, but it could be something sneaky. Mind giving that a run real quick in this branch?

@rok-cesnovar
Copy link
Member

Sure thing. Will run in about an hour or so.

@rok-cesnovar
Copy link
Member

That hour was long.

Runs fine locally.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.15 3.04 1.04 3.39% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -0.9% slower
eight_schools/eight_schools.stan 0.12 0.12 0.98 -1.93% slower
gp_regr/gp_regr.stan 0.17 0.17 1.03 2.49% faster
irt_2pl/irt_2pl.stan 5.69 5.63 1.01 1.11% faster
performance.compilation 89.82 87.2 1.03 2.92% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.49 8.48 1.0 0.1% faster
pkpd/one_comp_mm_elim_abs.stan 31.28 29.37 1.06 6.09% faster
sir/sir.stan 134.54 134.04 1.0 0.37% faster
gp_regr/gen_gp_data.stan 0.04 0.04 1.02 1.88% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.95 2.95 1.0 0.09% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.37 1.03 2.89% faster
arK/arK.stan 1.79 1.79 1.0 -0.38% slower
arma/arma.stan 0.6 0.6 0.99 -0.82% slower
garch/garch.stan 0.76 0.74 1.02 1.57% faster
Mean result: 1.01316112208

Jenkins Console Log
Blue Ocean
Commit hash: 38eb76d


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@bbbales2
Copy link
Member Author

bbbales2 commented Nov 5, 2020

@SteveBronder ready for review!

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.01 3.35 0.9 -11.48% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.97 -2.76% slower
eight_schools/eight_schools.stan 0.12 0.12 0.99 -1.46% slower
gp_regr/gp_regr.stan 0.17 0.17 0.99 -1.42% slower
irt_2pl/irt_2pl.stan 5.67 5.72 0.99 -0.86% slower
performance.compilation 87.83 85.43 1.03 2.73% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.5 8.48 1.0 0.28% faster
pkpd/one_comp_mm_elim_abs.stan 29.43 29.12 1.01 1.04% faster
sir/sir.stan 135.42 133.66 1.01 1.3% faster
gp_regr/gen_gp_data.stan 0.04 0.04 0.99 -0.76% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.95 2.97 1.0 -0.46% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.39 0.4 0.96 -3.68% slower
arK/arK.stan 1.76 1.77 0.99 -0.67% slower
arma/arma.stan 0.6 0.6 1.01 0.91% faster
garch/garch.stan 0.74 0.74 1.0 0.05% faster
Mean result: 0.98954762523

Jenkins Console Log
Blue Ocean
Commit hash: 484eb85


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple fixes, most of it really boils down to that I think a lot of the problems you are seeing come from the pattern where we use forward_as and promoting matrix of var to var matrix. I don't see why we need to do that promotion and I think breaking out the control flow logic will make those problems go away

Comment on lines 126 to 127
auto arena_A = to_arena(to_var_value_if<!is_constant<Mat1>::value>(A));
auto arena_B = to_arena(to_var_value_if<!is_constant<Mat2>::value>(B));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused about the logic here, why does this need to move over to a var_value conditionally or even at all? imo I'd rather have us use the scheme of moving the control flow up because it's a lot easier to read through

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a mat<var>, then it needs converted to a var<mat>, otherwise pass through.

The other option from these to_var_value_if statements are more function definitions and requires. Yes to_var_value_if and friends are confusing, but I do not think this is more confusing than:

require_eigen_st<is_var, Mat1>* = nullptr,
require_var_matrix_t<Mat2>* = nullptr

which is something like you'd need to add to the functions to specify that one of the things is an Eigen-like Matrix with a var and one of the other things is a var matrix.

The result of the to_var_value_if, regardless of what it is, needs to end up in the arena cause we use it in the reverse pass (hence everything is wrapped in a to_arena).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 Pull 👏 Out 👏 The 👏 Control 👏 Flow 👏

All of these problems go away if we just pull the control flow out into the function. There's a lot of good reasons to prefer pulling out control flow

  1. Each path from the start to the return is linear. This makes it a lot easier to read and I don't have to go through each line and think, "Well what if X is var and y is not or x is var and y is var and....". In a block of the control flow I'm aware of which is which.

  2. Removes the need for widgets like forward_as<>() and more type trait layers like to_*_if<>(). All the paths do different things and work on different types. Yinz keep saying this leads to code duplication but I disagree! The code is different in each path, it makes the code more verbose which, imo, is very different than duplication.

  3. Uses value_of() less (which requires making a holder<> class for Matrix<var>)

  4. Adding more additional control flow is less annoying. Like if we want to do the .noalias() optimization for var<mat> like

      if (is_var_matrix<Mat1>::value) {
        arena_A.adj().noalias() += arena_B * (res.adj() * arena_B.transpose());
      } else {
        arena_A.adj() += arena_B * (res.adj() * arena_B.transpose());
      }

When we bunch the control flow together with the code this turns into

      if (!is_constant<Mat1>::value) {
        if (is_var_matrix<Mat1>) {
          forward_as<arena_t<promote_scalar_t<var, Mat1>>>(arena_A).adj().noalias()
              += value_of(arena_B) * C_adj_B_t;
        } else {
          forward_as<arena_t<promote_scalar_t<var, Mat1>>>(arena_A).adj()
              += value_of(arena_B) * C_adj_B_t;
        }
      }

That's a mouthful!

I have an example of breaking out the control flow for quad_form below. imo, way more clear! Yeah there's more literal lines of code, but like that's fine to me because I think it's simpler and easier to read.

feature/quad_form...feature/quad_form-pullout-logic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have like 3 inputs or something then yeah I think it's a good idea to use the to_*_if() stuff cuz otherwise you end up with like a bunch of paths which feels silly, but for binary functions I think pulling out the control flow is much cleaner

Copy link
Collaborator

@SteveBronder SteveBronder Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And also the one downside is that for some reason the version that pulls out the flow has to use val_op() otherwise we run into const correctness errors due to something internal in Eigen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these problems go away if we just pull the control flow
"Well what if X is var and y is not or x is var and y is var and...."
Yinz keep saying this leads to code duplication but I disagree!

Sure. I disagree. I don't think any of this code is easy, and I have to read it very carefully line by line anyway.

I can pull out the control flow though (though I don't think it gets rid of the to_var_value problem).

Uses value_of() less (which requires making a holder<> class for Matrix)

I was wondering why you were swapping those all the time. I just liked value_of more cause I don't have to worry about the types. If we add a plain type overload could value_of just point to .val() (or .val_op()) instead of using a holder?

Like if we want to do the .noalias() optimization for var like

Well if we converted mat<var> to var<mat> we wouldn't need the extra logic. But if it's necessary sure. .noalias() is good optimization.

If you have like 3 inputs or something

For what it's worth, it is COming Soon (TM) with trace_gen_quad_form, but we can do different code there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I disagree. I don't think any of this code is easy, and I have to read it very carefully line by line anyway.

The asm etc that's generated for the two versions can be v different. Like for example with to_arena_if<>(), even when the bool template is false it's not a no-op and still has to do make an Eigen map. For smaller N stuff like that matters

I can pull out the control flow though (though I don't think it gets rid of the to_var_value problem).

For this I'd use something like the version I put up where it doesn't do the conversion

I was wondering why you were swapping those all the time. I just liked value_of more cause I don't have to worry about the types.

Yeah so value_of() for Eigen types calls

template <typename EigMat, require_eigen_t<EigMat>* = nullptr,
          require_not_st_arithmetic<EigMat>* = nullptr>
inline auto value_of(EigMat&& M) {
  return make_holder(
      [](auto& a) {
        return a.unaryExpr([](const auto& scal) { return value_of(scal); });
      },
      std::forward<EigMat>(M));
}

So it's good to call value_of() of when we are doing stuff like

// A is an input argument
arena_t<promote_scalar_t<double, Mat1>> arena_A(value_of(A));

But once we have the arena_t we should just call .val() etc.

If we add a plain type overload could value_of just point to .val() (or .val_op()) instead of using a holder?

Actually @t4c1 why is there a reason we don't have a separate version for plain and non-plain Eigen types? I think we could have an overload for when it's an Eigen matrix of vars that are not an expression and just return .val()

Well if we converted mat to var we wouldn't need the extra logic. But if it's necessary sure. .noalias() is good optimization.

Yeah it would be v cool if we could do that swap and have it be efficient but idt it's worth the cost of swapping stuff around.

For what it's worth, it is COming Soon (TM) with trace_gen_quad_form, but we can do different code there.

Oh yeah, I'd be a monster if I asked for breaking up the control flow for anything more than a binary function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses value_of() less (which requires making a holder<> class for Matrix)

If we add a plain type overload could value_of just point to .val() (or .val_op()) instead of using a holder?

Holder is only needed for rvalue inputs (which would not work without it). For lvalues make_holder just calls the given functor with given argument.

Actually @t4c1 why is there a reason we don't have a separate version for plain and non-plain Eigen types? I think we could have an overload for when it's an Eigen matrix of vars that are not an expression and just return .val()

.val() does exactly the same thing as value_of called on Matrices.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.val() does exactly the same thing as value_of called on Matrices.

The slight difference is that value_of will always return a CWiseUnaryOp whereas .val() will return either a CWiseUnaryView or CWiseUnaryOp, depending on const-ness

Comment on lines 132 to 133
auto arena_res = to_arena(value_of(arena_B).transpose() * value_of(arena_A)
* value_of(arena_B));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we access the value of arena_B a lot. It may be worth doing the read just once and saving it as arena_b_val

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it is a var_value or an arena_t<Eigen::MatrixXd>, so it's safe to use value_of

Comment on lines +121 to +124
using return_t
= promote_var_matrix_t<decltype(value_of(B).transpose().eval()
* value_of(A) * value_of(B).eval()),
Mat1, Mat2>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

promote_var_matrix_t is really intended for when you don't know whether you will return a var<matrix> or matrix<var>. Here we know by the template requires that at least one of the inputs must be a var<matrix> so the return type is just going to be a var_value<Eigen::Matrix<double, -1, -1>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was a little nervous about compile time sized matrices, which is why I kept the template.

In that case I guess I could have just done var_value<decltype(value_of(B).transpose().eval() * value_of(A) * value_of(B).eval())>.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I guess I could have just done var_value<decltype(value_of(B).transpose().eval() * value_of(A) * value_of(B).eval())>.

Yes though you want

var_value<plain_type_t<decltype(value_of(B).transpose().eval() * value_of(A) * value_of(B).eval())>> because var_value<> can accept expressions as the template (For slices such as Block<>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var_value<> can accept expressions

Oh right. Can't let the guard down for a second on this stuff :/

Comment on lines 141 to 154
reverse_pass_callback([arena_A, arena_B, res]() mutable {
auto C_adj_B_t = (res.adj() * value_of(arena_B).transpose()).eval();

if (!is_constant<Mat1>::value)
forward_as<promote_var_matrix_t<Mat1, Mat1, Mat2>>(arena_A).adj()
+= value_of(arena_B) * C_adj_B_t;

if (!is_constant<Mat2>::value)
forward_as<promote_var_matrix_t<Mat2, Mat1, Mat2>>(arena_B).adj()
+= value_of(arena_A) * C_adj_B_t.transpose()
+ value_of(arena_A).transpose() * value_of(arena_B) * res.adj();
});

return res;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[optional/rant] Another thing that's nice about moving out the control flow is that it let's you have the immediete evaluation for C_adj_B_t when both are var types and for when just one is a var you can do C_adj_B_t inline for the accumulation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see what you're saying.

Are we sure there's no temporary created in an expression like A * B * C though? I assume that is broken up into D = B * C and A * D if A, B, and C are all Eigen MatrixXd types.

Copy link
Collaborator

@SteveBronder SteveBronder Nov 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, the godbolt for that is getting at the level that I can read asm. Though it looks like it makes at least one alloc then can reuse that allocate with a memset under certain conditions. The main thing is that

A += B * C * D;

Needs to make at least one copy for (B * C * D) because of aliasing issues with the left and right hand side. Reading over the asm I think it does one alloc here and reuses the memory for the other multiply. And idt we can use .noalias() for eigen matrices of vars since .noalias() is like the restrict keyword and we access the value and adjoint through the same pointer on both sides (aka aliasing)

Though we can use .noalias() for var matrices when we do things like

A.adj().noalias() += ...

Since the values and adjoints of the var matrix are actually different pointers so there is in fact no aliasing. Does the restrict thing make sense?

https://godbolt.org/z/h5df65

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, the godbolt for that is getting at the level that I can read asm.

Alright I probably shouldn't have asked that question. Imo we're spending too long on this function.

We could do this efficiently like:

auto C_adj_B_t = to_ref_if<!is_constant<Mat1>::value && !is_constant<Mat2>::value>
      (res.adj() * value_of(arena_B).transpose());

But the bigger question is to_var_value_if and the thing here: #2164 (comment) so let's get those worked out

Comment on lines +104 to +105
* @tparam Mat1 type of the first (square) matrix
* @tparam Mat2 type of the second matrix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More informative docs would be good here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to be more specific. Both arguments are matrix types in the sense that they satisfy require_matrix_t, and one of them needs to be a reverse mode autodiff type.

We could say inherits from EigenBase or is a var_value<T> with T such that it inherits from EigenBase but that is sorta just redefining require_matrix_t in the docs.

Suggestions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could say inherits from EigenBase or is a var_value with T such that it inherits from EigenBase but that is sorta just redefining require_matrix_t in the docs.

I usually write that out. If someone is one the website the requires don't show up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More docs added

Comment on lines 30 to 58
/**
* Forward var_values on.
*
* @tparam T type of the input
* @param a matrix to convert
*/
template <typename T, require_var_t<T>* = nullptr>
auto to_var_value(T&& a) {
return std::forward<T>(a);
}

/**
* If the condition is true, calls `to_var_value` on the argument, otherwise
* forward the argument through.
*
* @tparam T type of argument
* @param a argument
* @return argument copied/evaluated on AD stack
*/
template <bool Condition, typename T, std::enable_if_t<!Condition>* = nullptr>
inline auto to_var_value_if(T&& a) {
return std::forward<T>(a);
}

template <bool Condition, typename T, std::enable_if_t<Condition>* = nullptr>
inline auto to_var_value_if(const T& a) {
return to_var_value(a);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having the to_var_value for var<mat> but imo I think to_var_value_if() is going to cause more problems than answers

inline var trace_quad_form(const Mat1& A, const Mat2& B) {
check_square("trace_quad_form", "A", A);
check_multiplicable("trace_quad_form", "A", A, "B", B);

arena_t<Mat1> arena_A = A;
arena_t<Mat2> arena_B = B;
auto arena_A = to_arena(to_var_value_if<!is_constant<Mat1>::value>(A));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following why they have to both be coerced to var matrices?


var res
= (value_of(arena_B).transpose() * value_of(arena_A) * value_of(arena_B))
.trace();

reverse_pass_callback([arena_A, arena_B, res]() mutable {
if (!is_constant<Mat1>::value)
forward_as<promote_scalar_t<var, Mat1>>(arena_A).adj()
forward_as<promote_var_matrix_t<Mat1, Mat1, Mat2>>(arena_A).adj()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what your seeing here is moreso because of the pattern your using. imo I'd rather just not use this pattern

@bbbales2
Copy link
Member Author

@SteveBronder I broke out the control flow and such here.

I'm just removing the to_var_value optimizations (and the to_var_value_if code) since that is not agreed on. It's not gonna be possible to answer that without benchmarks, and not doing it is faster than what is in develop now anyway.

Anyway it'll need another review. I stuck with value_of to avoid the val/val_op thing, but it sounds like value_of isn't using holders for plain_types so we're safe (#2164 (comment)).

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.97 3.03 0.98 -2.25% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -1.76% slower
eight_schools/eight_schools.stan 0.12 0.11 1.05 4.58% faster
gp_regr/gp_regr.stan 0.17 0.17 0.99 -0.72% slower
irt_2pl/irt_2pl.stan 5.63 5.65 1.0 -0.49% slower
performance.compilation 86.46 85.34 1.01 1.3% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.49 8.48 1.0 0.16% faster
pkpd/one_comp_mm_elim_abs.stan 30.1 29.58 1.02 1.72% faster
sir/sir.stan 132.47 134.02 0.99 -1.17% slower
gp_regr/gen_gp_data.stan 0.04 0.04 1.01 1.02% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.98 2.99 1.0 -0.24% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.37 0.4 0.92 -8.23% slower
arK/arK.stan 1.78 1.78 1.0 -0.03% slower
arma/arma.stan 0.61 0.61 1.01 0.72% faster
garch/garch.stan 0.74 0.74 1.01 0.67% faster
Mean result: 0.997537721224

Jenkins Console Log
Blue Ocean
Commit hash: 3cf52b4


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@bbbales2
Copy link
Member Author

@SteveBronder this is ready for review again. It was greenlit before I just added some more docs for the template types and merged in develop

Copy link
Collaborator

@SteveBronder SteveBronder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Two comments though. imo I'd rather not pay for making the holder() which we can just do by using .val() instead of value_of() like I have below for quad_form but I think this code is also fine

https://github.com/stan-dev/math/compare/feature/quad_form-pullout-logic#diff-11ae2c18d3cbe94d5f48b9f855242f977ee3ec1d490c43723e09417f95c829f3R35

Comment on lines +130 to +131
check_not_nan("multiply", "A", value_of(arena_A));
check_not_nan("multiply", "B", value_of(arena_B));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we do arena_t<promote_scalar_t<var, Mat1>> arena_A = A; we know that arena_A has a var scalar so we can use .val() directly. I'm pretty sure value_of() is going to have to make a holder() type where calling .val() directly won't have to. Is there a reason you can't just use the version I wrote below that handles var<mat> and mat<var>?

https://github.com/stan-dev/math/compare/feature/quad_form-pullout-logic#diff-11ae2c18d3cbe94d5f48b9f855242f977ee3ec1d490c43723e09417f95c829f3R35

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conclusion of this discussion that the value_of s here are fine and they don't create holders: #2164 (comment) .

Is there a reason you can't just use the version I wrote below that handles var and mat?

I am basically using that code. The differences are:

  1. value_of instead of worrying about the difference in val and val_op

  2. We talked about at the last release to stop updating mat<var> code unless we have benchmarks showing it's faster, so I kept the old mat<var> stuff in place.

  3. The code in that pull request returns a 1x1 matrix for the vector^T matrix vector case. This is not correct (see: https://mc-stan.org/docs/2_25/functions-reference/dot-products-and-specialized-products.html). I think there is a bug in the test framework that is letting this pass. I haven't made an issue for that yet.

Sound good?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made an issue for the 1x1 thing: #2198

Comment on lines +133 to +139
if (is_var_matrix<Mat1>::value) {
arena_A.adj().noalias()
+= res.adj() * value_of(arena_B) * value_of(arena_B).transpose();
} else {
arena_A.adj()
+= res.adj() * value_of(arena_B) * value_of(arena_B).transpose();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[side note] I wonder if it would be useful or look nice to have something like

update_adjoint(arena_A, res.adj() * value_of(arena_B) * value_of(arena_B).transpose());

Where for var<mat> it's just

lhs.adj().nolias() += rhs

and for mat<var> it is

lhs.adj() += rhs

So we don't have to have these if-else statements all over

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

noalias(lhs.adj()) += rhs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah what I said doesn't make sense cause you'd still need the is_var_matrix logic.

I can add this if you want. I'm happy with ifs though.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.05 3.55 0.86 -16.39% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.06% slower
eight_schools/eight_schools.stan 0.11 0.12 0.98 -1.54% slower
gp_regr/gp_regr.stan 0.17 0.16 1.03 2.51% faster
irt_2pl/irt_2pl.stan 5.71 5.62 1.02 1.49% faster
performance.compilation 86.29 85.54 1.01 0.86% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.48 8.5 1.0 -0.27% slower
pkpd/one_comp_mm_elim_abs.stan 31.0 30.51 1.02 1.6% faster
sir/sir.stan 132.53 134.85 0.98 -1.75% slower
gp_regr/gen_gp_data.stan 0.04 0.04 0.99 -0.62% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 2.97 2.95 1.01 0.82% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.37 1.03 2.86% faster
arK/arK.stan 1.77 1.77 1.0 -0.36% slower
arma/arma.stan 0.6 0.6 1.0 -0.17% slower
garch/garch.stan 0.74 0.59 1.26 20.86% faster
Mean result: 1.01063798211

Jenkins Console Log
Blue Ocean
Commit hash: fe0b223


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder SteveBronder merged commit 7b9b75f into develop Nov 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants