`var<mat>` versions of quad_form and trace_quad_form (Issue #2101) by bbbales2 · Pull Request #2164 · stan-dev/math

bbbales2 · 2020-10-26T17:27:24Z

Summary

var<mat> versions for quad_form and trace_quad_form

Not the cleanest. Comments inline.

Release notes

var<mat> versions of quad_form and trace_quad_form

Checklist

Math issue Make functions with custom autodiff var<mat> friendly #2101
Copyright holder: Columbia University

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

bbbales2

Questions and comments

bbbales2 · 2020-10-26T17:27:52Z

 * @throw <code>std::invalid_argument</code> if the matrix is not square
 */
-template <typename T_y, require_eigen_t<T_y>* = nullptr>
+template <typename T_y, require_matrix_t<T_y>* = nullptr>


a_few_varmats has this (#2106), so this will get fixed when that merges

bbbales2 · 2020-10-26T17:30:04Z

+  auto arena_A = to_arena(A);
+  auto arena_B = to_arena(B);
+
+  check_not_nan("multiply", "A", value_of(arena_A));


Need to go through and make all the check functions work with var<mat> types. I'll make a list of the ones that need updated and add them to #2101. Here I just pass in the value.

I think we should prefer passing values into check functions anyway. That way we don't need them to support vars. Also checks on value will be faster than ones on Matrix<var>.

Maybe this is an expression thing, again. Like, if we pass in MatrixXd, then if we switch to value then we have to do two full reads of the matrix to do a check. If we pass in Matrix<var, R, C> it is one read. If we pass in an expression view of the values in a Matrix<var, R, C>, then it is still one read. This sound right?

Yeah, but that is assuming the functions calling the check does not need to extract values anyway. Most functions do need values.

Most functions do need values.

That is true.

I guess if we just put a auto x_val = value_of_rec(x) inside all the checks then that would be good? If someone passes in values, that is good. If someone passes in mat<var>, it works via expressions.

Yeah, it would work, but for mat<var> it would not be efficient. Also I am trying to say if we only pass in values we don't need to change all checks.

Yeah, that makes sense. I don't want to accidentally make them slow.

I think our checks should work with var<mat> though. That means for ones that read values just adding a version that takes in a var<mat> and then calls itself with the value of the var<mat>.

bbbales2 · 2020-10-26T17:30:36Z

+    arena_res = (arena_res + arena_res.transpose()).eval();
+  }
+
+  return_t res = arena_res;


arena_res is on the arena, so since return_t is a var<mat> it will just take the pointer arena_res has right?

Yes. arena_t<var<T>> is var<T>.

bbbales2 · 2020-10-26T17:34:17Z

+				    value_of(B).eval()), Mat1, Mat2>;
+
+  auto arena_A = to_arena(A);
+  auto arena_B = to_arena(B);


If these matrices are either var<mat> s or just regular mat s then this code is fine. If one of these is a mat<var> though then it's not as fine because I access values and adjoints repeatedly. If the vals/adjoints are accessed more than once, it would be more economical to just copy to a var<mat>.

So my question is we should just by default be copying to var<mat> in all these functions?

How about you do something like:

const auto& A_adj = to_ref(arena_A)

Also it would be nice if we only copied these matrices to arena if we actually needed them in the reverse pass.

Well I'm not worried about evaluatiang expressions as much as converting things to var<mat> land so I can be lazy.

I checked and all the copying here is fine if inputs are a mix of mat<double> and var<mat>. The A inputs and the B inputs are required for both reverse passes. It's not so good for mat<var>.

I see. to_arena does not convert to var<mat>, so there is steel a need to handle mat<var> efficiently.

I guess the solution is to just add a version of to_var_value (https://github.com/stan-dev/math/blob/develop/stan/math/rev/fun/to_var_value.hpp) that just forwards things through if it is given a var_value.

"I think at a very high level" as a precursor to a sentence sounds like the most arrogant thing ever said. It is also very cringe on its own. I just meant "I'm not sure but", but it didn't come out that way lol.

Kind of interesting, I did a search on github for quad_form in stan programs

https://github.com/search?l=&p=5&q=quad_form+language%3AStan&type=Code

It looks like the most used case is having data as the second argument. Mostly just something to note. (aka I'd rather not spend too much time optimizing this part)

If these matrices are either var<mat> s or just regular mat s then this code is fine. If one of these is a mat<var> though then it's not as fine because I access values and adjoints repeatedly. If the vals/adjoints are accessed more than once, it would be more economical to just copy to a var<mat>.

I agree with Tadej that mat<var> -> var<mat> is v expensive. For this func it may be fine since we do a lot of mat muls

I also don't like to_var_value_if, if we want a var_value from a mat<var> we could just do

var_value<promote_scalar_t<double, Mat1>>

imo I'm not a huge fan of the do_something_if functions with templates. I personally label them as "easy to write hard to read". They're also worse for small N (doing almost nothing is different from doing nothing like you get by pulling out the logic)

I went through and moved all the var/double logic out.

https://github.com/stan-dev/math/compare/feature-pr2164/quad_form

The other benefit of that is we don't have to use forward_as

I also don't like to_var_value_if, if we want a var_value from a matwe could just do var_value<promote_scalar_t<double, Mat1>>

I don't think this constructor exists though, so we have to call a conversion somehow.

For this func it may be fine since we do a lot of mat muls

I agree the conversion is more expensive than not doing the conversion in ideal cases, but the comparison is between doing the conversion, calling value_of like a ton of extra times (and there are like 9-10 of these in the implementations here), or writing special versions that support combos of mat<var> and var<mat> (and selectively copying out values).

The conversion is like 5 extra read ops and we do 3 every time we do an uncached .val() or .adj(). So I'd much rather be lazy on this one. And then I want to avoid writing three versions of the function to support the argument combos.

I went through and moved all the var/double logic out.

Hmm I think what I wrote should be efficient for var<mat> mat<double>. I prefer the shorter code, but if there is an efficiency problem I can change it.

Also that is an interesting search neat trick

For the 3 read ops to do a .val() in a mat<var>, one of those is pointer chasing and I think will be way more expensive than loading up .val() three times from a var<mat>.

Edit: changed mat<var> to var<mat>... Got this exactly backwards on the first try lol.

…4.1 (tags/RELEASE_600/final)

…into feature/quad_form

stan-buildbot · 2020-10-27T19:09:48Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.13	3.09	1.01	1.23% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-1.36% slower
eight_schools/eight_schools.stan	0.12	0.12	1.02	1.53% faster
gp_regr/gp_regr.stan	0.18	0.18	0.98	-1.7% slower
irt_2pl/irt_2pl.stan	5.71	5.7	1.0	0.14% faster
performance.compilation	91.11	88.42	1.03	2.95% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.44	8.44	1.0	-0.02% slower
pkpd/one_comp_mm_elim_abs.stan	28.91	29.77	0.97	-2.96% slower
sir/sir.stan	133.67	136.1	0.98	-1.82% slower
gp_regr/gen_gp_data.stan	0.04	0.05	0.98	-1.62% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.96	2.95	1.0	0.05% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.38	0.37	1.02	1.48% faster
arK/arK.stan	1.8	1.8	1.0	-0.29% slower
arma/arma.stan	0.62	0.61	1.01	0.93% faster
garch/garch.stan	0.7	0.7	1.0	0.0% slower
Mean result: 0.999259235198

Jenkins Console Log
Blue Ocean
Commit hash: db2b5c9

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

… tests (Issue #2101)

…4.1 (tags/RELEASE_600/final)

bbbales2

Comments

bbbales2 · 2020-10-30T16:00:51Z


-  arena_t<Mat1> arena_A = A;
-  arena_t<Mat2> arena_B = B;
+  auto arena_A = to_arena(to_var_value_if<!is_constant<Mat1>::value>(A));


I ended up going with to_var_value_if as the lazy support for mixed var<mat> and mat<var>.

The other thing I tried was templating out the functions but that is confusing in its own way. It looks like:

template <typename Mat1, typename Mat2, require_var_matrix_t<Mat1>* = nullptr, require_eigen_st<is_var, Mat2>* = nullptr> inline var trace_quad_form(const Mat1& A, const Mat2& B) { return trace_quad_form(A, to_var_value(B)); } template <typename Mat1, typename Mat2, require_eigen_st<is_var, Mat1>* = nullptr, require_var_matrix_t<Mat2>* = nullptr> inline var trace_quad_form(const Mat1& A, const Mat2& B) { return trace_quad_form(to_var_value(A), B); } template <typename Mat1, typename Mat2, require_all_matrix_t<Mat1, Mat2>* = nullptr, require_any_var_matrix_t<Mat1, Mat2>* = nullptr>, require_all_not_eigen_st<is_var, Mat1, Mat2>* = nullptr> inline var trace_quad_form(const Mat1& A, const Mat2& B) { ... }

Edit: updated code

I'm not following why they have to both be coerced to var matrices?

The other options are:

Do not support mixes of mat<var> and var<mat>. I do not like that because we will have programs with both types (and so we could end up with mixed arguments)

Write three different specializations. I do not like that because we found out optimizing for mat<var> is trickier than var<mat>. I do not think this is particularly necessary cuz this will be faster than not having autodiff.

Or just don't move the matrix<var> to a var<matrix>. I think Tadej and I both prefer just leaving them instead of doing the conversion. In another comment I have branch with one quad_form that handles all the cases of vars with one function

feature/quad_form...feature/quad_form-pullout-logic

I'm going to sit down one day and figure out this whole val() and val_op() thing. Something something const correctness

The furthest I got was that it's something to do with Eigen's sizeof logic, there's a little more discussion in this stackoverflow thread. The error doesn't occur with float, only with double. I found a modification that fixed the compilation, but hard to know whether that would affect other behaviour. Tricky.

In another comment I have branch with one quad_form that handles all the cases of vars with one function

I was trying to avoid touching the pure mat<var> implementation since we decided to wait on benchmarks before we did that (which is why we decided not to merge #2019)

I think Tadej and I both prefer just leaving them

Yeah I saw the comments and I made a response where I argued the case that the conversion here was better: #2164 (comment)

Since I didn't see a response to that, I just went with the conversion to move this along.

I'm not on the same page with stuff like:

arena_t<promote_scalar_t<var, Mat1>> arena_A(A); arena_t<promote_scalar_t<var, Mat2>> arena_B(B); return_t res(arena_B.val_op().transpose() * arena_A.val_op() * arena_B.val_op());

Like up until this point I've been assuming if we read the .val() of a mat<var> more than once, we should make a copy of it on the arena.

Since this function does a lot of .val() calls on A and B, my understanding is that we need to make arena copies of the values, otherwise we're going to pointer-chase for every .val() that happens on a mat<var>, and it's the pointer chasing that is slow. I can just ditch the to_var_value stuff (it's gonna be faster than no autodiff either way) but I thought I made a pretty good argument back here: #2164 (comment).

This is the big Q btw. Can do the control flow thing sure.

bbbales2 · 2020-10-30T16:04:34Z

  reverse_pass_callback([arena_A, arena_B, res]() mutable {
    if (!is_constant<Mat1>::value)
-      forward_as<promote_scalar_t<var, Mat1>>(arena_A).adj()
+      forward_as<promote_var_matrix_t<Mat1, Mat1, Mat2>>(arena_A).adj()


I think we should redo promote_var_matrix_t in another pull request.

I think it should just be promote_var_matrix_t<T> returns the var_mat with the same shape as T.

I think our current design of promote_var_matrix_t would be useful when writing two argument functions that took all combinations of var<mat> and mat<var>.

It seems now we are writing functions where the signatures all separate:

One function does (mat<var>, mat<var>) and returns mat<var>, and then another function handles (var<mat>, mat<var>), (mat<var>, var<mat>), (var<mat>, var<mat>) and returns a var<mat> type.

So it's easy enough from the separation of functions to know what the return types should be, and so promote_var_matrix_t can simplify.

I think what your seeing here is moreso because of the pattern your using. imo I'd rather just not use this pattern

I think I could just do var_value<Mat1> since you mentioned it elsewhere.

Oh wait that doesn't work cuz might end up with a var_value<var_value<...>>.

Anyway, you're right it's a result of the pattern so it's kinda downstream of the rest of the comments.

bbbales2 · 2020-11-04T14:20:18Z

@rok-cesnovar @t4c1 @SteveBronder do we have an opencl computer I can log in to to test this? I wasn't able to get opencl running on my local computer

rok-cesnovar · 2020-11-04T14:25:21Z

Not really, I can give you access to our lab machine if needed, but I am not sure why would quad_form or trace_quad_form cause GLMs to fail. Maybe restart to double check?

bbbales2 · 2020-11-04T14:32:54Z

The test is test/unit/math/opencl/rev/normal_id_glm_lpdf_test. Not obvious to me either why it would fail, but it could be something sneaky. Mind giving that a run real quick in this branch?

rok-cesnovar · 2020-11-04T14:42:38Z

Sure thing. Will run in about an hour or so.

rok-cesnovar · 2020-11-04T19:47:23Z

That hour was long.

Runs fine locally.

stan-buildbot · 2020-11-05T18:54:05Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.15	3.04	1.04	3.39% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-0.9% slower
eight_schools/eight_schools.stan	0.12	0.12	0.98	-1.93% slower
gp_regr/gp_regr.stan	0.17	0.17	1.03	2.49% faster
irt_2pl/irt_2pl.stan	5.69	5.63	1.01	1.11% faster
performance.compilation	89.82	87.2	1.03	2.92% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.49	8.48	1.0	0.1% faster
pkpd/one_comp_mm_elim_abs.stan	31.28	29.37	1.06	6.09% faster
sir/sir.stan	134.54	134.04	1.0	0.37% faster
gp_regr/gen_gp_data.stan	0.04	0.04	1.02	1.88% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.95	2.95	1.0	0.09% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.37	1.03	2.89% faster
arK/arK.stan	1.79	1.79	1.0	-0.38% slower
arma/arma.stan	0.6	0.6	0.99	-0.82% slower
garch/garch.stan	0.76	0.74	1.02	1.57% faster
Mean result: 1.01316112208

Jenkins Console Log
Blue Ocean
Commit hash: 38eb76d

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

bbbales2 · 2020-11-05T21:42:48Z

@SteveBronder ready for review!

stan-buildbot · 2020-11-11T16:30:07Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.01	3.35	0.9	-11.48% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.97	-2.76% slower
eight_schools/eight_schools.stan	0.12	0.12	0.99	-1.46% slower
gp_regr/gp_regr.stan	0.17	0.17	0.99	-1.42% slower
irt_2pl/irt_2pl.stan	5.67	5.72	0.99	-0.86% slower
performance.compilation	87.83	85.43	1.03	2.73% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.5	8.48	1.0	0.28% faster
pkpd/one_comp_mm_elim_abs.stan	29.43	29.12	1.01	1.04% faster
sir/sir.stan	135.42	133.66	1.01	1.3% faster
gp_regr/gen_gp_data.stan	0.04	0.04	0.99	-0.76% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.95	2.97	1.0	-0.46% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.39	0.4	0.96	-3.68% slower
arK/arK.stan	1.76	1.77	0.99	-0.67% slower
arma/arma.stan	0.6	0.6	1.01	0.91% faster
garch/garch.stan	0.74	0.74	1.0	0.05% faster
Mean result: 0.98954762523

Jenkins Console Log
Blue Ocean
Commit hash: 484eb85

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

SteveBronder

Couple fixes, most of it really boils down to that I think a lot of the problems you are seeing come from the pattern where we use forward_as and promoting matrix of var to var matrix. I don't see why we need to do that promotion and I think breaking out the control flow logic will make those problems go away

SteveBronder · 2020-11-11T16:52:20Z

+  auto arena_A = to_arena(to_var_value_if<!is_constant<Mat1>::value>(A));
+  auto arena_B = to_arena(to_var_value_if<!is_constant<Mat2>::value>(B));


I'm confused about the logic here, why does this need to move over to a var_value conditionally or even at all? imo I'd rather have us use the scheme of moving the control flow up because it's a lot easier to read through

If it is a mat<var>, then it needs converted to a var<mat>, otherwise pass through.

The other option from these to_var_value_if statements are more function definitions and requires. Yes to_var_value_if and friends are confusing, but I do not think this is more confusing than:

require_eigen_st<is_var, Mat1>* = nullptr, require_var_matrix_t<Mat2>* = nullptr

which is something like you'd need to add to the functions to specify that one of the things is an Eigen-like Matrix with a var and one of the other things is a var matrix.

The result of the to_var_value_if, regardless of what it is, needs to end up in the arena cause we use it in the reverse pass (hence everything is wrapped in a to_arena).

👏 Pull 👏 Out 👏 The 👏 Control 👏 Flow 👏

All of these problems go away if we just pull the control flow out into the function. There's a lot of good reasons to prefer pulling out control flow

Each path from the start to the return is linear. This makes it a lot easier to read and I don't have to go through each line and think, "Well what if X is var and y is not or x is var and y is var and....". In a block of the control flow I'm aware of which is which.

Removes the need for widgets like forward_as<>() and more type trait layers like to_*_if<>(). All the paths do different things and work on different types. Yinz keep saying this leads to code duplication but I disagree! The code is different in each path, it makes the code more verbose which, imo, is very different than duplication.

Uses value_of() less (which requires making a holder<> class for Matrix<var>)

Adding more additional control flow is less annoying. Like if we want to do the .noalias() optimization for var<mat> like

if (is_var_matrix<Mat1>::value) { arena_A.adj().noalias() += arena_B * (res.adj() * arena_B.transpose()); } else { arena_A.adj() += arena_B * (res.adj() * arena_B.transpose()); }

When we bunch the control flow together with the code this turns into

if (!is_constant<Mat1>::value) { if (is_var_matrix<Mat1>) { forward_as<arena_t<promote_scalar_t<var, Mat1>>>(arena_A).adj().noalias() += value_of(arena_B) * C_adj_B_t; } else { forward_as<arena_t<promote_scalar_t<var, Mat1>>>(arena_A).adj() += value_of(arena_B) * C_adj_B_t; } }

That's a mouthful!

I have an example of breaking out the control flow for quad_form below. imo, way more clear! Yeah there's more literal lines of code, but like that's fine to me because I think it's simpler and easier to read.

feature/quad_form...feature/quad_form-pullout-logic

If you have like 3 inputs or something then yeah I think it's a good idea to use the to_*_if() stuff cuz otherwise you end up with like a bunch of paths which feels silly, but for binary functions I think pulling out the control flow is much cleaner

And also the one downside is that for some reason the version that pulls out the flow has to use val_op() otherwise we run into const correctness errors due to something internal in Eigen

All of these problems go away if we just pull the control flow
"Well what if X is var and y is not or x is var and y is var and...."
Yinz keep saying this leads to code duplication but I disagree!

Sure. I disagree. I don't think any of this code is easy, and I have to read it very carefully line by line anyway.

I can pull out the control flow though (though I don't think it gets rid of the to_var_value problem).

Uses value_of() less (which requires making a holder<> class for Matrix)

I was wondering why you were swapping those all the time. I just liked value_of more cause I don't have to worry about the types. If we add a plain type overload could value_of just point to .val() (or .val_op()) instead of using a holder?

Like if we want to do the .noalias() optimization for var like

Well if we converted mat<var> to var<mat> we wouldn't need the extra logic. But if it's necessary sure. .noalias() is good optimization.

If you have like 3 inputs or something

For what it's worth, it is COming Soon (TM) with trace_gen_quad_form, but we can do different code there.

Sure. I disagree. I don't think any of this code is easy, and I have to read it very carefully line by line anyway.

The asm etc that's generated for the two versions can be v different. Like for example with to_arena_if<>(), even when the bool template is false it's not a no-op and still has to do make an Eigen map. For smaller N stuff like that matters

I can pull out the control flow though (though I don't think it gets rid of the to_var_value problem).

For this I'd use something like the version I put up where it doesn't do the conversion

I was wondering why you were swapping those all the time. I just liked value_of more cause I don't have to worry about the types.

Yeah so value_of() for Eigen types calls

template <typename EigMat, require_eigen_t<EigMat>* = nullptr, require_not_st_arithmetic<EigMat>* = nullptr> inline auto value_of(EigMat&& M) { return make_holder( [](auto& a) { return a.unaryExpr([](const auto& scal) { return value_of(scal); }); }, std::forward<EigMat>(M)); }

So it's good to call value_of() of when we are doing stuff like

// A is an input argument arena_t<promote_scalar_t<double, Mat1>> arena_A(value_of(A));

But once we have the arena_t we should just call .val() etc.

If we add a plain type overload could value_of just point to .val() (or .val_op()) instead of using a holder?

Actually @t4c1 why is there a reason we don't have a separate version for plain and non-plain Eigen types? I think we could have an overload for when it's an Eigen matrix of vars that are not an expression and just return .val()

Well if we converted mat to var we wouldn't need the extra logic. But if it's necessary sure. .noalias() is good optimization.

Yeah it would be v cool if we could do that swap and have it be efficient but idt it's worth the cost of swapping stuff around.

For what it's worth, it is COming Soon (TM) with trace_gen_quad_form, but we can do different code there.

Oh yeah, I'd be a monster if I asked for breaking up the control flow for anything more than a binary function

Uses value_of() less (which requires making a holder<> class for Matrix)

If we add a plain type overload could value_of just point to .val() (or .val_op()) instead of using a holder?

Holder is only needed for rvalue inputs (which would not work without it). For lvalues make_holder just calls the given functor with given argument.

Actually @t4c1 why is there a reason we don't have a separate version for plain and non-plain Eigen types? I think we could have an overload for when it's an Eigen matrix of vars that are not an expression and just return .val()

.val() does exactly the same thing as value_of called on Matrices.

.val() does exactly the same thing as value_of called on Matrices.

The slight difference is that value_of will always return a CWiseUnaryOp whereas .val() will return either a CWiseUnaryView or CWiseUnaryOp, depending on const-ness

SteveBronder · 2020-11-11T16:54:32Z

+  auto arena_res = to_arena(value_of(arena_B).transpose() * value_of(arena_A)
+                            * value_of(arena_B));


It looks like we access the value of arena_B a lot. It may be worth doing the read just once and saving it as arena_b_val

At this point it is a var_value or an arena_t<Eigen::MatrixXd>, so it's safe to use value_of

SteveBronder · 2020-11-11T16:57:10Z

+  using return_t
+      = promote_var_matrix_t<decltype(value_of(B).transpose().eval()
+                                      * value_of(A) * value_of(B).eval()),
+                             Mat1, Mat2>;


promote_var_matrix_t is really intended for when you don't know whether you will return a var<matrix> or matrix<var>. Here we know by the template requires that at least one of the inputs must be a var<matrix> so the return type is just going to be a var_value<Eigen::Matrix<double, -1, -1>>

Yeah, I was a little nervous about compile time sized matrices, which is why I kept the template.

In that case I guess I could have just done var_value<decltype(value_of(B).transpose().eval() * value_of(A) * value_of(B).eval())>.

In that case I guess I could have just done var_value<decltype(value_of(B).transpose().eval() * value_of(A) * value_of(B).eval())>.

Yes though you want

var_value<plain_type_t<decltype(value_of(B).transpose().eval() * value_of(A) * value_of(B).eval())>> because var_value<> can accept expressions as the template (For slices such as Block<>

var_value<> can accept expressions

Oh right. Can't let the guard down for a second on this stuff :/

SteveBronder · 2020-11-11T17:01:28Z

+  reverse_pass_callback([arena_A, arena_B, res]() mutable {
+    auto C_adj_B_t = (res.adj() * value_of(arena_B).transpose()).eval();
+
+    if (!is_constant<Mat1>::value)
+      forward_as<promote_var_matrix_t<Mat1, Mat1, Mat2>>(arena_A).adj()
+          += value_of(arena_B) * C_adj_B_t;
+
+    if (!is_constant<Mat2>::value)
+      forward_as<promote_var_matrix_t<Mat2, Mat1, Mat2>>(arena_B).adj()
+          += value_of(arena_A) * C_adj_B_t.transpose()
+             + value_of(arena_A).transpose() * value_of(arena_B) * res.adj();
+  });
+
+  return res;


[optional/rant] Another thing that's nice about moving out the control flow is that it let's you have the immediete evaluation for C_adj_B_t when both are var types and for when just one is a var you can do C_adj_B_t inline for the accumulation.

Yeah I see what you're saying.

Are we sure there's no temporary created in an expression like A * B * C though? I assume that is broken up into D = B * C and A * D if A, B, and C are all Eigen MatrixXd types.

I'm not sure, the godbolt for that is getting at the level that I can read asm. Though it looks like it makes at least one alloc then can reuse that allocate with a memset under certain conditions. The main thing is that

A += B * C * D;

Needs to make at least one copy for (B * C * D) because of aliasing issues with the left and right hand side. Reading over the asm I think it does one alloc here and reuses the memory for the other multiply. And idt we can use .noalias() for eigen matrices of vars since .noalias() is like the restrict keyword and we access the value and adjoint through the same pointer on both sides (aka aliasing)

Though we can use .noalias() for var matrices when we do things like

A.adj().noalias() += ...

Since the values and adjoints of the var matrix are actually different pointers so there is in fact no aliasing. Does the restrict thing make sense?

https://godbolt.org/z/h5df65

I'm not sure, the godbolt for that is getting at the level that I can read asm.

Alright I probably shouldn't have asked that question. Imo we're spending too long on this function.

We could do this efficiently like:

auto C_adj_B_t = to_ref_if<!is_constant<Mat1>::value && !is_constant<Mat2>::value> (res.adj() * value_of(arena_B).transpose());

But the bigger question is to_var_value_if and the thing here: #2164 (comment) so let's get those worked out

SteveBronder · 2020-11-11T17:02:01Z

+ * @tparam Mat1 type of the first (square) matrix
+ * @tparam Mat2 type of the second matrix


More informative docs would be good here

It is hard to be more specific. Both arguments are matrix types in the sense that they satisfy require_matrix_t, and one of them needs to be a reverse mode autodiff type.

We could say inherits from EigenBase or is a var_value<T> with T such that it inherits from EigenBase but that is sorta just redefining require_matrix_t in the docs.

Suggestions?

We could say inherits from EigenBase or is a var_value with T such that it inherits from EigenBase but that is sorta just redefining require_matrix_t in the docs.

I usually write that out. If someone is one the website the requires don't show up

More docs added

SteveBronder · 2020-11-11T17:03:07Z

+/**
+ * Forward var_values on.
+ *
+ * @tparam T type of the input
+ * @param a matrix to convert
+ */
+template <typename T, require_var_t<T>* = nullptr>
+auto to_var_value(T&& a) {
+  return std::forward<T>(a);
+}
+
+/**
+ * If the condition is true, calls `to_var_value` on the argument, otherwise
+ * forward the argument through.
+ *
+ * @tparam T type of argument
+ * @param a argument
+ * @return argument copied/evaluated on AD stack
+ */
+template <bool Condition, typename T, std::enable_if_t<!Condition>* = nullptr>
+inline auto to_var_value_if(T&& a) {
+  return std::forward<T>(a);
+}
+
+template <bool Condition, typename T, std::enable_if_t<Condition>* = nullptr>
+inline auto to_var_value_if(const T& a) {
+  return to_var_value(a);
+}
+


I like having the to_var_value for var<mat> but imo I think to_var_value_if() is going to cause more problems than answers

SteveBronder · 2020-11-11T17:34:57Z


-  arena_t<Mat1> arena_A = A;
-  arena_t<Mat2> arena_B = B;
+  auto arena_A = to_arena(to_var_value_if<!is_constant<Mat1>::value>(A));


I'm not following why they have to both be coerced to var matrices?

SteveBronder · 2020-11-11T17:36:07Z

  reverse_pass_callback([arena_A, arena_B, res]() mutable {
    if (!is_constant<Mat1>::value)
-      forward_as<promote_scalar_t<var, Mat1>>(arena_A).adj()
+      forward_as<promote_var_matrix_t<Mat1, Mat1, Mat2>>(arena_A).adj()


I think what your seeing here is moreso because of the pattern your using. imo I'd rather just not use this pattern

…4.1 (tags/RELEASE_600/final)

…into feature/quad_form

bbbales2 · 2020-11-12T17:40:32Z

@SteveBronder I broke out the control flow and such here.

I'm just removing the to_var_value optimizations (and the to_var_value_if code) since that is not agreed on. It's not gonna be possible to answer that without benchmarks, and not doing it is faster than what is in develop now anyway.

Anyway it'll need another review. I stuck with value_of to avoid the val/val_op thing, but it sounds like value_of isn't using holders for plain_types so we're safe (#2164 (comment)).

stan-buildbot · 2020-11-15T22:43:43Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	2.97	3.03	0.98	-2.25% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.98	-1.76% slower
eight_schools/eight_schools.stan	0.12	0.11	1.05	4.58% faster
gp_regr/gp_regr.stan	0.17	0.17	0.99	-0.72% slower
irt_2pl/irt_2pl.stan	5.63	5.65	1.0	-0.49% slower
performance.compilation	86.46	85.34	1.01	1.3% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.49	8.48	1.0	0.16% faster
pkpd/one_comp_mm_elim_abs.stan	30.1	29.58	1.02	1.72% faster
sir/sir.stan	132.47	134.02	0.99	-1.17% slower
gp_regr/gen_gp_data.stan	0.04	0.04	1.01	1.02% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.98	2.99	1.0	-0.24% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.37	0.4	0.92	-8.23% slower
arK/arK.stan	1.78	1.78	1.0	-0.03% slower
arma/arma.stan	0.61	0.61	1.01	0.72% faster
garch/garch.stan	0.74	0.74	1.01	0.67% faster
Mean result: 0.997537721224

Jenkins Console Log
Blue Ocean
Commit hash: 3cf52b4

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

bbbales2 · 2020-11-16T14:30:41Z

@SteveBronder this is ready for review again. It was greenlit before I just added some more docs for the template types and merged in develop

SteveBronder

Looks good! Two comments though. imo I'd rather not pay for making the holder() which we can just do by using .val() instead of value_of() like I have below for quad_form but I think this code is also fine

https://github.com/stan-dev/math/compare/feature/quad_form-pullout-logic#diff-11ae2c18d3cbe94d5f48b9f855242f977ee3ec1d490c43723e09417f95c829f3R35

SteveBronder · 2020-11-16T18:49:05Z

+    check_not_nan("multiply", "A", value_of(arena_A));
+    check_not_nan("multiply", "B", value_of(arena_B));


Once we do arena_t<promote_scalar_t<var, Mat1>> arena_A = A; we know that arena_A has a var scalar so we can use .val() directly. I'm pretty sure value_of() is going to have to make a holder() type where calling .val() directly won't have to. Is there a reason you can't just use the version I wrote below that handles var<mat> and mat<var>?

https://github.com/stan-dev/math/compare/feature/quad_form-pullout-logic#diff-11ae2c18d3cbe94d5f48b9f855242f977ee3ec1d490c43723e09417f95c829f3R35

The conclusion of this discussion that the value_of s here are fine and they don't create holders: #2164 (comment) .

Is there a reason you can't just use the version I wrote below that handles var and mat?

I am basically using that code. The differences are:

value_of instead of worrying about the difference in val and val_op

We talked about at the last release to stop updating mat<var> code unless we have benchmarks showing it's faster, so I kept the old mat<var> stuff in place.

The code in that pull request returns a 1x1 matrix for the vector^T matrix vector case. This is not correct (see: https://mc-stan.org/docs/2_25/functions-reference/dot-products-and-specialized-products.html). I think there is a bug in the test framework that is letting this pass. I haven't made an issue for that yet.

Sound good?

I made an issue for the 1x1 thing: #2198

SteveBronder · 2020-11-16T18:54:33Z

+      if (is_var_matrix<Mat1>::value) {
+        arena_A.adj().noalias()
+            += res.adj() * value_of(arena_B) * value_of(arena_B).transpose();
+      } else {
+        arena_A.adj()
+            += res.adj() * value_of(arena_B) * value_of(arena_B).transpose();
+      }


[side note] I wonder if it would be useful or look nice to have something like

update_adjoint(arena_A, res.adj() * value_of(arena_B) * value_of(arena_B).transpose());

Where for var<mat> it's just

lhs.adj().nolias() += rhs

and for mat<var> it is

lhs.adj() += rhs

So we don't have to have these if-else statements all over

What about:

noalias(lhs.adj()) += rhs

Yeah what I said doesn't make sense cause you'd still need the is_var_matrix logic.

I can add this if you want. I'm happy with ifs though.

stan-buildbot · 2020-11-16T20:37:06Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.05	3.55	0.86	-16.39% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.98	-2.06% slower
eight_schools/eight_schools.stan	0.11	0.12	0.98	-1.54% slower
gp_regr/gp_regr.stan	0.17	0.16	1.03	2.51% faster
irt_2pl/irt_2pl.stan	5.71	5.62	1.02	1.49% faster
performance.compilation	86.29	85.54	1.01	0.86% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.48	8.5	1.0	-0.27% slower
pkpd/one_comp_mm_elim_abs.stan	31.0	30.51	1.02	1.6% faster
sir/sir.stan	132.53	134.85	0.98	-1.75% slower
gp_regr/gen_gp_data.stan	0.04	0.04	0.99	-0.62% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.97	2.95	1.01	0.82% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.38	0.37	1.03	2.86% faster
arK/arK.stan	1.77	1.77	1.0	-0.36% slower
arma/arma.stan	0.6	0.6	1.0	-0.17% slower
garch/garch.stan	0.74	0.59	1.26	20.86% faster
Mean result: 1.01063798211

Jenkins Console Log
Blue Ocean
Commit hash: fe0b223

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

var<mat> versions of quad_form and trace_quad_form (Issue #2101)

d00537b

bbbales2 commented Oct 26, 2020

View reviewed changes

stan-buildbot and others added 3 commits October 26, 2020 17:38

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

ae11340

…4.1 (tags/RELEASE_600/final)

Updated quad_form docs (Issue #2101)

e6fb3f2

Merge branch 'feature/quad_form' of https://github.com/stan-dev/math …

db2b5c9

…into feature/quad_form

bbbales2 mentioned this pull request Oct 26, 2020

Make functions with custom autodiff var<mat> friendly #2101

Open

78 tasks

bbbales2 commented Oct 30, 2020

View reviewed changes

Comment thread stan/math/prim/meta/is_vector.hpp

bbbales2 and others added 4 commits October 30, 2020 11:15

Convert mat<var> to var<mat> types (Issue #2101)

cb7b478

Switched trace_quad_form to to_var_value_if and added to_var_value_if…

202b4f4

… tests (Issue #2101)

Merge commit '8464e86fa846d11459c46bc706627583935e7ee3' into HEAD

de23951

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

d5942cf

…4.1 (tags/RELEASE_600/final)

bbbales2 commented Oct 30, 2020

View reviewed changes

Updated includes (Issue #2101)

1a45c65

Merge remote-tracking branch 'origin/develop' into feature/quad_form

38eb76d

bbbales2 added 2 commits November 9, 2020 13:12

Merge remote-tracking branch 'origin/develop' into feature/quad_form

77c24b0

Merge remote-tracking branch 'origin/develop' into feature/quad_form

484eb85

SteveBronder requested changes Nov 11, 2020

View reviewed changes

bbbales2 and others added 4 commits November 12, 2020 09:12

Broke out control flow (Issue #2101)

a1eaa0c

Broke out more flow control, removed to_var_value_if (Issue #2101)

39c5824

Merge remote-tracking branch 'origin/develop' into feature/quad_form

706bec3

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

421af8d

…4.1 (tags/RELEASE_600/final)

bbbales2 added 2 commits November 12, 2020 12:33

Removed unnecessary deallocate (Issue #2101)

e986107

Merge branch 'feature/quad_form' of https://github.com/stan-dev/math …

3cf52b4

…into feature/quad_form

bbbales2 added 2 commits November 16, 2020 09:29

Added more template docs (Issue #2101)

dd8418a

Merge remote-tracking branch 'origin/develop' into feature/quad_form

fe0b223

SteveBronder approved these changes Nov 16, 2020

View reviewed changes

SteveBronder merged commit 7b9b75f into develop Nov 19, 2020

		auto arena_A = to_arena(to_var_value_if<!is_constant<Mat1>::value>(A));
		auto arena_B = to_arena(to_var_value_if<!is_constant<Mat2>::value>(B));

		auto arena_res = to_arena(value_of(arena_B).transpose() * value_of(arena_A)
		* value_of(arena_B));

		* @tparam Mat1 type of the first (square) matrix
		* @tparam Mat2 type of the second matrix

		check_not_nan("multiply", "A", value_of(arena_A));
		check_not_nan("multiply", "B", value_of(arena_B));

Uh oh!

Uh oh!

Conversation

bbbales2 commented Oct 26, 2020

Summary

Release notes

Checklist

Uh oh!

bbbales2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbbales2 Oct 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

stan-buildbot commented Oct 27, 2020

Uh oh!

Uh oh!

bbbales2 left a comment

Choose a reason for hiding this comment

Uh oh!

bbbales2 Oct 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bbbales2 Oct 27, 2020 •

edited

Loading

bbbales2 Oct 30, 2020 •

edited

Loading