Skip to content

Adds flags for compiler optimizations #2020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Aug 24, 2020
Merged

Conversation

SteveBronder
Copy link
Collaborator

@SteveBronder SteveBronder commented Aug 18, 2020

Summary

After adding a template parameter to vari we saw a 20% regression in the SIR model. This PR includes some extra compiler flag options so if a user has either clang >= 8 or gcc >= 5. flto and other compiler optimization flags will turn on. flto will only turn on for clang >= 8 because clang-6 and clang-7's -dumpversion give 4.2.1 so we cannot tell them apart. clang-6 requires the use of either the ar gold linker plugin or llvm-ar. Since the gold plugin is a non-standard extension for some versions of ar to have installed we turn off flto by default for both clang 6 and 7

Tests

make feature so no new tests, though because our benchmarks use clang-6 they won't give the new benchmark values so I will paste them here. These tests were run on Ubuntu 20.04 with AMD Threadripper 1950x along with DDR4 ram. Tests for each compiler were run 10 times then averaged together

Clang-10

The clang-10 results were run with 5000 warmup and 5000 samples, though overall I didn't see the increase in iteration numbers help produce better reliability in the test numbers.

Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 7.1 6.71 1.06 5.49% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.02 0.88 -13.0% slower
eight_schools/eight_schools.stan 0.18 0.17 1.03 3.28% faster
arK/arK.stan 5.27 5.28 1.0 -0.16% slower
irt_2pl/irt_2pl.stan 15.57 15.87 0.98 -1.88% slower
performance.compilation 48.23 58.49 0.82 -21.28% slower
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 23.93 22.65 1.06 5.38% faster
pkpd/one_comp_mm_elim_abs.stan 59.22 56.66 1.05 4.31% faster
sir/sir.stan 224.85 190.03 1.18 15.48% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.25 0.25 1.01 0.75% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 6.93 6.61 1.05 4.66% faster
gp_regr/gp_regr.stan 0.33 0.32 1.03 2.54% faster
arma/arma.stan 0.76 0.75 1.02 1.96% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.0 0.24% faster
garch/garch.stan 1.97 1.75 1.13 11.35% faster
Mean result: 1.0199644747

GCC-10

Because I didn't see any increased reliability from running more than the standard number of iterations the gcc results were run with 1000 warmup and 1000 samples.

Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.34 1.98 1.18 15.29% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 1.05 4.75% faster
eight_schools/eight_schools.stan 0.05 0.05 0.99 -0.91% slower
arK/arK.stan 1.24 1.25 0.99 -0.95% slower
irt_2pl/irt_2pl.stan 4.22 4.03 1.05 4.43% faster
performance.compilation 86.27 96.83 0.89 -12.23% slower
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 4.59 4.7 0.98 -2.53% slower
pkpd/one_comp_mm_elim_abs.stan 14.17 14.03 1.01 0.95% faster
sir/sir.stan 87.9 70.69 1.24 19.57% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.23 0.23 1.0 0.11% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 1.68 1.7 0.99 -1.08% slower
gp_regr/gp_regr.stan 0.09 0.1 0.93 -8.02% slower
arma/arma.stan 0.33 0.34 0.98 -1.56% slower
gp_regr/gen_gp_data.stan 0.02 0.02 0.99 -1.37% slower
garch/garch.stan 0.34 0.37 0.91 -9.7% slower
Mean result: 1.01176292109

Side Effects

There's are some compiler flags which require the gcc graphite compiler extension, though tmk this comes with most standard builds of gcc unless you build gcc from source in which case I believe you need to specify whether the graphite extension should be compiled or not. If a user does not have graphite setup a warning is given and the program continues to run.

One of the interesting thing I found out is that several of these tests have a pretty wide variance. For instance here is the results of running develop against develop 10 times

---RESULTS---

Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 2.34 2.29 1.02 2.37% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.01 0.01 0.92 -8.15% slower
eight_schools/eight_schools.stan 0.05 0.05 1.01 0.92% faster
arK/arK.stan 1.27 1.3 0.98 -1.86% slower
irt_2pl/irt_2pl.stan 4.23 4.18 1.01 1.24% faster
performance.compilation 60.81 60.2 1.01 1.0% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 4.63 4.79 0.97 -3.4% slower
pkpd/one_comp_mm_elim_abs.stan 14.07 14.15 0.99 -0.57% slower
sir/sir.stan 82.19 81.9 1.0 0.36% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.25 0.25 1.03 2.95% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 1.71 1.73 0.99 -1.28% slower
gp_regr/gp_regr.stan 0.1 0.1 1.01 1.36% faster
arma/arma.stan 0.35 0.34 1.02 2.22% faster
gp_regr/gen_gp_data.stan 0.03 0.03 1.05 5.15% faster
garch/garch.stan 0.35 0.34 1.01 1.03% faster

So low_dim_corr_gauss and gen_gp_data can have pretty large variance with even a decent number of runs

Release notes

Add compiler flags to give better optimizations

Checklist

  • Closes Performance regression due to absence of lto #2008

  • Copyright holder: Steve Bronder

    The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
    - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
    - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

  • the basic tests are passing

    • unit tests pass (to run, use: ./runTests.py test/unit)
    • header checks pass, (make test-headers)
    • dependencies checks pass, (make test-math-dependencies)
    • docs build, (make doxygen)
    • code passes the built in C++ standards checks (make cpplint)
  • the code is written in idiomatic C++ and changes are documented in the doxygen

  • the new changes are tested

Copy link
Member

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! The performance tests look good.

I am a bit confused as to which clang versions allow these flags. In the Summary you mention clang 8 but in the side effects you write: "the most important one being that if a user uses a clang-[7-10] they must also have llvm-ar-[7-10] hooked into their path as well for flto.". I think you mention that there is some confusion with clang and versions anyways?

Another question is regarding "they must also have llvm-ar-[7-10] hooked into their path". What does this mean for a typical Stan user that is not an expert on C++ toolchains. Is this installed for Mac users with the Xcode toolchain and works out of the box? Does one have to do anything special on Ubuntu apart from "sudo apt install clang".

I am not as worried for anyone that compiles gcc on their own. Those that do, do not require much assistance wrt to setting up a working toolchain anyways.

One of the interesting thing I found out is that several of these tests have a pretty wide variance.

Yep, when doing some performance benchmarks in the past, I had to run like 40000 iterations for garch, gp_regr, eight_schools, and other quick models to get consistent timings.

@rok-cesnovar
Copy link
Member

We also need a quick summary to understand in which cases there is a regression. If I understand correctly then the logic is:

if g++ version < 5 or clang version < 8
  - some regression is expected for models where memory allocation is a big part of the model. Which I guess is ODE models? Any others?
else
  - no changes expected?

@wds15
Copy link
Contributor

wds15 commented Aug 18, 2020

Just as FYI: I did find performance regression for SIR, yes, but also for a mixture logistic regression model!

Thanks for putting this together, it looks like a lot of work! I expected that this would be quick and easy, but it‘s a lot of work as it looks.

@rok-cesnovar
Copy link
Member

Just as FYI: I did find performance regression for SIR, yes, but also for a mixture logistic regression model!

Oh, thanks for the info. Then it will be a bit more challenging to get an intuition on.

IMHO g++ < 5 and clang < 8 is not going to be a lot our user base. Its mostly going to be RTools 3.5 Windows users, which hopefully will upgrade sooner rather than later. Anyone that cares much about performance and execution times has probably already upgraded to RTools 4.0. Last 3 Ubuntu LTS come with a compiler version >= 5.

I am not so sure on Macs, but if I am reading this right, the supported clang versions in XCode for anything above and include Sierra is fine for us. I dont think there are many users still left on El Capitan and older.

@SteveBronder
Copy link
Collaborator Author

Thanks! The performance tests look good.

np!

I am a bit confused as to which clang versions allow these flags. In the Summary you mention clang 8 but in the side effects you write: "the most important one being that if a user uses a clang-[7-10] they must also have llvm-ar-[7-10] hooked into their path as well for flto.". I think you mention that there is some confusion with clang and versions anyways?

Yes sorry I submitted this before bed. This was a previous goof. If the user is using clang > 7 then we can use the system ar. I'll copy/paste what I wrote above

So this is a clang oddity. clang-6 and clang-7 both give 4.2.1 for -dumpversion. using -v instead of -dumpversion can give some weird results across operating systems and requires some extra trickery and gotchas to figure out. So while we could support flto by default with clang-7, clang-6 needs the llvm-ar. And since we can't tell them apart I think it's just easier to only support flto by default for systems with clang > 7.

Another question is regarding "they must also have llvm-ar-[7-10] hooked into their path". What does this mean for a typical Stan user that is not an expert on C++ toolchains. Is this installed for Mac users with the Xcode toolchain and works out of the box? Does one have to do anything special on Ubuntu apart from "sudo apt install clang".

My b you can ignore that

I am not as worried for anyone that compiles gcc on their own. Those that do, do not require much assistance wrt to setting up a working toolchain anyways.

That's what I figure as well

One of the interesting thing I found out is that several of these tests have a pretty wide variance.

Yep, when doing some performance benchmarks in the past, I had to run like 40000 iterations for garch, gp_regr, eight_schools, and other quick models to get consistent timings.

Ugh yeah it makes me question whether we should really use those for benchmarks. Maybe we need to change up how we do the performance tests to something like how google perf does them where they run N times till they get a stable estimate.

@SteveBronder
Copy link
Collaborator Author

Just as FYI: I did find performance regression for SIR, yes, but also for a mixture logistic regression model!

Thanks for putting this together, it looks like a lot of work! I expected that this would be quick and easy, but it‘s a lot of work as it looks.

Yes these flags should fix up both of those cases. I can look at that logistic regression you posted as well. May be a good idea to add it to the perf tests.

IMHO g++ < 5 and clang < 8 is not going to be a lot our user base. Its mostly going to be RTools 3.5 Windows users, which hopefully will upgrade sooner rather than later. Anyone that cares much about performance and execution times has probably already upgraded to RTools 4.0. Last 3 Ubuntu LTS come with a compiler version >= 5.

Yep!

I am not so sure on Macs, but if I am reading this right, the supported clang versions in XCode for anything above and include Sierra is fine for us. I dont think there are many users still left on El Capitan and older.

Is our testing mac on El Capitan or Sierra? I know it has clang-7 as the default compiler so it's either an updated El Capitan or an outdated Sierra

@SteveBronder
Copy link
Collaborator Author

Oh I just realized this is PR #2020

Let's quadruple check this one so it doesn't blow up and ruin everything

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.18 4.15 1.01 0.93% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.03 3.24% faster
eight_schools/eight_schools.stan 0.09 0.09 0.99 -1.08% slower
gp_regr/gp_regr.stan 0.2 0.19 1.01 1.34% faster
irt_2pl/irt_2pl.stan 5.4 5.45 0.99 -0.86% slower
performance.compilation 87.44 86.27 1.01 1.34% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.74 7.52 1.03 2.83% faster
pkpd/one_comp_mm_elim_abs.stan 26.27 25.99 1.01 1.09% faster
sir/sir.stan 130.35 140.49 0.93 -7.77% slower
gp_regr/gen_gp_data.stan 0.05 0.04 1.05 4.39% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.03 2.94 1.03 2.92% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.41 0.39 1.07 6.47% faster
arK/arK.stan 1.81 1.75 1.03 3.21% faster
arma/arma.stan 0.74 0.97 0.76 -30.9% slower
garch/garch.stan 0.53 0.51 1.04 4.08% faster
Mean result: 1.00023283158

Jenkins Console Log
Blue Ocean
Commit hash: 3ead9cb


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@rok-cesnovar
Copy link
Member

Oh I just realized this is PR #2020
Let's quadruple check this one so it doesn't blow up and ruin everything

Or let's just merge and get over it ASAP :)

Joking aside, this looks good now, there just seems to be a Windows related issue at the Cmdstan level. Will see if I can debug it a bit more.

@SteveBronder
Copy link
Collaborator Author

Ah, mingw's gcc ran into this compiler bug. This was fixed in gcc-8, just to be safe I'm not turning on flto for anything less than gcc-8 because I couldn't tell from the discussion whether this patch was backported to earlier gcc versions

@rok-cesnovar
Copy link
Member

Not sure I understand. This Windows machine uses g++ 8.3.0. How does setting the limit to 7 help here?

@SteveBronder
Copy link
Collaborator Author

Oh shoot I thought that part of the testing used Rtools 3.5. I'll have to dive into that more to figure out what's going on then

@SteveBronder
Copy link
Collaborator Author

Hmm okay this hints that the issue is when a linked library is not compiled with flto but other parts are. I'll have to play around with this

@rok-cesnovar
Copy link
Member

All Jenkins agents are now running RTools 4.0. The RTools 3.5 tests were moved to Github Actions. There will be an overview doc of all of this once the CI refactor is done.

@rok-cesnovar
Copy link
Member

Hm, ok, then it might be the boost program options we build in cmdstan and statically link. Not sure where this pops up. Will take a look on my Windows machine.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.2 4.07 1.03 3.13% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 1.03 3.0% faster
eight_schools/eight_schools.stan 0.09 0.09 1.03 3.34% faster
gp_regr/gp_regr.stan 0.2 0.19 1.04 3.68% faster
irt_2pl/irt_2pl.stan 5.37 5.37 1.0 -0.02% slower
performance.compilation 87.4 86.11 1.01 1.48% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 7.68 7.49 1.03 2.5% faster
pkpd/one_comp_mm_elim_abs.stan 27.17 26.67 1.02 1.87% faster
sir/sir.stan 132.69 140.2 0.95 -5.66% slower
gp_regr/gen_gp_data.stan 0.04 0.04 1.0 -0.16% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.04 3.0 1.01 1.15% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.41 0.38 1.07 6.11% faster
arK/arK.stan 1.8 1.78 1.02 1.57% faster
arma/arma.stan 0.75 0.96 0.78 -28.61% slower
garch/garch.stan 0.53 0.51 1.04 3.88% faster
Mean result: 1.00339652968

Jenkins Console Log
Blue Ocean
Commit hash: 506b98a


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@rok-cesnovar
Copy link
Member

Can confirm the suspicion. You can build the model, the problem is make build that also builds bin/stansummary, which requires Boost program options. Stansummary is statically linked in. I presume that lib needs -flto and the rest of flags?

I am guessing you need to add the flags here: https://github.com/stan-dev/cmdstan/blob/develop/make/command
Let me know if I can help/test anything.

@SteveBronder
Copy link
Collaborator Author

@rok-cesnovar what do you think about just adding the makevar variables in this PR then in cmdstan adding these optimization flags? I think doing these flags upstream gives us a larger view on what we want to use flto on. I also don't totally trust windows and would bet a dollar we'll have to do something particular for rtools4.0 and pystan

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Aug 20, 2020

Yes, that seems like a good plan. Have them defined in math and use them in cmdstan seems fine to me.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 4.16 4.23 0.98 -1.78% slower
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -1.78% slower
eight_schools/eight_schools.stan 0.09 0.09 0.97 -3.2% slower
gp_regr/gp_regr.stan 0.2 0.19 1.02 2.07% faster
irt_2pl/irt_2pl.stan 5.26 5.23 1.01 0.64% faster
performance.compilation 88.88 87.33 1.02 1.73% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.21 8.24 1.0 -0.35% slower
pkpd/one_comp_mm_elim_abs.stan 27.2 26.77 1.02 1.58% faster
sir/sir.stan 134.37 140.58 0.96 -4.63% slower
gp_regr/gen_gp_data.stan 0.05 0.04 1.01 1.2% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.3 3.34 0.99 -1.07% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.38 0.38 0.99 -1.13% slower
arK/arK.stan 1.84 1.85 0.99 -0.52% slower
arma/arma.stan 0.59 0.74 0.8 -24.98% slower
garch/garch.stan 0.62 0.73 0.85 -17.01% slower
Mean result: 0.972500184531

Jenkins Console Log
Blue Ocean
Commit hash: 0ce5180


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@SteveBronder
Copy link
Collaborator Author

@rok-cesnovar I modified this to just create the makevar variables. I will open up another PR for cmdstan/rstan/pystan to add the optimization flags

@rok-cesnovar
Copy link
Member

Cool. The only thing I am a bit worried about is math performance tests won't have these flags now that they will live in Cmdstan. But I guess that can be handled separately?

@SteveBronder
Copy link
Collaborator Author

I thought the performance tests ran using the develop version of cmdstan, is that false?

@rok-cesnovar
Copy link
Member

rok-cesnovar commented Aug 23, 2020

Sorry, I mean if someone will do performance tests with Math-only (with Google benchmark for example), that won't be representative of how it will actually run in the context of Cmdstan.

rstan/pystan do not use math makefiles anyways so its a different story there.

@SteveBronder
Copy link
Collaborator Author

Yeah thats true tho imo I think thats fine. Still technically a regression in the math lib but upstream performance improvement

Copy link
Member

@rok-cesnovar rok-cesnovar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rok-cesnovar rok-cesnovar merged commit 0591065 into develop Aug 24, 2020
@rok-cesnovar rok-cesnovar deleted the feature/flto-flags branch August 24, 2020 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression due to absence of lto
4 participants