-
-
Notifications
You must be signed in to change notification settings - Fork 193
Adds flags for compiler optimizations #2020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! The performance tests look good.
I am a bit confused as to which clang versions allow these flags. In the Summary you mention clang 8 but in the side effects you write: "the most important one being that if a user uses a clang-[7-10] they must also have llvm-ar-[7-10] hooked into their path as well for flto.". I think you mention that there is some confusion with clang and versions anyways?
Another question is regarding "they must also have llvm-ar-[7-10] hooked into their path". What does this mean for a typical Stan user that is not an expert on C++ toolchains. Is this installed for Mac users with the Xcode toolchain and works out of the box? Does one have to do anything special on Ubuntu apart from "sudo apt install clang".
I am not as worried for anyone that compiles gcc on their own. Those that do, do not require much assistance wrt to setting up a working toolchain anyways.
One of the interesting thing I found out is that several of these tests have a pretty wide variance.
Yep, when doing some performance benchmarks in the past, I had to run like 40000 iterations for garch, gp_regr, eight_schools, and other quick models to get consistent timings.
We also need a quick summary to understand in which cases there is a regression. If I understand correctly then the logic is:
|
Just as FYI: I did find performance regression for SIR, yes, but also for a mixture logistic regression model! Thanks for putting this together, it looks like a lot of work! I expected that this would be quick and easy, but it‘s a lot of work as it looks. |
Oh, thanks for the info. Then it will be a bit more challenging to get an intuition on. IMHO g++ < 5 and clang < 8 is not going to be a lot our user base. Its mostly going to be RTools 3.5 Windows users, which hopefully will upgrade sooner rather than later. Anyone that cares much about performance and execution times has probably already upgraded to RTools 4.0. Last 3 Ubuntu LTS come with a compiler version >= 5. I am not so sure on Macs, but if I am reading this right, the supported clang versions in XCode for anything above and include Sierra is fine for us. I dont think there are many users still left on El Capitan and older. |
…er versions that support them
np!
Yes sorry I submitted this before bed. This was a previous goof. If the user is using clang > 7 then we can use the system So this is a clang oddity. clang-6 and clang-7 both give 4.2.1 for -dumpversion. using -v instead of -dumpversion can give some weird results across operating systems and requires some extra trickery and gotchas to figure out. So while we could support flto by default with clang-7, clang-6 needs the llvm-ar. And since we can't tell them apart I think it's just easier to only support flto by default for systems with clang > 7.
My b you can ignore that
That's what I figure as well
Ugh yeah it makes me question whether we should really use those for benchmarks. Maybe we need to change up how we do the performance tests to something like how google perf does them where they run N times till they get a stable estimate. |
Yes these flags should fix up both of those cases. I can look at that logistic regression you posted as well. May be a good idea to add it to the perf tests.
Yep!
Is our testing mac on El Capitan or Sierra? I know it has clang-7 as the default compiler so it's either an updated El Capitan or an outdated Sierra |
Oh I just realized this is PR #2020 Let's quadruple check this one so it doesn't blow up and ruin everything |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Or let's just merge and get over it ASAP :) Joking aside, this looks good now, there just seems to be a Windows related issue at the Cmdstan level. Will see if I can debug it a bit more. |
Ah, mingw's gcc ran into this compiler bug. This was fixed in gcc-8, just to be safe I'm not turning on flto for anything less than gcc-8 because I couldn't tell from the discussion whether this patch was backported to earlier gcc versions |
Not sure I understand. This Windows machine uses g++ 8.3.0. How does setting the limit to 7 help here? |
Oh shoot I thought that part of the testing used Rtools 3.5. I'll have to dive into that more to figure out what's going on then |
Hmm okay this hints that the issue is when a linked library is not compiled with flto but other parts are. I'll have to play around with this |
All Jenkins agents are now running RTools 4.0. The RTools 3.5 tests were moved to Github Actions. There will be an overview doc of all of this once the CI refactor is done. |
Hm, ok, then it might be the boost program options we build in cmdstan and statically link. Not sure where this pops up. Will take a look on my Windows machine. |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
Can confirm the suspicion. You can build the model, the problem is I am guessing you need to add the flags here: https://github.com/stan-dev/cmdstan/blob/develop/make/command |
@rok-cesnovar what do you think about just adding the makevar variables in this PR then in cmdstan adding these optimization flags? I think doing these flags upstream gives us a larger view on what we want to use flto on. I also don't totally trust windows and would bet a dollar we'll have to do something particular for rtools4.0 and pystan |
Yes, that seems like a good plan. Have them defined in math and use them in cmdstan seems fine to me. |
Jenkins Console Log Machine informationProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010CPU: G++: Clang: |
@rok-cesnovar I modified this to just create the makevar variables. I will open up another PR for cmdstan/rstan/pystan to add the optimization flags |
Cool. The only thing I am a bit worried about is math performance tests won't have these flags now that they will live in Cmdstan. But I guess that can be handled separately? |
I thought the performance tests ran using the develop version of cmdstan, is that false? |
Sorry, I mean if someone will do performance tests with Math-only (with Google benchmark for example), that won't be representative of how it will actually run in the context of Cmdstan. rstan/pystan do not use math makefiles anyways so its a different story there. |
Yeah thats true tho imo I think thats fine. Still technically a regression in the math lib but upstream performance improvement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
After adding a template parameter to
vari
we saw a 20% regression in the SIR model. This PR includes some extra compiler flag options so if a user has eitherclang >= 8
orgcc >= 5
. flto and other compiler optimization flags will turn on. flto will only turn on forclang >= 8
because clang-6 and clang-7's-dumpversion
give4.2.1
so we cannot tell them apart.clang-6
requires the use of either thear
gold linker plugin orllvm-ar
. Since the gold plugin is a non-standard extension for some versions ofar
to have installed we turn off flto by default for both clang 6 and 7Tests
make feature so no new tests, though because our benchmarks use clang-6 they won't give the new benchmark values so I will paste them here. These tests were run on Ubuntu 20.04 with AMD Threadripper 1950x along with DDR4 ram. Tests for each compiler were run 10 times then averaged together
Clang-10
The clang-10 results were run with 5000 warmup and 5000 samples, though overall I didn't see the increase in iteration numbers help produce better reliability in the test numbers.
GCC-10
Because I didn't see any increased reliability from running more than the standard number of iterations the gcc results were run with 1000 warmup and 1000 samples.
Side Effects
There's are some compiler flags which require the gcc graphite compiler extension, though tmk this comes with most standard builds of gcc unless you build gcc from source in which case I believe you need to specify whether the graphite extension should be compiled or not. If a user does not have graphite setup a warning is given and the program continues to run.
One of the interesting thing I found out is that several of these tests have a pretty wide variance. For instance here is the results of running develop against develop 10 times
---RESULTS---
So
low_dim_corr_gauss
andgen_gp_data
can have pretty large variance with even a decent number of runsRelease notes
Add compiler flags to give better optimizations
Checklist
Closes Performance regression due to absence of lto #2008
Copyright holder: Steve Bronder
The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
./runTests.py test/unit
)make test-headers
)make test-math-dependencies
)make doxygen
)make cpplint
)the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested