-
Notifications
You must be signed in to change notification settings - Fork 152
Consistent significance #956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistent significance #956
Conversation
Can you say more about wanting something different for triage? I think our goal in the long run should probably be that human triage is focused primarily on working through the perf-regression label, with triage reports being automatically generated and only reviewed for noise (feeding into adjusting our heuristics, likely), primarily prepared for TWIR and compiler team meetings to provide loose awareness of ongoing work. |
What I'm really getting at is, do we believe that the significance thresholds we have (0.1% for non-dodgy test cases and 1% for dodgy ones) are the right ones and are significance thresholds alone what determine whether an entire run should be classified as a regression or not? It seems like most performance runs tend to return at least one test result that is significant. For example, here, do we want this to be flagged as a regression? My feeling is no, but we would with the changes in this PR. It feels like marking a PR as a regression/improvement/mixed should not be based on one test result showing up as a negative but rather some threshold of the number of significant test results. In other words, "significance" is localized to a given test case based on historical data. However, when interpreting whether an entire run should be classified as a regression, it's helpful to look across test results for different test cases. If only one test result show significant results, that's not a strong case that the PR as a whole is a regression. |
Yeah, I think we want to iterate on our heuristics for sure, no objections there. I think we can go a long way with the current approach(es) though. For your example, I'm not convinced it wasn't a localized regression, though it's definitely hard to say. If we look at the graph: The spike on the very left is the commit with the encoding-opt regression; the subsequent drop seems to be reproduced across many benchmarks -- so may not be related to this increase. OTOH, I do think that spending time investigating a single benchmark 0.3% change is likely not worth it in most cases, so highlighting this on PR diffs and elsewhere may not be useful. In practice, I think we likely want several levels of "confidence" and show those in different places. For example, if we presume we had 3 levels (maybe interesting, probably interesting, definitely interesting) -- let's assume all of these are composed of 'significant' changes under the current 0.1% / 1% heuristic -- then maybe a reasonable approach is:
I think we can try to define maybe/probably/definitely based on the number of improvements & regressions. Maybe to start something simple:
Goal would be to then take this and enhance it, for example, maybe saying "hey all the changes are in -opt builds", that might lower the significance threshold perhaps? Not sure. |
@Mark-Simulacrum Thanks for the great feedback! I added some docs which attempt to outline in prose how we do performance analysis. I also added the idea of summary analysis confidence, and we only show triage cases where we're sure that the summary is relevant. |
Docs look great -- thanks! Are we running the variance stuff in production today? I'm surprised how quickly we can compute those on the comparison page if so -- I'd have expected us to need a beefier machine or optimizations on the database layer or something. But if we can get away with it seems great :) |
8d03db3
to
960c67f
Compare
@Mark-Simulacrum we are already running the variance stuff in production, and it seems to be running fine 😊 |
@Mark-Simulacrum this is getting close, but I'm still not fully satisfied. I believe our understanding of what a regression, improvement or mixed performance change is is a bit more nuanced than what we have encoded here so far. For example, this comparison seems to be pretty clearly a regression. There is however, one test case that yields an improvement. Under these changes this comparison would be labeled as a "mixed" performance change. This is because we are currently deciding how to label comparisons based solely on whether they have exclusively all performance regressions/improvements or a mix of the two. This lacks any consideration for the magnitude and number of each kind of change. Magnitude and number interact in interesting ways. For example, take the following comparisons. How would you classify them?
It's hard to encode where the line is, but here's my attempt:
The label a comparison receives (improvement, regression or mixed) is based on the following:
I think this a good start. We could make further improvements though at some point. For example, by adding more nuance to labeling. For example, we can label comparisons as mixed but given an indication if one type of change is more apparent. |
@Mark-Simulacrum I updated this to take magnitude of changes into account. I've tested it with a few triage runs, and it looks relatively good for me. What do you think about merging this, and keeping an eye out for how the labeling ends up in practice? |
Yeah, that seems like a good start. I'm a little worried the more complicated we make it -- but I think most people likely don't need to know the rules. I think if we were to try and extend it / improve it, I might try to look at diagnosing regression/mixed/improvement across the board but also within each "scenarioish" (basically ignoring the benchmark, but including everything else). That would for example let us say "regression on incremental builds" but "improvement on non-incremental". Anyway, seems good for now; we can play it by ear. I think it might be useful to have triage reports generated with a much more primitive heuristic for several weeks (e.g., just significance) but mark them as "would have been shown under the current proposed advanced algorithm" -- that way it's much easier to evaluate whether we agree with it. |
This moves the definition of significance to one central place (almost) in the API instead of it be calculated on the client and in the API.
Some things to note:
Fixes #952