Skip to content

Consistent significance #956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 11, 2021

Conversation

rylev
Copy link
Member

@rylev rylev commented Aug 4, 2021

This moves the definition of significance to one central place (almost) in the API instead of it be calculated on the client and in the API.

Some things to note:

  • We're now using the same significance threshold for both triage reports and the comparison.html page. The threshold has been placed at 0.1% for test cases not considered "dodgy" (i.e., has some sort of noisy or high variability) and 1% for test cases considered dodgy.
  • This is a change to the triage thresholds from before which will likely lead to more cases being identified in the triage report. TODO: we should consider changing how we handle triage to not only rely purely on what we considered "significant". Perhaps we can classify changes as noteworthy of the triage report if the meet an additional threshold - perhaps 2 times the size of the significance threshold. I'm very open to ideas here.
  • We are using a mix of log change and relative change, and it's not clear why, so I started moving more exclusively to relative change.

Fixes #952

@rylev rylev requested a review from Mark-Simulacrum August 4, 2021 14:22
@Mark-Simulacrum
Copy link
Member

we should consider changing how we handle triage to not only rely purely on what we considered "significant". Perhaps we can classify changes as noteworthy of the triage report if the meet an additional threshold - perhaps 2 times the size of the significance threshold. I'm very open to ideas here.

Can you say more about wanting something different for triage? I think our goal in the long run should probably be that human triage is focused primarily on working through the perf-regression label, with triage reports being automatically generated and only reviewed for noise (feeding into adjusting our heuristics, likely), primarily prepared for TWIR and compiler team meetings to provide loose awareness of ongoing work.

@rylev
Copy link
Member Author

rylev commented Aug 4, 2021

What I'm really getting at is, do we believe that the significance thresholds we have (0.1% for non-dodgy test cases and 1% for dodgy ones) are the right ones and are significance thresholds alone what determine whether an entire run should be classified as a regression or not? It seems like most performance runs tend to return at least one test result that is significant. For example, here, do we want this to be flagged as a regression? My feeling is no, but we would with the changes in this PR.

It feels like marking a PR as a regression/improvement/mixed should not be based on one test result showing up as a negative but rather some threshold of the number of significant test results.

In other words, "significance" is localized to a given test case based on historical data. However, when interpreting whether an entire run should be classified as a regression, it's helpful to look across test results for different test cases. If only one test result show significant results, that's not a strong case that the PR as a whole is a regression.

@Mark-Simulacrum
Copy link
Member

Yeah, I think we want to iterate on our heuristics for sure, no objections there. I think we can go a long way with the current approach(es) though.

For your example, I'm not convinced it wasn't a localized regression, though it's definitely hard to say. If we look at the graph:

image

The spike on the very left is the commit with the encoding-opt regression; the subsequent drop seems to be reproduced across many benchmarks -- so may not be related to this increase.

OTOH, I do think that spending time investigating a single benchmark 0.3% change is likely not worth it in most cases, so highlighting this on PR diffs and elsewhere may not be useful. In practice, I think we likely want several levels of "confidence" and show those in different places.

For example, if we presume we had 3 levels (maybe interesting, probably interesting, definitely interesting) -- let's assume all of these are composed of 'significant' changes under the current 0.1% / 1% heuristic -- then maybe a reasonable approach is:

  • On PR comments, only note regression/improvement for probably & definitely interesting changes.
  • In performance reports (e.g., for TWIR & compiler meetings), highlight only definitely interesting changes
  • In semiregular triage (weekly today), highlight maybe interesting changes only -- goal is to tune our heuristics and make sure we're not missing things.
  • In comparison pages, highlight probably & definitely interesting, but include (grayed out or not colored, etc) maybe interesting. Humans can try to gauge on the comparison page if there's something to them.

I think we can try to define maybe/probably/definitely based on the number of improvements & regressions. Maybe to start something simple:

  • Maybe: 0-3 changes (positive or negative)
  • Probably: 3-5 changes (positive or negative)
  • Definitely: >5 changes (positive or negative)

Goal would be to then take this and enhance it, for example, maybe saying "hey all the changes are in -opt builds", that might lower the significance threshold perhaps? Not sure.

@rylev
Copy link
Member Author

rylev commented Aug 4, 2021

@Mark-Simulacrum Thanks for the great feedback! I added some docs which attempt to outline in prose how we do performance analysis. I also added the idea of summary analysis confidence, and we only show triage cases where we're sure that the summary is relevant.

@Mark-Simulacrum
Copy link
Member

Docs look great -- thanks!

Are we running the variance stuff in production today? I'm surprised how quickly we can compute those on the comparison page if so -- I'd have expected us to need a beefier machine or optimizations on the database layer or something. But if we can get away with it seems great :)

@rylev rylev force-pushed the consistent-significance branch from 8d03db3 to 960c67f Compare August 5, 2021 07:36
@rylev
Copy link
Member Author

rylev commented Aug 5, 2021

@Mark-Simulacrum we are already running the variance stuff in production, and it seems to be running fine 😊

@rylev
Copy link
Member Author

rylev commented Aug 5, 2021

@Mark-Simulacrum this is getting close, but I'm still not fully satisfied. I believe our understanding of what a regression, improvement or mixed performance change is is a bit more nuanced than what we have encoded here so far.

For example, this comparison seems to be pretty clearly a regression. There is however, one test case that yields an improvement. Under these changes this comparison would be labeled as a "mixed" performance change. This is because we are currently deciding how to label comparisons based solely on whether they have exclusively all performance regressions/improvements or a mix of the two. This lacks any consideration for the magnitude and number of each kind of change.

Magnitude and number interact in interesting ways. For example, take the following comparisons. How would you classify them?

  • 20 regressions, 1 improvement (the regressions range from 0.2% to 2%, the improvement is 0.3% change)
  • 20 regressions, 1 improvement (the regressions range from 0.2% to 2%, the improvement is a 20% change)
  • 20 regressions, 1 improvement (the regressions range are all under 1%, the improvement is a 5% change)
  • 10 regressions, 10 improvements (the regressions are all between 1% and 5% and the improvements are all below 1% change)

It's hard to encode where the line is, but here's my attempt:
Each change receives a "magnitude" category:

  • Small: 0.2% (the minimum significant change) to 1%,
  • Medium: 1% to 4%
  • Large: 4% to 10%
  • Very large: 10% and above

The label a comparison receives (improvement, regression or mixed) is based on the following:

  • If there are improvements and regressions with magnitude of medium or above then the comparison is mixed.
  • If there are only either improvements or regressions then the comparison is labeled with that kind.
  • If one kind of changes are of medium or above magnitude (and the other kind are not), then the comparison is mixed if 15% or more of the total changes are the other (small magnitude) kind. For example:
    • given 20 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there are 4 or more improvements.
    • given 5 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there 1 or more improvements.
  • If both kinds of changes are all low magnitude changes, then the comparison is mixed unless 90% or more of total changes are of one kind. For example:
    • given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind.
    • given 5 changes of different kinds all of low magnitude, the result is always mixed.

I think this a good start. We could make further improvements though at some point. For example, by adding more nuance to labeling. For example, we can label comparisons as mixed but given an indication if one type of change is more apparent.

@rylev
Copy link
Member Author

rylev commented Aug 5, 2021

@Mark-Simulacrum I updated this to take magnitude of changes into account. I've tested it with a few triage runs, and it looks relatively good for me. What do you think about merging this, and keeping an eye out for how the labeling ends up in practice?

@Mark-Simulacrum
Copy link
Member

Yeah, that seems like a good start. I'm a little worried the more complicated we make it -- but I think most people likely don't need to know the rules.

I think if we were to try and extend it / improve it, I might try to look at diagnosing regression/mixed/improvement across the board but also within each "scenarioish" (basically ignoring the benchmark, but including everything else). That would for example let us say "regression on incremental builds" but "improvement on non-incremental".

Anyway, seems good for now; we can play it by ear. I think it might be useful to have triage reports generated with a much more primitive heuristic for several weeks (e.g., just significance) but mark them as "would have been shown under the current proposed advanced algorithm" -- that way it's much easier to evaluate whether we agree with it.

@Mark-Simulacrum Mark-Simulacrum merged commit 555043f into rust-lang:master Aug 11, 2021
@rylev rylev deleted the consistent-significance branch August 11, 2021 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

comment summary does not match significant changes filter on comparison pages
2 participants