Consistent significance #956

rylev · 2021-08-04T14:22:31Z

This moves the definition of significance to one central place (almost) in the API instead of it be calculated on the client and in the API.

Some things to note:

We're now using the same significance threshold for both triage reports and the comparison.html page. The threshold has been placed at 0.1% for test cases not considered "dodgy" (i.e., has some sort of noisy or high variability) and 1% for test cases considered dodgy.
This is a change to the triage thresholds from before which will likely lead to more cases being identified in the triage report. TODO: we should consider changing how we handle triage to not only rely purely on what we considered "significant". Perhaps we can classify changes as noteworthy of the triage report if the meet an additional threshold - perhaps 2 times the size of the significance threshold. I'm very open to ideas here.
We are using a mix of log change and relative change, and it's not clear why, so I started moving more exclusively to relative change.

Fixes #952

Mark-Simulacrum · 2021-08-04T14:46:32Z

we should consider changing how we handle triage to not only rely purely on what we considered "significant". Perhaps we can classify changes as noteworthy of the triage report if the meet an additional threshold - perhaps 2 times the size of the significance threshold. I'm very open to ideas here.

Can you say more about wanting something different for triage? I think our goal in the long run should probably be that human triage is focused primarily on working through the perf-regression label, with triage reports being automatically generated and only reviewed for noise (feeding into adjusting our heuristics, likely), primarily prepared for TWIR and compiler team meetings to provide loose awareness of ongoing work.

rylev · 2021-08-04T15:21:21Z

What I'm really getting at is, do we believe that the significance thresholds we have (0.1% for non-dodgy test cases and 1% for dodgy ones) are the right ones and are significance thresholds alone what determine whether an entire run should be classified as a regression or not? It seems like most performance runs tend to return at least one test result that is significant. For example, here, do we want this to be flagged as a regression? My feeling is no, but we would with the changes in this PR.

It feels like marking a PR as a regression/improvement/mixed should not be based on one test result showing up as a negative but rather some threshold of the number of significant test results.

In other words, "significance" is localized to a given test case based on historical data. However, when interpreting whether an entire run should be classified as a regression, it's helpful to look across test results for different test cases. If only one test result show significant results, that's not a strong case that the PR as a whole is a regression.

Mark-Simulacrum · 2021-08-04T16:08:06Z

Yeah, I think we want to iterate on our heuristics for sure, no objections there. I think we can go a long way with the current approach(es) though.

For your example, I'm not convinced it wasn't a localized regression, though it's definitely hard to say. If we look at the graph:

The spike on the very left is the commit with the encoding-opt regression; the subsequent drop seems to be reproduced across many benchmarks -- so may not be related to this increase.

OTOH, I do think that spending time investigating a single benchmark 0.3% change is likely not worth it in most cases, so highlighting this on PR diffs and elsewhere may not be useful. In practice, I think we likely want several levels of "confidence" and show those in different places.

For example, if we presume we had 3 levels (maybe interesting, probably interesting, definitely interesting) -- let's assume all of these are composed of 'significant' changes under the current 0.1% / 1% heuristic -- then maybe a reasonable approach is:

On PR comments, only note regression/improvement for probably & definitely interesting changes.
In performance reports (e.g., for TWIR & compiler meetings), highlight only definitely interesting changes
In semiregular triage (weekly today), highlight maybe interesting changes only -- goal is to tune our heuristics and make sure we're not missing things.
In comparison pages, highlight probably & definitely interesting, but include (grayed out or not colored, etc) maybe interesting. Humans can try to gauge on the comparison page if there's something to them.

I think we can try to define maybe/probably/definitely based on the number of improvements & regressions. Maybe to start something simple:

Maybe: 0-3 changes (positive or negative)
Probably: 3-5 changes (positive or negative)
Definitely: >5 changes (positive or negative)

Goal would be to then take this and enhance it, for example, maybe saying "hey all the changes are in -opt builds", that might lower the significance threshold perhaps? Not sure.

rylev · 2021-08-04T16:53:11Z

@Mark-Simulacrum Thanks for the great feedback! I added some docs which attempt to outline in prose how we do performance analysis. I also added the idea of summary analysis confidence, and we only show triage cases where we're sure that the summary is relevant.

Mark-Simulacrum · 2021-08-04T23:41:02Z

Docs look great -- thanks!

Are we running the variance stuff in production today? I'm surprised how quickly we can compute those on the comparison page if so -- I'd have expected us to need a beefier machine or optimizations on the database layer or something. But if we can get away with it seems great :)

rylev · 2021-08-05T07:37:52Z

@Mark-Simulacrum we are already running the variance stuff in production, and it seems to be running fine 😊

rylev · 2021-08-05T09:07:13Z

@Mark-Simulacrum this is getting close, but I'm still not fully satisfied. I believe our understanding of what a regression, improvement or mixed performance change is is a bit more nuanced than what we have encoded here so far.

For example, this comparison seems to be pretty clearly a regression. There is however, one test case that yields an improvement. Under these changes this comparison would be labeled as a "mixed" performance change. This is because we are currently deciding how to label comparisons based solely on whether they have exclusively all performance regressions/improvements or a mix of the two. This lacks any consideration for the magnitude and number of each kind of change.

Magnitude and number interact in interesting ways. For example, take the following comparisons. How would you classify them?

20 regressions, 1 improvement (the regressions range from 0.2% to 2%, the improvement is 0.3% change)
20 regressions, 1 improvement (the regressions range from 0.2% to 2%, the improvement is a 20% change)
20 regressions, 1 improvement (the regressions range are all under 1%, the improvement is a 5% change)
10 regressions, 10 improvements (the regressions are all between 1% and 5% and the improvements are all below 1% change)

It's hard to encode where the line is, but here's my attempt:
Each change receives a "magnitude" category:

Small: 0.2% (the minimum significant change) to 1%,
Medium: 1% to 4%
Large: 4% to 10%
Very large: 10% and above

The label a comparison receives (improvement, regression or mixed) is based on the following:

If there are improvements and regressions with magnitude of medium or above then the comparison is mixed.
If there are only either improvements or regressions then the comparison is labeled with that kind.
If one kind of changes are of medium or above magnitude (and the other kind are not), then the comparison is mixed if 15% or more of the total changes are the other (small magnitude) kind. For example:
- given 20 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there are 4 or more improvements.
- given 5 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there 1 or more improvements.
If both kinds of changes are all low magnitude changes, then the comparison is mixed unless 90% or more of total changes are of one kind. For example:
- given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind.
- given 5 changes of different kinds all of low magnitude, the result is always mixed.

I think this a good start. We could make further improvements though at some point. For example, by adding more nuance to labeling. For example, we can label comparisons as mixed but given an indication if one type of change is more apparent.

rylev · 2021-08-05T12:41:43Z

@Mark-Simulacrum I updated this to take magnitude of changes into account. I've tested it with a few triage runs, and it looks relatively good for me. What do you think about merging this, and keeping an eye out for how the labeling ends up in practice?

Mark-Simulacrum · 2021-08-06T15:57:51Z

Yeah, that seems like a good start. I'm a little worried the more complicated we make it -- but I think most people likely don't need to know the rules.

I think if we were to try and extend it / improve it, I might try to look at diagnosing regression/mixed/improvement across the board but also within each "scenarioish" (basically ignoring the benchmark, but including everything else). That would for example let us say "regression on incremental builds" but "improvement on non-incremental".

Anyway, seems good for now; we can play it by ear. I think it might be useful to have triage reports generated with a much more primitive heuristic for several weeks (e.g., just significance) but mark them as "would have been shown under the current proposed advanced algorithm" -- that way it's much easier to evaluate whether we agree with it.

site/src/comparison.rs

rylev added 5 commits August 4, 2021 12:21

Refactor how statistics are gathered

de61048

Group all stats by test case

39608fb

Get comparison in order

7e3784d

Update comparison page to injest new API data

fefad06

Clarify significance threshold

a00ccef

rylev requested a review from Mark-Simulacrum August 4, 2021 14:22

Add docs about comparison analysis

933e58d

Introduce confidence levels

960c67f

rylev force-pushed the consistent-significance branch from 8d03db3 to 960c67f Compare August 5, 2021 07:36

rylev added 2 commits August 5, 2021 09:52

Fix percentages

c186ad4

Small adjustments

ae89411

Take magnitude into account

0af149a

Include probably relevant changes in triage for now

e8bd542

Mark-Simulacrum reviewed Aug 11, 2021

View reviewed changes

site/src/comparison.rs Outdated Show resolved Hide resolved

Update site/src/comparison.rs

a1a97a8

Mark-Simulacrum merged commit 555043f into rust-lang:master Aug 11, 2021

rylev deleted the consistent-significance branch August 11, 2021 16:36

rylev mentioned this pull request Aug 16, 2021

Regression is described as both 'significant' and 'small' #968

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consistent significance #956

Consistent significance #956

Uh oh!

rylev commented Aug 4, 2021 •

edited

Loading

Uh oh!

Mark-Simulacrum commented Aug 4, 2021

Uh oh!

rylev commented Aug 4, 2021 •

edited

Loading

Uh oh!

Mark-Simulacrum commented Aug 4, 2021

Uh oh!

rylev commented Aug 4, 2021

Uh oh!

Mark-Simulacrum commented Aug 4, 2021

Uh oh!

rylev commented Aug 5, 2021

Uh oh!

rylev commented Aug 5, 2021

Uh oh!

rylev commented Aug 5, 2021

Uh oh!

Mark-Simulacrum commented Aug 6, 2021

Uh oh!

Uh oh!

Uh oh!

Consistent significance #956

Consistent significance #956

Uh oh!

Conversation

rylev commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mark-Simulacrum commented Aug 4, 2021

Uh oh!

rylev commented Aug 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mark-Simulacrum commented Aug 4, 2021

Uh oh!

rylev commented Aug 4, 2021

Uh oh!

Mark-Simulacrum commented Aug 4, 2021

Uh oh!

rylev commented Aug 5, 2021

Uh oh!

rylev commented Aug 5, 2021

Uh oh!

rylev commented Aug 5, 2021

Uh oh!

Mark-Simulacrum commented Aug 6, 2021

Uh oh!

Uh oh!

Uh oh!

rylev commented Aug 4, 2021 •

edited

Loading

rylev commented Aug 4, 2021 •

edited

Loading