Skip to content

Commit 555043f

Browse files
Merge pull request #956 from rylev/consistent-significance
Consistent significance
2 parents a238596 + a1a97a8 commit 555043f

File tree

5 files changed

+504
-273
lines changed

5 files changed

+504
-273
lines changed

docs/comparison-analysis.md

+65
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Comparison Analysis
2+
3+
The following is a detailed explanation of the process undertaken to automate the analysis of test results for two artifacts of interest (artifact A and B).
4+
5+
This analysis can be done by hand, by using the [comparison page](https://perf.rust-lang.org/compare.html) and entering the two artifacts of interest in the form at the top.
6+
7+
## The goal
8+
9+
The goal of the analysis is to determine whether artifact B represents a performance improvement, regression, or mixed result from artifact A. Typically artifact B will be based on artifact A with a certain pull requests changes applied to artifact A to get artifact B, but this is not required.
10+
11+
Performance analysis is typically used to determine whether a pull request has introduced performance regressions or improvements.
12+
13+
## What is being compared?
14+
15+
At the core of comparison analysis are the collection of test results for the two artifacts being compared. For each test case, the statistics for the two artifacts are compared and a relative change percentage is obtained using the following formula:
16+
17+
```
18+
100 * ((statisticForArtifactB - statisticForArtifactA) / statisticForArtifactA)
19+
```
20+
21+
## High-level analysis description
22+
23+
Analysis of the changes is performed in order to determine whether artifact B represents a performance change over artifact A. At a high level the analysis performed takes the following form:
24+
25+
How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?
26+
27+
* If there are improvements and regressions with magnitude of medium or above then the comparison is mixed.
28+
* If there are only either improvements or regressions then the comparison is labeled with that kind.
29+
* If one kind of changes are of medium or above magnitude (and the other kind are not), then the comparison is mixed if 15% or more of the total changes are the other (small magnitude) kind. For example:
30+
* given 20 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there are 4 or more improvements.
31+
* given 5 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there 1 or more improvements.
32+
* If both kinds of changes are all low magnitude changes, then the comparison is mixed unless 90% or more of total changes are of one kind. For example:
33+
* given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind.
34+
* given 5 changes of different kinds all of low magnitude, the result is always mixed.
35+
36+
Whether we actually _report_ an analysis or not depends on the context and how _confident_ we are in the summary of the results (see below for an explanation of how confidence is derived). For example, in pull request performance "try" runs, we report a performance change if we are at least confident that the results are "probably relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
37+
38+
### What makes a test result significant?
39+
40+
A test result is significant if the relative change percentage meets some threshold. What the threshold is depends of whether the test case is "dodgy" or not (see below for an examination of "dodginess"). For dodgy test cases, the threshold is set at 1%. For non-dodgy test cases, the threshold is set to 0.1%.
41+
42+
### What makes a test case "dodgy"?
43+
44+
A test case is "dodgy" if it shows signs of either being noisy or highly variable.
45+
46+
To determine noise and high variability, the previous 100 test results for the test case in question are examined by calculating relative delta changes between adjacent test results. This is done with the following formula (where `testResult1` is the test result immediately proceeding `testResult2`):
47+
48+
```
49+
testResult2 - testResult1 / testResult1
50+
```
51+
52+
Any relative delta change that is above a threshold (currently 0.1) is considered "significant" for the purposes of dodginess detection.
53+
54+
A highly variable test case is one where a certain percentage (currently 5%) of relative delta changes are significant. The logic being that test cases should only display significant relative delta changes a small percentage of the time.
55+
56+
A noisy test case is one where of all the non-significant relative delta changes, the average delta change is still above some threshold (0.001). The logic being that non-significant changes should, on average, being very close to 0. If they are not close to zero, then they are noisy.
57+
58+
### How is confidence in whether a test analysis is "relevant" determined?
59+
60+
The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude (how large a change is regardless of the direction of the change).
61+
62+
The actual algorithm for determining confidence may change, but in general the following rules apply:
63+
* Definitely relevant: any number of very large changes, a small amount of large and/or medium changes, or a large amount of small changes.
64+
* Probably relevant: any number of large changes, more than 1 medium change, or smaller but still substantial amount of small changes.
65+
* Maybe relevant: if it doesn't fit into the above two categories, it ends in this category.

site/src/api.rs

+16-8
Original file line numberDiff line numberDiff line change
@@ -138,7 +138,6 @@ pub mod bootstrap {
138138
}
139139

140140
pub mod comparison {
141-
use crate::comparison;
142141
use collector::Bound;
143142
use database::Date;
144143
use serde::{Deserialize, Serialize};
@@ -153,13 +152,12 @@ pub mod comparison {
153152

154153
#[derive(Debug, Clone, Serialize)]
155154
pub struct Response {
156-
/// The variance data for each benchmark, if any.
157-
pub variance: Option<HashMap<String, comparison::BenchmarkVariance>>,
158155
/// The names for the previous artifact before `a`, if any.
159156
pub prev: Option<String>,
160157

161-
pub a: ArtifactData,
162-
pub b: ArtifactData,
158+
pub a: ArtifactDescription,
159+
pub b: ArtifactDescription,
160+
pub comparisons: Vec<Comparison>,
163161

164162
/// The names for the next artifact after `b`, if any.
165163
pub next: Option<String>,
@@ -169,15 +167,25 @@ pub mod comparison {
169167
pub is_contiguous: bool,
170168
}
171169

172-
/// A serializable wrapper for `comparison::ArtifactData`.
173170
#[derive(Debug, Clone, Serialize)]
174-
pub struct ArtifactData {
171+
pub struct ArtifactDescription {
175172
pub commit: String,
176173
pub date: Option<Date>,
177174
pub pr: Option<u32>,
178-
pub data: HashMap<String, Vec<(String, f64)>>,
179175
pub bootstrap: HashMap<String, u64>,
180176
}
177+
178+
/// A serializable wrapper for `comparison::ArtifactData`.
179+
#[derive(Debug, Clone, Serialize)]
180+
pub struct Comparison {
181+
pub benchmark: String,
182+
pub profile: String,
183+
pub scenario: String,
184+
pub is_significant: bool,
185+
pub is_dodgy: bool,
186+
pub historical_statistics: Option<Vec<f64>>,
187+
pub statistics: (f64, f64),
188+
}
181189
}
182190

183191
pub mod status {

0 commit comments

Comments
 (0)