Skip to content

Commit 0af149a

Browse files
committed
Take magnitude into account
1 parent ae89411 commit 0af149a

File tree

3 files changed

+156
-50
lines changed

3 files changed

+156
-50
lines changed

docs/comparison-analysis.md

+15-7
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,16 @@ At the core of comparison analysis are the collection of test results for the tw
2222

2323
Analysis of the changes is performed in order to determine whether artifact B represents a performance change over artifact A. At a high level the analysis performed takes the following form:
2424

25-
How many _significant_ test results indicate performance changes? If all significant test results indicate regressions (i.e., all percent relative changes are positive), then artifact B represents a performance regression over artifact A. If all significant test results indicate improvements (i.e., all percent relative changes are negative), then artifact B represents a performance improvement over artifact B. If some significant test results indicate improvement and others indicate regressions, then the performance change is mixed.
25+
How many _significant_ test results indicate performance changes and what is the magnitude of the changes (i.e., how large are the changes regardless of the direction of change)?
26+
27+
* If there are improvements and regressions with magnitude of medium or above then the comparison is mixed.
28+
* If there are only either improvements or regressions then the comparison is labeled with that kind.
29+
* If one kind of changes are of medium or above magnitude (and the other kind are not), then the comparison is mixed if 15% or more of the total changes are the other (small magnitude) kind. For example:
30+
* given 20 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there are 4 or more improvements.
31+
* given 5 regressions (with at least 1 of medium magnitude) and all improvements of low magnitude, the comparison is only mixed if there 1 or more improvements.
32+
* If both kinds of changes are all low magnitude changes, then the comparison is mixed unless 90% or more of total changes are of one kind. For example:
33+
* given 20 changes of different kinds all of low magnitude, the result is mixed unless only 2 or fewer of the changes are of one kind.
34+
* given 5 changes of different kinds all of low magnitude, the result is always mixed.
2635

2736
Whether we actually _report_ an analysis or not depends on the context and how _confident_ we are in the summary of the results (see below for an explanation of how confidence is derived). For example, in pull request performance "try" runs, we report a performance change if we are at least confident that the results are "probably relevant", while for the triage report, we only report if the we are confident the results are "definitely relevant".
2837

@@ -48,10 +57,9 @@ A noisy test case is one where of all the non-significant relative delta changes
4857

4958
### How is confidence in whether a test analysis is "relevant" determined?
5059

51-
The confidence in whether a test analysis is relevant depends on the number of significant test results. Depending on that number a confidence level is reached:
52-
53-
* Maybe relevant: 0-3 changes
54-
* Probably relevant: 4-6 changes
55-
* Definitely relevant: >6 changes
60+
The confidence in whether a test analysis is relevant depends on the number of significant test results and their magnitude (how large a change is regardless of the direction of the change).
5661

57-
Note: changes can be any combination of positive or negative changes.
62+
The actual algorithm for determining confidence may change, but in general the following rules apply:
63+
* Definitely relevant: any number of very large changes, a small amount of large and/or medium changes, or a large amount of small changes.
64+
* Probably relevant: any number of large changes, more than 1 medium change, or smaller but still substantial amount of small changes.
65+
* Maybe relevant: if it doesn't fit into the above two categories, it ends in this category.

site/src/comparison.rs

+140-42
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ async fn populate_report(comparison: &Comparison, report: &mut HashMap<Direction
133133
}
134134

135135
pub struct ComparisonSummary {
136-
/// Significant comparisons
136+
/// Significant comparisons ordered by magnitude
137137
comparisons: Vec<TestResultComparison>,
138138
}
139139

@@ -150,8 +150,8 @@ impl ComparisonSummary {
150150
}
151151

152152
let cmp = |b1: &TestResultComparison, b2: &TestResultComparison| {
153-
b1.log_change()
154-
.partial_cmp(&b2.log_change())
153+
b2.log_change()
154+
.partial_cmp(&b1.log_change())
155155
.unwrap_or(std::cmp::Ordering::Equal)
156156
};
157157
comparisons.sort_by(cmp);
@@ -161,41 +161,108 @@ impl ComparisonSummary {
161161

162162
/// The direction of the changes
163163
pub fn direction(&self) -> Option<Direction> {
164-
let d = match (self.largest_improvement(), self.largest_regression()) {
165-
(None, None) => return None,
166-
(Some(c), None) => c.direction(),
167-
(None, Some(c)) => c.direction(),
168-
(Some(a), Some(b)) if a.is_increase() == b.is_increase() => a.direction(),
169-
_ => Direction::Mixed,
170-
};
171-
Some(d)
164+
if self.comparisons.len() == 0 {
165+
return None;
166+
}
167+
168+
let (regressions, improvements): (Vec<&TestResultComparison>, _) =
169+
self.comparisons.iter().partition(|c| c.is_regression());
170+
171+
if regressions.len() == 0 {
172+
return Some(Direction::Improvement);
173+
}
174+
175+
if improvements.len() == 0 {
176+
return Some(Direction::Regression);
177+
}
178+
179+
let total_num = self.comparisons.len();
180+
let regressions_ratio = regressions.len() as f64 / total_num as f64;
181+
182+
let has_medium_and_above_regressions = regressions
183+
.iter()
184+
.any(|c| c.magnitude().is_medium_or_above());
185+
let has_medium_and_above_improvements = improvements
186+
.iter()
187+
.any(|c| c.magnitude().is_medium_or_above());
188+
match (
189+
has_medium_and_above_improvements,
190+
has_medium_and_above_regressions,
191+
) {
192+
(true, true) => return Some(Direction::Mixed),
193+
(true, false) => {
194+
if regressions_ratio >= 0.15 {
195+
Some(Direction::Mixed)
196+
} else {
197+
Some(Direction::Improvement)
198+
}
199+
}
200+
(false, true) => {
201+
if regressions_ratio < 0.85 {
202+
Some(Direction::Mixed)
203+
} else {
204+
Some(Direction::Regression)
205+
}
206+
}
207+
(false, false) => {
208+
if regressions_ratio >= 0.1 && regressions_ratio <= 0.9 {
209+
Some(Direction::Mixed)
210+
} else if regressions_ratio <= 0.1 {
211+
Some(Direction::Improvement)
212+
} else {
213+
Some(Direction::Regression)
214+
}
215+
}
216+
}
172217
}
173218

174-
/// The n largest changes ordered by the magnitude of their change
175-
pub fn largest_changes(&self, n: usize) -> Vec<&TestResultComparison> {
176-
let mut changes: Vec<&TestResultComparison> = self.comparisons.iter().take(n).collect();
177-
changes.sort_by(|a, b| {
178-
b.log_change()
179-
.abs()
180-
.partial_cmp(&a.log_change().abs())
181-
.unwrap_or(std::cmp::Ordering::Equal)
182-
});
183-
changes
219+
pub fn relevant_changes<'a>(&'a self) -> [Option<&TestResultComparison>; 2] {
220+
match self.direction() {
221+
Some(Direction::Improvement) => [self.largest_improvement(), None],
222+
Some(Direction::Regression) => [self.largest_regression(), None],
223+
Some(Direction::Mixed) => [self.largest_improvement(), self.largest_regression()],
224+
None => [None, None],
225+
}
184226
}
185227

186228
pub fn largest_improvement(&self) -> Option<&TestResultComparison> {
187-
self.comparisons.iter().filter(|s| !s.is_increase()).next()
229+
self.comparisons
230+
.iter()
231+
.filter(|s| !s.is_regression())
232+
.next()
188233
}
189234

190235
pub fn largest_regression(&self) -> Option<&TestResultComparison> {
191-
self.comparisons.iter().filter(|s| s.is_increase()).next()
236+
self.comparisons.iter().filter(|s| s.is_regression()).next()
192237
}
193238

194239
pub fn confidence(&self) -> ComparisonConfidence {
195-
match self.comparisons.len() {
196-
0..=3 => ComparisonConfidence::MaybeRelevant,
197-
4..=6 => ComparisonConfidence::ProbablyRelevant,
198-
_ => ComparisonConfidence::DefinitelyRelevant,
240+
let mut num_small_changes = 0;
241+
let mut num_medium_changes = 0;
242+
let mut num_large_changes = 0;
243+
let mut num_very_large_changes = 0;
244+
for c in self.comparisons.iter() {
245+
match c.magnitude() {
246+
Magnitude::Small => num_small_changes += 1,
247+
Magnitude::Medium => num_medium_changes += 1,
248+
Magnitude::Large => num_large_changes += 1,
249+
Magnitude::VeryLarge => num_very_large_changes += 1,
250+
}
251+
}
252+
253+
match (
254+
num_medium_changes,
255+
num_large_changes,
256+
num_very_large_changes,
257+
) {
258+
(_, _, vl) if vl > 0 => ComparisonConfidence::DefinitelyRelevant,
259+
(m, l, _) if m + (l * 2) > 5 => ComparisonConfidence::DefinitelyRelevant,
260+
(m, l, _) if m > 1 || l > 0 => ComparisonConfidence::ProbablyRelevant,
261+
(m, _, _) => match m * 2 + num_small_changes {
262+
0..=5 => ComparisonConfidence::MaybeRelevant,
263+
6..=10 => ComparisonConfidence::ProbablyRelevant,
264+
_ => ComparisonConfidence::DefinitelyRelevant,
265+
},
199266
}
200267
}
201268

@@ -215,7 +282,7 @@ impl ComparisonSummary {
215282
let end = &comparison.b.artifact;
216283
let link = &compare_link(start, end);
217284

218-
for change in self.largest_changes(2) {
285+
for change in self.relevant_changes().iter().filter_map(|s| *s) {
219286
write!(result, "- ").unwrap();
220287
change.summary_line(&mut result, Some(link))
221288
}
@@ -666,7 +733,7 @@ impl TestResultComparison {
666733
(b / a).ln()
667734
}
668735

669-
fn is_increase(&self) -> bool {
736+
fn is_regression(&self) -> bool {
670737
let (a, b) = self.results;
671738
b > a
672739
}
@@ -681,6 +748,19 @@ impl TestResultComparison {
681748
self.relative_change().abs() > threshold
682749
}
683750

751+
fn magnitude(&self) -> Magnitude {
752+
let mag = self.relative_change().abs();
753+
if mag < 0.01 {
754+
Magnitude::Small
755+
} else if mag < 0.04 {
756+
Magnitude::Medium
757+
} else if mag < 0.1 {
758+
Magnitude::Large
759+
} else {
760+
Magnitude::VeryLarge
761+
}
762+
}
763+
684764
fn is_dodgy(&self) -> bool {
685765
self.variance
686766
.as_ref()
@@ -703,24 +783,13 @@ impl TestResultComparison {
703783

704784
pub fn summary_line(&self, summary: &mut String, link: Option<&str>) {
705785
use std::fmt::Write;
706-
let magnitude = self.relative_change().abs();
707-
let size = if magnitude > 0.10 {
708-
"Very large"
709-
} else if magnitude > 0.05 {
710-
"Large"
711-
} else if magnitude > 0.01 {
712-
"Moderate"
713-
} else if magnitude > 0.005 {
714-
"Small"
715-
} else {
716-
"Very small"
717-
};
786+
let magnitude = self.magnitude();
718787

719788
let percent = self.relative_change() * 100.0;
720789
write!(
721790
summary,
722791
"{} {} in {}",
723-
size,
792+
magnitude,
724793
self.direction(),
725794
match link {
726795
Some(l) => format!("[instruction counts]({})", l),
@@ -736,13 +805,15 @@ impl TestResultComparison {
736805
.unwrap();
737806
}
738807
}
808+
739809
impl std::cmp::PartialEq for TestResultComparison {
740810
fn eq(&self, other: &Self) -> bool {
741811
self.benchmark == other.benchmark
742812
&& self.profile == other.profile
743813
&& self.scenario == other.scenario
744814
}
745815
}
816+
746817
impl std::cmp::Eq for TestResultComparison {}
747818

748819
impl std::hash::Hash for TestResultComparison {
@@ -772,6 +843,33 @@ impl std::fmt::Display for Direction {
772843
}
773844
}
774845

846+
/// The relative size of a performance change
847+
#[derive(Debug)]
848+
enum Magnitude {
849+
Small,
850+
Medium,
851+
Large,
852+
VeryLarge,
853+
}
854+
855+
impl Magnitude {
856+
fn is_medium_or_above(&self) -> bool {
857+
!matches!(self, Self::Small)
858+
}
859+
}
860+
861+
impl std::fmt::Display for Magnitude {
862+
fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
863+
let s = match self {
864+
Self::Small => "Small",
865+
Self::Medium => "Moderate",
866+
Self::Large => "Large",
867+
Self::VeryLarge => "Very large",
868+
};
869+
f.write_str(s)
870+
}
871+
}
872+
775873
async fn generate_report(
776874
start: &Bound,
777875
end: &Bound,

site/src/github.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -630,7 +630,7 @@ async fn categorize_benchmark(
630630
"This change led to significant {} in compiler performance.\n",
631631
category
632632
);
633-
for change in summary.largest_changes(2) {
633+
for change in summary.relevant_changes().iter().filter_map(|c| *c) {
634634
write!(result, "- ").unwrap();
635635
change.summary_line(&mut result, None)
636636
}

0 commit comments

Comments
 (0)