-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[BOLT] Impute missing trace fall-through #145258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: users/aaupov/spr/main.bolt-impute-missing-trace-fall-through
Are you sure you want to change the base?
[BOLT] Impute missing trace fall-through #145258
Conversation
Created using spr 1.3.4
@llvm/pr-subscribers-bolt Author: Amir Ayupov (aaupov) ChangesThis PR adds imputing the fall-throughs for the top-of-stack (TOS) Branch stack samplingBranch stack sampling (LBR, BRBE) provides a stack of (branch, target)
DataAggregator representation/tracesThe internal representation of a unit of the branch profile is a llvm-project/bolt/include/bolt/Profile/DataAggregator.h Lines 116 to 118 in 89c6144
Branch source and branch target match the branch stack entry.
Traces and fall-throughsTraces combine taken branches and fall-throughs into one record. This information is used to set the weight for call to call continuation Motivation: top-of-stack biased branchesInternal studies indicate that Intel LBR sampling is not completely Intuitively, for any branch in the program, the probability for it to be Relaxing the assumption of a constant probability of 1/N and modeling This means that for branches that are biased to be at the top of the ImplementationTOS entries are easily identifiable with a
Since traces are sorted by their three fields, the grouping is done Care is taken in two special cases:
EffectFor a large binary, this change improves the profile quality metrics:
Full diff: https://github.com/llvm/llvm-project/pull/145258.diff 3 Files Affected:
diff --git a/bolt/include/bolt/Core/MCPlusBuilder.h b/bolt/include/bolt/Core/MCPlusBuilder.h
index 804100db80793..50c5f5bcb9303 100644
--- a/bolt/include/bolt/Core/MCPlusBuilder.h
+++ b/bolt/include/bolt/Core/MCPlusBuilder.h
@@ -430,6 +430,12 @@ class MCPlusBuilder {
return Analysis->isIndirectBranch(Inst);
}
+ bool IsUnconditionalJump(const MCInst &Inst) const {
+ const MCInstrDesc &Desc = Info->get(Inst.getOpcode());
+ // barrier captures returns and unconditional branches
+ return Desc.isCall() || Desc.isBarrier();
+ }
+
/// Returns true if the instruction is memory indirect call or jump
virtual bool isBranchOnMem(const MCInst &Inst) const {
llvm_unreachable("not implemented");
diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 98e4bba872846..00c6f56520fdc 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -499,6 +499,10 @@ class DataAggregator : public DataReader {
/// If \p FileBuildID has no match, then issue an error and exit.
void processFileBuildID(StringRef FileBuildID);
+ /// Infer missing fall-throughs for branch-only traces (LBR top-of-stack
+ /// entries).
+ void imputeFallThroughs();
+
/// Debugging dump methods
void dump() const;
void dump(const PerfBranchSample &Sample) const;
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 88229bb31a2ad..2fcc2561cc212 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -77,6 +77,11 @@ FilterPID("pid",
cl::Optional,
cl::cat(AggregatorCategory));
+static cl::opt<bool> ImputeTraceFallthrough(
+ "impute-trace-fall-through",
+ cl::desc("impute missing fall-throughs for branch-only traces"),
+ cl::Optional, cl::cat(AggregatorCategory));
+
static cl::opt<bool>
IgnoreBuildID("ignore-build-id",
cl::desc("continue even if build-ids in input binary and perf.data mismatch"),
@@ -529,6 +534,60 @@ void DataAggregator::parsePerfData(BinaryContext &BC) {
deleteTempFiles();
}
+void DataAggregator::imputeFallThroughs() {
+ if (Traces.empty())
+ return;
+
+ std::pair PrevBranch(Trace::EXTERNAL, Trace::EXTERNAL);
+ uint64_t AggregateCount = 0;
+ uint64_t AggregateFallthroughSize = 0;
+ uint64_t InferredTraces = 0;
+
+ // Helper map with whether the instruction is a call/ret/unconditional branch
+ std::unordered_map<uint64_t, bool> IsUncondJumpMap;
+ auto checkUncondJump = [&](const uint64_t Addr) {
+ auto isUncondJump = [&](auto MI) {
+ return MI && BC->MIB->IsUnconditionalJump(*MI);
+ };
+ auto It = IsUncondJumpMap.find(Addr);
+ if (It != IsUncondJumpMap.end())
+ return It->second;
+ BinaryFunction *Func = getBinaryFunctionContainingAddress(Addr);
+ if (!Func)
+ return false;
+ const uint64_t Offset = Addr - Func->getAddress();
+ if (Func->hasInstructions()
+ ? isUncondJump(Func->getInstructionAtOffset(Offset))
+ : isUncondJump(Func->disassembleInstructionAtOffset(Offset))) {
+ IsUncondJumpMap.emplace(Addr, true);
+ return true;
+ }
+ return false;
+ };
+
+ for (auto &[Trace, Info] : Traces) {
+ if (Trace.From == Trace::EXTERNAL)
+ continue;
+ std::pair CurrentBranch(Trace.Branch, Trace.From);
+ if (Trace.To == Trace::BR_ONLY) {
+ uint64_t InferredBytes = PrevBranch == CurrentBranch
+ ? AggregateFallthroughSize / AggregateCount
+ : !checkUncondJump(Trace.From);
+ Trace.To = Trace.From + InferredBytes;
+ outs() << "Inferred " << Trace << " " << InferredBytes << '\n';
+ ++InferredTraces;
+ } else {
+ if (CurrentBranch != PrevBranch)
+ AggregateCount = AggregateFallthroughSize = 0;
+ if (Trace.To != Trace::EXTERNAL)
+ AggregateFallthroughSize += (Trace.To - Trace.From) * Info.TakenCount;
+ AggregateCount += Info.TakenCount;
+ }
+ PrevBranch = CurrentBranch;
+ }
+ outs() << "Inferred " << InferredTraces << " traces\n";
+}
+
Error DataAggregator::preprocessProfile(BinaryContext &BC) {
this->BC = &BC;
@@ -541,6 +600,9 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
// Sort parsed traces for faster processing.
llvm::sort(Traces, llvm::less_first());
+ if (opts::ImputeTraceFallthrough)
+ imputeFallThroughs();
+
if (opts::HeatmapMode) {
if (std::error_code EC = printLBRHeatMap())
return errorCodeToError(EC);
|
Created using spr 1.3.4
Created using spr 1.3.4
This PR adds imputing the fall-throughs for the top-of-stack (TOS)
entries, addressing fall-through undercounting for branches that appear
on the top of the branch stack.
Branch stack sampling
Branch stack sampling provides a stack of (branch, target)
entries for the last N taken branches:
Two consecutive entries represent the fall-through path where no branch
is taken, potentially spanning multiple conditional branches.
DataAggregator representation/traces
The internal representation of a unit of the branch profile is a
trace, a triple of branch source, branch target/fall-through start,
and fall-through end/next branch:
llvm-project/bolt/include/bolt/Profile/DataAggregator.h
Lines 116 to 118 in 89c6144
Branch source and branch target match the branch stack entry.
Fall-through end is the branch start for the next stack entry.
Naturally, the top-of-stack (TOS) entry doesn't have a fall-through end.
Internally, this case is represented with a magic value
BR_ONLY
inTo
field:llvm-project/bolt/include/bolt/Profile/DataAggregator.h
Line 111 in 89c6144
Traces and fall-throughs
Traces combine taken branches and fall-throughs into one record.
This helps achieve efficiencies throughout profile pre-aggregation,
profile parsing, and most critically for BOLT, enables distinguishing
call continuation fall-throughs from the rest.
This information is used to set the weight for call to call continuation
fall-through CFG edge, in case they are in separate basic blocks. The
effect of this information is non-trivial, providing a measurable CPU
win over separate branches and non-differentiated fall-throughs.
Motivation: top-of-stack biased branches
Intuitively, for any branch in the program, the probability for it to be
captured in a branch stack should be proportional to its execution
frequency, and if captured, it should appear in any given entry with
equal probability of ~1/N, N being the size of the branch stack.
If that was the case, all fall-throughs would have been undercounted by
that magnitude (1/32 for Intel LBR, or ~3%) relative to taken edges.
However, internal studies indicate that Intel LBR sampling is not
completely fair: the observed probability distribution of the percentage
of times an entry appears on the top of the stack does not follow a
binomial distribution (normal bell curve).
Relaxing the assumption of a constant probability of 1/N and modeling
the probability as a beta distribution, the resulting distribution more
closely resembles the actual observed distribution (wider bell curve).
This means that for branches that are biased to be at the top of the
stack, the fall-through path is undercounted to a larger extent than
for other branches, and uniform adjustment is thus inadequate.
Implementation
TOS entries are easily identifiable with a
BR_ONLY
value inTrace.To
field. To get an expected fall-through length for such traces:
Branch
,From
),fall-through length,
Since traces are sorted by their three fields, the grouping is done
naturally, and valid fall-through traces come first within a group, and
BR_ONLY
trace comes last in the group. This allows us to compute theweighted average in a single pass over traces.
Care is taken in two special cases:
call, or return). In this case, the fall-through length is set to
zero. This still allows extending it to cover call to continuation
fall-through CFG edge.
Effect
For a large binary, this change improves the profile quality metrics:
quality metric),
using the fall-through ranges),