Skip to content

[BOLT] Impute missing trace fall-through #145258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: users/aaupov/spr/main.bolt-impute-missing-trace-fall-through
Choose a base branch
from

Conversation

aaupov
Copy link
Contributor

@aaupov aaupov commented Jun 23, 2025

This PR adds imputing the fall-throughs for the top-of-stack (TOS)
entries
, addressing fall-through undercounting for branches that appear
on the top of the branch stack.

Branch stack sampling

Branch stack sampling provides a stack of (branch, target)
entries for the last N taken branches:

  • Intel LBR, SKL+: 32,
  • AMD BRS/LBRv2: 16,
  • ARM BRBE: 8/16/32/64 depending on the implementation,
  • arbitrary for trace-synthesized branch stacks (ARM ETM, Intel PT).

Two consecutive entries represent the fall-through path where no branch
is taken, potentially spanning multiple conditional branches.

DataAggregator representation/traces

The internal representation of a unit of the branch profile is a
trace, a triple of branch source, branch target/fall-through start,
and fall-through end/next branch:

uint64_t Branch;
uint64_t From;
uint64_t To;

Branch source and branch target match the branch stack entry.
Fall-through end is the branch start for the next stack entry.
Naturally, the top-of-stack (TOS) entry doesn't have a fall-through end.
Internally, this case is represented with a magic value BR_ONLY in
To field:

static constexpr const uint64_t BR_ONLY = -1ULL;

Traces and fall-throughs

Traces combine taken branches and fall-throughs into one record.
This helps achieve efficiencies throughout profile pre-aggregation,
profile parsing, and most critically for BOLT, enables distinguishing
call continuation fall-throughs from the rest.

This information is used to set the weight for call to call continuation
fall-through CFG edge, in case they are in separate basic blocks. The
effect of this information is non-trivial, providing a measurable CPU
win over separate branches and non-differentiated fall-throughs.

Motivation: top-of-stack biased branches

Intuitively, for any branch in the program, the probability for it to be
captured in a branch stack should be proportional to its execution
frequency, and if captured, it should appear in any given entry with
equal probability of ~1/N, N being the size of the branch stack.

If that was the case, all fall-throughs would have been undercounted by
that magnitude (1/32 for Intel LBR, or ~3%) relative to taken edges.

However, internal studies indicate that Intel LBR sampling is not
completely fair: the observed probability distribution of the percentage
of times an entry appears on the top of the stack does not follow a
binomial distribution (normal bell curve).

Relaxing the assumption of a constant probability of 1/N and modeling
the probability as a beta distribution, the resulting distribution more
closely resembles the actual observed distribution (wider bell curve).

This means that for branches that are biased to be at the top of the
stack, the fall-through path is undercounted to a larger extent than
for other branches, and uniform adjustment is thus inadequate.

Implementation

TOS entries are easily identifiable with a BR_ONLY value in Trace.To
field. To get an expected fall-through length for such traces:

  • group traces by the first two fields (Branch, From),
  • using traces that do have fall-through end, compute weighed average
    fall-through length
    ,
  • use that for the trace without a fall-through end.

Since traces are sorted by their three fields, the grouping is done
naturally, and valid fall-through traces come first within a group, and
BR_ONLY trace comes last in the group. This allows us to compute the
weighted average in a single pass over traces.

Care is taken in two special cases:

  • Fall-through start is an unconditional jump (unconditional branch,
    call, or return). In this case, the fall-through length is set to
    zero. This still allows extending it to cover call to continuation
    fall-through CFG edge.
  • Traces in external code (DSO, JIT) are skipped.

Effect

For a large binary, this change improves the profile quality metrics:

  • no significant increase in mismatching traces (primary raw profile
    quality metric),
  • increases the profile density: 599.4 -> 620.6 (density is computed
    using the fall-through ranges),
  • CG flow imbalance: 11.56% -> 8.28%,
  • CFG flow imbalance:
    • weighted: 3.63% -> 3.34%,
    • worst: 17.13% -> 22.05% (the only regression).

Created using spr 1.3.4
@aaupov aaupov marked this pull request as ready for review June 23, 2025 05:06
@aaupov aaupov requested a review from ShatianWang June 23, 2025 05:06
@llvmbot llvmbot added the BOLT label Jun 23, 2025
@llvmbot
Copy link
Member

llvmbot commented Jun 23, 2025

@llvm/pr-subscribers-bolt

Author: Amir Ayupov (aaupov)

Changes

This PR adds imputing the fall-throughs for the top-of-stack (TOS)
entries
, addressing fall-through undercounting for branches that appear
on the top of the branch stack more often than they should.

Branch stack sampling

Branch stack sampling (LBR, BRBE) provides a stack of (branch, target)
entries for the last N taken branches:

  • Intel SKL+: 32,
  • AMD BRS/LBRv2: 16,
  • ARM BRBE: 8/16/32/64 depending on the implementation,
  • arbitrary for trace-synthesized branch stacks (ARM ETM, Intel PT).

DataAggregator representation/traces

The internal representation of a unit of the branch profile is a
trace, a triple of branch source, branch target/fall-through start,
and fall-through end/next branch:

uint64_t Branch;
uint64_t From;
uint64_t To;

Branch source and branch target match the branch stack entry.
Fall-through end is the branch start for the next stack entry.
Naturally, the top-of-stack (TOS) entry doesn't have a fall-through end.
Internally, this case is represented with a magic value BR_ONLY in
To field:

static constexpr const uint64_t BR_ONLY = -1ULL;

Traces and fall-throughs

Traces combine taken branches and fall-throughs into one record.
This helps achieve efficiencies throughout profile pre-aggregation,
profile parsing, and most critically for BOLT, enables distinguishing
call continuation fall-throughs from the rest.

This information is used to set the weight for call to call continuation
fall-through CFG edge, in case they are in separate basic blocks. The
effect of this information is non-trivial, providing a measurable CPU
win over separate branches and non-differentiated fall-throughs.

Motivation: top-of-stack biased branches

Internal studies indicate that Intel LBR sampling is not completely
fair.

Intuitively, for any branch in the program, the probability for it to be
captured in a branch stack should be proportional to its execution
frequency, and if captured, it should appear in any given entry with
equal probability of ~1/N, N being the size of the branch stack.
However, the observed probability distribution of the percentage of
times an entry appears on the top of the stack does not follow a
binomial distribution.

Relaxing the assumption of a constant probability of 1/N and modeling
the probability as a beta distribution, the resulting distribution more
closely resembled the actual observed distribution.

This means that for branches that are biased to be at the top of the
stack, the fall-through path is undercounted in the profile.

Implementation

TOS entries are easily identifiable with a BR_ONLY value in Trace.To
field. To get an expected fall-through length for such traces:

  • group traces by the first two fields (Branch, From),
  • using traces that do have fall-through end, compute weighed average
    fall-through length
    ,
  • use that for the trace without a fall-through end.

Since traces are sorted by their three fields, the grouping is done
naturally, and valid fall-through traces come first within a group, and
BR_ONLY trace comes last in the group. This allows us to compute the
weighted average in a single pass over traces.

Care is taken in two special cases:

  • Fall-through start is an unconditional jump (unconditional branch,
    call, or return). In this case, the fall-through length is set to
    zero. This still allows extending it to cover call to continuation
    fall-through CFG edge.
  • Traces in external code (DSO, JIT) are skipped.

Effect

For a large binary, this change improves the profile quality metrics:

  • no significant increase in mismatching traces (primary raw profile
    quality metric),
  • increases the profile density: 599.4 -> 620.6 (density is computed
    using the fall-through ranges),
  • CG flow imbalance: 11.56% -> 8.28%,
  • CFG flow imbalance:
    • weighted: 3.63% -> 3.34%,
    • worst: 17.13% -> 22.05% (the only regression).

Full diff: https://github.com/llvm/llvm-project/pull/145258.diff

3 Files Affected:

  • (modified) bolt/include/bolt/Core/MCPlusBuilder.h (+6)
  • (modified) bolt/include/bolt/Profile/DataAggregator.h (+4)
  • (modified) bolt/lib/Profile/DataAggregator.cpp (+62)
diff --git a/bolt/include/bolt/Core/MCPlusBuilder.h b/bolt/include/bolt/Core/MCPlusBuilder.h
index 804100db80793..50c5f5bcb9303 100644
--- a/bolt/include/bolt/Core/MCPlusBuilder.h
+++ b/bolt/include/bolt/Core/MCPlusBuilder.h
@@ -430,6 +430,12 @@ class MCPlusBuilder {
     return Analysis->isIndirectBranch(Inst);
   }
 
+  bool IsUnconditionalJump(const MCInst &Inst) const {
+    const MCInstrDesc &Desc = Info->get(Inst.getOpcode());
+    // barrier captures returns and unconditional branches
+    return Desc.isCall() || Desc.isBarrier();
+  }
+
   /// Returns true if the instruction is memory indirect call or jump
   virtual bool isBranchOnMem(const MCInst &Inst) const {
     llvm_unreachable("not implemented");
diff --git a/bolt/include/bolt/Profile/DataAggregator.h b/bolt/include/bolt/Profile/DataAggregator.h
index 98e4bba872846..00c6f56520fdc 100644
--- a/bolt/include/bolt/Profile/DataAggregator.h
+++ b/bolt/include/bolt/Profile/DataAggregator.h
@@ -499,6 +499,10 @@ class DataAggregator : public DataReader {
   /// If \p FileBuildID has no match, then issue an error and exit.
   void processFileBuildID(StringRef FileBuildID);
 
+  /// Infer missing fall-throughs for branch-only traces (LBR top-of-stack
+  /// entries).
+  void imputeFallThroughs();
+
   /// Debugging dump methods
   void dump() const;
   void dump(const PerfBranchSample &Sample) const;
diff --git a/bolt/lib/Profile/DataAggregator.cpp b/bolt/lib/Profile/DataAggregator.cpp
index 88229bb31a2ad..2fcc2561cc212 100644
--- a/bolt/lib/Profile/DataAggregator.cpp
+++ b/bolt/lib/Profile/DataAggregator.cpp
@@ -77,6 +77,11 @@ FilterPID("pid",
   cl::Optional,
   cl::cat(AggregatorCategory));
 
+static cl::opt<bool> ImputeTraceFallthrough(
+    "impute-trace-fall-through",
+    cl::desc("impute missing fall-throughs for branch-only traces"),
+    cl::Optional, cl::cat(AggregatorCategory));
+
 static cl::opt<bool>
 IgnoreBuildID("ignore-build-id",
   cl::desc("continue even if build-ids in input binary and perf.data mismatch"),
@@ -529,6 +534,60 @@ void DataAggregator::parsePerfData(BinaryContext &BC) {
   deleteTempFiles();
 }
 
+void DataAggregator::imputeFallThroughs() {
+  if (Traces.empty())
+    return;
+
+  std::pair PrevBranch(Trace::EXTERNAL, Trace::EXTERNAL);
+  uint64_t AggregateCount = 0;
+  uint64_t AggregateFallthroughSize = 0;
+  uint64_t InferredTraces = 0;
+
+  // Helper map with whether the instruction is a call/ret/unconditional branch
+  std::unordered_map<uint64_t, bool> IsUncondJumpMap;
+  auto checkUncondJump = [&](const uint64_t Addr) {
+    auto isUncondJump = [&](auto MI) {
+      return MI && BC->MIB->IsUnconditionalJump(*MI);
+    };
+    auto It = IsUncondJumpMap.find(Addr);
+    if (It != IsUncondJumpMap.end())
+      return It->second;
+    BinaryFunction *Func = getBinaryFunctionContainingAddress(Addr);
+    if (!Func)
+      return false;
+    const uint64_t Offset = Addr - Func->getAddress();
+    if (Func->hasInstructions()
+            ? isUncondJump(Func->getInstructionAtOffset(Offset))
+            : isUncondJump(Func->disassembleInstructionAtOffset(Offset))) {
+      IsUncondJumpMap.emplace(Addr, true);
+      return true;
+    }
+    return false;
+  };
+
+  for (auto &[Trace, Info] : Traces) {
+    if (Trace.From == Trace::EXTERNAL)
+      continue;
+    std::pair CurrentBranch(Trace.Branch, Trace.From);
+    if (Trace.To == Trace::BR_ONLY) {
+      uint64_t InferredBytes = PrevBranch == CurrentBranch
+                                   ? AggregateFallthroughSize / AggregateCount
+                                   : !checkUncondJump(Trace.From);
+      Trace.To = Trace.From + InferredBytes;
+      outs() << "Inferred " << Trace << " " << InferredBytes << '\n';
+      ++InferredTraces;
+    } else {
+      if (CurrentBranch != PrevBranch)
+        AggregateCount = AggregateFallthroughSize = 0;
+      if (Trace.To != Trace::EXTERNAL)
+        AggregateFallthroughSize += (Trace.To - Trace.From) * Info.TakenCount;
+      AggregateCount += Info.TakenCount;
+    }
+    PrevBranch = CurrentBranch;
+  }
+  outs() << "Inferred " << InferredTraces << " traces\n";
+}
+
 Error DataAggregator::preprocessProfile(BinaryContext &BC) {
   this->BC = &BC;
 
@@ -541,6 +600,9 @@ Error DataAggregator::preprocessProfile(BinaryContext &BC) {
   // Sort parsed traces for faster processing.
   llvm::sort(Traces, llvm::less_first());
 
+  if (opts::ImputeTraceFallthrough)
+    imputeFallThroughs();
+
   if (opts::HeatmapMode) {
     if (std::error_code EC = printLBRHeatMap())
       return errorCodeToError(EC);

aaupov added 2 commits June 23, 2025 12:54
Created using spr 1.3.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants