Description
I'm going to try to assemble an MRE but it's going to take me a little time because I'm working inside of a proprietary monorepo. Let me describe what I'm seeing:
When building the same Rust codebase with both Cargo and Bazel on Apple Silicon (ARM64) and x86_64 Linux, the Bazel-built binary shows approximately 60% worse performance in benchmarks compared to the Cargo-built version, despite seemingly identical optimization settings.
Environment
Platform: macOS (ARM64) and Linux (x86_64)
Rust version: nightly-2025-02-07
rules_rust version: 0.60.0
Both builds use static linking
Both builds have identical environment variables
What I've Tried
- Verified LTO is properly enabled in both builds (both thin and fat LTO, with and without LTO=manual)
linker-plugin-lto
only worked on Linux, barely helped (~1.5%)- Confirmed optimizations are correctly set (opt-level=3, codegen-units=1)
- Tried adding -Ctarget-cpu=native and ARM-specific target features on both platforms.
- Verified the same code paths are being executed (same performance profile allocation percentages)
- I used
--subcommands
andRUSTC_LOG=rustc_codegen_ssa::back::lto=info
to confirm LTO was firing, it was on both platforms. - Manually setting
rustc_flags
on the binary target, library targets, etc. a la https://github.com/bazelbuild/examples/tree/main/rust-examples/03-comp-opt - The only thing that has significantly helped performance (-10%) has been to set this:
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cllvm-args=-import-instr-limit=5000
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cllvm-args=-inline-threshold=10000
But even with that it's quite a bit slower than the Cargo version.
Primary hypothesis: Dependency resolution method differences between Cargo and Bazel
My best guess is that the way the Bazel build links the rust_library
targets defeats cross-crate inlining, monomorphization and optimization during LTO. My project's performance is extremely sensitive to inlining. Caveat: the most performance sensitive/critical libraries are both extern
in Bazel and Cargo.
Secondary hypotheses:
- Different LLVM optimization pass ordering between Cargo and Bazel builds
- Dependency build order affecting optimization outcomes (I don't rate this highly)
Reproduction Steps
- Build the same binary with both Cargo and Bazel using release configurations
- Run benchmarks on both binaries (both outside of the sandbox)
- Observe approximately 60% worse performance in the Bazel-built binary
The benchmark is rust_binary
building a benches/
file that uses the Criterion benchmark harness intenrally.
I think I need a way to force Bazel to use --extern-style dependency resolution rather than -L library paths, similar to how Cargo resolves dependencies in workspace mode, to get proper monomorphization and LTO behavior. I cannot use Bazel for this Rust project if this requires rewriting all the crate dependencies as one big source tree.
Frankly I'm a little baffled because I know major projects are using Bazel for their builds and I don't know how I'm running into a perf issue like this that I haven't been able to see mentioned anywhere on the web. And I looked very hard for answers. I've been working on this for the last few days.
Here's a sample .bazelrc
I used in my testing:
Linux:
build:release --action_env=RUSTC_LOG=rustc_codegen_ssa::back::lto=info
build:release --compilation_mode=opt
build:release --@rules_rust//rust/settings:lto=fat
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Copt-level=3
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Ccodegen-units=1
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebug-assertions=off
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Coverflow-checks=off
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Clinker-plugin-lto
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cembed-bitcode=yes
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Clink-arg=-fuse-ld=lld
macOS:
build:release --action_env=RUSTC_LOG=rustc_codegen_ssa::back::lto=info
build:release --compilation_mode=opt
build:release --@rules_rust//rust/settings:lto=fat
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Copt-level=3
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Ccodegen-units=1
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebug-assertions=off
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Coverflow-checks=off
I am desperately hoping that I simply missed something obvious. My command for building the binary would typically be something like:
bazel build --config=release //crate:bench_name
and then I'd run the benchmark binary in bazel-bin/
with the --bench
argument.
One difference between the binaries I noticed was that the Bazel generated binary was ~8.8 MiB vs. the ~8.3 MiB of the Cargo built binary. I first noticed the performance disparity with a very vanilla bazel build -c opt
generated binary.
toolchain setup from MODULE.bazel
:
bazel_dep(name = "rules_rust", version = "0.60.0")
rust = use_extension("@rules_rust//rust:extensions.bzl", "rust")
rust.toolchain(
edition = "2021",
versions = ["nightly/2025-02-07"],
dev_components = True,
)
use_repo(rust, "rust_toolchains")
register_toolchains("@rust_toolchains//:all")
[...]
crates.from_cargo(
name = "crates",
cargo_lockfile = "//:Cargo.lock",
generate_binaries = True,
manifests = [
(Cargo.tomls of the workspace and constituent crates)
],
)
Then the rust_binary
for the benchmark:
rust_binary(
name = "my_benchmark",
srcs = ["benches/my_benchmark.rs"],
aliases = aliases(),
edition = "2021",
proc_macro_deps = all_crate_deps(
proc_macro = True,
proc_macro_dev = True,
),
deps = [
":lib",
(other in-workspace dependencies)
] + all_crate_deps(
normal = True,
normal_dev = True,
),
)
There are two particular dependencies which are exposed as rust_library
s in the Bazel build which are very inlining sensitive, either one not getting inlined properly would explain the overall benchmark being slower without any obvious hot-spots. Comparing the params and rustc invocations from Cargo don't show any obvious differences. e.g.
From Bazel, benchmark program params in bazel-out:
--extern=pkg-a=bazel-out/darwin_arm64-opt/bin/pkg-a/libpkg-a-262915972.rlib
-Ldependency=bazel-out/darwin_arm64-opt/bin/pkg-a
--extern=pkg-b=bazel-out/darwin_arm64-opt/bin/pkg-b/libpkg-b-2509656758.rlib
-Ldependency=bazel-out/darwin_arm64-opt/bin/pkg-b
From Cargo:
-L
dependency=/Users/callen/.cargo/cache/release/deps
--extern
pkg-a=/Users/callen/.cargo/cache/release/deps/libpkg-a-27b8eb42159f5c37.rlib
--extern
pkg-b=/Users/callen/.cargo/cache/release/deps/libpkg-b-f26c14e53d574db5.rlib
Then buried in Cargo's deps:
/Users/callen/.cargo/cache/release/deps/libpkg-b-f26c14e53d574db5.rmeta: (…source listing…)
/Users/callen/.cargo/cache/release/deps/libpkg-b-f26c14e53d574db5.rlib: (…source listing…)
/Users/callen/.cargo/cache/release/deps/pkg-b-f26c14e53d574db5.d: (…source listing…)
I've been using nm -C
and nm -gUC
with awk to generate symbol table counts and they're strongly similar. The main pattern I'm seeing is additional symbols getting included in the Bazel binary seemingly because dead-code elimination isn't firing for the library artifacts properly.
Turning LTO off for both builds makes the Cargo and Bazel benchmarks about the same.
One difference I just noticed:
Bazel rustc for building the final benchmark binary:
--emit=link=bazel-out/darwin_arm64-opt/bin/crate/benchmark
--emit=dep-info
Cargo, for the same:
--emit=dep-info,link
Update: after some research, I think the =path
argument to emit=link
is just setting the output path, shouldn't make a difference.
It looks like the rlibs have both object code and bitcode in both Bazel and Cargo.
The rust_library
definitions look like this:
rust_library(
name = "pkg-b",
srcs = glob(["src/**/*.rs"]),
crate_features = [
"default",
],
edition = "2021",
proc_macro_deps = [
"//pkg-b/macros",
] + all_crate_deps(proc_macro = True),
deps = [
"//pkg-b/vendored-crate-1",
"//pkg-b/vendored-crate-2",
"//pkg-b/vendored-crate-3",
"//pkg-b/vendored-crate-4",
] + all_crate_deps(),
)
rust_library(
name = "pkg-a",
srcs = glob(["src/**/*.rs"]),
aliases = aliases(),
crate_features = [
"default",
"bazel_build",
],
edition = "2021",
proc_macro_deps = [
"//pkg-b/macros",
] + all_crate_deps(proc_macro = True),
deps = [
"//pkg-c",
"//pkg-b/vendored-crate-2",
"//pkg-b",
] + all_crate_deps(),
)
New update: enabling build:release --@rules_rust//rust/settings:lto=thin
worked locally on Linux and macOS (still same perf issue tho) but barfed on the proc macro crates in CI/CD with a linker error that is indicative of "tried improperly to LTO a proc macro crate".
Wondering if the CC and CXX compilers, and every other environment variable are the same between cargo and bazel. I think you may be compiling your sys dependencies differently.
Update: A Reddit user suggested a possibility, here's my reply:
I think you may be compiling your sys dependencies differently.
Could be, but all the perf impact is the pure Rust crates that live inside the Cargo workspace. There's very very little
-sys
in the dep tree and none of it is in the critical path. If any of it had been, it would've shown up when I profiled the optimized builds with debug symbols.
Update, follow-up:
build:release --compilation_mode=opt
build:release --@rules_rust//rust/settings:lto=thin
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Copt-level=3
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Ccodegen-units=1
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebug-assertions=off
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Coverflow-checks=off
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebuginfo=0
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cstrip=debuginfo
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cforce-frame-pointers=no
This made the benchmark 3.5x as slow as the original benchmark with Cargo.