Rust application much slower when built with rules_rust than with Cargo

I'm going to try to assemble an MRE but it's going to take me a little time because I'm working inside of a proprietary monorepo. Let me describe what I'm seeing:

When building the same Rust codebase with both Cargo and Bazel on Apple Silicon (ARM64) and x86_64 Linux, the Bazel-built binary shows approximately 60% worse performance in benchmarks compared to the Cargo-built version, despite seemingly identical optimization settings.


Environment

Platform: macOS (ARM64) and Linux (x86_64)
Rust version: nightly-2025-02-07
rules_rust version: 0.60.0
Both builds use static linking
Both builds have identical environment variables

What I've Tried

- Verified LTO is properly enabled in both builds (both thin and fat LTO, with and without LTO=manual)
- `linker-plugin-lto` only worked on Linux, barely helped (~1.5%)
- Confirmed optimizations are correctly set (opt-level=3, codegen-units=1)
- Tried adding -Ctarget-cpu=native and ARM-specific target features on both platforms.
- Verified the same code paths are being executed (same performance profile allocation percentages)
- I used `--subcommands` and `RUSTC_LOG=rustc_codegen_ssa::back::lto=info` to confirm LTO was firing, it was on both platforms.
- Manually setting `rustc_flags` on the binary target, library targets, etc. a la https://github.com/bazelbuild/examples/tree/main/rust-examples/03-comp-opt
- The only thing that has significantly helped performance (-10%) has been to set this:

```
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cllvm-args=-import-instr-limit=5000
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cllvm-args=-inline-threshold=10000
```

But even with that it's quite a bit slower than the Cargo version.

Primary hypothesis: Dependency resolution method differences between Cargo and Bazel

My best guess is that the way the Bazel build links the `rust_library` targets defeats cross-crate inlining, monomorphization and optimization during LTO. My project's performance is extremely sensitive to inlining. Caveat: the most performance sensitive/critical libraries are both `extern` in Bazel and Cargo.


Secondary hypotheses:

- Different LLVM optimization pass ordering between Cargo and Bazel builds
- Dependency build order affecting optimization outcomes (I don't rate this highly)

Reproduction Steps

- Build the same binary with both Cargo and Bazel using release configurations
- Run benchmarks on both binaries (both outside of the sandbox)
- Observe approximately 60% worse performance in the Bazel-built binary

The benchmark is `rust_binary` building a `benches/` file that uses the Criterion benchmark harness intenrally.

I think I need a way to force Bazel to use --extern-style dependency resolution rather than -L library paths, similar to how Cargo resolves dependencies in workspace mode, to get proper monomorphization and LTO behavior. I cannot use Bazel for this Rust project if this requires rewriting all the crate dependencies as one big source tree.

Frankly I'm a little baffled because I know major projects are using Bazel for their builds and I don't know how I'm running into a perf issue like this that I haven't been able to see mentioned anywhere on the web. And I looked *very* hard for answers. I've been working on this for the last few days.

Here's a sample `.bazelrc` I used in my testing:

Linux:

```
build:release --action_env=RUSTC_LOG=rustc_codegen_ssa::back::lto=info
build:release --compilation_mode=opt
build:release --@rules_rust//rust/settings:lto=fat
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Copt-level=3
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Ccodegen-units=1
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebug-assertions=off
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Coverflow-checks=off

build:release --@rules_rust//rust/settings:extra_rustc_flag=-Clinker-plugin-lto
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cembed-bitcode=yes
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Clink-arg=-fuse-ld=lld
```

macOS:

```
build:release --action_env=RUSTC_LOG=rustc_codegen_ssa::back::lto=info
build:release --compilation_mode=opt
build:release --@rules_rust//rust/settings:lto=fat
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Copt-level=3
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Ccodegen-units=1
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebug-assertions=off
build:release --@rules_rust//rust/settings:extra_rustc_flag=-Coverflow-checks=off
```

I am desperately hoping that I simply missed something obvious. My command for building the binary would typically be something like:

`bazel build --config=release //crate:bench_name` and then I'd run the benchmark binary in `bazel-bin/` with the `--bench` argument.

One difference between the binaries I noticed was that the Bazel generated binary was ~8.8 MiB vs. the ~8.3 MiB of the Cargo built binary. I first noticed the performance disparity with a very vanilla `bazel build -c opt` generated binary.

toolchain setup from `MODULE.bazel`:

```
bazel_dep(name = "rules_rust", version = "0.60.0")

rust = use_extension("@rules_rust//rust:extensions.bzl", "rust")
rust.toolchain(
    edition = "2021",
    versions = ["nightly/2025-02-07"],
    dev_components = True,
)
use_repo(rust, "rust_toolchains")

register_toolchains("@rust_toolchains//:all")

[...]

crates.from_cargo(
    name = "crates",
    cargo_lockfile = "//:Cargo.lock",
    generate_binaries = True,
    manifests = [
       (Cargo.tomls of the workspace and constituent crates)
    ],
)
```

Then the `rust_binary` for the benchmark:

```
rust_binary(
    name = "my_benchmark",
    srcs = ["benches/my_benchmark.rs"],
    aliases = aliases(),
    edition = "2021",
    proc_macro_deps = all_crate_deps(
        proc_macro = True,
        proc_macro_dev = True,
    ),
    deps = [
        ":lib",
        (other in-workspace dependencies)
    ] + all_crate_deps(
        normal = True,
        normal_dev = True,
    ),
)
```

There are two particular dependencies which are exposed as `rust_library`s in the Bazel build which are _very_ inlining sensitive, either one not getting inlined properly would explain the overall benchmark being slower without any obvious hot-spots. Comparing the params and rustc invocations from Cargo don't show any obvious differences. e.g.

From Bazel, benchmark program params in bazel-out:

```
--extern=pkg-a=bazel-out/darwin_arm64-opt/bin/pkg-a/libpkg-a-262915972.rlib
-Ldependency=bazel-out/darwin_arm64-opt/bin/pkg-a
--extern=pkg-b=bazel-out/darwin_arm64-opt/bin/pkg-b/libpkg-b-2509656758.rlib
-Ldependency=bazel-out/darwin_arm64-opt/bin/pkg-b
```

From Cargo:

```
-L
dependency=/Users/callen/.cargo/cache/release/deps
--extern
pkg-a=/Users/callen/.cargo/cache/release/deps/libpkg-a-27b8eb42159f5c37.rlib
--extern
pkg-b=/Users/callen/.cargo/cache/release/deps/libpkg-b-f26c14e53d574db5.rlib
```

Then buried in Cargo's deps:

```
/Users/callen/.cargo/cache/release/deps/libpkg-b-f26c14e53d574db5.rmeta: (…source listing…)

/Users/callen/.cargo/cache/release/deps/libpkg-b-f26c14e53d574db5.rlib: (…source listing…)

/Users/callen/.cargo/cache/release/deps/pkg-b-f26c14e53d574db5.d: (…source listing…)
```

I've been using `nm -C` and `nm -gUC` with awk to generate symbol table counts and they're strongly similar. The main pattern I'm seeing is additional symbols getting included in the Bazel binary seemingly because dead-code elimination isn't firing for the library artifacts properly.

Turning LTO off for both builds makes the Cargo and Bazel benchmarks about the same.

One difference I just noticed:

Bazel rustc for building the final benchmark binary:

```
--emit=link=bazel-out/darwin_arm64-opt/bin/crate/benchmark
--emit=dep-info
```

Cargo, for the same:

```
--emit=dep-info,link
```

Update: after some research, I think the `=path` argument to `emit=link` is just setting the output path, shouldn't make a difference.

It looks like the rlibs have both object code and bitcode in both Bazel and Cargo.

The `rust_library` definitions look like this:

```
rust_library(
    name = "pkg-b",
    srcs = glob(["src/**/*.rs"]),
    crate_features = [
        "default",
    ],
    edition = "2021",
    proc_macro_deps = [
        "//pkg-b/macros",
    ] + all_crate_deps(proc_macro = True),
    deps = [
        "//pkg-b/vendored-crate-1",
        "//pkg-b/vendored-crate-2",
        "//pkg-b/vendored-crate-3",
        "//pkg-b/vendored-crate-4",
    ] + all_crate_deps(),
)
```

```
rust_library(
    name = "pkg-a",
    srcs = glob(["src/**/*.rs"]),
    aliases = aliases(),
    crate_features = [
        "default",
        "bazel_build",
    ],
    edition = "2021",
    proc_macro_deps = [
        "//pkg-b/macros",
    ] + all_crate_deps(proc_macro = True),
    deps = [
        "//pkg-c",
        "//pkg-b/vendored-crate-2",
        "//pkg-b",
    ] + all_crate_deps(),
)
```

New update: enabling `build:release --@rules_rust//rust/settings:lto=thin` worked locally on Linux and macOS (still same perf issue tho) but barfed on the proc macro crates in CI/CD with a linker error that is indicative of "tried improperly to LTO a proc macro crate".

>Wondering if the CC and CXX compilers, and every other environment variable are the same between cargo and bazel. I think you may be compiling your sys dependencies differently.

Update: A Reddit user suggested a possibility, here's my reply:


>>I think you may be compiling your sys dependencies differently.

>Could be, but all the perf impact is the pure Rust crates that live inside the Cargo workspace. There's very very little `-sys` in the dep tree and none of it is in the critical path. If any of it had been, it would've shown up when I profiled the optimized builds with debug symbols.

Update, follow-up:

```
    build:release --compilation_mode=opt
    build:release --@rules_rust//rust/settings:lto=thin
    build:release --@rules_rust//rust/settings:extra_rustc_flag=-Copt-level=3
    build:release --@rules_rust//rust/settings:extra_rustc_flag=-Ccodegen-units=1
    build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebug-assertions=off
    build:release --@rules_rust//rust/settings:extra_rustc_flag=-Coverflow-checks=off
    build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cdebuginfo=0
    build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cstrip=debuginfo
    build:release --@rules_rust//rust/settings:extra_rustc_flag=-Cforce-frame-pointers=no
```

>This made the benchmark 3.5x as slow as the original benchmark with Cargo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rust application much slower when built with rules_rust than with Cargo #3407

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rust application much slower when built with rules_rust than with Cargo #3407

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions