fix(node): Set backfill max job duration to a safe amount of time by Evalir · Pull Request #90 · init4tech/node-components

Evalir · 2026-02-24T21:45:47Z

Description

Our timeout crash is caused by a footgun on Reth's backfill threshold configuration API. Here's how it roughly looks:

reth allows the mdbx tx timeout sentinel thread to have a configurable max timeout, hidden behind a flag (--db.read-transaction-timeout). We currently, AFAIK, do not configure this. The default value is 5 minutes. Any transactions that live beyond that are killed.
Reth, for the exex backfill, sets the max_duration for its backfill processes to 30 seconds. This does not exceed the default max timeout for the mdbx sentinel thread.
The ExecutionStageThresholds default for max duration for execution jobs is 10 minutes. This, by far, exceeds the mdbx timeout sentinel thread max duration. On the execution stage for the reth node this is safe as it calls disable_long_read_transaction_safety(), which bypasses the timeout. This is not done on backfill, so this default is unsafe. This is therefore a massive footgun for any exex that configures backfill thresholds.
We were only tweaking the max_blocks config value, and using the unsafe default for max duration, when setting the backfill thresholds. This leads to us avoiding an OOM crash, but then running into an mdbx crash due to the sentinel thread forcibly killing the transaction.
Before set_backfill_thresholds was added, what caused the OOM was the massive amount of blocks being processed. max_duration had a sane default (30s). Once it was introduced, this new bug was added due to the ::default() usage.
The fix, is therefore, to limit the max duration to a reasonably low time. 30s should be fine.
- Note: 30s is a reasonably margin due to a few reasons:
  - We avoid the hard 5m timeout.
  - The intended default for exex backfill was already 30s.
  - There's a 60s timeout warning already, this avoids it entirely. We've seen this error spuriously over the entirety of the signet node's lifecycle.

This now exposes a new BACKFILL_MAX_DURATION, which tweaks the max duration time, in seconds.

Related Issue

Testing

Tests pass locally
New tests added (if applicable)

Our timeout crash is caused by a footgun on Reth's backfill threshold configuration API. Here's how it roughly looks: - reth allows the mdbx tx timeout sentinel thread to have a configurable max timeout, hidden behind a flag (--db.read-transaction-timeout). We currently, AFAIK, do not configure this. The default value is 5 minutes. Any transactions that live beyond that are killed. - Reth, for the exex backfill, sets the max_duration for its backfill processes to 30 seconds. This does not exceed the default max timeout for the mdbx sentinel thread. - The `ExecutionStageThresholds` default for max duration for execution jobs is 10 minutes. This, by far, exceeds the mdbx timeout sentinel thread max duration. On the execution stage for the reth node this is safe as it calls `disable_long_read_transaction_safety()`, which bypasses the timeout. This is not done on backfill, so this default is unsafe. This is therefore a massive footgun for any exex that configures backfill thresholds. - We were only tweaking the `max_blocks` config value, and using the unsafe default for max duration, when setting the backfill thresholds. This leads to us avoiding an OOM crash, but then running into an mdbx crash due to the sentinel thread forcibly killing the transaction. - Before `set_backfill_thresholds` was added, what caused the OOM was the massive amount of blocks being processed. `max_duration` had a sane default (30s). Once it was introduced, this new bug was added due to the `::default()` usage. - The fix, is therefore, to limit the max duration to a reasonably low time. 30s should be fine. - Note: 30s is a reasonably margin due to a few reasons: - We avoid the hard 5m timeout. - The intended default for exex backfill was already 30s. - There's a 60s timeout warning already, this avoids it entirely. We've seen this error spuriously over the entirety of the signet node's lifecycle.

Evalir · 2026-02-24T21:46:06Z

fix(node): Set backfill max job duration to a safe amount of time #90 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

dylanlott · 2026-02-24T22:06:32Z

crates/node/src/node.rs

                max_blocks: Some(max_blocks),
+                // Keep the backfill mdbx read transaction open for no longer than 30 seconds.
+                // This prevents MDBX from killing the read transaction, leading to a crash if reth's default mdbx read transaction timeout is exceeded.
+                max_duration: Some(Duration::from_secs(30)),


Do we want to expose this to configs with 30 seconds as the default?

No. The correct fix for this is for reth to make this footgun impossible

Sure but do we actually think that'll happen? I just can easily see a future where we're back to tweaking this and the above max_blocks values because of another change.

Right I see what you mean now. Yeah, we can expose this

dylanlott

LGTM. Let's get this loaded up and syncing ASAP. Good detective work! 🔍

dylanlott · 2026-02-24T22:18:58Z

crates/node/src/node.rs

                max_blocks: Some(max_blocks),
+                // Keep the backfill mdbx read transaction open for no longer than 30 seconds.
+                // This prevents MDBX from killing the read transaction, leading to a crash if reth's default mdbx read transaction timeout is exceeded.
+                max_duration: Some(Duration::from_secs(30)),


Sure but do we actually think that'll happen? I just can easily see a future where we're back to tweaking this and the above max_blocks values because of another change.

Evalir requested a review from a team as a code owner February 24, 2026 21:45

dylanlott reviewed Feb 24, 2026

View reviewed changes

Evalir requested review from Fraser999, dylanlott, prestwich and rymcol February 24, 2026 22:16

dylanlott approved these changes Feb 24, 2026

View reviewed changes

Evalir added 2 commits February 24, 2026 23:54

chore: expose backfill max duration config on env var

f72ab07

chore: version bump

b0c71d7

Evalir enabled auto-merge (squash) February 24, 2026 22:56

Evalir merged commit 9017fbc into main Feb 24, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(node): Set backfill max job duration to a safe amount of time#90

fix(node): Set backfill max job duration to a safe amount of time#90
Evalir merged 3 commits intomainfrom
evalir/fix-timeout-crash

Evalir commented Feb 24, 2026 •

edited

Loading

Uh oh!

Evalir commented Feb 24, 2026

Uh oh!

dylanlott Feb 24, 2026

Uh oh!

Evalir Feb 24, 2026

Uh oh!

dylanlott Feb 24, 2026

Uh oh!

Evalir Feb 24, 2026

Uh oh!

dylanlott left a comment

Uh oh!

dylanlott Feb 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Evalir commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Testing

Uh oh!

Evalir commented Feb 24, 2026

Uh oh!

dylanlott Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Evalir Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

dylanlott Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Evalir Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

dylanlott left a comment

Choose a reason for hiding this comment

Uh oh!

dylanlott Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Evalir commented Feb 24, 2026 •

edited

Loading