Let `BackgroundProcessor` drive HTLC forwarding #3891

tnull · 2025-06-25T15:12:49Z

Closes #3768.
Closes #1101.

Previously, we'd require the user to manually call process_pending_htlc_forwards as part of PendingHTLCsForwardable event handling. Here, we rather move this responsibility to BackgroundProcessor, which simplifies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations.

Note this also introduces batching rounds rather than calling process_pending_htlc_forwards individually for each PendingHTLCsForwardable event, which had been unintuitive anyways, as subsequent PendingHTLCsForwardable could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, as process_pending_htlc_forwards would of course handle all pending HTLCs at once.

To this end, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms and drop the PendingHTLCsForwardable event.

~~Draft for now as I'm still cleaning up the code base as part of the final commit dropping PendingHTLCsForwardable.~~

ldk-reviews-bot · 2025-06-25T15:12:52Z

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

joostjager · 2025-06-25T16:46:14Z

Does this in any way limit users to not have delays or not have batching? Assuming that's what they want.

tnull · 2025-06-25T16:58:39Z

Does this in any way limit users to not have delays or not have batching? Assuming that's what they want.

On the contrary actually: it effectively reduces the (mean and min forwarding) delay quite a bit, which we can allow as we're gonna add larger receiver-side delays in the next step. And, while it get's rid of the event, users are still free to call process_pending_htlc_forwards on a faster schedule if they really want to. IMO, this should result in a win-win situation: substantially reduced forwarding delays on average and by default, while still considerably improving receiver anonymity.

joostjager · 2025-06-26T09:11:32Z

Isn't it the case that without the event, as a user you are forced to "poll" for forwards, making extra delays unavoidable?

tnull · 2025-06-26T09:17:20Z

Isn't it the case that without the event, as a user you are forced to "poll" for forwards, making extra delays unavoidable?

LDK always processes HTLCs in batches (note that process_pending_htlcs never allowed to just forward a single HTLC, for good reason). Having some batching delay makes a lot of sense in any scenario. And given that 'polling' is really cheap, users could consider doing that frequently. But, they really shouldn't try to skip the batching entirely as IO overhead/delay would come to bite them (especially on more busy forwarding nodes), and of course since they should be 'good citizens' providing some privacy by default for the network.

joostjager · 2025-06-26T09:26:58Z

Polling may be cheap, but forcing users to poll when there is an event mechanism available, is that really the right choice? Perhaps the event is beneficial for testing, debugging and monitoring too?

tnull · 2025-06-26T09:32:32Z

Polling may be cheap, but forcing users to poll when there is an event mechanism available, is that really the right choice? Perhaps the event is beneficial for testing, debugging and monitoring too?

The event never featured any information so is not helpful for debugging or 'informational' purposes. Plus, it means at least 1-2 more rounds of ChannelManager persistence, just to queue and remove the event. So since we don't need it anymore, we should def. drop it in production. As you know I was on the fence whether to drop it for testing, but now went this way, especially given that nobody indicated a strong opinion either way. If we indeed want to introspect the holding cell during testing (or, e.g., in fuzzing), we should add another approach to do it, but that's up for discussion.

joostjager · 2025-06-26T09:41:28Z

But at least the event could wake up the background processor, where as now nothing is waking it up for forwards and the user is forced to call into channel manager at a high frequency? Not sure if there is a lighter way to wake up the bp without persistence involved.

Also if you have to call into channel manager always anyway, aren't there more events/notifiers that can be dropped?

As you know I was on the fence whether to drop it for testing, but now went this way, especially given that nobody indicated a strong opinion either way.

I may have missed this deciding moment.

If the assertions were useless to begin with, no problem dropping them of course. I can imagine though that at some points, a peek into the pending htlc state is still required to not reduce the coverage of the tests?

tnull · 2025-06-26T09:46:58Z

But at least the event could wake up the background processor, where as now nothing is waking it up for forwards and the user is forced to call into channel manager at a high frequency? Not sure if there is a lighter way to wake up the bp without persistence involved.

Also if you have to call into channel manager always anyway, aren't there more events/notifiers that can be dropped?

As you know I was on the fence whether to drop it for testing, but now went this way, especially given that nobody indicated a strong opinion either way.

I may have missed this deciding moment.

Again, the default behavior we had intended to switch to for quite some time is to introduce batching intervals (especially given that the current event-based approach was essentially broken/race-y). This is what is implemented here. If users want to bend the recommended/default approach they are free to do so, but I don't think it makes sense to keep all the legacy codepaths, including persistence overhead, around if it's not used anymore.

If the assertions were useless to begin with, no problem dropping them of course. I can imagine though that at some points, a peek into the pending htlc state is still required to not reduce the coverage of the tests?

I don't think this is generally the case, no. The 'assertion' that is mainly dropped is 'we generated an event', every thing else remains the same.

joostjager · 2025-06-26T10:25:37Z

Again, the default behavior we had intended to switch to for quite some time is to introduce batching intervals (especially given that the current event-based approach was essentially broken/race-y). This is what is implemented here. If users want to bend the recommended/default approach they are free to do so, but I don't think it makes sense to keep all the legacy codepaths, including persistence overhead, around if it's not used anymore.

This doesn't rule out a notification when there's something to forward, to at least not keep spinning when there's nothing to do?

tnull · 2025-06-27T09:31:17Z

Finished for now with the test refactoring post-dropping PendingHTLCsForwardable event. This should be good for a first round of (concept) review. Whether or not we should add a notifier on top is up for debate.

ldk-reviews-bot · 2025-06-27T09:40:24Z

✅ Added second reviewer: @valentinewallace

joostjager · 2025-06-27T13:22:18Z

lightning-background-processor/src/lib.rs

@@ -360,12 +376,24 @@ macro_rules! define_run_body {
 				break;
 			}

+			if $timer_elapsed(&mut last_forwards_processing_call, cur_batch_delay) {
+				$channel_manager.get_cm().process_pending_htlc_forwards();


Looked a bit closer at this function. There is a lot of logic in there. Also various locks obtained.

ldk-reviews-bot · 2025-06-30T00:00:43Z

🔔 1st Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-07-02T00:01:33Z

🔔 2nd Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Previously, all `TIMER` constants were `u64` implictly assumed to represent seconds. Here, we switch them over to be `Duration`s, which allows for the introduction of sub-second timers. Moreover, it avoids any future confusions due to the implicitly assumed units.

tnull

Alright, now addressed all the pending feedback (including expect_pending_* cleanup, increasing test coverage) and rebased on main to address minor conflict.

Let me know when this can be squashed.

lightning-background-processor/src/fwd_batch.rs

joostjager · 2025-07-18T11:58:39Z

lightning-background-processor/src/lib.rs

+
+			// Checke whether to exit the loop again, as some time might have passed since we
+			// checked above.
+			if $loop_exit_check {


Isn't exit happening already immediately in the selector inside $await below, making this new check redundant?

No, it's not, because in the $await case we'd always await E or F once before checking/exiting if the bool is set. So it's not immediate.

I though that the sleeper used for E and F is also checking for immediate exit?

lightning/src/ln/channelmanager.rs

lightning/src/ln/outbound_payment.rs

joostjager · 2025-07-18T12:03:47Z

lightning/src/ln/outbound_payment.rs

+			pmt.is_auto_retryable_now()
+				|| !pmt.is_auto_retryable_now()
+					&& pmt.remaining_parts() == 0
+					&& !pmt.is_fulfilled()


I noticed that in check_retry_payments, is_fulfilled() isn't checked. Not sure if that needs to be consistent?

That is the (pre-existing) abandon part. The new part relating to retry is is_auto_retryable_now.

joostjager · 2025-07-18T12:28:12Z

lightning/src/ln/channelmanager.rs

@@ -6336,7 +6336,7 @@ where

 	// Returns whether or not we need to re-persist.
 	fn internal_process_pending_htlc_forwards(&self) -> NotifyOption {
-		let should_persist = NotifyOption::DoPersist;
+		let mut should_persist = NotifyOption::SkipPersistNoEvents;
 		self.process_pending_update_add_htlcs();


Nothing in process_pending_update_add_htlcs that triggers persist?

I think most (all?) cases would result in forward_htlcs etc. being set, but now added some more broad checks to process_pending_update_add_htlcs to be on the safe side.

lightning/src/ln/channelmanager.rs

joostjager · 2025-07-18T12:32:29Z

lightning/src/ln/channelmanager.rs

@@ -6329,9 +6334,19 @@ where
 	/// Users implementing their own background processing logic should call this in irregular,
 	/// randomly-distributed intervals.
 	pub fn process_pending_htlc_forwards(&self) {
+		if self


This still feels quite unnecessary to me. Yes, a second caller will hit the lock, but currently we only call this in one location.

Right, but users might choose to call it on their own timeline in addition to our calls, right? Also not super sure if we need it, but it should be super cheap and possibly avoids some unnecessary lock congestion.

I'd just leave it out until it becomes a problem if we are not sure.

joostjager · 2025-07-18T12:42:24Z

lightning-background-processor/src/fwd_batch.rs

+// log_normal_data <- round(rlnorm(n, meanlog = meanlog, sdlog = sdlog))
+// cat(log_normal_data, file = "log_normal_data.txt", sep = ", ")
+// ```
+static FWD_DELAYS_MILLIS: [u16; 10000] = [


Maybe the least you can do is just scale that distribution, so that there is some control over the batch delay?

joostjager · 2025-07-18T13:06:27Z

lightning/src/ln/async_payments_tests.rs

@@ -1081,7 +1081,7 @@ fn invalid_async_receive_with_retry<F1, F2>(
 	// Fail the HTLC backwards to enable us to more easily modify the now-Retryable outbound to test
 	// failures on the recipient's end.
 	nodes[2].node.fail_htlc_backwards(&payment_hash);
-	expect_pending_htlcs_forwardable_conditions(
+	expect_htlc_failure_conditions(


This looks like a clean rename that could be isolated?

Yes, that would be doable if you prefer. Now made it a non-fixup commit.

ldk-reviews-bot · 2025-07-21T00:00:54Z

🔔 1st Reminder

Hey @TheBlueMatt @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-07-21T00:00:55Z

🔔 1st Reminder

Hey @TheBlueMatt @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

tnull · 2025-07-21T11:35:50Z

Addressed the pending (literally, lol) feedback.

TheBlueMatt

LGTM, basically. Feel free to squash fixups AFAIC.

TheBlueMatt · 2025-07-21T12:31:21Z

lightning-background-processor/src/fwd_batch.rs

+fn rand_batch_delay_millis() -> u16 {
+	const USIZE_LEN: usize = core::mem::size_of::<usize>();
+	let mut random_bytes = [0u8; USIZE_LEN];
+	possiblyrandom::getpossiblyrandom(&mut random_bytes);


I don't think we want the fallback logic as-is. possiblyrandom is a bit smarter than the coarse std-or-not we use in lightning (eg some users may turn on lightning no-std to disable time in an environment where random is perfectly available, eg SGX). If we're using possiblyrandom we should always call it and let its internal fallback logic decide whether to provide zeros. We can also just take one of the 50s that are in the middle of FWD_DELAYS_MILLIS and move it to the first position....that's still random :p.

TheBlueMatt · 2025-07-21T13:05:50Z

fuzz/src/chanmon_consistency.rs

@@ -1365,6 +1340,7 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
 						},
 					}
 				}
+				nodes[$node].process_pending_htlc_forwards();


Given we only process in prod when ChannelManager says we need to, lets keep that behavior here.

Given we only process in prod when ChannelManager says we need to, lets keep that behavior here.

Ah, we actually reverted that as Joost had concerns whether that was stable (e.g., in case we'd add some behavior but don't update the checks). We now use the needs_pending_htlc_processing to skip the BP wakup, but if we're in the BP loop and the delay up we just call process_pending_htlc_forwards to make sure we'd always process eventually. Should I still add an if checking on needs_pending_htlc_processing here, as it might be able to detect such bugs, even if we don't use it exactly like that in prod?

in case we'd add some behavior but don't update the checks

right, but presumably that would be a bug we want the fuzzer to catch :)

We now use the needs_pending_htlc_processing to skip the BP wakup, but if we're in the BP loop and the delay up we just call process_pending_htlc_forwards to make sure we'd always process eventually.

Yea, that's great in prod, but the extra time required to go around the loop in order to unblock an HTLC would presumably still be something we consider a bug.

Should I still add an if checking on needs_pending_htlc_processing here, as it might be able to detect such bugs, even if we don't use it exactly like that in prod?

Yea, I think so, though thinking about it again I think it may need to be a while? The fuzzing loop doesn't get woken by the ChannelManager having more work to do, and its possible that we do a processing step and then end up with more processing to do as a result, which in the BP would make the loop go around again fast, but here would not.

Previously, we'd require the user to manually call `process_pending_htlc_forwards` as part of `PendingHTLCsForwardable` event handling. Here, we rather move this responsibility to `BackgroundProcessor`, which simplyfies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations. Note this also introduces batching rounds rather than calling `process_pending_htlc_forwards` individually for each `PendingHTLCsForwardable` event, which had been unintuitive anyways, as subsequent `PendingHTLCsForwardable` could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, as `process_pending_htlc_forwards` would of course handle all pending HTLCs at once.

Now that we have `BackgroundProcessor` drive the batch forwarding of HTLCs, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms.

.. instead we just move 50 ms up to first position

.. as `forward_htlcs` now does the same thing

.. as `fail_htlcs_backwards_internal` now does the same thing

We move the code into the `optionally_notify` closure, but maintain the behavior for now. In the next step, we'll use this to make sure we only repersist when necessary.

We skip repersisting `ChannelManager` when nothing is actually processed.

We add a reenatrancy guard to disallow entering `process_pending_htlc_forwards` multiple times. This makes sure that we'd skip any additional processing calls if a prior round/batch of processing is still underway.

tnull

Squashed fixups, and included one more that reverts the fallback logic requested elsewhere

tnull · 2025-07-21T13:43:11Z

fuzz/src/chanmon_consistency.rs

@@ -1365,6 +1340,7 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) {
 						},
 					}
 				}
+				nodes[$node].process_pending_htlc_forwards();


Given we only process in prod when ChannelManager says we need to, lets keep that behavior here.

Ah, we actually reverted that as Joost had concerns whether that was stable (e.g., in case we'd add some behavior but don't update the checks). We now use the needs_pending_htlc_processing to skip the BP wakup, but if we're in the BP loop and the delay up we just call process_pending_htlc_forwards to make sure we'd always process eventually. Should I still add an if checking on needs_pending_htlc_processing here, as it might be able to detect such bugs, even if we don't use it exactly like that in prod?

tnull marked this pull request as draft June 25, 2025 15:12

tnull force-pushed the 2025-06-batch-forwarding-delays branch from ceb3335 to 9ba691c Compare June 26, 2025 08:13

tnull force-pushed the 2025-06-batch-forwarding-delays branch from 9ba691c to b38c19e Compare June 26, 2025 09:49

tnull force-pushed the 2025-06-batch-forwarding-delays branch from c1a0b35 to d35c944 Compare June 26, 2025 13:17

tnull added this to Weekly Goals Jun 26, 2025

tnull self-assigned this Jun 26, 2025

tnull force-pushed the 2025-06-batch-forwarding-delays branch from d35c944 to c21aeab Compare June 27, 2025 09:29

tnull requested a review from TheBlueMatt June 27, 2025 09:29

tnull marked this pull request as ready for review June 27, 2025 09:29

tnull removed the request for review from TheBlueMatt June 27, 2025 09:36

tnull moved this to Goal: Merge in Weekly Goals Jun 27, 2025

ldk-reviews-bot requested a review from valentinewallace June 27, 2025 09:40

tnull requested review from TheBlueMatt and removed request for TheBlueMatt June 27, 2025 09:51

joostjager reviewed Jun 27, 2025

View reviewed changes

tnull force-pushed the 2025-06-batch-forwarding-delays branch from c21aeab to e2ad6ca Compare July 2, 2025 09:55

tnull force-pushed the 2025-06-batch-forwarding-delays branch 2 times, most recently from dff8088 to dc3022a Compare July 18, 2025 10:26

tnull commented Jul 18, 2025

View reviewed changes

tnull requested review from TheBlueMatt, joostjager and valentinewallace July 18, 2025 10:28

tnull force-pushed the 2025-06-batch-forwarding-delays branch 4 times, most recently from 599afd5 to c6b7a66 Compare July 18, 2025 12:01

joostjager reviewed Jul 18, 2025

View reviewed changes

tnull force-pushed the 2025-06-batch-forwarding-delays branch from c6b7a66 to 7ef6769 Compare July 21, 2025 11:35

TheBlueMatt reviewed Jul 21, 2025

View reviewed changes

tnull added 10 commits July 21, 2025 15:38

Randomly draw forwarding delays

31c0576

Now that we have `BackgroundProcessor` drive the batch forwarding of HTLCs, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms.

f Revert the fallback logic

92749fd

.. instead we just move 50 ms up to first position

Drop PendingHTLCsForwardable event

02793c1

Rename expect_pending_htlcs_forwardable_conditions

a065ab6

Drop unnecessary forward_htlcs_without_forward_event

15140bd

.. as `forward_htlcs` now does the same thing

Drop unnecessary fail_htlcs_.._without_forwarding_event

6de2edf

.. as `fail_htlcs_backwards_internal` now does the same thing

Use optionally_notify in process_pending_htlc_forwards

60bf17c

We move the code into the `optionally_notify` closure, but maintain the behavior for now. In the next step, we'll use this to make sure we only repersist when necessary.

Skip unnecessary persists in process_pending_htlc_forwards

d94749d

We skip repersisting `ChannelManager` when nothing is actually processed.

Add reenatrancy guard for process_pending_htlc_forwards

bc80d0a

We add a reenatrancy guard to disallow entering `process_pending_htlc_forwards` multiple times. This makes sure that we'd skip any additional processing calls if a prior round/batch of processing is still underway.

tnull commented Jul 21, 2025

View reviewed changes

tnull force-pushed the 2025-06-batch-forwarding-delays branch from 7ef6769 to bc80d0a Compare July 21, 2025 13:44

Let BackgroundProcessor drive HTLC forwarding #3891

Are you sure you want to change the base?

Let BackgroundProcessor drive HTLC forwarding #3891

Conversation

tnull commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 26, 2025

Uh oh!

tnull commented Jun 26, 2025

Uh oh!

joostjager commented Jun 26, 2025

Uh oh!

tnull commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joostjager commented Jun 26, 2025

Uh oh!

tnull commented Jun 27, 2025

Uh oh!

ldk-reviews-bot commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldk-reviews-bot commented Jun 30, 2025

Uh oh!

ldk-reviews-bot commented Jul 2, 2025

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldk-reviews-bot commented Jul 21, 2025

Let `BackgroundProcessor` drive HTLC forwarding #3891

Let `BackgroundProcessor` drive HTLC forwarding #3891

tnull commented Jun 25, 2025 •

edited

Loading

ldk-reviews-bot commented Jun 25, 2025 •

edited

Loading

joostjager commented Jun 25, 2025 •

edited

Loading

tnull commented Jun 25, 2025 •

edited

Loading

tnull commented Jun 26, 2025 •

edited

Loading

joostjager commented Jun 26, 2025 •

edited

Loading

tnull commented Jun 26, 2025 •

edited

Loading

joostjager Jul 18, 2025 •

edited

Loading

tnull Jul 21, 2025 •

edited

Loading

tnull Jul 21, 2025 •

edited

Loading