-
Notifications
You must be signed in to change notification settings - Fork 418
Let BackgroundProcessor
drive HTLC forwarding
#3891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Let BackgroundProcessor
drive HTLC forwarding
#3891
Conversation
👋 Thanks for assigning @valentinewallace as a reviewer! |
Does this in any way limit users to not have delays or not have batching? Assuming that's what they want. |
On the contrary actually: it effectively reduces the (mean and min forwarding) delay quite a bit, which we can allow as we're gonna add larger receiver-side delays in the next step. And, while it get's rid of the event, users are still free to call |
ceb3335
to
9ba691c
Compare
Isn't it the case that without the event, as a user you are forced to "poll" for forwards, making extra delays unavoidable? |
LDK always processes HTLCs in batches (note that |
Polling may be cheap, but forcing users to poll when there is an event mechanism available, is that really the right choice? Perhaps the event is beneficial for testing, debugging and monitoring too? |
The event never featured any information so is not helpful for debugging or 'informational' purposes. Plus, it means at least 1-2 more rounds of |
But at least the event could wake up the background processor, where as now nothing is waking it up for forwards and the user is forced to call into channel manager at a high frequency? Not sure if there is a lighter way to wake up the bp without persistence involved. Also if you have to call into channel manager always anyway, aren't there more events/notifiers that can be dropped?
I may have missed this deciding moment. If the assertions were useless to begin with, no problem dropping them of course. I can imagine though that at some points, a peek into the pending htlc state is still required to not reduce the coverage of the tests? |
Again, the default behavior we had intended to switch to for quite some time is to introduce batching intervals (especially given that the current event-based approach was essentially broken/race-y). This is what is implemented here. If users want to bend the recommended/default approach they are free to do so, but I don't think it makes sense to keep all the legacy codepaths, including persistence overhead, around if it's not used anymore.
I don't think this is generally the case, no. The 'assertion' that is mainly dropped is 'we generated an event', every thing else remains the same. |
9ba691c
to
b38c19e
Compare
This doesn't rule out a notification when there's something to forward, to at least not keep spinning when there's nothing to do? |
c1a0b35
to
d35c944
Compare
d35c944
to
c21aeab
Compare
Finished for now with the test refactoring post-dropping |
✅ Added second reviewer: @valentinewallace |
@@ -360,12 +376,24 @@ macro_rules! define_run_body { | |||
break; | |||
} | |||
|
|||
if $timer_elapsed(&mut last_forwards_processing_call, cur_batch_delay) { | |||
$channel_manager.get_cm().process_pending_htlc_forwards(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked a bit closer at this function. There is a lot of logic in there. Also various locks obtained.
🔔 1st Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @valentinewallace! This PR has been waiting for your review. |
c21aeab
to
e2ad6ca
Compare
Previously, all `TIMER` constants were `u64` implictly assumed to represent seconds. Here, we switch them over to be `Duration`s, which allows for the introduction of sub-second timers. Moreover, it avoids any future confusions due to the implicitly assumed units.
dff8088
to
dc3022a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, now addressed all the pending feedback (including expect_pending_*
cleanup, increasing test coverage) and rebased on main
to address minor conflict.
Let me know when this can be squashed.
599afd5
to
c6b7a66
Compare
|
||
// Checke whether to exit the loop again, as some time might have passed since we | ||
// checked above. | ||
if $loop_exit_check { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't exit happening already immediately in the selector inside $await
below, making this new check redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not, because in the $await
case we'd always await E
or F
once before checking/exiting if the bool is set. So it's not immediate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I though that the sleeper used for E and F is also checking for immediate exit?
pmt.is_auto_retryable_now() | ||
|| !pmt.is_auto_retryable_now() | ||
&& pmt.remaining_parts() == 0 | ||
&& !pmt.is_fulfilled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that in check_retry_payments
, is_fulfilled()
isn't checked. Not sure if that needs to be consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the (pre-existing) abandon part. The new part relating to retry
is is_auto_retryable_now
.
lightning/src/ln/channelmanager.rs
Outdated
@@ -6336,7 +6336,7 @@ where | |||
|
|||
// Returns whether or not we need to re-persist. | |||
fn internal_process_pending_htlc_forwards(&self) -> NotifyOption { | |||
let should_persist = NotifyOption::DoPersist; | |||
let mut should_persist = NotifyOption::SkipPersistNoEvents; | |||
self.process_pending_update_add_htlcs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in process_pending_update_add_htlcs
that triggers persist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most (all?) cases would result in forward_htlcs
etc. being set, but now added some more broad checks to process_pending_update_add_htlcs
to be on the safe side.
@@ -6329,9 +6334,19 @@ where | |||
/// Users implementing their own background processing logic should call this in irregular, | |||
/// randomly-distributed intervals. | |||
pub fn process_pending_htlc_forwards(&self) { | |||
if self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still feels quite unnecessary to me. Yes, a second caller will hit the lock, but currently we only call this in one location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but users might choose to call it on their own timeline in addition to our calls, right? Also not super sure if we need it, but it should be super cheap and possibly avoids some unnecessary lock congestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just leave it out until it becomes a problem if we are not sure.
// log_normal_data <- round(rlnorm(n, meanlog = meanlog, sdlog = sdlog)) | ||
// cat(log_normal_data, file = "log_normal_data.txt", sep = ", ") | ||
// ``` | ||
static FWD_DELAYS_MILLIS: [u16; 10000] = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the least you can do is just scale that distribution, so that there is some control over the batch delay?
@@ -1081,7 +1081,7 @@ fn invalid_async_receive_with_retry<F1, F2>( | |||
// Fail the HTLC backwards to enable us to more easily modify the now-Retryable outbound to test | |||
// failures on the recipient's end. | |||
nodes[2].node.fail_htlc_backwards(&payment_hash); | |||
expect_pending_htlcs_forwardable_conditions( | |||
expect_htlc_failure_conditions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a clean rename that could be isolated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that would be doable if you prefer. Now made it a non-fixup commit.
🔔 1st Reminder Hey @TheBlueMatt @valentinewallace! This PR has been waiting for your review. |
1 similar comment
🔔 1st Reminder Hey @TheBlueMatt @valentinewallace! This PR has been waiting for your review. |
c6b7a66
to
7ef6769
Compare
Addressed the pending (literally, lol) feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, basically. Feel free to squash fixups AFAIC.
fn rand_batch_delay_millis() -> u16 { | ||
const USIZE_LEN: usize = core::mem::size_of::<usize>(); | ||
let mut random_bytes = [0u8; USIZE_LEN]; | ||
possiblyrandom::getpossiblyrandom(&mut random_bytes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want the fallback logic as-is. possiblyrandom
is a bit smarter than the coarse std
-or-not we use in lightning
(eg some users may turn on lightning
no-std to disable time in an environment where random
is perfectly available, eg SGX). If we're using possiblyrandom
we should always call it and let its internal fallback logic decide whether to provide zeros. We can also just take one of the 50s that are in the middle of FWD_DELAYS_MILLIS
and move it to the first position....that's still random :p.
@@ -1365,6 +1340,7 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) { | |||
}, | |||
} | |||
} | |||
nodes[$node].process_pending_htlc_forwards(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we only process in prod when ChannelManager
says we need to, lets keep that behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we only process in prod when
ChannelManager
says we need to, lets keep that behavior here.
Ah, we actually reverted that as Joost had concerns whether that was stable (e.g., in case we'd add some behavior but don't update the checks). We now use the needs_pending_htlc_processing
to skip the BP wakup, but if we're in the BP loop and the delay up we just call process_pending_htlc_forwards
to make sure we'd always process eventually. Should I still add an if
checking on needs_pending_htlc_processing
here, as it might be able to detect such bugs, even if we don't use it exactly like that in prod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case we'd add some behavior but don't update the checks
right, but presumably that would be a bug we want the fuzzer to catch :)
We now use the needs_pending_htlc_processing to skip the BP wakup, but if we're in the BP loop and the delay up we just call process_pending_htlc_forwards to make sure we'd always process eventually.
Yea, that's great in prod, but the extra time required to go around the loop in order to unblock an HTLC would presumably still be something we consider a bug.
Should I still add an if checking on needs_pending_htlc_processing here, as it might be able to detect such bugs, even if we don't use it exactly like that in prod?
Yea, I think so, though thinking about it again I think it may need to be a while
? The fuzzing loop doesn't get woken by the ChannelManager having more work to do, and its possible that we do a processing step and then end up with more processing to do as a result, which in the BP would make the loop go around again fast, but here would not.
Previously, we'd require the user to manually call `process_pending_htlc_forwards` as part of `PendingHTLCsForwardable` event handling. Here, we rather move this responsibility to `BackgroundProcessor`, which simplyfies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations. Note this also introduces batching rounds rather than calling `process_pending_htlc_forwards` individually for each `PendingHTLCsForwardable` event, which had been unintuitive anyways, as subsequent `PendingHTLCsForwardable` could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, as `process_pending_htlc_forwards` would of course handle all pending HTLCs at once.
Now that we have `BackgroundProcessor` drive the batch forwarding of HTLCs, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms.
.. instead we just move 50 ms up to first position
.. as `forward_htlcs` now does the same thing
.. as `fail_htlcs_backwards_internal` now does the same thing
We move the code into the `optionally_notify` closure, but maintain the behavior for now. In the next step, we'll use this to make sure we only repersist when necessary.
We skip repersisting `ChannelManager` when nothing is actually processed.
We add a reenatrancy guard to disallow entering `process_pending_htlc_forwards` multiple times. This makes sure that we'd skip any additional processing calls if a prior round/batch of processing is still underway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Squashed fixups, and included one more that reverts the fallback logic requested elsewhere
@@ -1365,6 +1340,7 @@ pub fn do_test<Out: Output>(data: &[u8], underlying_out: Out, anchors: bool) { | |||
}, | |||
} | |||
} | |||
nodes[$node].process_pending_htlc_forwards(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we only process in prod when
ChannelManager
says we need to, lets keep that behavior here.
Ah, we actually reverted that as Joost had concerns whether that was stable (e.g., in case we'd add some behavior but don't update the checks). We now use the needs_pending_htlc_processing
to skip the BP wakup, but if we're in the BP loop and the delay up we just call process_pending_htlc_forwards
to make sure we'd always process eventually. Should I still add an if
checking on needs_pending_htlc_processing
here, as it might be able to detect such bugs, even if we don't use it exactly like that in prod?
7ef6769
to
bc80d0a
Compare
Closes #3768.
Closes #1101.
Previously, we'd require the user to manually call
process_pending_htlc_forwards
as part ofPendingHTLCsForwardable
event handling. Here, we rather move this responsibility toBackgroundProcessor
, which simplifies the flow and allows us to implement reasonable forwarding delays on our side rather than delegating to users' implementations.Note this also introduces batching rounds rather than calling
process_pending_htlc_forwards
individually for eachPendingHTLCsForwardable
event, which had been unintuitive anyways, as subsequentPendingHTLCsForwardable
could lead to overlapping batch intervals, resulting in the shortest timespan 'winning' every time, asprocess_pending_htlc_forwards
would of course handle all pending HTLCs at once.To this end, we implement random sampling of batch delays from a log-normal distribution with a mean of 50ms and drop the
PendingHTLCsForwardable
event.Draft for now as I'm still cleaning up the code base as part of the final commit droppingPendingHTLCsForwardable
.