Batch MPP claims into single ChannelMonitorUpdate#4552
Conversation
|
👋 Thanks for assigning @TheBlueMatt as a reviewer! |
|
I've completed a thorough re-review of the entire PR diff. All significant issues from my prior review remain outstanding. I found no new critical issues beyond what was previously flagged. Re-review of PR #4552 — Batch MPP claims into single ChannelMonitorUpdateAll previously flagged issues remain relevant and unaddressed. No new critical bugs found in this pass. Prior issues still outstanding (not re-posted inline):
Minor new observations:
|
395b30d to
88d62c4
Compare
88d62c4 to
5880633
Compare
|
🔔 1st Reminder Hey @jkczyz! This PR has been waiting for your review. |
5880633 to
d801ec9
Compare
It's intentionally left unused right now, pending any design decision that would be made for the msat values that field holds, in the case of a single claim, there's a
This is why that field exist, the absense of the direction it should take is the reason it is intentionally currently left unused. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4552 +/- ##
==========================================
+ Coverage 86.12% 86.36% +0.23%
==========================================
Files 157 158 +1
Lines 108922 109493 +571
Branches 108922 109493 +571
==========================================
+ Hits 93812 94561 +749
+ Misses 12495 12381 -114
+ Partials 2615 2551 -64
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
🔔 1st Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
|
🔔 2nd Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
|
🔔 3rd Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
TheBlueMatt
left a comment
There was a problem hiding this comment.
Hmm, would this not be simpler by using the holding cell instead? When we're failing, we call queue_fail_htlc which just shoves the failure into the holding cell then call check_free_peer_holding_cells to release them all at once. Rather than rebuilding that logic for claims, we should be able to do something similar. We'd have to handle the monitor updates with the preimages a bit differently, but not drastically so.
d801ec9 to
5992824
Compare
5992824 to
f466457
Compare
f466457 to
dddd924
Compare
086a669 to
2a28f1d
Compare
2a28f1d to
fc83c75
Compare
|
🔔 1st Reminder Hey @TheBlueMatt! This PR has been waiting for your review. |
| // When the caller asked us to push into the holding cell for batching | ||
| // (`force_holding_cell`) we skip producing a `ChannelMonitorUpdate` here; the caller is | ||
| // responsible for either flushing the holding cell (producing one combined update with | ||
| // every queued preimage and the commitment) or building a preimage-only update for |
There was a problem hiding this comment.
This is a massive layer violation - the ChannelManager shouldn't be making decisions around how to build ChannelMonitorUpdates, we shouldn't need a manual build_preimage_only_monitor_update, etc. Instead, we should be able to queue all the updates, then use the existing check_free_holding_cells logic (or check_free_peer_holding_cells to filter by the channels we actually updated) and have the channel automatically include the preimage updatesteps in the ChannelMonitorUpdate it builds.
fc83c75 to
4e596b0
Compare
| let mon_update_id = self.context.latest_monitor_update_id; | ||
|
|
||
| self.queue_fulfill_htlc(htlc_id_arg, payment_preimage_arg, None, None, false, logger); | ||
| self.context.latest_monitor_update_id = mon_update_id; |
There was a problem hiding this comment.
Bug: Spurious MONITOR_UPDATE_IN_PROGRESS and corrupted blocked update IDs
queue_fulfill_htlc now has side effects that the old get_update_fulfill_htlc did not:
- Since
can_generate_new_commitment()is false (asserted above),will_flush_immediately = falseinsidequeue_fulfill_htlc. - The
!will_flush_immediatelybranch callsmonitor_updating_paused(...)at line 7690, which setsMONITOR_UPDATE_IN_PROGRESSon the channel state. - If
blocked_monitor_updatesis non-empty, lines 7678-7684 shift all blocked update IDs up by 1.
The caller resets latest_monitor_update_id at line 7624, but does NOT:
- Clear
MONITOR_UPDATE_IN_PROGRESS(no monitor update was actually submitted, so nothing will callmonitor_updating_restoredto clear it) - Un-shift the
blocked_monitor_updatesIDs
The old code called get_update_fulfill_htlc directly, which did NOT call monitor_updating_paused and did NOT shift blocked update IDs when taking the holding-cell path. This is a regression.
Consequences:
MONITOR_UPDATE_IN_PROGRESSis set with no corresponding in-flight monitor update. After peer reconnection,can_generate_new_commitment()returns false (due toMONITOR_UPDATE_IN_PROGRESS), potentially preventing the channel from making progress.- If
blocked_monitor_updatesexisted, their IDs are now off-by-one from what the ChannelMonitor expects, which could cause update ID mismatches.
In practice, this may be partially mitigated if a MonitorUpdatesComplete background event fires for this channel during startup (clearing the flag via channel_monitor_updated → monitor_updating_restored), but that's fragile and shouldn't be relied upon.
Fix: Avoid queue_fulfill_htlc here. Either call get_update_fulfill_htlc directly (as the old code did), or add explicit cleanup:
self.queue_fulfill_htlc(htlc_id_arg, payment_preimage_arg, None, None, false, logger);
self.context.latest_monitor_update_id = mon_update_id;
// queue_fulfill_htlc may have set MONITOR_UPDATE_IN_PROGRESS and shifted
// blocked_monitor_updates — undo both since we're discarding the update.
self.context.channel_state.clear_monitor_update_in_progress();(Plus un-shifting blocked updates if applicable, which is more complex — calling get_update_fulfill_htlc directly is simpler.)
4e596b0 to
6abcee3
Compare
| None | ||
| } | ||
| }) | ||
| .collect(); |
There was a problem hiding this comment.
Bug (minor): check_free_peer_holding_cells does not check pending_events for ReleaseRAAChannelMonitorUpdate events when building the set of RAA-blocked channels. Compare with the raa_blocked computation in claim_mpp_part (line 9903-9920) and raa_monitor_updates_held (line 13068-13085), which both check pending_events in addition to actions_blocking_raa_monitor_updates.
If a ReleaseRAAChannelMonitorUpdate event is pending but no actions_blocking_raa_monitor_updates entry exists for the channel, check_free_peer_holding_cells will attempt to flush the holding cell. The flush's combined ChannelMonitorUpdate will then be blocked by raa_monitor_updates_held inside handle_new_monitor_update (because it DOES check pending_events). The end result is correct (preimage is in the blocked update), but the discrepancy is confusing and could mask issues.
More importantly, the comment on line 14012-14015 says "The preimage half of any queued claim has already been persisted as a standalone monitor update by queue_fulfill_htlc", but this is only true when raa_blocked was true in queue_fulfill_htlc. If queue_fulfill_htlc's raa_blocked check at line 9892-9920 saw no blocker (perhaps the ReleaseRAAChannelMonitorUpdate event was added between the queue and flush), the preimage was deferred to the flush. Then the flush produces a combined update that gets blocked by raa_monitor_updates_held — meaning the preimage sits in a blocked_monitor_updates queue in memory, not yet durably persisted to the ChannelMonitor. A crash before the blocker clears would lose the preimage.
Consider also checking pending_events for ReleaseRAAChannelMonitorUpdate in this function, consistent with the other two call sites.
There was a problem hiding this comment.
The discrepancy is intentional rather than buggy, this call site's purpose is just to avoid wasting work generating a commitment_signed that the peer will reject (or that we can't usefully proceed with). It's not part of the preimage-safety invariant.
Adding the pending_events check here would incorrectly supress legitimate flushes, since the presence of a ReleaseRAAChannelMonitorUpdate event only blocks future inbound-RAA-derived monitor updates, not outbound commitment generation.
6abcee3 to
92fd2d9
Compare
| htlc_value_msat, | ||
| update_blocked: true, | ||
| }; | ||
| return UpdateFulfillFetch::DuplicateClaim {}; |
There was a problem hiding this comment.
Bug (low severity): Behavioral change weakens preimage-safety net in fulfill/fail race
The old code for this branch returned UpdateFulfillFetch::NewClaim { monitor_update, htlc_value_msat, update_blocked: true } — the preimage-containing ChannelMonitorUpdate was built before the holding-cell check and persisted to the monitor even though the HTLC was also being failed. The new code returns DuplicateClaim, so the preimage is never persisted to the ChannelMonitor.
If in release mode (where the debug_assert doesn't fire) the channel is force-closed and the counterparty broadcasts a commitment that includes this HTLC, the monitor won't have the preimage to claim on-chain.
In practice this should be unreachable because claim_funds and the fail path don't race (both are gated by claimable_payments), so the debug_assert catches it in testing. But the safety net is now weaker.
Consider persisting the preimage even in this edge case, or adding a comment explicitly documenting that this is safe because the race is structurally prevented.
When an MMP payment has multiple parts that arrive over the same final-hop channel, we previously processed each part sequentially through the full commitment dance: claim one part, build a ChannelMonitorUpdate, ship a commitment_signed, wait for the peer's RAA + next commitment_signed, then move to the next part. This added significant claim latency and wasted round trips on unnecessary monitor updates. Instead, queue all MPP part claims into the channel's holding cell first, then flush them together in a single pass. The holding-cell flush produces one combined ChannelMonitorUpdate (preimage + commitment steps) and one commitment_signed carrying all update_fulfill_htlc messages at once, eliminating the intermediate round trips. To enable this, the old get_update_fulfill_htlc_and_commit is split into queue_fulfill_htlc (pushes into the holding cell) and the existing holding- cell flush path. ClaimedMPPPayment RAA blockers are deliberately ignored during both the queue and flush phases so the first MPP part on a channel does not force subsequent parts onto the standalone preimage path. When the channel cannot immediately flush its holding cell (e.g. peer is disconnected or another monitor update is in flight), a standalone preimage- only ChannelMonitorUpdate is persisted for HTLC-preimage safety, matching the prior behavior.
92fd2d9 to
6f6a028
Compare
Tests has been updated to reflect this new batching of MPP claims
test_single_channel_multiple_mppauto_retry_partial_failuretest_keysend_dup_hash_partial_mppcloses #3986