simln-lib: add htlc interceptor for simulated nodes #261

elnosh · 2025-05-06T15:18:53Z

Mostly took changes from these 2 commits to add an Interceptor trait:

Some of the things I changed from those commits:

intercept_htlc now returns a Result to communicate the result of the intercepted htlc instead of sending it through a channel.
spawns a task in a JoinSet for each interceptor's intercept_htlc so that if any of those holds the htlc for a long time it does not block the other interceptors. It then waits for the completion of the tasks in the joinset. If any of them returns a result to fail the htlc, it drops the JoinSet since there is no need to wait for completion of other tasks because the htlc will fail anyways.

carlaKC · 2025-05-06T17:40:59Z

Driveby comment: could we split up the logic adding a unique index and the interceptor into separate commits?
For the sake of breaking up into smaller logical chunks / reviewability.

carlaKC

Main comments are around how we handle shutdown in a reaonsable way. Specifically:

An interceptor has had a critical failure, how does it tell the simulator to shut down
The simultor needs to terminate the interceptors, how can we cleanly do this so that interception code isn't left in a bad state

carlaKC · 2025-05-09T13:34:54Z

simln-lib/src/sim_node.rs

+    #[error("DuplicateCustomRecord: key {0}")]
+    DuplicateCustomRecord(u64),
+    #[error("InterceptorError: {0}")]
+    InterceptorError(Box<dyn Error + Send + Sync + 'static>),


Let's go ahead and define this as an error type in the top level library?

I think that simln could use an overhaul in the way we do errors (I did not understand rust errors when we started this project 🙈 ), and defining a reasonable general error type seems like a good start.

carlaKC · 2025-05-09T13:36:06Z

simln-lib/src/sim_node.rs

+    /// The short channel id for the incoming channel that this htlc was delivered on.
+    pub incoming_htlc: HtlcRef,


nit: note that this is a unique identifier for the htlc

carlaKC · 2025-05-09T13:36:45Z

simln-lib/src/sim_node.rs

+    /// Custom records provided by the incoming htlc.
+    pub incoming_custom_records: CustomRecords,
+
+    /// The short channel id for the outgoing channel that this htlc should be forwarded over.


nit: expand doc to note that None indicates that the intercepting node is the receiver.

simln-lib/src/sim_node.rs

carlaKC · 2025-05-09T14:07:42Z

simln-lib/src/sim_node.rs

+                        }
+                    },
+                    Ok(Err(e)) => {
+                        drop(intercepts);


I think that we may need a more gentle shutdown than dropping the task (which afaik will force abort the intercept_htlc call).

I can see it being difficult to write interceptor code that works with another system when your code may be run half way and then killed - could end up with funny states. My instinct is to provides a triggered pair and instruct interceptors to listen on it for shutdown signals? Open to other approaches as well.

I do think that the ability to shut down all the other interceptors once the first error is reached is a really, really nice feature.

sounds good 👍 provide that trigger in the InterceptRequest and send on it if we need to fail other interceptors?
just to note, although intercept_htlc may force-aborted, interceptor should still get notified about htlc resolution through notify_resolution but I think we'd still like to shutdown gently through intercept_htlc?

provide that trigger in the InterceptRequest and send on it if we need to fail other interceptors?

I think that we should handle the triggering?

Give each interceptor a listener

On first interceptor error, trigger shutdown

Then interceptors are responsible for listening on that shutdown listener and getting their state in order before they exit. We still wait for each to finish, but that should be relatively quick because we've signaled that it's shutdown time.

Interceptor should still get notified about htlc resolution through notify_resolution

As-is I don't think we'd notify if the HTLC is failed by the interceptor? It'll never be "fully" forwarded by the node, so we don't notify it being resolved.

Edit: although coming to think of this, we probably do want to notify the failure even if one of the interceptors has returned a fail outcome - the others may have returned success and aren't aware that it actually never ended up going through.

Give each interceptor a listener

On first interceptor error, trigger shutdown

Then interceptors are responsible for listening on that shutdown listener and getting their state in order before they exit. We still wait for each to finish, but that should be relatively quick because we've signaled that it's shutdown time.

got it, will do.

As-is I don't think we'd notify if the HTLC is failed by the interceptor? It'll never be "fully" forwarded by the node, so we don't notify it being resolved.

I thought we would. Because if an interceptor returns an error, we then return it in add_htlcs and then call remove_htlcs here: https://github.com/elnosh/sim-ln/blob/2de405fdcb37c8ffe4bd8d2cc0077ef7099cc3ed/simln-lib/src/sim_node.rs#L1274 which internally calls notify_resolution for each interceptors

I thought we would.

Ok cool, forgot when that is/isn't called bu sgtm!

carlaKC · 2025-05-09T14:09:01Z

simln-lib/src/sim_node.rs

+pub trait Interceptor: Send + Sync {
+    /// Implemented by HTLC interceptors that provide input on the resolution of HTLCs forwarded in the simulation.
+    async fn intercept_htlc(&self, req: InterceptRequest)
+        -> Result<CustomRecords, ForwardingError>;


If we only have one layer of Result here, how does the intercept_htlc call tell the simulator that it's hit a ciritical error and wants to shut down (rather than a forwarding error for this payment specifically)?

I wasn't sure if that's something we'd want. I didn't like the idea of nested Results. It could instead be communicated through specific variants in ForwardingError? Simulation could check with is_critical if the ForwardingError returned warrants a shutdown.

It could instead be communicated through specific variants in ForwardingError

That seems like a bit of a layering violation to me. AnInterceptorError for interceptors that might want to fail forwards with their own ForwardingFalure reason that isn't defined in simln (eg, you don't have enough reputation) makes sense to me, but we're stretching its definition a bit to cover things like unexpected internal state errors.

Nested results are definitely ugly, but I think that it more clearly represents the two types of failures that we could have here? One for "fail this HTLC", one for "something has gone wrong with my interception". If we define a type for the inner result the function signatures won't be too ugly at least.

fair! will change 👍

elnosh · 2025-05-09T20:08:07Z

An interceptor has had a critical failure, how does it tell the simulator to shut down

I wasn't sure if letting the interceptor shutdown the simulation was something we'd want but it could be done through specific variants in the ForwardingError returned? Simulation could check with is_critical if the ForwardingError returned warrants a shutdown. Although Interceptors should be aware of this to trigger a shutdown by providing specific variants.

The simultor needs to terminate the interceptors, how can we cleanly do this so that interception code isn't left in a bad state

should we include something like a channel or shutdown signal in InterceptRequest?

carlaKC · 2025-05-12T12:31:59Z

WDYT about adapting this latency interceptor and surfacing it on the CI?

Nice to have a user of the interceptor that we can run + test in master rather than merging in some unused code and having to test it externally.

elnosh · 2025-05-12T13:25:50Z

sgtm, as I was doing exactly that and testing externally heh.

elnosh · 2025-05-12T15:55:50Z

simln-lib/src/sim_node.rs

+    async fn notify_resolution(
+        &self,
+        _res: InterceptResolution,
+    ) -> Result<(), Box<dyn Error + Send + Sync + 'static>> {
+        Ok(())
+    }


rethinking this, I'm wondering what type of error should this be or if this method should return a Result at all? Leaning towards this should not return a Result since the simulation is just notifying the interceptor of a resolution.

Perhaps the interceptor wants to signal shutdown on the notify_resolution because it got information it wasn't expecting? Or a database became unavailable, etc.

elnosh · 2025-05-14T16:11:14Z

Added changes addressing comments. Using nested results now to differentiate between normal and critical errors that could be returned. If a CriticalError is received, it returns and triggers a shutdown. If a ForwardingError happens during interception, we send a signal to other interceptors to let them know they should shut down.

Also took your commit carlaKC@2a35bd5 as suggested to add the latency interceptor.

If the simulation runs with the LatencyInterceptor it can log some error messages during shutdown

2025-05-14T15:59:40.049Z ERROR [simln_lib] Track payment failed for 7a56bab4c9b27d3cc09438dd6c32255eb4ecb4559337e9125c8920f8b071ab61: Track payment error: shutdown during payment tracking.

because we triggered a shutdown while an intercepted payment (with the added latency) hasn't resolved.

carlaKC

This is looking really good! Only two major comments on shutdown + cli option for latency.

Been a bit nitty about docs because this feature really will only be used when people use simln as a library so it's important to get the docs readable.

carlaKC · 2025-05-15T13:07:57Z

simln-lib/src/sim_node.rs

+    #[error("DuplicateCustomRecord: key {0}")]
+    DuplicateCustomRecord(u64),
+    #[error("InterceptorError: {0}")]
+    InterceptorError(String),


nit: docs on these new variants, here + above

carlaKC · 2025-05-15T13:10:46Z

simln-lib/src/sim_node.rs

+    // The listener on which the interceptor will receive shutdown signals.
+    pub shutdown_listener: Listener,


Let's add a little more instruction here for end users? Just explaining that implementations must listen on this channel, because if they don't they're at risk of blocking quick resolution of HTLCs when other interceptors quickly return a failure outcome.

carlaKC · 2025-05-15T13:11:46Z

simln-lib/src/sim_node.rs

+    pub success: bool,
+}
+
+pub type CustomRecords = HashMap<u64, Vec<u8>>;


We're aiming to enforce docs on public types in future so let's add docs here and below?

carlaKC · 2025-05-15T13:14:37Z

simln-lib/src/sim_node.rs

+    /// Optional set of interceptors that will be called every time a HTLC is added to a simulated channel.
+    interceptors: Vec<Arc<dyn Interceptor>>,


I think that it would be useful to add some docs explaining how multiple interceptors interact with each other, specifically:

That custom records will be merged, but conflicts will fail

That any single interception sending a forwarding failure will result in the HTLC being failed and they'll receive a shutdown signal if this happens

I think that this is probably the most natural place for this to live? Doesn't really fit on the trait because this is one specific way of using it.

carlaKC · 2025-05-15T13:21:41Z

simln-lib/src/sim_node.rs

+            let channel = node_lock
+                .get_mut(&scid)
+                .ok_or(CriticalError::ChannelNotFound(scid))?;


nit: pre-existing, but could you update function docs to note that if we run into a critical error we don't bother to fail back the HTLC, with the expectation that the simulation will shut down shortly.

carlaKC · 2025-05-15T13:34:05Z

simln-lib/src/sim_node.rs

+            // the HTLC. If any of the interceptors did return an error, we send a shutdown signal
+            // to the other interceptors that may have not returned yet.
+            let mut interceptor_failure = None;
+            while let Some(res) = intercepts.join_next().await {


I think there's one last shutdown case we need to think about here:

Waiting on long resolving interceptors

Error elsewhere in the simulation (eg, it hits its total runtime and shuts down)

We'll keep waiting here, because we've created our own triggered pair. I think that we can do this by selecting on intercepts.join_next() and pass in SimGraph's shutdown_trigger and shut down the interceptors if we get the high level signal that it's time to shut down.

We're somewhat running into a flaw with triggered here - that we can't create a child trigger that would shut down with the parent, but maybe that's a feature not a bug. Either way, I think it's outside of the scope of this PR to rework all that so I think a select is our best option (even if ugly).

Another reason to pull this out to a function IMO - always good to have some unit tests checking that all this slippery shut down stuff works as expected! Sadly might have to make a manual mock, because mockall doesn't play nice with async but hopefully it's minimal boilerplate.

I think that we can do this by selecting on intercepts.join_next() and pass in SimGraph's shutdown_trigger and shut down the interceptors if we get the high level signal that it's time to shut down.

I think we'll need to pass in a shutdown listener from the upstream Simulation since currently we don't have a way to listen for a shutdown signal in SimGraph afaik. shutdown_trigger is the one we use to trigger on critical errors.

We're somewhat running into a flaw with triggered here - that we can't create a child trigger that would shut down with the parent, but maybe that's a feature not a bug. Either way, I think it's outside of the scope of this PR to rework all that so I think a select is our best option (even if ugly).

maybe for another PR but could look into CancellationToken looks like it lets you create "child tokens" that can get cancelled along with the parent

I think we'll need to pass in a shutdown listener from the upstream Simulation since currently we don't have a way to listen for a shutdown signal in SimGraph afaik.

Right, we need shutdown_listener 🤦‍♀️ SGTM, we can add it to SimGraph and then pass it along.

maybe for another PR but could look into CancellationToken looks like it lets you create "child tokens" that can get cancelled along with the parent

Yeah I like the idea of investigating this - took a look when we updated the leaky joinset and liked it.

moved the htlc interception logic to a separate method. I didn't end up needing to use select here since we are passing the shutdown listener to SimGraph now so it can be passed directly in the InterceptRequest to the interceptors.

carlaKC · 2025-05-15T13:34:51Z

simln-lib/src/sim_node.rs

+
+    /// Notifies the interceptor that a previously intercepted htlc has been resolved. Default implementation is a no-op
+    /// for cases where the interceptor only cares about interception, not resolution of htlcs.
+    async fn notify_resolution(&self, _res: InterceptResolution) -> Result<(), CriticalError> {


note in docs that this function should not be blocking

carlaKC · 2025-05-15T13:38:18Z

simln-lib/src/sim_node.rs

+
+    /// Tests intercepted htlc success.
+    #[tokio::test]
+    async fn test_intercepted_htlc_success() {


very nice tests 👌

carlaKC · 2025-05-15T13:48:57Z

simln-lib/src/sim_node.rs

+        let mut mock_interceptor_1 = MockTestInterceptor::new();
+        mock_interceptor_1
+            .expect_intercept_htlc()
+            .returning(|_| Ok(Ok(CustomRecords::default())));


Pity we can't have an async closure here - would be nice to make this select on shutdown + a long sleep so we know we're correctly shutting down a long waiting interceptor when another errors.

carlaKC · 2025-05-15T13:56:35Z

sim-cli/src/parsing.rs

@@ -87,6 +88,9 @@ pub struct Cli {
    /// simulated nodes.
    #[clap(long)]
    pub speedup_clock: Option<u16>,
+    /// Latency to optionally introduce for simulated nodes.
+    #[clap(long)]
+    pub latency_ms: Option<f32>,


Somebody would have to try really hard to mess this up, but technically this could be 0 or negative which doesn't make sense here.

Eg:

sim-cli -l debug -s ln_10_simln.json --latency-ms="-2" Error: Simulated Network Error: Could not create possion: lambda is not positive in Poisson distribution

Also don't think we need the granularity of a fraction of a millisecond, so I think it's okay to make this a u32? Then validate:

It's not zero (just don't set it in that case)

It's not set when we run a simulation with real nodes (like we do for clock speedup)

nit: note in doc that this is expressed in milliseconds, since the doc is what sim-cli --help will print.

elnosh force-pushed the interceptor branch from 62a573e to 2de405f Compare May 7, 2025 15:45

carlaKC reviewed May 9, 2025

View reviewed changes

elnosh commented May 12, 2025

View reviewed changes

elnosh mentioned this pull request May 12, 2025

simln-lib: refactor ForwardingError to nested results #265

Merged

elnosh added 2 commits May 13, 2025 16:21

simln-lib: add unique index to htlcs in ChannelState

bb07809

simln-lib: add htlc Interceptor trait to simulated node

860915a

elnosh force-pushed the interceptor branch from 2de405f to 4ada643 Compare May 14, 2025 15:41

elnosh requested a review from carlaKC May 14, 2025 16:11

carlaKC reviewed May 15, 2025

View reviewed changes

elnosh force-pushed the interceptor branch from 4ada643 to c43b325 Compare May 16, 2025 13:38

elnosh and others added 4 commits May 16, 2025 08:52

f - handle simulation shutdown and add docs

e605cc9

sim-lib: add interceptor that introduces latency to sim-node

8bd9c13

sim-cli: add latency interceptor option in cli

ad4ee85

f - validate latency in cli and make u32

93514a5

elnosh force-pushed the interceptor branch from c43b325 to 93514a5 Compare May 16, 2025 13:52

elnosh requested a review from carlaKC May 16, 2025 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simln-lib: add htlc interceptor for simulated nodes #261

simln-lib: add htlc interceptor for simulated nodes #261

elnosh commented May 6, 2025

carlaKC commented May 6, 2025

carlaKC left a comment

carlaKC May 9, 2025

carlaKC May 9, 2025

carlaKC May 9, 2025

carlaKC May 9, 2025

elnosh May 9, 2025

carlaKC May 9, 2025 •

edited

Loading

elnosh May 9, 2025

carlaKC May 9, 2025

carlaKC May 9, 2025

elnosh May 9, 2025

carlaKC May 9, 2025

elnosh May 9, 2025

elnosh commented May 9, 2025 •

edited

Loading

carlaKC commented May 12, 2025

elnosh commented May 12, 2025

elnosh May 12, 2025

carlaKC May 12, 2025

elnosh commented May 14, 2025

carlaKC left a comment

carlaKC May 15, 2025

carlaKC May 15, 2025

carlaKC May 15, 2025

carlaKC May 15, 2025

carlaKC May 15, 2025

carlaKC May 15, 2025

elnosh May 15, 2025

carlaKC May 15, 2025

elnosh May 16, 2025

carlaKC May 15, 2025

carlaKC May 15, 2025

carlaKC May 15, 2025

carlaKC May 15, 2025

		/// The short channel id for the incoming channel that this htlc was delivered on.
		pub incoming_htlc: HtlcRef,

		// The listener on which the interceptor will receive shutdown signals.
		pub shutdown_listener: Listener,

		/// Optional set of interceptors that will be called every time a HTLC is added to a simulated channel.
		interceptors: Vec<Arc<dyn Interceptor>>,

simln-lib: add htlc interceptor for simulated nodes #261

Are you sure you want to change the base?

simln-lib: add htlc interceptor for simulated nodes #261

Conversation

elnosh commented May 6, 2025

carlaKC commented May 6, 2025

carlaKC left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlaKC May 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elnosh commented May 9, 2025 • edited Loading

carlaKC commented May 12, 2025

elnosh commented May 12, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elnosh commented May 14, 2025

carlaKC left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carlaKC May 9, 2025 •

edited

Loading

elnosh commented May 9, 2025 •

edited

Loading