Do not panic in switch synchronization task #8714

internet-diglett · 2025-07-28T22:57:43Z

There were some lingering expect() and unwrap() statements in the sync switch configuration background task. Some of these were in places we believed to be impossible to hit, but a recent core dump shows that we have somehow triggered one of them, so we're now logging an error and continuing in each case.

There were some lingering `expect()` and `unwrap()` statements in the sync switch configuration background task. Some of these were in places we believed to be impossible to hit, but a recent core dump shows that we have somehow triggered one of them, so we're now logging an error and continuing in each case.

jgallagher

I'm not sure I'm the best person to review this; I don't really have any context on the various networking machinery involved. That said, the changes make me a little uneasy for reasons noted below; I'm worried the panic we saw is indicative that we have some structural / type-level issues that we're catching at runtime, and these changes kinda paper over those issues (and in some cases make it more likely we won't notice them until later).

jgallagher · 2025-07-29T14:00:32Z

nexus/src/app/background/tasks/sync_switch_configuration.rs

+                                // There should always be a set of prefixes for a given announce set id.
+                                // If this is `None` we need to audit the history of the specified bgp config
+                                // and announce set to see what went wrong.
+                                error!(
+                                    log,
+                                    "bgp config references an announce set that does not exist";
+                                    "bgp_config_id" => ?config.id(),
+                                    "announce_set_id" => ?config.bgp_announce_set_id,
+                                );
+                                vec![]


Getting rid of potential panics is good, but this change makes me nervous:

We don't have any way of escalating error logs, and AFAIK we don't regularly check for them even on systems where we could (e.g., dogfood). We noticed Nexus panic inside sync_switch_configuration background task #8579 specifically because of the panic; if this had been an error log all along, would we have just logged an "ought to be impossible" condition and not even realized it had happened?

The comment says if this happens we should audit the history of the BGP config; have we done that on dogfood to understand how it happened? (I'm worried that if we haven't done that, there might be ways we could get here that we don't understand, possibly because of other bugs?)

jgallagher · 2025-07-29T14:07:20Z

nexus/src/app/background/tasks/sync_switch_configuration.rs

+                        let net = match Ipv4Net::new(prefix.value, prefix.length) {
+                            Ok(v) => v,
+                            Err(e) => {
+                                error!(
+                                    log,
+                                    "failed to create Ipv4Net from Prefix4";
+                                    "prefix4" => ?prefix,
+                                    "error" => %DisplayErrorChain::new(&e),


This conversion seems like it should be infallible, I think? Could Prefix4 (a) require at runtime that its length be valid (i.e., <= 32) and (b) have an infallible method that gives back an Ipv4Net?

If a Prefix4 with a length greater than 32 is a sensible thing, I'd retract my questions. But I think if we have such a Prefix4 floating around that's already indicative of some missed input validation or something, right? We shouldn't be catching that this late.

jgallagher · 2025-07-29T14:11:11Z

nexus/src/app/background/tasks/sync_switch_configuration.rs

+                            // Theoretically this should never be possible. If we have a PortSettingsChange::Apply, that
+                            // means we have an active port_settings record and the port_settings_id should be `Some(Uuid)`.


Is this actually impossible, as in it's guaranteed that the settings we read from the DB always have an ID? If so, is there a way to structure the types such that the ID is not optional?

jgallagher · 2025-07-29T14:13:22Z

nexus/src/app/background/tasks/sync_switch_configuration.rs

+                                "config" => ?desired_config,
+                                "error" => %e,
+                            );
+                            // if we cannot successfully serialize the config, we cannot store the config


I think this one should go back to being an .expect("BootstoreConfig should be serializable as JSON"):

A serialization failure means something has changed with the definition of the BootstoreConfig type such that it's no longer possible to represent it as JSON. That would be extremely bad.

As noted above, replacing this panic with an error log means we're likely to miss it if this happens. Since in this case the only way for it to start panicking is a code change (and an unlikely one at that), I think we should panic so that we catch such a change as early as possible.

jgallagher · 2025-07-29T16:36:11Z

I'm worried the panic we saw is indicative that we have some structural / type-level issues that we're catching at runtime, and these changes kinda paper over those issues (and in some cases make it more likely we won't notice them until later).

To expand on this a bit: there are a bunch of ways to handle errors. I think I'd start describing those by splitting them into two categories: an error that can only happen if we have a mistake in code, and errors that must be handled at runtime. I'm sure there are good sources that go into the difference in different languages, but I think in Rust this is boiling down to "if something is wrong here, should I assert! / panic! / unwrap() / expect(), or should I do something to try to handle this error at runtime?".

There are a bunch of cases where I'd claim it's correct to use assert/panic/etc. even in production code: any time we're ensuring some internal precondition or consistency that we have full control over and that should never happen unless we have a bug or have made an error in reasoning about things. If we're in this state, the program can't know how to proceed, because something is fundamentally wrong, and all we can really do is panic. I'll claim that the inability to serialize BootstoreConfig to JSON in this PR is one such case: the only way this can fail at runtime is if the type has changed to something that can't be serialized as JSON at all (i.e., we've introduced a bug: we need to serialize a thing as JSON that can't be serialized as JSON).

There are way more kinds of errors that shouldn't result in panics, of course. Bad input, I/O errors, DNS lookup failures, ..., this list is basically unending.

My concern about the changes on this PR are that of the unwraps we had before, I don't know which ones actually should still be panics because they were really asserting some condition about the system as a whole that should always be true and which ones were incorrectly panicking in some case we ought to handle somehow. (We don't have a lot of great ways to handle this other than logging at the moment, but that's a different problem.) It sounds like from your comment on the issue (#8579 (comment)) that maybe all of these are really in the "this should never happen and if it does we have a bug somewhere else (maybe in database validation?)" case? If that's right, I think these changes aren't the way we should go, and instead we need to find and fix the root cause(s).

This is certainly more work than just converting the panics to error logs. In terms of triaging for R16, I think we're kinda in the same boat in terms of some of that work though; I think some reasonable questions are:

Do we understand what happened on dogfood?
Could it happen on a customer system?
What's the effect if it does happen? (dogfood apparently came back from this fine, but if we don't know what happened we also don't know whether that would be true if it happened somewhere else.)

internet-diglett · 2025-07-29T17:10:44Z

@jgallagher Your points make sense to me. I was primarily trying to add information to allow us to get more context about the time periods and sequences of the circumstances that caused these issues, I failed to consider that we didn't have any other canaries to let us know when we actually ran into this problem to cause us to actually look at the logs.

It's probably a good idea to log and panic here for the future because some of these tables (like the dependent tables in switch port settings) don't keep historical information (we hard delete those records, we don't soft delete them).

I'm currently looking into the issue on Dogfood.

internet-diglett · 2025-07-30T16:12:31Z

Update on issue: #8579 (comment)

internet-diglett requested a review from jgallagher July 28, 2025 22:59

jgallagher reviewed Jul 29, 2025

View reviewed changes

internet-diglett mentioned this pull request Jul 29, 2025

Nexus panic inside sync_switch_configuration background task #8579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not panic in switch synchronization task #8714

Do not panic in switch synchronization task #8714

Uh oh!

internet-diglett commented Jul 28, 2025

Uh oh!

jgallagher left a comment

Uh oh!

jgallagher Jul 29, 2025

Uh oh!

jgallagher Jul 29, 2025

Uh oh!

jgallagher Jul 29, 2025

Uh oh!

jgallagher Jul 29, 2025

Uh oh!

jgallagher commented Jul 29, 2025

Uh oh!

internet-diglett commented Jul 29, 2025 •

edited

Loading

Uh oh!

internet-diglett commented Jul 30, 2025

Uh oh!

Uh oh!

		// Theoretically this should never be possible. If we have a PortSettingsChange::Apply, that
		// means we have an active port_settings record and the port_settings_id should be `Some(Uuid)`.

Do not panic in switch synchronization task #8714

Are you sure you want to change the base?

Do not panic in switch synchronization task #8714

Uh oh!

Conversation

internet-diglett commented Jul 28, 2025

Related

Uh oh!

jgallagher left a comment

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

jgallagher commented Jul 29, 2025

Uh oh!

internet-diglett commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

internet-diglett commented Jul 30, 2025

Uh oh!

Uh oh!

internet-diglett commented Jul 29, 2025 •

edited

Loading