Skip to content

Do not panic in switch synchronization task #8714

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

internet-diglett
Copy link
Contributor

There were some lingering expect() and unwrap() statements in the sync switch configuration background task. Some of these were in places we believed to be impossible to hit, but a recent core dump shows that we have somehow triggered one of them, so we're now logging an error and continuing in each case.

Related

There were some lingering `expect()` and `unwrap()` statements
in the sync switch configuration background task. Some of these
were in places we believed to be impossible to hit, but a recent
core dump shows that we have somehow triggered one of them, so
we're now logging an error and continuing in each case.
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'm the best person to review this; I don't really have any context on the various networking machinery involved. That said, the changes make me a little uneasy for reasons noted below; I'm worried the panic we saw is indicative that we have some structural / type-level issues that we're catching at runtime, and these changes kinda paper over those issues (and in some cases make it more likely we won't notice them until later).

Comment on lines +895 to +904
// There should always be a set of prefixes for a given announce set id.
// If this is `None` we need to audit the history of the specified bgp config
// and announce set to see what went wrong.
error!(
log,
"bgp config references an announce set that does not exist";
"bgp_config_id" => ?config.id(),
"announce_set_id" => ?config.bgp_announce_set_id,
);
vec![]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of potential panics is good, but this change makes me nervous:

  1. We don't have any way of escalating error logs, and AFAIK we don't regularly check for them even on systems where we could (e.g., dogfood). We noticed Nexus panic inside sync_switch_configuration background task #8579 specifically because of the panic; if this had been an error log all along, would we have just logged an "ought to be impossible" condition and not even realized it had happened?
  2. The comment says if this happens we should audit the history of the BGP config; have we done that on dogfood to understand how it happened? (I'm worried that if we haven't done that, there might be ways we could get here that we don't understand, possibly because of other bugs?)

Comment on lines +911 to +918
let net = match Ipv4Net::new(prefix.value, prefix.length) {
Ok(v) => v,
Err(e) => {
error!(
log,
"failed to create Ipv4Net from Prefix4";
"prefix4" => ?prefix,
"error" => %DisplayErrorChain::new(&e),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion seems like it should be infallible, I think? Could Prefix4 (a) require at runtime that its length be valid (i.e., <= 32) and (b) have an infallible method that gives back an Ipv4Net?

If a Prefix4 with a length greater than 32 is a sensible thing, I'd retract my questions. But I think if we have such a Prefix4 floating around that's already indicative of some missed input validation or something, right? We shouldn't be catching that this late.

Comment on lines +946 to +947
// Theoretically this should never be possible. If we have a PortSettingsChange::Apply, that
// means we have an active port_settings record and the port_settings_id should be `Some(Uuid)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually impossible, as in it's guaranteed that the settings we read from the DB always have an ID? If so, is there a way to structure the types such that the ID is not optional?

"config" => ?desired_config,
"error" => %e,
);
// if we cannot successfully serialize the config, we cannot store the config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one should go back to being an .expect("BootstoreConfig should be serializable as JSON"):

  1. A serialization failure means something has changed with the definition of the BootstoreConfig type such that it's no longer possible to represent it as JSON. That would be extremely bad.
  2. As noted above, replacing this panic with an error log means we're likely to miss it if this happens. Since in this case the only way for it to start panicking is a code change (and an unlikely one at that), I think we should panic so that we catch such a change as early as possible.

@jgallagher
Copy link
Contributor

I'm worried the panic we saw is indicative that we have some structural / type-level issues that we're catching at runtime, and these changes kinda paper over those issues (and in some cases make it more likely we won't notice them until later).

To expand on this a bit: there are a bunch of ways to handle errors. I think I'd start describing those by splitting them into two categories: an error that can only happen if we have a mistake in code, and errors that must be handled at runtime. I'm sure there are good sources that go into the difference in different languages, but I think in Rust this is boiling down to "if something is wrong here, should I assert! / panic! / unwrap() / expect(), or should I do something to try to handle this error at runtime?".

There are a bunch of cases where I'd claim it's correct to use assert/panic/etc. even in production code: any time we're ensuring some internal precondition or consistency that we have full control over and that should never happen unless we have a bug or have made an error in reasoning about things. If we're in this state, the program can't know how to proceed, because something is fundamentally wrong, and all we can really do is panic. I'll claim that the inability to serialize BootstoreConfig to JSON in this PR is one such case: the only way this can fail at runtime is if the type has changed to something that can't be serialized as JSON at all (i.e., we've introduced a bug: we need to serialize a thing as JSON that can't be serialized as JSON).

There are way more kinds of errors that shouldn't result in panics, of course. Bad input, I/O errors, DNS lookup failures, ..., this list is basically unending.

My concern about the changes on this PR are that of the unwraps we had before, I don't know which ones actually should still be panics because they were really asserting some condition about the system as a whole that should always be true and which ones were incorrectly panicking in some case we ought to handle somehow. (We don't have a lot of great ways to handle this other than logging at the moment, but that's a different problem.) It sounds like from your comment on the issue (#8579 (comment)) that maybe all of these are really in the "this should never happen and if it does we have a bug somewhere else (maybe in database validation?)" case? If that's right, I think these changes aren't the way we should go, and instead we need to find and fix the root cause(s).

This is certainly more work than just converting the panics to error logs. In terms of triaging for R16, I think we're kinda in the same boat in terms of some of that work though; I think some reasonable questions are:

  • Do we understand what happened on dogfood?
  • Could it happen on a customer system?
  • What's the effect if it does happen? (dogfood apparently came back from this fine, but if we don't know what happened we also don't know whether that would be true if it happened somewhere else.)

@internet-diglett
Copy link
Contributor Author

internet-diglett commented Jul 29, 2025

@jgallagher Your points make sense to me. I was primarily trying to add information to allow us to get more context about the time periods and sequences of the circumstances that caused these issues, I failed to consider that we didn't have any other canaries to let us know when we actually ran into this problem to cause us to actually look at the logs.

It's probably a good idea to log and panic here for the future because some of these tables (like the dependent tables in switch port settings) don't keep historical information (we hard delete those records, we don't soft delete them).

I'm currently looking into the issue on Dogfood.

@internet-diglett
Copy link
Contributor Author

Update on issue: #8579 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nexus panic inside sync_switch_configuration background task
2 participants