-
Notifications
You must be signed in to change notification settings - Fork 50
Do not panic in switch synchronization task #8714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There were some lingering `expect()` and `unwrap()` statements in the sync switch configuration background task. Some of these were in places we believed to be impossible to hit, but a recent core dump shows that we have somehow triggered one of them, so we're now logging an error and continuing in each case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'm the best person to review this; I don't really have any context on the various networking machinery involved. That said, the changes make me a little uneasy for reasons noted below; I'm worried the panic we saw is indicative that we have some structural / type-level issues that we're catching at runtime, and these changes kinda paper over those issues (and in some cases make it more likely we won't notice them until later).
// There should always be a set of prefixes for a given announce set id. | ||
// If this is `None` we need to audit the history of the specified bgp config | ||
// and announce set to see what went wrong. | ||
error!( | ||
log, | ||
"bgp config references an announce set that does not exist"; | ||
"bgp_config_id" => ?config.id(), | ||
"announce_set_id" => ?config.bgp_announce_set_id, | ||
); | ||
vec![] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting rid of potential panics is good, but this change makes me nervous:
- We don't have any way of escalating error logs, and AFAIK we don't regularly check for them even on systems where we could (e.g., dogfood). We noticed Nexus panic inside
sync_switch_configuration
background task #8579 specifically because of the panic; if this had been an error log all along, would we have just logged an "ought to be impossible" condition and not even realized it had happened? - The comment says if this happens we should audit the history of the BGP config; have we done that on dogfood to understand how it happened? (I'm worried that if we haven't done that, there might be ways we could get here that we don't understand, possibly because of other bugs?)
let net = match Ipv4Net::new(prefix.value, prefix.length) { | ||
Ok(v) => v, | ||
Err(e) => { | ||
error!( | ||
log, | ||
"failed to create Ipv4Net from Prefix4"; | ||
"prefix4" => ?prefix, | ||
"error" => %DisplayErrorChain::new(&e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conversion seems like it should be infallible, I think? Could Prefix4
(a) require at runtime that its length
be valid (i.e., <= 32
) and (b) have an infallible method that gives back an Ipv4Net
?
If a Prefix4
with a length greater than 32 is a sensible thing, I'd retract my questions. But I think if we have such a Prefix4
floating around that's already indicative of some missed input validation or something, right? We shouldn't be catching that this late.
// Theoretically this should never be possible. If we have a PortSettingsChange::Apply, that | ||
// means we have an active port_settings record and the port_settings_id should be `Some(Uuid)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this actually impossible, as in it's guaranteed that the settings we read from the DB always have an ID? If so, is there a way to structure the types such that the ID is not optional?
"config" => ?desired_config, | ||
"error" => %e, | ||
); | ||
// if we cannot successfully serialize the config, we cannot store the config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this one should go back to being an .expect("BootstoreConfig should be serializable as JSON")
:
- A serialization failure means something has changed with the definition of the
BootstoreConfig
type such that it's no longer possible to represent it as JSON. That would be extremely bad. - As noted above, replacing this panic with an error log means we're likely to miss it if this happens. Since in this case the only way for it to start panicking is a code change (and an unlikely one at that), I think we should panic so that we catch such a change as early as possible.
To expand on this a bit: there are a bunch of ways to handle errors. I think I'd start describing those by splitting them into two categories: an error that can only happen if we have a mistake in code, and errors that must be handled at runtime. I'm sure there are good sources that go into the difference in different languages, but I think in Rust this is boiling down to "if something is wrong here, should I There are a bunch of cases where I'd claim it's correct to use assert/panic/etc. even in production code: any time we're ensuring some internal precondition or consistency that we have full control over and that should never happen unless we have a bug or have made an error in reasoning about things. If we're in this state, the program can't know how to proceed, because something is fundamentally wrong, and all we can really do is panic. I'll claim that the inability to serialize There are way more kinds of errors that shouldn't result in panics, of course. Bad input, I/O errors, DNS lookup failures, ..., this list is basically unending. My concern about the changes on this PR are that of the unwraps we had before, I don't know which ones actually should still be panics because they were really asserting some condition about the system as a whole that should always be true and which ones were incorrectly panicking in some case we ought to handle somehow. (We don't have a lot of great ways to handle this other than logging at the moment, but that's a different problem.) It sounds like from your comment on the issue (#8579 (comment)) that maybe all of these are really in the "this should never happen and if it does we have a bug somewhere else (maybe in database validation?)" case? If that's right, I think these changes aren't the way we should go, and instead we need to find and fix the root cause(s). This is certainly more work than just converting the panics to error logs. In terms of triaging for R16, I think we're kinda in the same boat in terms of some of that work though; I think some reasonable questions are:
|
@jgallagher Your points make sense to me. I was primarily trying to add information to allow us to get more context about the time periods and sequences of the circumstances that caused these issues, I failed to consider that we didn't have any other canaries to let us know when we actually ran into this problem to cause us to actually look at the logs.
I'm currently looking into the issue on Dogfood. |
Update on issue: #8579 (comment) |
There were some lingering
expect()
andunwrap()
statements in the sync switch configuration background task. Some of these were in places we believed to be impossible to hit, but a recent core dump shows that we have somehow triggered one of them, so we're now logging an error and continuing in each case.Related
sync_switch_configuration
background task #8579