-
Notifications
You must be signed in to change notification settings - Fork 52
Create db_metadata_nexus_state
table
#8845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7b98014
to
7bafa74
Compare
|
||
SET LOCAL disallow_full_table_scans = off; | ||
|
||
INSERT INTO omicron.public.db_metadata_nexus (nexus_id, last_drained_blueprint_id, state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be possible to not include this populate step, depending on how reconfigurator execution plans on populating these records. After all, the "no db_metadata_nexus
records" case is already treated specially for backwards-compatibility.
This would also let us delete the data migrations in nexus/tests/integration_tests/schema.rs
/// Returns an error if: | ||
/// - Any db_metadata_nexus records already exist (should only be called | ||
/// during initial setup) | ||
pub async fn initialize_nexus_access_from_blueprint_on_connection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are okay existing in a period of time where db_metadata_nexus
records do not exist, but blueprint execution could otherwise be functional, this change may not be necessary.
However, I think the presence of active
records for live Nexuses acts as a strong guard against quiescing, as documented in https://rfd.shared.oxide.computer/rfd/588 , so they are populated here, within RSS setup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I like the approach you've got here.
7bafa74
to
d70e41f
Compare
db_metadata_nexus_state
table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have a lot of small comments here but I think this is close to what we've discussed.
One lingering thing that makes me nervous is that there are so many implicit assumptions and constraints on the datastore functions in db_metadata
. This is not a blocker for this PR! But I wonder if this would benefit from an approach that used distinct types for the different phases. I'll think about this and bring it up elsewhere.
|
||
// Before proceeding, all records must be in the "inactive" or "not_yet" states. | ||
// | ||
// We explicitly look for any records violating this, rather than explicitly looking for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels less future-proofed to me, on the grounds that if we added some new state that's logically like one of these other two states, we'll erroneously not include it here and so not notice something in that state. It feels safer to me to look for active
explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on the grounds that if we added some new state that's logically like one of these other two states
What if the new state is logically like "active"? Or needs different handling than these other two states? Perhaps it's like unquiescing_to_deal_with_expungement
to try to deal with https://rfd.shared.oxide.computer/rfd/0588#_trying_to_handle_permanent_failures_during_handoff
Truly, I have no idea what kind of future state we would want, but if I use:
let active_count = dsl::db_metadata_nexus.filter(dsl::state.eq(active))
// Proceed if "active_count" > 0
...
Then I wouldn't be handling this case correctly.
I was trying to follow the conditions for handoff we agreed on in RFD 588:
To carry out the handoff:
Precondition: all records in this table must have statenot_yet
orquiesced
.
Anything other state - whether it's active
, unquiescing_to_deal_with_expungement
, or something else - would violate that constraint, as written.
I think this might be more obvious if I renamed this variable from active_count
to not_not_yet_and_not_quiesced_count
but that feels much wordier.
} = identity_check | ||
else { | ||
return Err(BackoffError::permanent( | ||
"Nexus ID needed for handoff", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should truly be impossible, right? I assume we wouldn't have returned NeedsHandoff
if the identity check policy was DontCare. I don't think we should crash or anything but just wanted to be clear on my understanding and I think it's worth a comment to this effect. In the future it'd be great if we could rework it so this case wasn't representable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is within the implementation details of check_schema_and_access
, but NeedsHandoff
is only returned when access is DoesNotHaveAccessYet
. This can only be returned by DataStore::check_nexus_access
, which is only invoked when the IdentityCheckPolicy::CheckAndTakeover
variant is used (which has the explicit Nexus UUID).
I could pass the Nexus UUID back out through the NeedsHandoff
enum? Gave this a shot in ef27f21, removed this error case.
nexus/src/bin/schema-updater.rs
Outdated
println!("Update to {version} complete"); | ||
} | ||
SchemaAction::NeedsHandoff | SchemaAction::Refuse => { | ||
println!("Cannot update to version {version}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, these should be impossible, right? Here I'd suggest being more explicit and reporting this as some kind of internal error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 2760930
} | ||
|
||
/// Registers a Nexus instance as having active access to the database | ||
pub async fn database_nexus_access_insert( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this non-pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
/// Checks if any db_metadata_nexus records exist in the database using an existing connection | ||
pub async fn database_nexus_access_any_exist_on_connection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this non-pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
/// Checks if any db_metadata_nexus records exist in the database | ||
pub async fn database_nexus_access_any_exist(&self) -> Result<bool, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
/// Describes how a consumer may want to react to schema and access | ||
#[derive(Debug, Copy, Clone, PartialEq)] | ||
pub enum ConsumerPolicy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I wasn't thinking of the current MUPdate-based update process.
How does this PR affect that? It seems like this PR does change Nexus to automatically try to update the schema?
|
||
/// Describes what should be done with a schema | ||
#[derive(Debug, Copy, Clone, PartialEq)] | ||
pub enum SchemaAction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DatastoreSetupAction
?
SchemaAction::NeedsHandoff | SchemaAction::Refuse => { | ||
println!("Cannot update to version {version}") | ||
DatastoreSetupAction::Refuse => { | ||
println!("Refusing to update to version {version}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would this be? (I'm imagining the support person seeing this message and not knowing what this means or what to do next.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One example could be if the "version the schema updater wants to upgrade to" is older than the observed version on disk (e.g., really old schema-updater, newer deployment). Running the schema-updater ls
command should make this immediately clear.
Implements Nexus's
db_metadata_nexus_state
table, to consider access by other Nexusesthat may be concurrently executing with older schemas.
Database Schema & Migration
db_metadata_nexus
table to track Nexus access with states: active, not_yet, inactiveValidation
check_schema_and_access()
function validates both schema version compatibility and Nexus accessSchemaAction
enum guides database initialization based on "access" / "schema" combinationsattempt_handoff()
function enables atomic transition of Nexus access fromnot_yet
toactive
statesBackwards compatibility
Fixes #8501