Create `db_metadata_nexus_state` table #8845

smklein · 2025-08-14T23:59:05Z

Implements Nexus's db_metadata_nexus_state table, to consider access by other Nexuses
that may be concurrently executing with older schemas.

Database Schema & Migration

Added db_metadata_nexus table to track Nexus access with states: active, not_yet, inactive
Migration automatically populates records for existing Nexus zones from the current target blueprint

Validation

New check_schema_and_access() function validates both schema version compatibility and Nexus access
SchemaAction enum guides database initialization based on "access" / "schema" combinations
attempt_handoff() function enables atomic transition of Nexus access from not_yet to active states

Backwards compatibility

Backward compatibility for deployments upgrading from pre-existing schemas, and support added to populate new deployments
Support for both explicit Nexus IDs and omitted Nexus IDs (for the schema updater binary)

Fixes #8501

smklein · 2025-08-15T20:18:20Z

schema/crdb/populate-db-metadata-nexus/up04.sql

+
+SET LOCAL disallow_full_table_scans = off;
+
+INSERT INTO omicron.public.db_metadata_nexus (nexus_id, last_drained_blueprint_id, state)


It may be possible to not include this populate step, depending on how reconfigurator execution plans on populating these records. After all, the "no db_metadata_nexus records" case is already treated specially for backwards-compatibility.

This would also let us delete the data migrations in nexus/tests/integration_tests/schema.rs

smklein · 2025-08-15T20:20:55Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+    /// Returns an error if:
+    /// - Any db_metadata_nexus records already exist (should only be called
+    /// during initial setup)
+    pub async fn initialize_nexus_access_from_blueprint_on_connection(


If we are okay existing in a period of time where db_metadata_nexus records do not exist, but blueprint execution could otherwise be functional, this change may not be necessary.

However, I think the presence of active records for live Nexuses acts as a strong guard against quiescing, as documented in https://rfd.shared.oxide.computer/rfd/588 , so they are populated here, within RSS setup.

Yeah, I like the approach you've got here.

davepacheco

Thanks! I have a lot of small comments here but I think this is close to what we've discussed.

One lingering thing that makes me nervous is that there are so many implicit assumptions and constraints on the datastore functions in db_metadata. This is not a blocker for this PR! But I wonder if this would benefit from an approach that used distinct types for the different phases. I'll think about this and bring it up elsewhere.

schema/crdb/dbinit.sql

schema/crdb/populate-db-metadata-nexus/up04.sql

nexus/db-queries/src/db/datastore/db_metadata.rs

davepacheco · 2025-08-25T21:16:04Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+
+        // Before proceeding, all records must be in the "inactive" or "not_yet" states.
+        //
+        // We explicitly look for any records violating this, rather than explicitly looking for


This feels less future-proofed to me, on the grounds that if we added some new state that's logically like one of these other two states, we'll erroneously not include it here and so not notice something in that state. It feels safer to me to look for active explicitly.

on the grounds that if we added some new state that's logically like one of these other two states

What if the new state is logically like "active"? Or needs different handling than these other two states? Perhaps it's like unquiescing_to_deal_with_expungement to try to deal with https://rfd.shared.oxide.computer/rfd/0588#_trying_to_handle_permanent_failures_during_handoff

Truly, I have no idea what kind of future state we would want, but if I use:

let active_count = dsl::db_metadata_nexus.filter(dsl::state.eq(active)) // Proceed if "active_count" > 0 ...

Then I wouldn't be handling this case correctly.

I was trying to follow the conditions for handoff we agreed on in RFD 588:

To carry out the handoff:
Precondition: all records in this table must have state not_yet or quiesced.

Anything other state - whether it's active, unquiescing_to_deal_with_expungement, or something else - would violate that constraint, as written.

I think this might be more obvious if I renamed this variable from active_count to not_not_yet_and_not_quiesced_count but that feels much wordier.

davepacheco · 2025-08-26T16:46:19Z

nexus/db-queries/src/db/datastore/mod.rs

+                            } = identity_check
+                            else {
+                                return Err(BackoffError::permanent(
+                                    "Nexus ID needed for handoff",


This should truly be impossible, right? I assume we wouldn't have returned NeedsHandoff if the identity check policy was DontCare. I don't think we should crash or anything but just wanted to be clear on my understanding and I think it's worth a comment to this effect. In the future it'd be great if we could rework it so this case wasn't representable.

Yeah, this is within the implementation details of check_schema_and_access, but NeedsHandoff is only returned when access is DoesNotHaveAccessYet. This can only be returned by DataStore::check_nexus_access, which is only invoked when the IdentityCheckPolicy::CheckAndTakeover variant is used (which has the explicit Nexus UUID).

I could pass the Nexus UUID back out through the NeedsHandoff enum? Gave this a shot in ef27f21, removed this error case.

davepacheco · 2025-08-26T16:49:04Z

nexus/src/bin/schema-updater.rs

+                    println!("Update to {version} complete");
+                }
+                SchemaAction::NeedsHandoff | SchemaAction::Refuse => {
+                    println!("Cannot update to version {version}")


Again, these should be impossible, right? Here I'd suggest being more explicit and reporting this as some kind of internal error.

Updated in 2760930

davepacheco · 2025-08-26T16:57:56Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+    }
+
+    /// Registers a Nexus instance as having active access to the database
+    pub async fn database_nexus_access_insert(


Can we make this non-pub?

davepacheco · 2025-08-26T16:58:24Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+    }
+
+    /// Checks if any db_metadata_nexus records exist in the database using an existing connection
+    pub async fn database_nexus_access_any_exist_on_connection(


Can we make this non-pub?

davepacheco · 2025-08-26T16:58:39Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+    }
+
+    /// Checks if any db_metadata_nexus records exist in the database
+    pub async fn database_nexus_access_any_exist(&self) -> Result<bool, Error> {


nexus/db-queries/src/db/datastore/mod.rs

nexus/db-queries/src/db/datastore/db_metadata.rs

davepacheco · 2025-08-26T17:09:58Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+
+/// Describes how a consumer may want to react to schema and access
+#[derive(Debug, Copy, Clone, PartialEq)]
+pub enum ConsumerPolicy {


Oh, I wasn't thinking of the current MUPdate-based update process.

How does this PR affect that? It seems like this PR does change Nexus to automatically try to update the schema?

davepacheco · 2025-08-26T17:10:35Z

nexus/db-queries/src/db/datastore/db_metadata.rs

+
+/// Describes what should be done with a schema
+#[derive(Debug, Copy, Clone, PartialEq)]
+pub enum SchemaAction {


DatastoreSetupAction?

nexus/db-queries/src/db/datastore/db_metadata.rs

davepacheco · 2025-08-26T20:19:04Z

nexus/src/bin/schema-updater.rs

-                SchemaAction::NeedsHandoff | SchemaAction::Refuse => {
-                    println!("Cannot update to version {version}")
+                DatastoreSetupAction::Refuse => {
+                    println!("Refusing to update to version {version}")


why would this be? (I'm imagining the support person seeing this message and not knowing what this means or what to do next.)

One example could be if the "version the schema updater wants to upgrade to" is older than the observed version on disk (e.g., really old schema-updater, newer deployment). Running the schema-updater ls command should make this immediately clear.

smklein · 2025-08-27T00:11:26Z

Just added population of the db_metadata_nexus records in c5e4509, with tests in 403b1d7.

next step: Splitting this into the smaller PRs, as we discussed in the update sync today.

…es (#8924) Split off of #8845 Creates the schema, ensures it stays up-to-date. Does not attempt to read it. First part of #8501: adding schema for records, writing them. Not yet reading these records.

smklein force-pushed the db_metadata_nexus branch 4 times, most recently from 7b98014 to 7bafa74 Compare August 15, 2025 18:55

smklein commented Aug 15, 2025

View reviewed changes

smklein mentioned this pull request Aug 18, 2025

Extract schema update policy from implementation #8807

Closed

davepacheco mentioned this pull request Aug 20, 2025

start updating quiesce for new Nexus handoff #8875

Open

Create db_metadata_nexus_state table

d70e41f

smklein force-pushed the db_metadata_nexus branch from 7bafa74 to d70e41f Compare August 21, 2025 21:22

smklein changed the title ~~Create db_metadata_nexus_state table~~ Create db_metadata_nexus_state table Aug 21, 2025

smklein requested review from davepacheco and jgallagher August 21, 2025 21:28

smklein marked this pull request as ready for review August 21, 2025 21:28

Update data migration tests from 181 -> 182

1ef5e5e

davepacheco reviewed Aug 25, 2025

View reviewed changes

smklein added 14 commits August 25, 2025 15:53

merge

865efed

patch schema version, update up04.sql

f6e90a3

s/inactive/quiesced

914248b

unique index

f46e22a

Only update non-expunged Nexuses, update data migration test too

db416d0

new_with_timeout tweaks

b25ce91

IdentityCheckPolicy

edeee09

503

2573384

Better handling of SchemaAction::Handoff

2f0c43f

remove ConsumerPolicy

165fd0e

Error types, unused code, comments

3edce16

line lengths

46f920a

comment

f5a4792

use blueprint in-memory, rather than doing db queries

4b2b58a

davepacheco reviewed Aug 26, 2025

View reviewed changes

smklein added 2 commits August 26, 2025 10:49

s/SchemaAction/DatastoreSetupAction

641ed0a

feed-forward nexus_id to avoid impossible code paths

ef27f21

smklein mentioned this pull request Aug 26, 2025

Let Nexus see schema_dir to self-update #8912

Open

smklein added 3 commits August 26, 2025 11:30

schema-updater comments

2760930

less pub

42be76b

warnings

875d8e6

davepacheco approved these changes Aug 26, 2025

View reviewed changes

smklein added 2 commits August 26, 2025 16:21

reconfigurator execution to populate/destroy records

c5e4509

tests

403b1d7

davepacheco mentioned this pull request Aug 27, 2025

quiesce needs to keep track of blueprint ids #8919

Open

merge with main (now rev 184)

09237bd

This was referenced Aug 27, 2025

(1/N) db_metadata_nexus schema changes, db queries. Populate the tables #8924

Merged

(5/N) Read database access records on boot #8925

Open

(3/N) db_metadata_nexus queries #8931

Open

(4/N) db_metadata_nexus queries (handoff) #8932

Open


		SET LOCAL disallow_full_table_scans = off;

		INSERT INTO omicron.public.db_metadata_nexus (nexus_id, last_drained_blueprint_id, state)

Create db_metadata_nexus_state table #8845

Are you sure you want to change the base?

Create db_metadata_nexus_state table #8845

Uh oh!

Conversation

smklein commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Database Schema & Migration

Validation

Backwards compatibility

Uh oh!

smklein Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davepacheco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smklein commented Aug 27, 2025

Uh oh!

Uh oh!

Create `db_metadata_nexus_state` table #8845

Create `db_metadata_nexus_state` table #8845

smklein commented Aug 14, 2025 •

edited

Loading

smklein Aug 15, 2025 •

edited

Loading