Skip to content

Add bg task for collecting chicken switches from DB #8462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 1, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion dev-tools/omdb/src/bin/omdb/nexus/chicken_switches.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

use crate::Omdb;
use crate::check_allow_destructive::DestructiveOperationToken;
use clap::ArgAction;
use clap::Args;
use clap::Subcommand;
use http::StatusCode;
Expand Down Expand Up @@ -33,6 +34,7 @@ pub enum ChickenSwitchesCommands {

#[derive(Debug, Clone, Args)]
pub struct ChickenSwitchesSetArgs {
#[clap(long, action=ArgAction::Set)]
planner_enabled: bool,
}

Expand Down Expand Up @@ -100,7 +102,13 @@ async fn chicken_switches_show(
println!(" modified time: {time_modified}");
println!(" planner enabled: {planner_enabled}");
}
Err(err) => eprintln!("error: {:#}", err),
Err(err) => {
if err.status() == Some(StatusCode::NOT_FOUND) {
println!("No chicken switches enabled");
} else {
eprintln!("error: {:#}", err)
}
}
}

Ok(())
Expand Down
12 changes: 12 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chicken_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down Expand Up @@ -260,6 +264,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chicken_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down Expand Up @@ -451,6 +459,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chicken_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down
18 changes: 18 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -268,6 +268,10 @@ task: "blueprint_rendezvous"
owned rendezvous tables that other subsystems consume


task: "chicken_switches_watcher"
watch db for chicken switch changes


task: "crdb_node_id_collector"
Collects node IDs of running CockroachDB zones

Expand Down Expand Up @@ -543,6 +547,13 @@ task: "blueprint_rendezvous"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: no blueprint

task: "chicken_switches_watcher"
configured period: every <REDACTED_DURATION>s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "chicken_switches_watcher" (don't know how to interpret details: Object {"chicken_switches_updated": Bool(false)})

task: "crdb_node_id_collector"
configured period: every <REDACTED_DURATION>m
currently executing: no
Expand Down Expand Up @@ -1083,6 +1094,13 @@ task: "blueprint_rendezvous"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
last completion reported error: no blueprint

task: "chicken_switches_watcher"
configured period: every <REDACTED_DURATION>s
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "chicken_switches_watcher" (don't know how to interpret details: Object {"chicken_switches_updated": Bool(false)})

task: "crdb_node_id_collector"
configured period: every <REDACTED_DURATION>m
currently executing: no
Expand Down
2 changes: 1 addition & 1 deletion docs/reconfigurator.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ We're being cautious about rolling out that kind of automation. Instead, today,

`omdb` uses the Nexus internal API to do these things. Since this can only be done using `omdb`, Reconfigurator can really only be used by Oxide engineering and support, not customers.

The planner background task is currently disabled by default, but can be enabled by setting the Nexus configuration option `blueprints.disable_planner = false`. To get to the long term vision where the system is doing all this on its own in response to operator input, we'll need to get confidence that continually executing the planner will have no ill effects on working systems. This might involve more operational experience with it, more safeties, and tools for pausing execution, previewing what it _would_ do, etc.
The planner background task is currently disabled by default, but can be enabled via `omdb nexus chicken-switches --planner-enabled`. To get to the long term vision where the system is doing all this on its own in response to operator input, we'll need to get confidence that continually executing the planner will have no ill effects on working systems. This might involve more operational experience with it, more safeties, and tools for pausing execution, previewing what it _would_ do, etc.

== Design patterns

Expand Down
16 changes: 10 additions & 6 deletions nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -594,9 +594,6 @@ pub struct PhantomDiskConfig {
#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct BlueprintTasksConfig {
/// background planner chicken switch
pub disable_planner: bool,

/// period (in seconds) for periodic activations of the background task that
/// reads the latest target blueprint from the database
#[serde_as(as = "DurationSeconds<u64>")]
Expand All @@ -622,6 +619,11 @@ pub struct BlueprintTasksConfig {
/// collects the node IDs of CockroachDB zones
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs_collect_crdb_node_ids: Duration,

/// period (in seconds) for periodic activations of the background task that
/// reads chicken switches from the database
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs_load_chicken_switches: Duration,
}

#[serde_as]
Expand Down Expand Up @@ -1079,12 +1081,12 @@ mod test {
physical_disk_adoption.period_secs = 30
decommissioned_disk_cleaner.period_secs = 30
phantom_disks.period_secs = 30
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
blueprints.period_secs_rendezvous = 300
blueprints.period_secs_collect_crdb_node_ids = 180
blueprints.period_secs_load_chicken_switches= 5
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
Expand Down Expand Up @@ -1247,13 +1249,14 @@ mod test {
period_secs: Duration::from_secs(30),
},
blueprints: BlueprintTasksConfig {
disable_planner: true,
period_secs_load: Duration::from_secs(10),
period_secs_plan: Duration::from_secs(60),
period_secs_execute: Duration::from_secs(60),
period_secs_collect_crdb_node_ids:
Duration::from_secs(180),
period_secs_rendezvous: Duration::from_secs(300),
period_secs_load_chicken_switches:
Duration::from_secs(5)
},
sync_service_zone_nat: SyncServiceZoneNatConfig {
period_secs: Duration::from_secs(30)
Expand Down Expand Up @@ -1396,12 +1399,12 @@ mod test {
physical_disk_adoption.period_secs = 30
decommissioned_disk_cleaner.period_secs = 30
phantom_disks.period_secs = 30
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
blueprints.period_secs_rendezvous = 300
blueprints.period_secs_collect_crdb_node_ids = 180
blueprints.period_secs_load_chicken_switches= 5
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
Expand All @@ -1424,6 +1427,7 @@ mod test {
alert_dispatcher.period_secs = 42
webhook_deliverator.period_secs = 43
sp_ereport_ingester.period_secs = 44

[default_region_allocation_strategy]
type = "random"
"##,
Expand Down
1 change: 1 addition & 0 deletions nexus/background-task-interface/src/init.rs
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ pub struct BackgroundTasks {
pub task_alert_dispatcher: Activator,
pub task_webhook_deliverator: Activator,
pub task_sp_ereport_ingester: Activator,
pub task_chicken_switches_loader: Activator,

// Handles to activate background tasks that do not get used by Nexus
// at-large. These background tasks are implementation details as far as
Expand Down
2 changes: 1 addition & 1 deletion nexus/examples/config-second.toml
Original file line number Diff line number Diff line change
Expand Up @@ -118,12 +118,12 @@ phantom_disks.period_secs = 30
physical_disk_adoption.period_secs = 30
support_bundle_collector.period_secs = 30
decommissioned_disk_cleaner.period_secs = 60
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
blueprints.period_secs_rendezvous = 300
blueprints.period_secs_collect_crdb_node_ids = 180
blueprints.period_secs_load_chicken_switches = 5
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
Expand Down
2 changes: 1 addition & 1 deletion nexus/examples/config.toml
Original file line number Diff line number Diff line change
Expand Up @@ -104,12 +104,12 @@ phantom_disks.period_secs = 30
physical_disk_adoption.period_secs = 30
support_bundle_collector.period_secs = 30
decommissioned_disk_cleaner.period_secs = 60
blueprints.disable_planner = true
blueprints.period_secs_load = 10
blueprints.period_secs_plan = 60
blueprints.period_secs_execute = 60
blueprints.period_secs_rendezvous = 300
blueprints.period_secs_collect_crdb_node_ids = 180
blueprints.period_secs_load_chicken_switches = 5
sync_service_zone_nat.period_secs = 30
switch_port_settings_manager.period_secs = 30
region_replacement.period_secs = 30
Expand Down
19 changes: 18 additions & 1 deletion nexus/src/app/background/init.rs
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ use super::tasks::blueprint_execution;
use super::tasks::blueprint_load;
use super::tasks::blueprint_planner;
use super::tasks::blueprint_rendezvous;
use super::tasks::chicken_switches::ChickenSwitchesLoader;
use super::tasks::crdb_node_id_collector;
use super::tasks::decommissioned_disk_cleaner;
use super::tasks::dns_config;
Expand Down Expand Up @@ -230,6 +231,7 @@ impl BackgroundTasksInitializer {
task_alert_dispatcher: Activator::new(),
task_webhook_deliverator: Activator::new(),
task_sp_ereport_ingester: Activator::new(),
task_chicken_switches_loader: Activator::new(),

task_internal_dns_propagation: Activator::new(),
task_external_dns_propagation: Activator::new(),
Expand Down Expand Up @@ -306,6 +308,7 @@ impl BackgroundTasksInitializer {
task_alert_dispatcher,
task_webhook_deliverator,
task_sp_ereport_ingester,
task_chicken_switches_loader,
// Add new background tasks here. Be sure to use this binding in a
// call to `Driver::register()` below. That's what actually wires
// up the Activator to the corresponding background task.
Expand Down Expand Up @@ -476,13 +479,26 @@ impl BackgroundTasksInitializer {
inventory_watcher
};

let chicken_switches_loader =
ChickenSwitchesLoader::new(datastore.clone());
let chicken_switches_watcher = chicken_switches_loader.watcher();
driver.register(TaskDefinition {
name: "chicken_switches_watcher",
description: "watch db for chicken switch changes",
period: config.blueprints.period_secs_load_chicken_switches,
task_impl: Box::new(chicken_switches_loader),
opctx: opctx.child(BTreeMap::new()),
watchers: vec![],
activator: task_chicken_switches_loader,
});

// Background task: blueprint planner
//
// Replans on inventory collection and changes to the current
// target blueprint.
let blueprint_planner = blueprint_planner::BlueprintPlanner::new(
datastore.clone(),
config.blueprints.disable_planner,
chicken_switches_watcher.clone(),
inventory_watcher.clone(),
rx_blueprint.clone(),
);
Expand All @@ -496,6 +512,7 @@ impl BackgroundTasksInitializer {
watchers: vec![
Box::new(inventory_watcher.clone()),
Box::new(rx_blueprint.clone()),
Box::new(chicken_switches_watcher),
],
activator: task_blueprint_planner,
});
Expand Down
27 changes: 22 additions & 5 deletions nexus/src/app/background/tasks/blueprint_planner.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ use nexus_db_queries::context::OpContext;
use nexus_db_queries::db::DataStore;
use nexus_reconfigurator_planning::planner::Planner;
use nexus_reconfigurator_preparation::PlanningInputFromDb;
use nexus_types::deployment::ReconfiguratorChickenSwitches;
use nexus_types::deployment::{Blueprint, BlueprintTarget};
use nexus_types::internal_api::background::BlueprintPlannerStatus;
use omicron_common::api::external::LookupType;
Expand All @@ -24,7 +25,7 @@ use tokio::sync::watch::{self, Receiver, Sender};
/// Background task that runs the update planner.
pub struct BlueprintPlanner {
datastore: Arc<DataStore>,
disabled: bool,
rx_chicken_switches: Receiver<ReconfiguratorChickenSwitches>,
rx_inventory: Receiver<Option<CollectionUuid>>,
rx_blueprint: Receiver<Option<Arc<(BlueprintTarget, Blueprint)>>>,
tx_blueprint: Sender<Option<Arc<(BlueprintTarget, Blueprint)>>>,
Expand All @@ -33,12 +34,18 @@ pub struct BlueprintPlanner {
impl BlueprintPlanner {
pub fn new(
datastore: Arc<DataStore>,
disabled: bool,
rx_chicken_switches: Receiver<ReconfiguratorChickenSwitches>,
rx_inventory: Receiver<Option<CollectionUuid>>,
rx_blueprint: Receiver<Option<Arc<(BlueprintTarget, Blueprint)>>>,
) -> Self {
let (tx_blueprint, _) = watch::channel(None);
Self { datastore, disabled, rx_inventory, rx_blueprint, tx_blueprint }
Self {
datastore,
rx_chicken_switches,
rx_inventory,
rx_blueprint,
tx_blueprint,
}
}

pub fn watcher(
Expand All @@ -51,7 +58,8 @@ impl BlueprintPlanner {
/// If it is different from the current target blueprint,
/// save it and make it the current target.
pub async fn plan(&mut self, opctx: &OpContext) -> BlueprintPlannerStatus {
if self.disabled {
let switches = self.rx_chicken_switches.borrow_and_update().clone();
if !switches.planner_enabled {
debug!(&opctx.log, "blueprint planning disabled, doing nothing");
return BlueprintPlannerStatus::Disabled;
}
Expand Down Expand Up @@ -251,6 +259,7 @@ mod test {
use super::*;
use crate::app::background::tasks::blueprint_load::TargetBlueprintLoader;
use crate::app::background::tasks::inventory_collection::InventoryCollector;
use nexus_inventory::now_db_precision;
use nexus_test_utils_macros::nexus_test;

type ControlPlaneTestContext =
Expand Down Expand Up @@ -291,10 +300,18 @@ mod test {
let rx_collector = collector.watcher();
collector.activate(&opctx).await;

// Enable the planner
let (_tx, chicken_switches_collector_rx) =
watch::channel(ReconfiguratorChickenSwitches {
version: 1,
planner_enabled: true,
time_modified: now_db_precision(),
});

// Finally, spin up the planner background task.
let mut planner = BlueprintPlanner::new(
datastore.clone(),
false,
chicken_switches_collector_rx,
rx_collector,
rx_loader.clone(),
);
Expand Down
Loading
Loading