Skip to content

Conversation

@bfjelds
Copy link
Member

@bfjelds bfjelds commented Jan 16, 2026

🔍 Description

Available rollback detection is too complicated and hard to understand. It also requires parsing from the initial state.

Find simpler approach that parses from the most recent state, so fewer things must be parsed. Also add handling to HostStatus parsing to keep unrelated (i.e. HostStatus entries that are not required for rollback calculation) parse failures from stopping rollback.

@bfjelds
Copy link
Member Author

bfjelds commented Jan 17, 2026

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member Author

bfjelds commented Jan 17, 2026

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds
Copy link
Member Author

bfjelds commented Jan 17, 2026

/AzurePipelines run [GITHUB]-trident-pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bfjelds bfjelds changed the title DRAFT: engineering: simplify and improve resliency of available rollback detection engineering: simplify and improve resliency of available rollback detection Jan 17, 2026
@bfjelds bfjelds marked this pull request as ready for review January 17, 2026 02:30
@bfjelds bfjelds requested a review from a team as a code owner January 17, 2026 02:30
Copilot AI review requested due to automatic review settings January 17, 2026 02:30
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the manual rollback detection logic to simplify the approach and improve resilience. The refactoring changes the parsing direction from oldest-to-newest to newest-to-oldest, which allows the system to determine available rollbacks more efficiently by requiring fewer HostStatus entries to be parsed. Additionally, the PR adds error handling in the datastore layer to prevent parse failures of unrelated HostStatus entries from blocking rollback detection.

Changes:

  • Refactored manual rollback detection to parse from newest to oldest instead of oldest to newest
  • Added graceful error handling for HostStatus parsing failures in the datastore
  • Updated protobuf enum naming to follow SCREAMING_SNAKE_CASE convention
  • Fixed a spelling error in a comment

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
proto/harpoon.proto Updated enum value names to follow protobuf conventions (breaking API change)
crates/trident/src/datastore.rs Changed parsing order to DESC and added error handling to return None for parse failures
crates/trident/src/engine/manual_rollback/utils.rs Major refactoring: simplified logic with Operation abstraction, reversed parsing direction, removed volume-specific tracking
crates/trident/src/engine/storage/rebuild.rs Fixed spelling in comment: "hasnt" to "hasn't"

Comment on lines +249 to +251
NO_ROLLBACK_AVAILABLE = 0;
AB_ROLLBACK_EXPECTED = 1;
RUNTIME_ROLLBACK_EXPECTED = 2;
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These enum value name changes are breaking API changes. Clients using the old enum values (NoRollbackAvailable, AbRollbackExpected, RuntimeRollbackExpected) will need to update to the new values (NO_ROLLBACK_AVAILABLE, AB_ROLLBACK_EXPECTED, RUNTIME_ROLLBACK_EXPECTED). While SCREAMING_SNAKE_CASE is the standard convention for protobuf enums, ensure that all API clients are updated accordingly, or consider adding a deprecation period if this is a public API.

Copilot uses AI. Check for mistakes.
}
// Parse host status groups, where [provisioned & *finalize & *stage] == "group"
// into operations
for hs in host_statuses.iter() {
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iteration at line 239 includes the first element that was already processed at lines 224-227. While the logic handles this by skipping Provisioned states with Unknown kind (lines 243-246), it would be clearer and more efficient to skip the first element in the iteration. Consider changing the iteration to start from the second element: for hs in host_statuses.iter().skip(1).

Suggested change
for hs in host_statuses.iter() {
for hs in host_statuses.iter().skip(1) {

Copilot uses AI. Check for mistakes.
rollback_filters: &mut Vec<OperationKind>,
) -> Result<bool, TridentError> {
// For ManualRollback operations, do not add to the operation list, and
// configure ongoing_rollback_operation_type so that subsesquent Update
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error in comment: "subsesquent" should be "subsequent".

Suggested change
// configure ongoing_rollback_operation_type so that subsesquent Update
// configure ongoing_rollback_operation_type so that subsequent Update

Copilot uses AI. Check for mistakes.
Comment on lines +315 to +338
Ok(Self {
chain: operation_list
.iter()
.map(|op| {
let from_hs = op
.clone()
.from_host_status
.expect("to_host_status must be present for rollbackable operation");
ManualRollbackChainItem {
spec: from_hs.spec.clone(),
ab_active_volume: from_hs.ab_active_volume,
install_index: from_hs.install_index,
kind: match &op.kind {
OperationKind::AbUpdate => ManualRollbackKind::Ab,
OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime,
kind => panic!(
"Unexpected operation kind for rollbackable operation: {:?}",
kind
),
},
}
})
.collect(),
})
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using expect() here will cause a panic if from_host_status is None. This should return a Result with a proper error instead of panicking in production code. Consider using .ok_or_else() to return a TridentError if from_host_status is None.

Suggested change
Ok(Self {
chain: operation_list
.iter()
.map(|op| {
let from_hs = op
.clone()
.from_host_status
.expect("to_host_status must be present for rollbackable operation");
ManualRollbackChainItem {
spec: from_hs.spec.clone(),
ab_active_volume: from_hs.ab_active_volume,
install_index: from_hs.install_index,
kind: match &op.kind {
OperationKind::AbUpdate => ManualRollbackKind::Ab,
OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime,
kind => panic!(
"Unexpected operation kind for rollbackable operation: {:?}",
kind
),
},
}
})
.collect(),
})
let mut chain_items: Vec<ManualRollbackChainItem> = Vec::new();
for op in operation_list.iter() {
let from_hs = match op.from_host_status.clone() {
Some(hs) => hs,
None => {
trace!(
"from_host_status is None for operation {:?}, skipping non-rollbackable operation",
op.kind
);
continue;
}
};
let item = ManualRollbackChainItem {
spec: from_hs.spec.clone(),
ab_active_volume: from_hs.ab_active_volume,
install_index: from_hs.install_index,
kind: match &op.kind {
OperationKind::AbUpdate => ManualRollbackKind::Ab,
OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime,
kind => panic!(
"Unexpected operation kind for rollbackable operation: {:?}",
kind
),
},
};
chain_items.push(item);
}
Ok(Self { chain: chain_items })

Copilot uses AI. Check for mistakes.
let from_hs = op
.clone()
.from_host_status
.expect("to_host_status must be present for rollbackable operation");
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message says "to_host_status must be present" but the code is actually checking from_host_status. The error message should be corrected to say "from_host_status must be present for rollbackable operation".

Suggested change
.expect("to_host_status must be present for rollbackable operation");
.expect("from_host_status must be present for rollbackable operation");

Copilot uses AI. Check for mistakes.
Comment on lines +315 to +338
Ok(Self {
chain: operation_list
.iter()
.map(|op| {
let from_hs = op
.clone()
.from_host_status
.expect("to_host_status must be present for rollbackable operation");
ManualRollbackChainItem {
spec: from_hs.spec.clone(),
ab_active_volume: from_hs.ab_active_volume,
install_index: from_hs.install_index,
kind: match &op.kind {
OperationKind::AbUpdate => ManualRollbackKind::Ab,
OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime,
kind => panic!(
"Unexpected operation kind for rollbackable operation: {:?}",
kind
),
},
}
})
.collect(),
})
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using panic! in production code is not appropriate for handling unexpected state. This should return a Result with a TridentError instead. Consider propagating the error up through the call chain or converting this to a proper error return.

Suggested change
Ok(Self {
chain: operation_list
.iter()
.map(|op| {
let from_hs = op
.clone()
.from_host_status
.expect("to_host_status must be present for rollbackable operation");
ManualRollbackChainItem {
spec: from_hs.spec.clone(),
ab_active_volume: from_hs.ab_active_volume,
install_index: from_hs.install_index,
kind: match &op.kind {
OperationKind::AbUpdate => ManualRollbackKind::Ab,
OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime,
kind => panic!(
"Unexpected operation kind for rollbackable operation: {:?}",
kind
),
},
}
})
.collect(),
})
let mut chain: Vec<ManualRollbackChainItem> = Vec::new();
for op in operation_list.iter() {
let from_hs = op
.clone()
.from_host_status
.expect("to_host_status must be present for rollbackable operation");
let kind = match &op.kind {
OperationKind::AbUpdate => ManualRollbackKind::Ab,
OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime,
unexpected_kind => {
return Err(TridentError::from(Error::msg(format!(
"Unexpected operation kind for rollbackable operation: {:?}",
unexpected_kind
))));
}
};
chain.push(ManualRollbackChainItem {
spec: from_hs.spec.clone(),
ab_active_volume: from_hs.ab_active_volume,
install_index: from_hs.install_index,
kind,
});
}
Ok(Self { chain })

Copilot uses AI. Check for mistakes.
Comment on lines +238 to +240
ANY_ROLLBACK = 0;
AB_ROLLBACK = 1;
RUNTIME_ROLLBACK = 2;
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These enum value name changes are breaking API changes. Clients using the old enum values (AnyRollbackRequested, AbRollbackRequested, RuntimeRollbackRequested) will need to update to the new values (ANY_ROLLBACK, AB_ROLLBACK, RUNTIME_ROLLBACK). While SCREAMING_SNAKE_CASE is the standard convention for protobuf enums, ensure that all API clients are updated accordingly, or consider adding a deprecation period if this is a public API.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants