-
Notifications
You must be signed in to change notification settings - Fork 15
engineering: simplify and improve resliency of available rollback detection #459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/AzurePipelines run [GITHUB]-trident-pr-e2e |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the manual rollback detection logic to simplify the approach and improve resilience. The refactoring changes the parsing direction from oldest-to-newest to newest-to-oldest, which allows the system to determine available rollbacks more efficiently by requiring fewer HostStatus entries to be parsed. Additionally, the PR adds error handling in the datastore layer to prevent parse failures of unrelated HostStatus entries from blocking rollback detection.
Changes:
- Refactored manual rollback detection to parse from newest to oldest instead of oldest to newest
- Added graceful error handling for HostStatus parsing failures in the datastore
- Updated protobuf enum naming to follow SCREAMING_SNAKE_CASE convention
- Fixed a spelling error in a comment
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| proto/harpoon.proto | Updated enum value names to follow protobuf conventions (breaking API change) |
| crates/trident/src/datastore.rs | Changed parsing order to DESC and added error handling to return None for parse failures |
| crates/trident/src/engine/manual_rollback/utils.rs | Major refactoring: simplified logic with Operation abstraction, reversed parsing direction, removed volume-specific tracking |
| crates/trident/src/engine/storage/rebuild.rs | Fixed spelling in comment: "hasnt" to "hasn't" |
| NO_ROLLBACK_AVAILABLE = 0; | ||
| AB_ROLLBACK_EXPECTED = 1; | ||
| RUNTIME_ROLLBACK_EXPECTED = 2; |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These enum value name changes are breaking API changes. Clients using the old enum values (NoRollbackAvailable, AbRollbackExpected, RuntimeRollbackExpected) will need to update to the new values (NO_ROLLBACK_AVAILABLE, AB_ROLLBACK_EXPECTED, RUNTIME_ROLLBACK_EXPECTED). While SCREAMING_SNAKE_CASE is the standard convention for protobuf enums, ensure that all API clients are updated accordingly, or consider adding a deprecation period if this is a public API.
| } | ||
| // Parse host status groups, where [provisioned & *finalize & *stage] == "group" | ||
| // into operations | ||
| for hs in host_statuses.iter() { |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iteration at line 239 includes the first element that was already processed at lines 224-227. While the logic handles this by skipping Provisioned states with Unknown kind (lines 243-246), it would be clearer and more efficient to skip the first element in the iteration. Consider changing the iteration to start from the second element: for hs in host_statuses.iter().skip(1).
| for hs in host_statuses.iter() { | |
| for hs in host_statuses.iter().skip(1) { |
| rollback_filters: &mut Vec<OperationKind>, | ||
| ) -> Result<bool, TridentError> { | ||
| // For ManualRollback operations, do not add to the operation list, and | ||
| // configure ongoing_rollback_operation_type so that subsesquent Update |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error in comment: "subsesquent" should be "subsequent".
| // configure ongoing_rollback_operation_type so that subsesquent Update | |
| // configure ongoing_rollback_operation_type so that subsequent Update |
| Ok(Self { | ||
| chain: operation_list | ||
| .iter() | ||
| .map(|op| { | ||
| let from_hs = op | ||
| .clone() | ||
| .from_host_status | ||
| .expect("to_host_status must be present for rollbackable operation"); | ||
| ManualRollbackChainItem { | ||
| spec: from_hs.spec.clone(), | ||
| ab_active_volume: from_hs.ab_active_volume, | ||
| install_index: from_hs.install_index, | ||
| kind: match &op.kind { | ||
| OperationKind::AbUpdate => ManualRollbackKind::Ab, | ||
| OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime, | ||
| kind => panic!( | ||
| "Unexpected operation kind for rollbackable operation: {:?}", | ||
| kind | ||
| ), | ||
| }, | ||
| } | ||
| }) | ||
| .collect(), | ||
| }) |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using expect() here will cause a panic if from_host_status is None. This should return a Result with a proper error instead of panicking in production code. Consider using .ok_or_else() to return a TridentError if from_host_status is None.
| Ok(Self { | |
| chain: operation_list | |
| .iter() | |
| .map(|op| { | |
| let from_hs = op | |
| .clone() | |
| .from_host_status | |
| .expect("to_host_status must be present for rollbackable operation"); | |
| ManualRollbackChainItem { | |
| spec: from_hs.spec.clone(), | |
| ab_active_volume: from_hs.ab_active_volume, | |
| install_index: from_hs.install_index, | |
| kind: match &op.kind { | |
| OperationKind::AbUpdate => ManualRollbackKind::Ab, | |
| OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime, | |
| kind => panic!( | |
| "Unexpected operation kind for rollbackable operation: {:?}", | |
| kind | |
| ), | |
| }, | |
| } | |
| }) | |
| .collect(), | |
| }) | |
| let mut chain_items: Vec<ManualRollbackChainItem> = Vec::new(); | |
| for op in operation_list.iter() { | |
| let from_hs = match op.from_host_status.clone() { | |
| Some(hs) => hs, | |
| None => { | |
| trace!( | |
| "from_host_status is None for operation {:?}, skipping non-rollbackable operation", | |
| op.kind | |
| ); | |
| continue; | |
| } | |
| }; | |
| let item = ManualRollbackChainItem { | |
| spec: from_hs.spec.clone(), | |
| ab_active_volume: from_hs.ab_active_volume, | |
| install_index: from_hs.install_index, | |
| kind: match &op.kind { | |
| OperationKind::AbUpdate => ManualRollbackKind::Ab, | |
| OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime, | |
| kind => panic!( | |
| "Unexpected operation kind for rollbackable operation: {:?}", | |
| kind | |
| ), | |
| }, | |
| }; | |
| chain_items.push(item); | |
| } | |
| Ok(Self { chain: chain_items }) |
| let from_hs = op | ||
| .clone() | ||
| .from_host_status | ||
| .expect("to_host_status must be present for rollbackable operation"); |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message says "to_host_status must be present" but the code is actually checking from_host_status. The error message should be corrected to say "from_host_status must be present for rollbackable operation".
| .expect("to_host_status must be present for rollbackable operation"); | |
| .expect("from_host_status must be present for rollbackable operation"); |
| Ok(Self { | ||
| chain: operation_list | ||
| .iter() | ||
| .map(|op| { | ||
| let from_hs = op | ||
| .clone() | ||
| .from_host_status | ||
| .expect("to_host_status must be present for rollbackable operation"); | ||
| ManualRollbackChainItem { | ||
| spec: from_hs.spec.clone(), | ||
| ab_active_volume: from_hs.ab_active_volume, | ||
| install_index: from_hs.install_index, | ||
| kind: match &op.kind { | ||
| OperationKind::AbUpdate => ManualRollbackKind::Ab, | ||
| OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime, | ||
| kind => panic!( | ||
| "Unexpected operation kind for rollbackable operation: {:?}", | ||
| kind | ||
| ), | ||
| }, | ||
| } | ||
| }) | ||
| .collect(), | ||
| }) |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using panic! in production code is not appropriate for handling unexpected state. This should return a Result with a TridentError instead. Consider propagating the error up through the call chain or converting this to a proper error return.
| Ok(Self { | |
| chain: operation_list | |
| .iter() | |
| .map(|op| { | |
| let from_hs = op | |
| .clone() | |
| .from_host_status | |
| .expect("to_host_status must be present for rollbackable operation"); | |
| ManualRollbackChainItem { | |
| spec: from_hs.spec.clone(), | |
| ab_active_volume: from_hs.ab_active_volume, | |
| install_index: from_hs.install_index, | |
| kind: match &op.kind { | |
| OperationKind::AbUpdate => ManualRollbackKind::Ab, | |
| OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime, | |
| kind => panic!( | |
| "Unexpected operation kind for rollbackable operation: {:?}", | |
| kind | |
| ), | |
| }, | |
| } | |
| }) | |
| .collect(), | |
| }) | |
| let mut chain: Vec<ManualRollbackChainItem> = Vec::new(); | |
| for op in operation_list.iter() { | |
| let from_hs = op | |
| .clone() | |
| .from_host_status | |
| .expect("to_host_status must be present for rollbackable operation"); | |
| let kind = match &op.kind { | |
| OperationKind::AbUpdate => ManualRollbackKind::Ab, | |
| OperationKind::RuntimeUpdate => ManualRollbackKind::Runtime, | |
| unexpected_kind => { | |
| return Err(TridentError::from(Error::msg(format!( | |
| "Unexpected operation kind for rollbackable operation: {:?}", | |
| unexpected_kind | |
| )))); | |
| } | |
| }; | |
| chain.push(ManualRollbackChainItem { | |
| spec: from_hs.spec.clone(), | |
| ab_active_volume: from_hs.ab_active_volume, | |
| install_index: from_hs.install_index, | |
| kind, | |
| }); | |
| } | |
| Ok(Self { chain }) |
| ANY_ROLLBACK = 0; | ||
| AB_ROLLBACK = 1; | ||
| RUNTIME_ROLLBACK = 2; |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These enum value name changes are breaking API changes. Clients using the old enum values (AnyRollbackRequested, AbRollbackRequested, RuntimeRollbackRequested) will need to update to the new values (ANY_ROLLBACK, AB_ROLLBACK, RUNTIME_ROLLBACK). While SCREAMING_SNAKE_CASE is the standard convention for protobuf enums, ensure that all API clients are updated accordingly, or consider adding a deprecation period if this is a public API.
🔍 Description
Available rollback detection is too complicated and hard to understand. It also requires parsing from the initial state.
Find simpler approach that parses from the most recent state, so fewer things must be parsed. Also add handling to HostStatus parsing to keep unrelated (i.e. HostStatus entries that are not required for rollback calculation) parse failures from stopping rollback.