HDDS-14498. Zero Downtime Upgrade Design (ZDU)#9664
Conversation
| 6. The finalize command is sent to SCM by the admin - this is what is used to switch the cluster to act as the new version. Upon receipt of the finalize command: | ||
| 7. SCM will finalize itself over Ratis, saving the new finalized version. | ||
| 8. It will notify datanodes over the heartbeat to finalize. | ||
| 9. After all healthy datanodes have been finalized, OM can be finalized. To do this, OM will have been polling SCM periodically to see if it should finalize. Only after SCM and all datanodes have been finalized will OM get a “ready to finalize” response from the poll. The OM leader will then send a finalize command over Ratis to all OMs. | ||
| 10. As OM is the entry point to the cluster for external clients, finalizing OM unlocks any new features in the upgraded version. |
There was a problem hiding this comment.
These don't render correctly
| 1. Upgrade all SCMs to the new version | ||
| 2. Upgrade Recon to the new version | ||
| 3. Upgrade all Datanodes to the new version | ||
| 4. Upgrade all OMs to the new version | ||
| 5. Upgrade all S3 gateways to the new version |
There was a problem hiding this comment.
Could SCM or OM bootstrapping during Step 1 or Step 4 lead to any potential issue? e.g. in a busy cluster, a newly upgraded OM finds itself lagging behind (by a lot), and decides to bootstrap from an older-version OM.
At least DB schema won't be changed until finalization.
There was a problem hiding this comment.
This should not cause an issue, because the apparent versions the components will remain the same in the Ratis ring even as the software is updated. That means the components with newer software version will still write data in a way that the older components bootstrapping can understand (and vice versa). Check out the table around line 107 and the appendix to see how the apparent version moves in lock step for a Ratis ring.
Finalization to move the apparent version forward can be done from a Ratis snapshot because the version is written to the DB as well as the version file. This is already handled in the current upgrade flow because finalization is an online operation.
|
|
||
| Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period. | ||
|
|
||
| DIsk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either. |
There was a problem hiding this comment.
| DIsk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either. | |
| Disk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either. |
| 3. Upgrade all Datanodes to the new version | ||
| 4. Upgrade all OMs to the new version |
There was a problem hiding this comment.
Is it correct to assume that decommissioned datanodes would be ignored during the upgrade in Step 3? If so, what if they are recommissioned later (say after Step 4)?
There was a problem hiding this comment.
There upgrade/restart steps are done by an admin, possibly with an orchestration layer, so Ozone doesn't decide whether or not the decom nodes get upgraded. If they do, nothing about the decom/maintenace/recom process is expected to change though since ZDU means all existing operations are allowed throughout the upgrade and finalization process.
Starting on line 227 we spec out how datanodes are handled relative to SCM, which includes if they are offline and come back later. Let me know if there's more questions in that area. Note that once SCM is finalized, any datanodes that later appear with the old software version will be fenced out until the admin upgrades them.
The doc currently doesn't specify whether nodes undergoing decom or maintenance will be instructed to finalize by SCM. I think we should still send them the finalize commands so they don't block further upgrade steps unnecessarily. @sodonnel what do you think?
|
|
||
| During the upgrade, the cluster’s fault tolerance will not change. As nodes are being restarted with the new versions, we still require 2 OMs and SCMs active at all times to remain available. If any nodes fail to start in the new version, our existing fault tolerance accounts for this. The node should be brought online either by resolving the issue or downgrading it before others are restarted. Note that all nodes must be running the newest software version for finalization to begin, but the cluster remains fully operational with existing features until then. | ||
|
|
||
| Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period. |
There was a problem hiding this comment.
ZDU can take several days
Then it would be nice to have a dashboard for this (not just in CM). It can show how much time has been taken for each component, how much overall progress has been made, estimated time remaining for each component and overall. New metrics can be added as needed.
There was a problem hiding this comment.
Definitely. This was in my head but I realized there was no Jira. I filed HDDS-14825. We should be able to do this with just the software and apparent version metrics. Finalization status can be derived from that by the dashboard.
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days. |
|
I have updated the doc based on two design flaws that were exposed during development: If the finalize command is sent only to SCM, OM cannot reliably learn when to finalizeThere may be release which only increase the OM component version. In this case SCM will be automatically finalized on startup. If OM only polls SCM and finalizes automatically when SCM's software and apparent versions match, it would incorrectly finalize on startup without an admin command, preventing downgrade. Additionally the finalize and status commands cannot use one CLI to contact both OM and SCM since they are usually configured with different kerberos principals. The updated upgrade finalize and status commands now go to OM only and pass through to SCM. OM writes a marker to its DB to indicate that a finalize command was given and that it should start polling SCM for HDDS finalization status. The order of component finalization (SCM -> DNs -> OM) remains the same. Missing handling of mixed software datanodes during container replicationThe previous design only accounted for datanodes with different apparent versions but identical software versions. A Datanode client/source in the new software version pushing a container to a Datanode server/target with an old software version would still have compatibility issues. In the new design the lowest apparent version to use on the replication path is provided by SCM, similar to what is done on the write path. This negates the need for Datanodes to finalize automatically when contacted by a newer peer. |
What changes were proposed in this pull request?
This is a design document for Ozone Zero Downtime Upgrade (ZDU).
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14498
How was this patch tested?
N/A