Skip to content

HDDS-14498. Zero Downtime Upgrade Design (ZDU)#9664

Open
sodonnel wants to merge 16 commits into
apache:masterfrom
sodonnel:HDDS-14498-zdu-design
Open

HDDS-14498. Zero Downtime Upgrade Design (ZDU)#9664
sodonnel wants to merge 16 commits into
apache:masterfrom
sodonnel:HDDS-14498-zdu-design

Conversation

@sodonnel
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This is a design document for Ozone Zero Downtime Upgrade (ZDU).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14498

How was this patch tested?

N/A

@smengcl smengcl self-requested a review January 27, 2026 19:43
Comment thread hadoop-hdds/docs/content/design/zdu-design.md
Comment on lines +171 to +175
6. The finalize command is sent to SCM by the admin - this is what is used to switch the cluster to act as the new version. Upon receipt of the finalize command:
7. SCM will finalize itself over Ratis, saving the new finalized version.
8. It will notify datanodes over the heartbeat to finalize.
9. After all healthy datanodes have been finalized, OM can be finalized. To do this, OM will have been polling SCM periodically to see if it should finalize. Only after SCM and all datanodes have been finalized will OM get a “ready to finalize” response from the poll. The OM leader will then send a finalize command over Ratis to all OMs.
10. As OM is the entry point to the cluster for external clients, finalizing OM unlocks any new features in the upgraded version.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't render correctly

@errose28 errose28 added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Feb 5, 2026
Copy link
Copy Markdown
Contributor

@smengcl smengcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @sodonnel @errose28 @fapifta for the design.

Comment on lines +39 to +43
1. Upgrade all SCMs to the new version
2. Upgrade Recon to the new version
3. Upgrade all Datanodes to the new version
4. Upgrade all OMs to the new version
5. Upgrade all S3 gateways to the new version
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could SCM or OM bootstrapping during Step 1 or Step 4 lead to any potential issue? e.g. in a busy cluster, a newly upgraded OM finds itself lagging behind (by a lot), and decides to bootstrap from an older-version OM.

At least DB schema won't be changed until finalization.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not cause an issue, because the apparent versions the components will remain the same in the Ratis ring even as the software is updated. That means the components with newer software version will still write data in a way that the older components bootstrapping can understand (and vice versa). Check out the table around line 107 and the appendix to see how the apparent version moves in lock step for a Ratis ring.

Finalization to move the apparent version forward can be done from a Ratis snapshot because the version is written to the DB as well as the version file. This is already handled in the current upgrade flow because finalization is an online operation.


Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period.

DIsk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DIsk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either.
Disk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either.

Comment on lines +41 to +42
3. Upgrade all Datanodes to the new version
4. Upgrade all OMs to the new version
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to assume that decommissioned datanodes would be ignored during the upgrade in Step 3? If so, what if they are recommissioned later (say after Step 4)?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There upgrade/restart steps are done by an admin, possibly with an orchestration layer, so Ozone doesn't decide whether or not the decom nodes get upgraded. If they do, nothing about the decom/maintenace/recom process is expected to change though since ZDU means all existing operations are allowed throughout the upgrade and finalization process.

Starting on line 227 we spec out how datanodes are handled relative to SCM, which includes if they are offline and come back later. Let me know if there's more questions in that area. Note that once SCM is finalized, any datanodes that later appear with the old software version will be fenced out until the admin upgrades them.

The doc currently doesn't specify whether nodes undergoing decom or maintenance will be instructed to finalize by SCM. I think we should still send them the finalize commands so they don't block further upgrade steps unnecessarily. @sodonnel what do you think?


During the upgrade, the cluster’s fault tolerance will not change. As nodes are being restarted with the new versions, we still require 2 OMs and SCMs active at all times to remain available. If any nodes fail to start in the new version, our existing fault tolerance accounts for this. The node should be brought online either by resolving the issue or downgrading it before others are restarted. Note that all nodes must be running the newest software version for finalization to begin, but the cluster remains fully operational with existing features until then.

Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZDU can take several days

Then it would be nice to have a dashboard for this (not just in CM). It can show how much time has been taken for each component, how much overall progress has been made, estimated time remaining for each component and overall. New metrics can be added as needed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely. This was in my head but I realized there was no Jira. I filed HDDS-14825. We should be able to do this with just the software and apparent version metrics. Finalization status can be derived from that by the dashboard.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 2, 2026

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label Apr 2, 2026
@errose28 errose28 removed the stale label Apr 2, 2026
@github-actions
Copy link
Copy Markdown

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label Apr 24, 2026
@errose28 errose28 removed the stale label Apr 24, 2026
@github-actions
Copy link
Copy Markdown

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

@github-actions github-actions Bot added the stale label May 31, 2026
@errose28 errose28 removed the stale label Jun 1, 2026
@errose28
Copy link
Copy Markdown
Contributor

errose28 commented Jun 2, 2026

I have updated the doc based on two design flaws that were exposed during development:

If the finalize command is sent only to SCM, OM cannot reliably learn when to finalize

There may be release which only increase the OM component version. In this case SCM will be automatically finalized on startup. If OM only polls SCM and finalizes automatically when SCM's software and apparent versions match, it would incorrectly finalize on startup without an admin command, preventing downgrade. Additionally the finalize and status commands cannot use one CLI to contact both OM and SCM since they are usually configured with different kerberos principals. The updated upgrade finalize and status commands now go to OM only and pass through to SCM. OM writes a marker to its DB to indicate that a finalize command was given and that it should start polling SCM for HDDS finalization status. The order of component finalization (SCM -> DNs -> OM) remains the same.

Missing handling of mixed software datanodes during container replication

The previous design only accounted for datanodes with different apparent versions but identical software versions. A Datanode client/source in the new software version pushing a container to a Datanode server/target with an old software version would still have compatibility issues. In the new design the lowest apparent version to use on the replication path is provided by SCM, similar to what is done on the write path. This negates the need for Datanodes to finalize automatically when contacted by a newer peer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

design zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants