HDDS-14498. Zero Downtime Upgrade Design (ZDU) by sodonnel · Pull Request #9664 · apache/ozone

sodonnel · 2026-01-23T18:54:00Z

What changes were proposed in this pull request?

This is a design document for Ozone Zero Downtime Upgrade (ZDU).

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14498

How was this patch tested?

N/A

ptlrs · 2026-01-29T02:28:56Z

+6. The finalize command is sent to SCM by the admin - this is what is used to switch the cluster to act as the new version. Upon receipt of the finalize command:  
+   7. SCM will finalize itself over Ratis, saving the new finalized version.
+   8. It will notify datanodes over the heartbeat to finalize.
+   9. After all healthy datanodes have been finalized, OM can be finalized. To do this, OM will have been polling SCM periodically to see if it should finalize. Only after SCM and all datanodes have been finalized will OM get a “ready to finalize” response from the poll. The OM leader will then send a finalize command over Ratis to all OMs.
+   10. As OM is the entry point to the cluster for external clients, finalizing OM unlocks any new features in the upgraded version.


These don't render correctly

smengcl

Thanks @sodonnel @errose28 @fapifta for the design.

smengcl · 2026-03-08T20:44:57Z

+1. Upgrade all SCMs to the new version  
+2. Upgrade Recon to the new version  
+3. Upgrade all Datanodes to the new version  
+4. Upgrade all OMs to the new version  
+5. Upgrade all S3 gateways to the new version


Could SCM or OM bootstrapping during Step 1 or Step 4 lead to any potential issue? e.g. in a busy cluster, a newly upgraded OM finds itself lagging behind (by a lot), and decides to bootstrap from an older-version OM.

At least DB schema won't be changed until finalization.

This should not cause an issue, because the apparent versions the components will remain the same in the Ratis ring even as the software is updated. That means the components with newer software version will still write data in a way that the older components bootstrapping can understand (and vice versa). Check out the table around line 107 and the appendix to see how the apparent version moves in lock step for a Ratis ring.

Finalization to move the apparent version forward can be done from a Ratis snapshot because the version is written to the DB as well as the version file. This is already handled in the current upgrade flow because finalization is an online operation.

smengcl · 2026-03-08T20:45:38Z

+
+Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period.
+
+DIsk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either.


Suggested change

DIsk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either.

Disk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either.

smengcl · 2026-03-08T20:54:29Z

+3. Upgrade all Datanodes to the new version  
+4. Upgrade all OMs to the new version  


Is it correct to assume that decommissioned datanodes would be ignored during the upgrade in Step 3? If so, what if they are recommissioned later (say after Step 4)?

There upgrade/restart steps are done by an admin, possibly with an orchestration layer, so Ozone doesn't decide whether or not the decom nodes get upgraded. If they do, nothing about the decom/maintenace/recom process is expected to change though since ZDU means all existing operations are allowed throughout the upgrade and finalization process.

Starting on line 227 we spec out how datanodes are handled relative to SCM, which includes if they are offline and come back later. Let me know if there's more questions in that area. Note that once SCM is finalized, any datanodes that later appear with the old software version will be fenced out until the admin upgrades them.

The doc currently doesn't specify whether nodes undergoing decom or maintenance will be instructed to finalize by SCM. I think we should still send them the finalize commands so they don't block further upgrade steps unnecessarily. @sodonnel what do you think?

smengcl · 2026-03-08T21:00:35Z

+
+During the upgrade, the cluster’s fault tolerance will not change. As nodes are being restarted with the new versions, we still require 2 OMs and SCMs active at all times to remain available. If any nodes fail to start in the new version, our existing fault tolerance accounts for this. The node should be brought online either by resolving the issue or downgrading it before others are restarted. Note that all nodes must be running the newest software version for finalization to begin, but the cluster remains fully operational with existing features until then.
+
+Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period.


ZDU can take several days

Then it would be nice to have a dashboard for this (not just in CM). It can show how much time has been taken for each component, how much overall progress has been made, estimated time remaining for each component and overall. New metrics can be added as needed.

Definitely. This was in my head but I realized there was no Jira. I filed HDDS-14825. We should be able to do this with just the software and apparent version metrics. Finalization status can be derived from that by the dashboard.

github-actions · 2026-04-02T00:09:57Z

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

github-actions · 2026-04-24T00:17:14Z

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

github-actions · 2026-05-31T00:25:48Z

This PR has been marked as stale due to 21 days of inactivity. Please comment or remove the stale label to keep it open. Otherwise, it will be automatically closed in 7 days.

…n passing

errose28 · 2026-06-02T21:47:19Z

I have updated the doc based on two design flaws that were exposed during development:

If the finalize command is sent only to SCM, OM cannot reliably learn when to finalize

There may be release which only increase the OM component version. In this case SCM will be automatically finalized on startup. If OM only polls SCM and finalizes automatically when SCM's software and apparent versions match, it would incorrectly finalize on startup without an admin command, preventing downgrade. Additionally the finalize and status commands cannot use one CLI to contact both OM and SCM since they are usually configured with different kerberos principals. The updated upgrade finalize and status commands now go to OM only and pass through to SCM. OM writes a marker to its DB to indicate that a finalize command was given and that it should start polling SCM for HDDS finalization status. The order of component finalization (SCM -> DNs -> OM) remains the same.

Missing handling of mixed software datanodes during container replication

The previous design only accounted for datanodes with different apparent versions but identical software versions. A Datanode client/source in the new software version pushing a container to a Datanode server/target with an old software version would still have compatibility issues. In the new design the lowest apparent version to use on the replication path is provided by SCM, similar to what is done on the write path. This negates the need for Datanodes to finalize automatically when contacted by a newer peer.

S O'Donnell and others added 7 commits January 23, 2026 18:51

HDDS-14498. Zero Downtime Upgrade Design (ZDU)

82e8206

add image as png rather than embedded

46d6ced

Fix image path

d2990ea

Fix headers

8805065

Add more specifics to coomponent version section

606075b

Add appendix with table and update existing tables

7aa0bb4

Rat + front matter

d175224

errose28 added the design label Jan 26, 2026

errose28 mentioned this pull request Jan 26, 2026

HDDS-14298. [Website v2] Ozone Enhancement Proposals apache/ozone-site#188

Open

3 tasks

errose28 added 6 commits January 26, 2026 14:43

Fix missing column and formatting

3e362a3

Add HDDSVersion requirements and migration

52d4fe6

Remove trailing whitespace

3c1939a

Add new section on changes to existing framework

b802eca

Why did Obsidian delete the front matter

d5c61ee

Authors as list

5a62cca

smengcl self-requested a review January 27, 2026 19:43

jojochuang reviewed Jan 28, 2026

View reviewed changes

Comment thread hadoop-hdds/docs/content/design/zdu-design.md

Add Datanode upgrade step to appendix

b72a527

ptlrs reviewed Jan 29, 2026

View reviewed changes

errose28 added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Feb 5, 2026

smengcl reviewed Mar 8, 2026

View reviewed changes

github-actions Bot added the stale label Apr 2, 2026

errose28 removed the stale label Apr 2, 2026

github-actions Bot added the stale label Apr 24, 2026

errose28 removed the stale label Apr 24, 2026

github-actions Bot added the stale label May 31, 2026

errose28 removed the stale label Jun 1, 2026

errose28 added 2 commits June 2, 2026 17:28

Update finalize flow, ratis version checks, and DN replication versio…

21929cb

…n passing

Fix typos

ed46cd1


		Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period.

		DIsk and datanode balancing could be safely suspended if required. For disk balancing, the process is all within the same datanode process, so mixed component versions are not a concern. Cross node balancing uses the container replication mechanism internally, and we would not gain much by pausing it during upgrades either.

		3. Upgrade all Datanodes to the new version
		4. Upgrade all OMs to the new version


		During the upgrade, the cluster’s fault tolerance will not change. As nodes are being restarted with the new versions, we still require 2 OMs and SCMs active at all times to remain available. If any nodes fail to start in the new version, our existing fault tolerance accounts for this. The node should be brought online either by resolving the issue or downgrading it before others are restarted. Note that all nodes must be running the newest software version for finalization to begin, but the cluster remains fully operational with existing features until then.

		Initially, this design considered pausing some background operations to remove risk during upgrade. Snapshots are an area with complex storage requirements that must be mirrored across the OMs. However a ZDU can take several days and removing the ability to take or delete snapshots during that time would impact backup and disaster recovery schedules which would not be acceptable. Similarly block deletion was considered and similar concerns were uncovered around freeing space on clusters with capacity issues. It would also not be wise to suspend replication for an extended period.

Conversation

sodonnel commented Jan 23, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smengcl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 2, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

errose28 commented Jun 2, 2026

If the finalize command is sent only to SCM, OM cannot reliably learn when to finalize

Missing handling of mixed software datanodes during container replication

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants