Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

publish strategy for DNS Failover or workload Migration #1081

Open
philbrookes opened this issue Dec 13, 2024 · 3 comments
Open

publish strategy for DNS Failover or workload Migration #1081

philbrookes opened this issue Dec 13, 2024 · 3 comments
Assignees
Labels
RFC required Requires an RFC to back it up

Comments

@philbrookes
Copy link
Contributor

philbrookes commented Dec 13, 2024

Is blocked by : Kuadrant/dns-operator#390

Why

Currently there is no way to ask the DNS Operator to publish or unpublish a DNS Record only when a certain level of redundancy is encountered, this means that gracefully removing a DNS Record requires manual intervention and an understanding of the internal workings of the DNS-Operator.

Some Example Use Cases

  • DNS Failover (rapidly switch to alternative DNS Configuration) to a secondary site
  • Workload migration (removing workload from one cluster in favour of a new cluster)
  • Extra clusters during periods of high load

What

Add an optional publishStrategy to the dns policy CRD, which will allow an administrator to define a some rules which when met will instruct the DNS Operator to publish/unpublish the records from the zone and set a condition in the status.

How

Diagram

Image
https://miro.com/app/board/uXjVL32kOMY=/

Kuadrant operator changes

The DNS Policy and DNS Record CRDs will have a new field added to their spec:

publishStrategy:
  rule: <syntax to be confirmed>
  republish: true|false (default false)

This is read by the kuadrant-operator and propagated into any relevant DNS Records.

When the DNS Operator acts on these instructions it will set a condition in the DNS Record.

This condition will be propagated back into the relevant DNS Policy.

DNS Operator Changes

The DNS Operator will read the publishStrategy from the DNS Record on reconcile, based on the values it will then interrogate the zone values to see if the publish rule is met. If so it will publish the records, if not it will ensure the records are unpublished and update the condition in the DNS Record status to reflect the decision.

If the strategy has defined republish to be true, then while the DNS Record exists, if the count of unowned leaf records ever drops below the resiliency requirement, then the DNS Operator will republish these records.

Use cases expanded

DNS Failover

To enact DNS Failover with this config, the rule for publishing could be set to "when all other records are marked as unhealthy".

Example

Cluster 1 publishing strategy is always publish
Cluster 2 publishing strategy is: "when number of active records unhealthy >= n"

  • Cluster 1 is currently published and healthy and cluster 2 has no published records.
  • An event occurs that causes the workload to begin malfunctioning on cluster 1.
  • All the records for cluster 1 are marked as unhealthy in the registry (but not removed as they are the only records available)
  • cluster 2 reconciles and sees that all the records currently in the zone are unhealthy, as this satisfies it's publishing rule, it publishes it's records
  • cluster 1 reconciles and sees there are records other than it's own and so unpublishes them for being unhealthy
  • eventually cluster 1 is healthy again and republishes it's records
  • cluster 2 sees records in the zone that are healthy and unpublishes it's own records.

Workload migration

Cluster 1 has a workload that needs to be migrated to cluster 2.

  • workload is created on cluster 2
  • publishing strategy on cluster 1 is set to: "no other records exist" and republish false
  • records created by cluster 2
  • cluster 1 sees other records exist and unpublishes it's records from the zone
  • admin sees the status updated on the DNS Policy in cluster 1 (all records removed from zone) happened more than the TTL ago
  • admin can safely remove the workload from cluster 1.

Extra clusters during high load

This case would require that the rule is able to query metrics, which is not confirmed yet.

Cluster 1 has the workload and publishes always
Cluster 2 has the workload and has a publishing rule: when requests per minute > x.

@Boomatang
Copy link
Contributor

Is there an RFC for this work?

@Boomatang Boomatang added the RFC required Requires an RFC to back it up label Jan 9, 2025
@maleck13
Copy link
Collaborator

related to Kuadrant/dns-operator#356

@philbrookes philbrookes changed the title Graceful unpublish DNS Records publish strategy for DNS Failover or workload Migration Jan 27, 2025
@maleck13 maleck13 removed the RFC required Requires an RFC to back it up label Jan 28, 2025
@maleck13 maleck13 moved this to Todo in Kuadrant Jan 28, 2025
@maleck13 maleck13 added the RFC required Requires an RFC to back it up label Jan 28, 2025
@maleck13
Copy link
Collaborator

@philbrookes I feel like this is an overarching epic that has migration and failover as features within it WDYT

Perhaps naming wise DNSRecord publish criteria and then two features of this 1) DNS Failover and 2) Workload Migration?

@trepel trepel self-assigned this Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC required Requires an RFC to back it up
Projects
Status: Todo
Development

No branches or pull requests

4 participants