Skip to content

Add 3-node HA + cloud-ui in TF + local-fast SC for DC-Controlplane#102

Open
HiranAdikari wants to merge 7 commits into
wso2:mainfrom
HiranAdikari:feature/dc-controlplane-3node-ha
Open

Add 3-node HA + cloud-ui in TF + local-fast SC for DC-Controlplane#102
HiranAdikari wants to merge 7 commits into
wso2:mainfrom
HiranAdikari:feature/dc-controlplane-3node-ha

Conversation

@HiranAdikari
Copy link
Copy Markdown
Contributor

Summary

Three related changes to the dc-controlplane + dc-controlplane-services
modules, plus a small k8s-cluster change to support per-pool SC overrides.

1. 3-node HA dc-controlplane (2 commits)

  • dc-controlplane module now provisions 3 control-plane VMs.
  • kube-vip runs as a static pod on each control-plane node, exposing
    the apiserver via an ARP-announced VIP.
  • Static pod template at dc-controlplane/templates/kube-vip-rke2.yaml.tftpl.
  • Node cloud-init template at dc-controlplane/templates/node-cloud-init.yaml.tftpl
    pulls in the kube-vip manifest at bootstrap.
  • Kube-vip pod renamed to avoid colliding with the existing DaemonSet on
    Harvester host clusters that ship their own kube-vip.

2. cloud-ui in Terraform (2 commits)

  • dc-controlplane-services module now creates the cloud-ui
    Deployment + Service, with optional Ingress (cloud_ui_ingress_enabled).
  • The Deployment's image field is lifecycle.ignored so CI's
    kubectl set image rollouts don't fight Terraform.

3. Local-fast StorageClass for control-plane disks (3 commits)

  • New create_local_storage_class flag (default true) in
    dc-controlplane. When set, the module creates a 1-replica,
    best-effort-local Longhorn SC on the Harvester host cluster and
    threads it through each pool's storage_class_name.
  • Also bakes a VirtualMachineImage with that SC pinned. Required
    because Harvester ignores storageClassName on cloned PVCs — they
    silently inherit the source image's SC.
  • Production sites with adequate host IOPS can flip the flag off to
    recover full replicated SC for control-plane root disks.

Test plan

  • terraform fmt -recursive -check clean
  • terraform validate clean across affected modules
  • Running in a 3-node Harvester dev environment for several
    weeks — control-plane Ready, kube-vip serving the apiserver VIP,
    cloud-ui Deployment shape matches the TF, control-plane root disks
    on the local-fast SC.

Risks

  • Default behaviour change of create_local_storage_class=true affects
    any environment consuming this module on Harvester host storage
    without configured vm_storage_class_override. Operators upgrading
    should review the flag.
  • kube-vip pod rename is a forced replacement — bootstrap loses the
    apiserver VIP for ~30s during the rolling restart of control-plane
    nodes. Plan a maintenance window if applying to a live HA cluster.

🤖 Generated with Claude Code

HiranAdikari and others added 7 commits May 19, 2026 12:31
Replace the list-passthrough machine_pools input with a per-node-count
interface (node_count, node_ips, sizing, vips). The module now
generates one machine_pool per node, each with quantity=1, stable
names node1/node2/node3, and its own cloud-init that pins enp1s0 to
the matching static IP from node_ips. node_count is validated to
1/3/5 so callers cannot accidentally pick a count that cannot form an
etcd quorum.

A kube-vip static-pod manifest is rendered once at module level (ARP
mode, leader-elected) and shipped via cloud-init write_files to
/var/lib/rancher/rke2/server/manifests/kube-vip.yaml on every server
node. The API VIP is added automatically to the apiserver tls-san so
clients reaching the cluster through kube-vip see a cert-valid name;
operators can add more SANs via tls_san_extra. The API VIP is NOT
allocated from the Harvester IPPool — it is reserved out-of-band and
claimed by kube-vip via ARP. The IPPool keeps a single ingress VIP
range.

To support per-node user_data without duplicating the k8s-cluster
module, the machine_pools list entry gains an optional user_data
field. When set on a pool it overrides the module-level user_data
for that pool only; existing single-user_data callers are unaffected.

Adds api_vip, ingress_vip, and node_ips outputs so consumer layers
can build /etc/hosts entries, kubeconfigs, and Rancher proxy URLs
without re-deriving the values.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The harvester-cloud-provider Helm chart that RKE2 installs automatically
on harvester clusters ships its own kube-vip DaemonSet in kube-system,
configured for Service-type LoadBalancer announcements (cp_enable=false,
svc_enable=true). Naming our apiserver-VIP static pod kube-vip too is
fine technically (different name suffixes) but turns kubectl listings
into "which kube-vip am I looking at" guessing games.

Rename to kube-vip-apiserver so the role is obvious at a glance. Adds
a short comment to the manifest explaining the split. No behaviour
change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cloud-ui Deployment + Service are applied out-of-band by the
cloud-ui GitHub workflow. The matching Ingress was applied the same
way, which means a fresh cluster lost cloud-ui routing until the
next workflow run — easy to miss during a planned rebuild.

Move just the Ingress (not the Deployment/Service) into this module
so TF restores routing as part of the cluster bring-up; the Service
the Ingress references appears when the workflow next runs.

Opt-in: cloudui_hostname empty string means no Ingress is created,
so existing consumers without a cloud-ui are unaffected. When set,
the Ingress reuses the dc-api self-signed cert via SAN — the
existing wildcard in ingress_additional_dns_names already covers
sibling hostnames in the same env, so one browser cert warning
covers both UIs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds kubernetes_deployment.cloud_ui and kubernetes_service.cloud_ui,
gated on a new cloudui_image variable (empty = skip). Mirrors the
dc-api dual-ownership pattern: TF owns the Deployment shape, CI's
kubectl set image rolls the image, lifecycle ignores image so the
two don't fight.

Before this, a cluster destroy+recreate left cloud-ui completely
missing until the next push to the cloud-ui workflow. With the
Deployment in TF the rebuild brings cloud-ui up immediately with
whatever cloudui_image is set to (default :latest), and CI takes
over rolling forward on the next push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
On slow / overloaded host storage, the default 2-replica Longhorn
StorageClass replicates every etcd fsync across the network, which is
the bottleneck that causes 3-node etcd bootstrap to time out and
single-node steady-state to hang. The fix is to put the control-plane
VM root disks on a 1-replica, strict-local Longhorn volume so fsync
becomes a local-disk call. The application-layer etcd quorum is what
makes the loss of disk-level HA acceptable.

dc-controlplane module:
  - Creates a kubernetes_storage_class_v1 on the Harvester host
    cluster (numberOfReplicas=1, dataLocality=strict-local) using
    the existing kubernetes.harvester provider alias. Gated on a new
    create_local_storage_class flag, default true.
  - vm_storage_class_override lets operators point at a pre-existing
    custom SC if they prefer.
  - Plumbs the resolved SC name into each generated machine_pools
    entry's storage_class_name field.

k8s-cluster module:
  - machine_pools entries gain an optional storage_class_name. When
    set, it's threaded into the disk_info JSON as storageClassName,
    which Harvester honours per-disk. When unset, the disk_info JSON
    omits the field entirely — preserves the original behaviour
    (use the host cluster's default SC).

In-cluster PVCs (e.g. dc-postgres) are unaffected — those continue to
use the downstream cluster's default SC, which the Harvester CSI
driver translates to the host's default replicated SC.

Production environments with adequate host IOPS should set
create_local_storage_class = false to recover full per-disk
replication for the control-plane VM root disks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harvester's validator webhook rejects StorageClasses with
dataLocality=strict-local because it would forbid VM live-migration
of any workload using that SC, which conflicts with Harvester's
operational model. Switch to best-effort: Longhorn still tries to
place the (single) replica on the same Harvester host as the
consuming VM, but allows migration when capacity demands it.

For our control-plane use case the practical effect is identical:
at volume creation Longhorn picks the local host and stays there;
fsyncs are local. The only behavioural difference would be the
fallback when local capacity is exhausted — strict-local refuses
to place at all, best-effort places remotely as a last resort.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harvester's VM provisioner ignores the storageClassName field that
HarvesterConfig.disk_info advertises — boot-disk PVCs are cloned
from the source image's PVC and silently inherit the SOURCE image's
StorageClass. So even though our pools requested
dcapi-controlplane-local, the cloned PVCs ended up on
longhorn-ubuntu-22-04 (the SC the upstream Ubuntu image was
imported with) — full 2-replica replicated writes, no fsync win.

The fix: when create_local_storage_class is true, the module now
also creates its OWN VirtualMachineImage in the project namespace,
downloaded from var.node_image_url (default upstream Ubuntu 22.04
cloud image), with storageClassName pinned to the local SC.
Clones from this image inherit dcapi-controlplane-local — finally
giving us single-replica strict-local fsyncs.

node_image_name becomes optional. When the module creates its own
image, that image's reference flows into local.effective_image_name
and gets passed to each pool. node_image_name is only honored when
create_local_storage_class is false (the prod path that uses a
pre-existing image, typically from a bootstrap layer). A
precondition fails fast if the operator turns off local SC without
supplying an image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces cloud-ui workload support and performs a comprehensive refactor of the control-plane module to enable HA RKE2 cluster provisioning with static node configuration. The cloud-ui module conditionally deploys a hardened nginx container, service, and ingress when configuration variables are provided. The control-plane refactor replaces dynamic machine-pool configuration with per-node static provisioning via cloud-init, introduces kube-vip for API server and ingress VIP announcement, adds optional Longhorn-backed local storage classes for VM root disks, and extends the machine-pool schema to support per-pool cloud-init and storage-class overrides. All changes are additive or refactoring; no existing resources are removed.

Suggested reviewers

  • gnudeep
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description provides a clear summary of changes but does not follow the repository's required template structure with sections like Purpose, Goals, Approach, User stories, Release note, Documentation, etc. Restructure the description to match the template: add Purpose (with issue links), Goals, Approach, User stories, Release note, Documentation, Training, Certification, Marketing, test sections, and security checks.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately summarizes the three main changes: 3-node HA control plane setup, cloud-ui Terraform management, and local-fast StorageClass configuration for DC-Controlplane.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/management/dc-controlplane-services/main.tf`:
- Around line 825-827: The TLS config may not include var.cloudui_hostname in
the certificate SANs, causing mTLS/hostname validation failures; update the
certificate creation so the generated cert covers the cloud-ui hostname by
adding var.cloudui_hostname to the certificate's subjectAltNames/hosts list used
to create kubernetes_secret_v1.dc_api_tls (or the cert-manager Certificate
resource) and ensure the tls block continues to reference
kubernetes_secret_v1.dc_api_tls.metadata[0].name so the ingress uses the secret
containing the cert that includes var.cloudui_hostname.

In `@modules/management/dc-controlplane-services/variables.tf`:
- Around line 64-67: The variable description for cloudui_hostname is out of
date: it still states the cloud-ui Deployment and Service are deployed
separately, but the module now creates them when cloudui_image is set. Update
the description of variable "cloudui_hostname" to state that when non-empty the
module creates an Ingress that routes the hostname to a Service named "cloud-ui"
in the dc-system namespace and that if cloudui_image is provided the module will
also create the cloud-ui Deployment and Service (or set to empty string to opt
out).

In `@modules/management/dc-controlplane/main.tf`:
- Around line 83-127: The module pins the created Harvester image to
var.local_storage_class_name instead of the resolved
local.vm_storage_class_name, so a vm_storage_class_override is ignored; update
the harvester_image.dcapi_node resource to set storage_class_name =
local.vm_storage_class_name, and keep the locals.effective_image_name logic
as-is so the module will prefer the created image when present (ensuring the
image you create is actually pinned to the resolved storage class).

In `@modules/management/dc-controlplane/variables.tf`:
- Around line 193-223: Update the documentation and variable descriptions to
match the actual StorageClass type used (best-effort) instead of saying
strict-local: change the module header comment and the description fields for
variable "create_local_storage_class" and "local_storage_class_name" (and any
nearby comments that mention strict-local) to state that the created Longhorn
StorageClass uses best-effort topology/migration semantics and explain the
resulting placement/migration behavior and tradeoffs; keep
"vm_storage_class_override" behavior note (empty string defers to
create_local_storage_class) but ensure its description does not reference
strict-local either.
- Around line 34-52: Add a validation that enforces node_count matches the
length of node_ips so provisioning can't silently mismatch sizes: in the
variable "node_ips" validation (or as an additional validation on "node_count")
add a condition requiring length(var.node_ips) == var.node_count and provide a
clear error_message like "length(node_ips) must equal node_count"; reference the
variables node_count and node_ips when adding the check so the module fails fast
on mismatched values.

In `@modules/workloads/k8s-cluster/main.tf`:
- Around line 118-122: The user_data assignment currently treats whitespace-only
strings as non-empty; update the conditional checks that use
try(each.value.user_data, "") and try(each.value.user_data, null) to apply
trimspace(...) before comparing to "" (e.g., trimspace(try(each.value.user_data,
"")) != "") so whitespace-only overrides are treated as empty and fall back to
local.effective_node_user_data; make the same change to the other similar block
that uses try(each.value.node_user_data, "") so both override paths use
trimspace checks.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7052de2a-efb4-4d6c-be83-5463e1382656

📥 Commits

Reviewing files that changed from the base of the PR and between fdddf35 and 5e10815.

📒 Files selected for processing (9)
  • modules/management/dc-controlplane-services/main.tf
  • modules/management/dc-controlplane-services/variables.tf
  • modules/management/dc-controlplane/main.tf
  • modules/management/dc-controlplane/outputs.tf
  • modules/management/dc-controlplane/templates/kube-vip-rke2.yaml.tftpl
  • modules/management/dc-controlplane/templates/node-cloud-init.yaml.tftpl
  • modules/management/dc-controlplane/variables.tf
  • modules/workloads/k8s-cluster/main.tf
  • modules/workloads/k8s-cluster/variables.tf

Comment on lines +825 to +827
tls {
hosts = [var.cloudui_hostname]
secret_name = kubernetes_secret_v1.dc_api_tls.metadata[0].name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ensure cloud-ui TLS hostname is covered by the generated certificate.

Line 826 uses var.cloudui_hostname, but cert SANs are not automatically guaranteed to include it. This can cause certificate mismatch when cloud-ui ingress is enabled.

Suggested fix
 resource "tls_self_signed_cert" "dc_api" {
@@
-  dns_names = concat(
-    [var.dcapi_hostname],
-    var.ingress_additional_dns_names,
-  )
+  dns_names = distinct(compact(concat(
+    [var.dcapi_hostname],
+    var.cloudui_hostname != "" ? [var.cloudui_hostname] : [],
+    var.ingress_additional_dns_names,
+  )))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/management/dc-controlplane-services/main.tf` around lines 825 - 827,
The TLS config may not include var.cloudui_hostname in the certificate SANs,
causing mTLS/hostname validation failures; update the certificate creation so
the generated cert covers the cloud-ui hostname by adding var.cloudui_hostname
to the certificate's subjectAltNames/hosts list used to create
kubernetes_secret_v1.dc_api_tls (or the cert-manager Certificate resource) and
ensure the tls block continues to reference
kubernetes_secret_v1.dc_api_tls.metadata[0].name so the ingress uses the secret
containing the cert that includes var.cloudui_hostname.

Comment on lines +64 to +67
variable "cloudui_hostname" {
type = string
description = "Public hostname for the cloud-ui ingress. When non-empty, this module creates an Ingress (and re-uses the dc-api self-signed cert via SAN) that routes the hostname to a Service named 'cloud-ui' in the dc-system namespace. The cloud-ui Deployment + Service themselves are deployed separately (out-of-band CI workflow); leaving the Ingress here means a cluster rebuild restores routing as soon as the workflow drops the Service back in. Set to empty string to opt out."
default = ""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update cloudui_hostname description to match current module behavior.

The text says cloud-ui Deployment/Service are deployed separately, but this module now creates them when cloudui_image is set. Please align the description so consumers don’t misconfigure ownership expectations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/management/dc-controlplane-services/variables.tf` around lines 64 -
67, The variable description for cloudui_hostname is out of date: it still
states the cloud-ui Deployment and Service are deployed separately, but the
module now creates them when cloudui_image is set. Update the description of
variable "cloudui_hostname" to state that when non-empty the module creates an
Ingress that routes the hostname to a Service named "cloud-ui" in the dc-system
namespace and that if cloudui_image is provided the module will also create the
cloud-ui Deployment and Service (or set to empty string to opt out).

Comment on lines +83 to +127
# Resolved SC name for the VM root disks. Empty string = let Harvester use
# the host cluster's default StorageClass (the original module behaviour).
locals {
vm_storage_class_name = (
var.vm_storage_class_override != "" ? var.vm_storage_class_override :
var.create_local_storage_class ? var.local_storage_class_name :
""
)
}

# ── VirtualMachineImage on the local SC ──────────────────────────────────────
# Harvester's VM provisioner ignores the storageClassName field passed via
# HarvesterConfig.disk_info — it clones the boot disk from the source image's
# PVC and the cloned PVC inherits the SOURCE image's StorageClass, not the
# requested one. To actually land VM root disks on dcapi-controlplane-local
# we have to give the source image itself that SC. So when the operator opts
# into the local SC, the module also creates its own VirtualMachineImage in
# the project namespace, downloaded from the same upstream URL, but with
# storageClassName pinned. VMs cloned from this image inherit the local SC.
resource "harvester_image" "dcapi_node" {
count = var.create_local_storage_class ? 1 : 0

name = "dcapi-controlplane-ubuntu"
namespace = var.project_name
display_name = var.node_image_display_name
source_type = "download"
url = var.node_image_url

storage_class_name = var.local_storage_class_name

depends_on = [
module.project,
kubernetes_storage_class_v1.vm_local,
]
}

# Resolved image reference passed to each machine_pool. When the module
# created its own image, use that. Otherwise the operator must supply
# node_image_name themselves (typically from a bootstrap layer's output).
locals {
effective_image_name = (
var.create_local_storage_class && length(harvester_image.dcapi_node) > 0
? "${var.project_name}/${harvester_image.dcapi_node[0].name}"
: var.node_image_name
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor vm_storage_class_override when selecting the source image.

vm_storage_class_override wins in local.vm_storage_class_name, but the generated harvester_image is still pinned to var.local_storage_class_name, and effective_image_name always prefers that image while create_local_storage_class = true. Because the boot disk inherits the source image StorageClass, the override is ignored on the path where it matters most.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/management/dc-controlplane/main.tf` around lines 83 - 127, The module
pins the created Harvester image to var.local_storage_class_name instead of the
resolved local.vm_storage_class_name, so a vm_storage_class_override is ignored;
update the harvester_image.dcapi_node resource to set storage_class_name =
local.vm_storage_class_name, and keep the locals.effective_image_name logic
as-is so the module will prefer the created image when present (ensuring the
image you create is actually pinned to the resolved storage class).

Comment on lines +34 to +52
variable "node_count" {
type = number
description = "Number of control-plane nodes. Must be 1, 3 or 5 to form an etcd quorum. 3 is the recommended HA default."
default = 3

validation {
condition = contains([1, 3, 5], var.node_count)
error_message = "node_count must be 1, 3 or 5 (etcd quorum requirement)."
}
}

variable "node_ips" {
type = list(string)
description = "Per-node static management IPs. List length must equal node_count. Order is significant — index N maps to pool 'node{N+1}'."

validation {
condition = length(var.node_ips) == length(distinct(var.node_ips))
error_message = "node_ips entries must be unique."
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Enforce node_count against node_ips.

The module builds per-node cloud-init and machine pools from var.node_ips, so node_count currently does not affect provisioning. A mismatch here can silently create the wrong control-plane size. Add a validation or precondition that requires length(node_ips) == node_count.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/management/dc-controlplane/variables.tf` around lines 34 - 52, Add a
validation that enforces node_count matches the length of node_ips so
provisioning can't silently mismatch sizes: in the variable "node_ips"
validation (or as an additional validation on "node_count") add a condition
requiring length(var.node_ips) == var.node_count and provide a clear
error_message like "length(node_ips) must equal node_count"; reference the
variables node_count and node_ips when adding the check so the module fails fast
on mismatched values.

Comment on lines +193 to +223
# ── VM root-disk storage ─────────────────────────────────────────────────────
# By default, this module creates a single-replica, strict-local Longhorn
# StorageClass on the Harvester host cluster and points the control-plane VM
# root disks at it. The fsync-latency drop versus the default 2-replica SC is
# 5-10× — necessary for etcd quorum to bootstrap and stay healthy on slow or
# overloaded host storage. The trade is that losing the Harvester host where
# a VM lives loses that VM's root disk; etcd quorum at the application layer
# is what compensates.
#
# For environments with fast underlying storage (good NVMe + spare IOPS),
# set create_local_storage_class = false to fall back to the host cluster's
# default StorageClass and recover full per-disk replication.
#
# In-cluster PVCs (e.g. dc-postgres) are unaffected — they continue to use
# the downstream cluster's default StorageClass, which the Harvester CSI
# driver translates to the host's default (replicated) SC.
variable "create_local_storage_class" {
type = bool
description = "When true, create a single-replica strict-local Longhorn StorageClass on the Harvester host cluster for the control-plane VM root disks. Recommended for dev / slow-disk environments where etcd fsync latency is the bottleneck. Set false in production when host storage has adequate IOPS for the default 2-replica SC."
default = true
}

variable "local_storage_class_name" {
type = string
description = "Name of the local-fast StorageClass created on the Harvester host cluster when create_local_storage_class = true."
default = "dcapi-controlplane-local"
}

variable "vm_storage_class_override" {
type = string
description = "Explicit StorageClass name for the control-plane VM root disks. Takes precedence over create_local_storage_class — use this to point at a pre-existing custom SC on the Harvester cluster. Empty string defers to the create_local_storage_class logic."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align the storage-class description with the implementation.

These comments and the create_local_storage_class description still say strict-local, but the resource now uses best-effort. Keeping the wording stale will mislead operators about the expected placement and migration behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/management/dc-controlplane/variables.tf` around lines 193 - 223,
Update the documentation and variable descriptions to match the actual
StorageClass type used (best-effort) instead of saying strict-local: change the
module header comment and the description fields for variable
"create_local_storage_class" and "local_storage_class_name" (and any nearby
comments that mention strict-local) to state that the created Longhorn
StorageClass uses best-effort topology/migration semantics and explain the
resulting placement/migration behavior and tradeoffs; keep
"vm_storage_class_override" behavior note (empty string defers to
create_local_storage_class) but ensure its description does not reference
strict-local either.

Comment on lines +118 to +122
user_data = (
try(each.value.user_data, null) != null && try(each.value.user_data, "") != ""
? each.value.user_data
: local.effective_node_user_data
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Treat whitespace-only overrides as empty values.

The checks on Line 119 and Line 134 only compare against "", so whitespace-only values are treated as valid overrides. Use trimspace(...) != "" to match the documented “non-empty” behavior.

Proposed fix
     user_data = (
-      try(each.value.user_data, null) != null && try(each.value.user_data, "") != ""
+      try(each.value.user_data, null) != null && trimspace(try(each.value.user_data, "")) != ""
       ? each.value.user_data
       : local.effective_node_user_data
     )
@@
-        try(each.value.storage_class_name, null) != null && try(each.value.storage_class_name, "") != "" ? {
+        try(each.value.storage_class_name, null) != null && trimspace(try(each.value.storage_class_name, "")) != "" ? {
           storageClassName = each.value.storage_class_name
         } : {}

Also applies to: 125-137

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/workloads/k8s-cluster/main.tf` around lines 118 - 122, The user_data
assignment currently treats whitespace-only strings as non-empty; update the
conditional checks that use try(each.value.user_data, "") and
try(each.value.user_data, null) to apply trimspace(...) before comparing to ""
(e.g., trimspace(try(each.value.user_data, "")) != "") so whitespace-only
overrides are treated as empty and fall back to local.effective_node_user_data;
make the same change to the other similar block that uses
try(each.value.node_user_data, "") so both override paths use trimspace checks.

@HiranAdikari HiranAdikari changed the title feat(dc-controlplane): 3-node HA + cloud-ui in TF + local-fast SC 3-node HA + cloud-ui in TF + local-fast SC May 22, 2026
@HiranAdikari HiranAdikari changed the title 3-node HA + cloud-ui in TF + local-fast SC Add 3-node HA + cloud-ui in TF + local-fast SC for DC-Controlplane May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant