Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 67 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,73 @@ cozystack.installer Release Notes
Unreleased
==========

Ubuntu Secure Boot: pre-install drbd-dkms from LINBIT PPA.

- ``examples/ubuntu/prepare-ubuntu.yml`` now installs ``drbd-dkms``
from the LINBIT PPA on Ubuntu hosts and configures
``options drbd usermode_helper=disabled`` via
``/etc/modprobe.d/cozystack-drbd.conf``. On hosts where UEFI
Secure Boot is enabled (most bare-metal installs and Secure-Boot
cloud SKUs), kernel lockdown rejects the unsigned modules built
by piraeus-operator's in-cluster compile path
(``Key was rejected by service``). With drbd-dkms installed,
dkms+shim signs the module against a per-host MOK key and
piraeus-operator's loader auto-detects host-loaded DRBD and
exits cleanly.
- ``drbd-dkms`` Depends on ``drbd-utils (>= 9.28.0)``, so the userspace
is pulled onto the host as a transitive apt dependency. It is
unused at runtime — piraeus-operator's satellite container ships
its own copy and invokes that one. The playbook runs
``systemctl mask drbd.service`` on the host so the userspace
cannot be enabled by accident and race the satellite.
- The ``/etc/modprobe.d/`` drop-in is written BEFORE drbd-dkms is
installed so any auto-modprobe triggered by a future package
postinst loads the module with ``usermode_helper=disabled`` —
without that param, piraeus-operator's loader die()s on the
host-loaded module.
- The initial ``modprobe drbd`` is tolerated (``ignore_errors: true``)
because the dkms-generated MOK key is not enrolled until the
operator confirms it via the shim console on the next reboot.
Persistence in ``/etc/modules-load.d/`` is gated on a successful
modprobe so ``systemd-modules-load.service`` does not fail every
boot before MOK enrollment. A reminder task fires when the modprobe
is deferred, pointing at the enrollment step.
- New opt-out variable ``cozystack_enable_drbd_dkms`` (default
``true``) for Talos hosts or operators who deliberately use the
in-cluster compile path. New variable ``cozystack_drbd_ppa``
(default ``ppa:linbit/linbit-drbd9-stack``) for sites that mirror
the LINBIT archive internally — overridable from inventory (the
default is in the task's ``| default(...)`` filter, not in
play-level ``vars:`` where it would outrank inventory).
- Automated only on Ubuntu releases LINBIT keeps current — Jammy
(22.04) and Noble (24.04) as of 2026. Interim non-LTS releases
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding a future year like '2026' can make the documentation quickly become outdated. Consider rephrasing to be more evergreen, for example: '...as of this writing.' or simply stating the releases without a time-based qualifier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "as of 2026" anchor is intentional. The supported-series list is time-dependent — LINBIT adds new Ubuntu LTS codenames as they become current and drops older ones. The temporal anchor signals to a reader who lands on this CHANGELOG in 2028 that the list reflects state at writing and they should check the LINBIT PPA detail page for the current set. Removing the year would actually make the line less informative for future readers, since they would have no way to tell whether the list is current or stale. Same rationale applies to the related lines in README.md and the playbook comment.

(Oracular 24.10, Plucky 25.04) and the next LTS before LINBIT
publishes for it are skipped with a notice. Gating is by release
name (``ansible_distribution_release``), not version number, because
LINBIT's PPA is keyed by release name and version-based gates
silently leak interim releases. The supported list is exposed as
``cozystack_drbd_supported_releases`` (default ``[jammy, noble]``)
so operators can extend it from inventory once LINBIT publishes
for a new series. Debian, RHEL, and SUSE are not automated either —
LINBIT does not publish a Debian PPA, and the RHEL/SUSE flow needs
a different repo plus pre-signed kmods.

Fix: tolerated-modprobe pattern previously silenced its own gates.

- The pre-existing ``Load ZFS kernel module now`` and
``Enable multipathd service`` tasks used ``failed_when: false`` to
tolerate failures. Ansible's ``failed_when`` is evaluated after the
module returns and rewrites the registered variable's ``failed``
attribute to match — so every downstream gate of the form
``when: _cozystack_X.failed | default(false)`` was permanently
False. The persistence drop-in for ZFS was written even when
modprobe failed (which then crashed
``systemd-modules-load.service`` every boot), and the multipathd
warn task never fired. Switched to ``ignore_errors: true``, which
lets the module's outcome through to the registered variable while
still tolerating the failure for play-execution purposes. Same fix
applied to the new DRBD modprobe task.

Ubuntu 26.04 LTS support and namespace adoption.

- ``examples/ubuntu/`` now boots end-to-end on Ubuntu 26.04 LTS. Two
Expand Down
41 changes: 36 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,9 +71,16 @@ Shallow "looks right" research will be rejected in plan review.
- Kernel modules on nodes:
- Drop a file in `/etc/modules-load.d/cozystack*.conf` for persistence.
- `community.general.modprobe` for immediate load.
- Use `failed_when: false` only when the failure is benign (module
built into kernel, only one of two vendor-specific modules applies);
document the reason inline.
- **`failed_when: false` rewrites the registered variable's `failed`
field to `False`.** That breaks every downstream gate of the form
`when: _var.failed | default(false)`. Use `failed_when: false`
ONLY when nothing reads the registered variable (e.g., handler
tasks, or modprobe tasks gated by a follow-up `stat` on
`/sys/module/<name>` instead of by `.failed`). When the registered
variable IS consulted by a downstream `when:`, use
`ignore_errors: true` instead — it preserves the module's actual
`failed` flag while still tolerating the failure for play
execution.
- Kernel headers: Piraeus builds DRBD against the running kernel, not
the staged one, so pin to the running kernel when the distro allows.
- Ubuntu/Debian: `linux-headers-{{ ansible_kernel }}` works.
Expand All @@ -97,8 +104,32 @@ their own pods. Do **not** install on the host:

- `qemu-kvm`, `libvirt*` — KubeVirt bundles these.
- `openvswitch-switch` / `openvswitch` userspace — Kube-OVN bundles OVS.
- `drbd-utils`, `drbd-dkms`, `kmod-drbd*` — Piraeus init-container
compiles DRBD 9.x from source at runtime (kernel headers are enough).
- `drbd-utils` userspace (`drbdadm`, `drbdmeta`, `drbdsetup`) — Piraeus
ships these in the satellite container.

**Exception**: `drbd-dkms` IS installed on Ubuntu LTS hosts (22.04 /
24.04) where the in-cluster Piraeus compile path is unavailable —
UEFI Secure Boot enabled, kernel lockdown rejects the unsigned
modules built by the in-cluster compiler with `Key was rejected by
service`. See `cozystack_enable_drbd_dkms` in
`examples/ubuntu/prepare-ubuntu.yml`. `drbd-dkms` Depends on
`drbd-utils (>= 9.28.0)`, so `drbdadm`/`drbdmeta`/`drbdsetup`
land on the host as a transitive apt dependency. They are
unused — the satellite container ships its own copy. The
playbook `systemctl mask`s `drbd.service` so the host-side
userspace cannot be enabled accidentally and race the satellite.
Talos ships pre-signed DRBD modules in extensions and does not
need this.

The collection does not automate the equivalent flow on RHEL/SUSE
(LINBIT-managed RPM repo + pre-signed kmods). Operators on those
distros either build and sign drbd-dkms manually or disable Secure
Boot. LINBIT's PPA is keyed by Ubuntu release name and currently
publishes for jammy + noble; interim releases (oracular, plucky)
and the next LTS pre-publication fall back to the same manual
path. The supported list is exposed as
``cozystack_drbd_supported_releases`` so operators can extend it
from inventory once LINBIT publishes for a new series.

The host only needs the kernel modules and, for KVM, a working `/dev/kvm`.

Expand Down
25 changes: 23 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,14 +67,31 @@ LINSTOR uses LVM thin pools by default for local block storage.

#### Required: Kernel headers (Piraeus DRBD loader)

LINSTOR uses DRBD 9.x for replication. The Piraeus operator's init container compiles the DRBD kernel module from source **against the running kernel** at runtime, so only kernel headers must be installed on the host — **no DRBD host packages are needed**. Pin the headers package to `ansible_kernel` so a staged-but-not-yet-booted kernel update doesn't install headers for the wrong kernel.
LINSTOR uses DRBD 9.x for replication. The Piraeus operator's init container compiles the DRBD kernel module from source **against the running kernel** at runtime, so kernel headers must be installed on the host. Pin the headers package to `ansible_kernel` so a staged-but-not-yet-booted kernel update doesn't install headers for the wrong kernel.

| Ubuntu/Debian | RHEL/CentOS | openSUSE/SLE |
| --- | --- | --- |
| `linux-headers-{{ ansible_kernel }}` | `kernel-devel-{{ ansible_kernel }}` plus `kernel-modules-extra-{{ ansible_kernel }}` | `kernel-default-devel` (zypper resolves to running kernel — SUSE's NVR format differs from `uname -r`) |

On Oracle Linux the playbook auto-detects the UEK kernel (`uek` substring in `ansible_kernel`) and installs `kernel-uek-devel-{{ ansible_kernel }}` / `kernel-uek-modules-extra-{{ ansible_kernel }}` instead. Oracle Linux is not on the validated-end-to-end list; this code path is retained best-effort for users who still run the example playbook there. ZFS automation skips on UEK kernels because OpenZFS does not publish kmod builds for UEK.

#### Required on Ubuntu Secure Boot hosts: drbd-dkms

The in-cluster compile path above produces unsigned modules. UEFI Secure Boot's kernel lockdown rejects them at `insmod` time with `Key was rejected by service`, breaking the satellite Pod boot. Common on bare-metal Ubuntu installs (Secure Boot is the firmware default on most modern boards) and on Secure-Boot-enabled cloud SKUs; standard cloud-VM images on AWS/GCP/Azure typically ship without it and the in-cluster path works as-is.

`examples/ubuntu/prepare-ubuntu.yml` installs `drbd-dkms` from the LINBIT PPA on Ubuntu LTS hosts (22.04 / 24.04) so dkms+shim signs the module against a per-host MOK key. The Piraeus loader detects the host-loaded module and exits cleanly without attempting its own compile + insmod.

`drbd-dkms` has a hard apt dependency on `drbd-utils` (≥ 9.28.0), so `drbdadm`/`drbdmeta`/`drbdsetup` land on the host transitively. They are unused at runtime: piraeus-operator's satellite container ships its own copy of `drbd-utils` and invokes that one. The host's `drbd.service` ships disabled by maintainer; the playbook also `systemctl mask`s it so a future `systemctl enable drbd` cannot accidentally race the satellite container. Net result: the only DRBD state managed on the host is the kernel module.

On hosts where Secure Boot is disabled the module loads on first run and no MOK enrollment is needed; the prepare playbook is still safe to run because the dkms build succeeds with an unsigned key the kernel happily accepts under unenforced lockdown. On Secure Boot hosts, the dkms-generated MOK key must be enrolled at the shim console on the next reboot (Enroll MOK → View key → Continue → dkms password). The playbook tolerates the initial modprobe failure pending enrollment and emits a reminder; idempotent re-run after enrollment converges.

Variables (define in inventory to override):

- `cozystack_enable_drbd_dkms` (bool, default `true`): set `false` on Talos hosts (Talos ships signed DRBD modules in extensions) or where Secure Boot is disabled and you prefer the in-cluster compile path.
- `cozystack_drbd_ppa` (string, default `ppa:linbit/linbit-drbd9-stack`): point at a local mirror of the LINBIT archive.

The PPA-based path is automated only on Ubuntu releases LINBIT keeps current — Jammy (22.04 LTS) and Noble (24.04 LTS) as of 2026. Interim non-LTS releases (Oracular 24.10, Plucky 25.04, etc.) and the next LTS (Resolute 26.04) before LINBIT publishes for them are not in the LINBIT PPA, so the playbook skips the install and emits a notice on those hosts. The supported list is exposed as `cozystack_drbd_supported_releases` (default `[jammy, noble]`); operators can extend it from inventory once LINBIT publishes for a new series, without waiting for a collection release. Operators on unsupported releases must build and sign `drbd-dkms` manually, downgrade to a supported LTS, or disable Secure Boot. Debian is not automated either (no LINBIT Debian PPA). RHEL/SUSE need a separate flow (LINBIT-managed RPM repo + pre-signed kmods) and are out of scope for this collection.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the changelog, the hardcoded year '2026' here may become stale. Phrasing this without a specific year would improve long-term maintainability of the documentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the CHANGELOG comment: the "as of 2026" anchor is intentional and load-bearing. It tells a future reader that the LINBIT PPA series list is time-dependent and the snapshot recorded here is from a specific point. The variables-table entry below this paragraph (cozystack_drbd_supported_releases) and the inline troubleshooting note also explicitly invite operators to override the list once LINBIT publishes for a new series — the temporal phrasing complements that.


#### Required: Multipath DRBD blacklist

> **Silent failure if omitted.** `multipathd` defaults to grabbing any device matching common patterns including DRBD's `drbd*`. Once that happens LINSTOR cannot access its own volumes and volumes become unreadable after the next reboot.
Expand Down Expand Up @@ -167,9 +184,10 @@ informational notice:

Other subsystem notes:

- **Ubuntu 26.04 LTS:** two changes to be aware of.
- **Ubuntu 26.04 LTS:** three changes to be aware of.
1. *Auto-applied by `examples/ubuntu/site.yml`*: `sudo-rs` ships as the default `/usr/bin/sudo` alternative and does not honour ansible's `become_method: sudo` privilege-escalation pseudo-tty — every `become: true` task hangs with `Timeout (12s) waiting for privilege escalation prompt`. The classical sudo binary is co-installed at `/usr/bin/sudo.ws`. `site.yml` imports `prepare-sudo.yml` first, which switches the `sudo` alternative back via `update-alternatives` using a `raw` command (so it works even when become is broken). The play is a no-op on releases without sudo-rs. If you bypass `site.yml` and call the prepare playbooks directly, run `prepare-sudo.yml` before any task with `become: true` on 26.04 hosts.
2. *Manual inventory setting on 26.04 hosts*: the playbook auto-skips `linux-modules-extra-*` on Ubuntu 26.04+ because the package no longer exists for kernel 7.x — `openvswitch` and `vport-geneve` are bundled into `linux-image-generic`. The auto-skip relies on `ansible_distribution_version`; on hosts where that fact is unreliable, set `cozystack_ubuntu_extra_packages: []` in inventory to skip the apt install explicitly.
3. *DRBD via `drbd-dkms` is not automated on releases LINBIT does not publish for*: LINBIT's PPA only ships drbd-dkms for the LTS series they keep current (Jammy 22.04 + Noble 24.04 as of 2026). Interim releases (Oracular 24.10, Plucky 25.04) and the next LTS (Resolute 26.04) before LINBIT publishes are skipped with a notice; on Secure Boot hosts the in-cluster compile path will fail with `Key was rejected by service`. Build and sign `drbd-dkms` manually, downgrade to a supported LTS, extend `cozystack_drbd_supported_releases` from inventory once LINBIT publishes for your release, or disable Secure Boot.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is another instance of the hardcoded year '2026'. Removing it would prevent the documentation from becoming outdated in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same rationale as the other two "as of 2026" mentions: the LINBIT PPA's published-series set changes over time, so anchoring the claim to a calendar year helps a future reader identify the snapshot rather than assume the list is current. Removing the year would obscure that signal.

- **Cloud providers (Ubuntu on OCI, AWS, GCP):** stock Ubuntu cloud images ship an iptables INPUT chain that ends with `REJECT icmp-host-prohibited`, which blocks k3s ports 2380/6443 between nodes. Set `cozystack_flush_iptables: true` in your inventory so the prepare playbook flushes the INPUT chain before k3s installs. Oracle Linux images on OCI do not have this restriction out of the box.
- **Rocky 10 / Alma 10 (and other RHEL 10 rebuilds):** the `iptables` userspace binary is not installed by default. `examples/rhel/prepare-rhel.yml` installs `iptables-nft` so the `cozystack_flush_iptables` task and k3s kube-proxy replacement have a working `iptables` wrapper over nftables.
- **ARM64 (aarch64):** OpenZFS does not publish aarch64 RPMs for RHEL-family distributions via `zfsonlinux.org/epel`. Cozystack itself targets x86_64.
Expand Down Expand Up @@ -345,6 +363,9 @@ vars to opt out of the corresponding prepare step:
| `cozystack_enable_kubevirt` | `true` | Example playbooks: load KubeVirt kernel modules. Set `false` to skip. |
| `cozystack_flush_iptables` | `false` | Example playbooks: flush the iptables INPUT chain before k3s installs. Set `true` on Ubuntu/Debian cloud images (OCI/AWS/GCP) where the default INPUT chain ends with `REJECT icmp-host-prohibited` and blocks k3s inter-node ports 2380/6443. |
| `cozystack_zfs_release_rpm_extra` | `{}` | `examples/rhel/` only: merged on top of the built-in `cozystack_zfs_release_rpm_by_major` dict, so you can add (or override) a single EL-major → OpenZFS release RPM entry from inventory without wiping the base dict. Example: `{"10": "https://zfsonlinux.org/epel/zfs-release-X-Y.el10.noarch.rpm"}` once upstream ships one. |
| `cozystack_enable_drbd_dkms` | `true` | `examples/ubuntu/` only: install `drbd-dkms` from the LINBIT PPA on Ubuntu LTS 22.04 / 24.04 hosts so DRBD's kernel module is signed via dkms+shim under Secure Boot. Set `false` on Talos hosts (Talos ships pre-signed DRBD modules in extensions) or where Secure Boot is disabled and the in-cluster compile path is preferred. The toggle stops *future* installs but does NOT undo a prior install — manually `apt purge drbd-dkms` and remove the LINBIT entry from `/etc/apt/sources.list.d/` if you flipped to `false` after a successful run. |
| `cozystack_drbd_ppa` | `ppa:linbit/linbit-drbd9-stack` | `examples/ubuntu/` only: override to point at a Launchpad PPA mirror of the LINBIT archive. `ansible.builtin.apt_repository` resolves the signing key for `ppa:` URIs by querying Launchpad's REST API directly (no extra packages required). Non-Launchpad URIs (`deb http://internal-mirror/...`) work but you must manage the apt signing key separately — drop a keyring under `/etc/apt/keyrings/` and add `signed-by=` to the repo line. |
| `cozystack_drbd_supported_releases` | `[jammy, noble]` | `examples/ubuntu/` only: list of Ubuntu release codenames LINBIT's PPA publishes drbd-dkms for. Extend from inventory when LINBIT adds a new series (e.g. `[jammy, noble, resolute]`) without waiting for a collection release. The playbook skips the install and emits a notice on Ubuntu hosts whose `ansible_distribution_release` is not in this list. |

## Using with k3s

Expand Down
Loading