Skip to content

Investigate: Oracle Linux 9/10 multi-master k3s fails on OCI — port 2380 blocked between nodes #26

@lexfrei

Description

@lexfrei

Summary

When provisioning a 3-node multi-master Cozystack cluster on Oracle Cloud Infrastructure with examples/rhel/site.yml, k3s embedded etcd cannot establish peer communication: agent nodes cannot reach the bootstrap server's port 2380 (etcd peer). Port 6443 (kube-apiserver) works fine over the same VNIC. The same Tofu/OCI configuration with Ubuntu 22.04 / 24.04 reaches all 87/87 HelmReleases Ready.

Reproduction

  • 3-node OCI cluster (VM.Standard3.Flex, 4 OCPU / 32 GiB / 256 GB), Oracle Linux 9.7 or 10.1 with default UEK kernel
  • Same VCN, same subnet, NSG configured INGRESS all from 0.0.0.0/0 and EGRESS all to 0.0.0.0/0
  • All three nodes in the server inventory group
  • cozystack_flush_iptables: true
  • Run examples/rhel/site.yml

Result: agent nodes (server[1], server[2]) get MemberAdd request timed out, transport: authentication handshake failed: context deadline exceeded. From an OL agent: bash -c 'echo > /dev/tcp/<server-private-ip>/2380' returns BLOCKED (TCP SYN times out, no SYN-ACK). The same test for :6443 succeeds.

What is not the cause

  • iptables INPUT is empty / policy ACCEPT after the playbook flushes (verified)
  • firewalld and nftables services are inactive (verified)
  • NSG / security list permit all (verified — same as the working Ubuntu cluster)
  • etcd is listening on the external IP (ss -lnt confirms LISTEN ... <ip>:2380)
  • Local connect to <own-ip>:2380 from the same node works
  • iptables-save shows no REJECT/DROP except harmless KUBE-FIREWALL (loopback only) and OVN-POSTROUTING (set-bound)
  • iptables-save does emit # Warning: iptables-legacy tables present, use iptables-legacy-save to see them, but iptables-legacy-save is not present in the package

Hypotheses to investigate

  • SELinux Enforcing on OL is blocking unprivileged TLS handshake to etcd peer port (Ubuntu has AppArmor with permissive defaults)
  • Hidden iptables-legacy ruleset is filtering at kernel netfilter layer
  • oracle-cloud-agent adds packet-filtering or network policy that is not visible via standard tooling
  • OCI VNIC source/destination check or stateful tracking interacts with OL kernel differently than Ubuntu
  • Path MTU / TCP MSS issue specific to OL kernel that breaks etcd peer TLS handshake (large packets)

What works

Ubuntu 22.04 / 24.04 on OCI with the same module: 3-node multi-master, 87/87 HelmReleases Ready. Documented in README.

Scope

Out of scope for feat/node-prerequisites PR — playbook prepares nodes correctly, the failure is at OS/cloud interaction layer. Filed for separate investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions