feat(charts/cozystack): add recommended sysctl and etcd defaults#131
feat(charts/cozystack): add recommended sysctl and etcd defaults#131IvanHunters wants to merge 1 commit intomainfrom
Conversation
Add recommended sysctl parameters for TCP orphan handling, network backlog, and TCP keepalive to improve DRBD connection stability and prevent port exhaustion. Add etcd quota and request size limits to prevent etcd running out of space with large LINSTOR CRD datasets and allow larger CRD objects to be stored. Signed-off-by: ohotnikov.ivan <ohotnikov.ivan@e-queo.net>
📝 WalkthroughWalkthroughThe PR updates the Talos machine configuration Helm template helper to add recommended network tuning sysctls (TCP orphan handling, backlog, and keepalive settings) and etcd cluster extraArgs for quota and request size limits. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces several kernel sysctl optimizations for TCP orphan handling, network backlog, and keepalive settings. It also increases etcd's storage quota and maximum request size to accommodate large LINSTOR CRD datasets. A review comment suggests that the Kubernetes API server should also be configured with a matching --max-resource-write-bytes limit to ensure the increased etcd request size is fully effective.
| {{- toYaml .Values.advertisedSubnets | nindent 6 }} | ||
| extraArgs: | ||
| quota-backend-bytes: "8589934592" # 8GiB - prevent etcd running out of space with large LINSTOR CRD datasets | ||
| max-request-bytes: "10485760" # 10MiB - allow larger CRD objects to be stored |
There was a problem hiding this comment.
Increasing max-request-bytes in etcd is a necessary step for handling large objects, but it is often insufficient on its own. To fully support larger CRD objects (such as large LINSTOR datasets), the Kubernetes API server should also be configured with a matching --max-resource-write-bytes limit in its extraArgs. Without this corresponding change, the API server may reject large requests before they reach etcd, rendering the increased etcd limit ineffective for those operations.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
charts/cozystack/templates/_helpers.tpl (1)
32-42: Sysctl defaults look reasonable; consider making them configurable.The added TCP/network tunings are sensible defaults for clusters running DRBD/LINSTOR replication. A few non-blocking observations:
- These sysctls apply uniformly to controlplane and workers (intended, since DRBD runs cluster-wide), but they are also imposed on clusters that don't use DRBD. If a deployment uses cozystack without LINSTOR/DRBD, the aggressive
tcp_keepalive_time=600(vs. kernel default7200) and shortenedtcp_fin_timeout=30will still apply.- All values are hardcoded — there's no way to override via
values.yamlwithout forking the chart. This matches the style of the existing sysctls above (lines 29–31), so it is consistent, but you may eventually want to expose at least the DRBD-specific knobs (similar to hownr_hugepagesis gated via$.Values.nr_hugepages).♻️ Optional: gate DRBD-tuning sysctls behind a value (e.g.
.Values.drbdTuning)net.ipv4.neigh.default.gc_thresh1: "4096" net.ipv4.neigh.default.gc_thresh2: "8192" net.ipv4.neigh.default.gc_thresh3: "16384" + {{- if (default true $.Values.drbdTuning) }} # TCP orphan handling net.ipv4.tcp_orphan_retries: "3" net.ipv4.tcp_fin_timeout: "30" # Network backlog net.core.netdev_max_backlog: "5000" net.core.netdev_budget: "600" net.core.netdev_budget_usecs: "8000" # TCP keepalive (early detection of dead connections) net.ipv4.tcp_keepalive_time: "600" net.ipv4.tcp_keepalive_intvl: "10" net.ipv4.tcp_keepalive_probes: "6" + {{- end }}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@charts/cozystack/templates/_helpers.tpl` around lines 32 - 42, The hardcoded sysctl entries (net.ipv4.tcp_keepalive_time, net.ipv4.tcp_keepalive_intvl, net.ipv4.tcp_keepalive_probes, net.ipv4.tcp_fin_timeout, net.core.netdev_* and tcp_orphan_retries) should be made configurable or gated so non-DRBD clusters are not forced to use aggressive values; add a boolean flag (e.g. .Values.drbdTuning default false) or a map (.Values.sysctls) in values.yaml and update the template to only render these DRBD-specific tunings when .Values.drbdTuning is true (or allow per-key overrides via the map), following the existing pattern used for nr_hugepages to locate and change the rendering logic in the helper template.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@charts/cozystack/templates/_helpers.tpl`:
- Around line 134-136: The etcd extraArgs (quota-backend-bytes and
max-request-bytes) are fine but need an in-template clarification: add a comment
next to quota-backend-bytes and max-request-bytes (the keys under extraArgs)
stating these apply only to control-plane nodes and that kube-apiserver enforces
a 3MiB hard client request body limit (so the 10MiB etcd setting is for direct
etcd or special use cases like LINSTOR CRDs), and mention operational
considerations (memory sizing/defrag/alerts) so operators understand sizing
tradeoffs and why these values were chosen.
---
Nitpick comments:
In `@charts/cozystack/templates/_helpers.tpl`:
- Around line 32-42: The hardcoded sysctl entries (net.ipv4.tcp_keepalive_time,
net.ipv4.tcp_keepalive_intvl, net.ipv4.tcp_keepalive_probes,
net.ipv4.tcp_fin_timeout, net.core.netdev_* and tcp_orphan_retries) should be
made configurable or gated so non-DRBD clusters are not forced to use aggressive
values; add a boolean flag (e.g. .Values.drbdTuning default false) or a map
(.Values.sysctls) in values.yaml and update the template to only render these
DRBD-specific tunings when .Values.drbdTuning is true (or allow per-key
overrides via the map), following the existing pattern used for nr_hugepages to
locate and change the rendering logic in the helper template.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 88104462-eb73-4562-b962-3f8d22321d20
📒 Files selected for processing (1)
charts/cozystack/templates/_helpers.tpl
| extraArgs: | ||
| quota-backend-bytes: "8589934592" # 8GiB - prevent etcd running out of space with large LINSTOR CRD datasets | ||
| max-request-bytes: "10485760" # 10MiB - allow larger CRD objects to be stored |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does kube-apiserver in Kubernetes 1.30+ require any flag (e.g., --max-request-bytes) to be coordinated when etcd's --max-request-bytes is raised to 10MiB, or does the apiserver pass requests through transparently up to etcd's limit?
💡 Result:
No, kube-apiserver in Kubernetes 1.30+ does not require any flag like --max-request-bytes to be coordinated when etcd's --max-request-bytes is raised to 10MiB. The apiserver passes etcd requests transparently up to etcd's limit, but enforces its own independent limit on incoming client request body sizes for write operations. kube-apiserver has a hardcoded default MaxRequestBodyBytes of 3MiB (3145728 bytes) for decoding write request bodies (e.g., CREATE/UPDATE/PATCH), which fails requests larger than this with "Request entity too large: limit is 3145728" before reaching etcd. This limit accounts for JSON-to-protobuf conversion expansion when storing in etcd (etcd default 1.5MiB, apiserver allows ~2x). It is not exposed as a command-line flag in standard kube-apiserver; users of the generic apiserver library can configure it programmatically. etcd's --max-request-bytes independently limits gRPC requests from kube-apiserver (e.g., Put,Txn). Raising it to 10MiB allows larger backend requests without apiserver changes, as apiserver does not impose additional byte limits on outgoing etcd calls beyond what fits its object model. apiserver transparently forwards transformed storage requests to etcd. No changes in 1.30+ alter this; official docs list no --max-request-bytes flag for kube-apiserver, and source code confirms the client-side body limit is separate from etcd's server-side limit.
Citations:
- 1: https://v1-30.docs.kubernetes.io/docs/reference/generated/kube-apiserver/
- 2: https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/
- 3: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go
- 4: https://stackoverflow.com/questions/60468110/kubernetes-object-size-limitations
- 5: Adding a limit on the size of request body the apiserver will decode for write operations kubernetes/kubernetes#73805
- 6: [KUBE-APISERVER] Error from server (RequestEntityTooLarge Request entity too large: limit is 3145728 kubernetes/kubernetes#105863
- 7: https://v1-31.docs.kubernetes.io/docs/reference/generated/kube-apiserver/
- 8: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/server/config.go
- 9: https://github.com/kubernetes/kubernetes/blob/master/vendor/go.etcd.io/etcd/server/v3/embed/config.go
- 10: https://etcd.io/docs/v3.6/dev-guide/limit/
- 11: Restriction on MaxRequestSize and maxWALEntrySize etcd-io/etcd#14114
- 12: https://github.com/kubernetes/kubernetes/blob/da53a247633/cmd/kube-apiserver/app/options/options.go
Etcd defaults align with upstream guidance — note the kube-apiserver request-size boundary.
Both values are correctly placed inside the eq .MachineType "controlplane" block, so they apply only to control-plane nodes where etcd runs. The values match etcd's documented recommendations:
quota-backend-bytes: 8589934592(8 GiB) is the upstream-suggested maximum (per the etcd "How to debug large db size issue" blog post).max-request-bytes: 10485760(10 MiB) sits at etcd's documented soft cap; exactly at the 10 MB boundary and safe from WAL-decode failure risk (Restriction on MaxRequestSize and maxWALEntrySize etcd-io/etcd#14114).
Important clarification: kube-apiserver (1.30+) has a hardcoded 3MiB limit on incoming client request bodies for write operations and does not expose this as a configurable flag. This means API clients cannot send requests larger than 3MiB, even though etcd can accept 10MiB. The 10MiB etcd setting still provides value for direct etcd operations or future use cases, but no apiserver flag coordination is needed.
Operational considerations to keep in mind:
- Memory footprint: etcd (boltDB) mmaps the backend file, so raising
quota-backend-bytesfrom the default 2 GiB to 8 GiB allows the resident-set on each control-plane node to grow up to ~8 GiB just for etcd. Ensure control-plane VMs are sized accordingly. - Defragmentation: Larger backends grow fragmentation. Confirm that periodic
etcdctl defrag(or equivalent) is in place; otherwise the 8 GiB quota can still be hit by fragmented free pages. - Alerting: Recommend adding/keeping alerts on
etcd_mvcc_db_total_size_in_bytesandetcd_server_quota_backend_bytesso operators see growth trends well before hitting the new ceiling. - LINSTOR CRD objects: The 10MiB etcd setting is sized to handle LINSTOR CRD workloads. Document that this setting exists to support that use case and not as a general invitation to store large blobs in arbitrary CRs (which remain bound by the apiserver's 3MiB client limit).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@charts/cozystack/templates/_helpers.tpl` around lines 134 - 136, The etcd
extraArgs (quota-backend-bytes and max-request-bytes) are fine but need an
in-template clarification: add a comment next to quota-backend-bytes and
max-request-bytes (the keys under extraArgs) stating these apply only to
control-plane nodes and that kube-apiserver enforces a 3MiB hard client request
body limit (so the 10MiB etcd setting is for direct etcd or special use cases
like LINSTOR CRDs), and mention operational considerations (memory
sizing/defrag/alerts) so operators understand sizing tradeoffs and why these
values were chosen.
lexfrei
left a comment
There was a problem hiding this comment.
Title scope
Recent PRs touching charts/cozystack/templates/_helpers.tpl use a concept-style scope rather than a path: feat(cozystack): enable allocateNodeCIDRs by default (#91), fix(chart): cast metadata.id to string for regexMatch … (#110), feat(helpers): add bond interface discovery helpers (#94). feat(charts/cozystack): … is the first path-style scope I see in this repo. feat(cozystack): would mirror #91, which was a similar defaults-tweak for this same chart.
Merge ordering with #130
This PR edits cluster.etcd.advertisedSubnets immediately adjacent to where #130 wraps the same block in an if/else discovery fallback. There's no semantic conflict, but git merge-tree reports a textual conflict — whichever PR lands second will need a small rebase.
TCP keepalive blast radius
tcp_keepalive_time/intvl/probes = 600/10/6 is roughly 12× more aggressive on idle timeout than the kernel defaults (7200/75/9). The PR body frames this as DRBD-focused, but net.ipv4.tcp_keepalive_* are kernel-wide and apply to every long-lived idle TCP socket on the node — pods holding NFS mounts, idle DB clients, MQ consumers, etc. all inherit the new failure characteristics. Likely still the right default for a Cozystack cluster, but worth calling out explicitly in the PR description so operators understand the scope.
The other sysctls (tcp_orphan_retries, tcp_fin_timeout, netdev_*) are more conservative and look fine.
etcd quota and request size
quota-backend-bytes: 8 GiB is etcd's documented upper recommended ceiling, not a typical default — etcd's own default is 2 GiB. Same direction for max-request-bytes: 10 MiB (vs etcd's 1.5 MiB default). Both are correct numbers for a LINSTOR-heavy control plane, but they're strong opinions to bake in for every Cozystack cluster: a small dev cluster still pays for the larger backend in etcd RSS regardless of whether it ever uses it.
Could these be exposed in values.yaml with the proposed numbers as defaults? For example:
etcd:
quotaBackendBytes: "8589934592"
maxRequestBytes: "10485760"The recommendation still ships as the default, but operators can tune without forking the chart. #91 used this exact pattern for allocateNodeCIDRs — defaults-tweak plus values exposure in the same change.
Tests
pkg/engine/render_test.go already asserts on the rendered sysctls: block (lines 114, 211). It would be cheap to extend those assertions to the new sysctl keys, and to add a check that etcd.extraArgs renders only for MachineType: controlplane — that's already correct in the template (the block sits inside {{- if eq .MachineType "controlplane" }}), but it isn't covered by a test today.
Style nit
Inline YAML comments after string scalars (quota-backend-bytes: "8589934592" # 8GiB - …) parse fine, but the rest of this template uses Go template comments ({{- /* … */ -}}) for explanations — see #130 for examples. Aligning would help readability.
Summary
Adds recommended sysctl and etcd defaults to Talos machine configuration to improve DRBD connection stability and prevent etcd storage issues with large LINSTOR CRD datasets.
Changes
Machine sysctls
tcp_orphan_retries=3andtcp_fin_timeout=30to prevent orphaned TCP connections from accumulating with DRBD peer connectionsnetdev_max_backlog,netdev_budget, andnetdev_budget_usecsto prevent packet loss under high DRBD replication trafficCluster etcd
Related
Summary by CodeRabbit