Skip to content

feat(apply): pre-flight Talos version check and decode-error hints#133

Merged
lexfrei merged 6 commits intomainfrom
feat/preflight-version-check
May 5, 2026
Merged

feat(apply): pre-flight Talos version check and decode-error hints#133
lexfrei merged 6 commits intomainfrom
feat/preflight-version-check

Conversation

@lexfrei
Copy link
Copy Markdown
Contributor

@lexfrei lexfrei commented May 5, 2026

Why

talm apply -i against a node in maintenance mode can fail with a cryptic strict-decoder error like:

error applying new configuration: rpc error: code = Unknown desc = failed to parse config: unknown keys found during decoding:
machine:
    install:
        grubUseUKICmdline: true

The field is not in the user's nodes/*.yaml or in any chart template — it is auto-injected by machinery during config generation when the contract derived from templateOptions.talosVersion outpaces what's compiled into the running maintenance binary. Real-world example: cozystack/cozystack#2442. Root cause for that user was boot-to-talos -yes defaulting to Talos v1.11.6 while the install image targeted v1.12.6, but the cryptic error gave them no clue.

Scope is laid out in cozystack/talm#132. This PR addresses the talm-side improvements from item 3 of that issue, plus a pre-flight detection complement.

What

Two complementary improvements:

  1. Pre-flight version check — before apply actually sends the config, read the running Talos version from the COSI Versions.runtime.talos.dev/runtime/version resource (NonSensitive, reachable via --insecure Reader role) and compare it against the configured contract. When the configured contract is strictly newer than the running version, print a warning with a hint pointing at the maintenance image vs contract mismatch. Best-effort: any read or parse failure returns silently and never blocks apply.

  2. Backstop hint on the decode error — when c.ApplyConfiguration returns a unknown keys found during decoding: error, attach a cockroachdb/errors.WithHint annotation pointing at the same mismatch. The custom error printer in the entrypoint renders hints as hint: ... lines under the main error text. Existing fmt.Errorf errors continue to render unchanged.

Both apply paths are covered (template-rendering and direct-patch). Default scope is draft only in the sense that the warning is informational — pre-flight never aborts apply; the user may have a valid reason to pin a lower contract.

Implementation notes

  • New direct dependency: github.com/cockroachdb/errors for WithHint / GetAllHints. Minimal usage today; future sites can attach hints without callers having to format them.
  • New file: pkg/commands/preflight.go with preflightCheckTalosVersion, evaluateVersionMismatch, annotateApplyConfigError, and the two hint constants.
  • pkg/commands/apply.go calls into both helpers from each apply closure.
  • main.go (entrypoint) renders hints to stderr after the main error message via errors.GetAllHints. SilenceErrors/SilenceUsage were already set on the root cobra command.
  • Pre-flight reads via safe.StateGet[*runtime.Version] — the same pattern as talosctl/cmd/talos/support.go upstream. Uses c.COSI.Get semantics (no state.WithSkipProtobufUnmarshal) since the runtime package's protobuf types are imported directly.

Tests

pkg/commands/preflight_test.go:

  • TestEvaluateVersionMismatch — table-driven, 8 cases covering all comparison branches and unparseable inputs. Asserts that the warning carries a hint and mentions the running version.
  • TestAnnotateApplyConfigError — 3 cases (nil, unrelated error, strict-decoder error). Asserts hint attachment via errors.GetAllHints.

go test ./... — all pass. golangci-lint run ./... — 0 issues.

Verification (manual)

End-to-end:

  1. Pre-flight on real maintenance Talos: build talm, run ./talm apply -f nodes/node1.yaml -i against a node where Versions.runtime.talos.dev reports a version older than templateOptions.talosVersion. Expect a pre-flight: warning on stderr with a hint, then apply continues.
  2. Backstop on real strict-decoder error: same node, force a config that injects an unknown field. Expect the trailing error to include hint: ... line(s).
  3. No regression when versions match: contract equal to running Talos. Expect no warning, no hint.
  4. No regression when COSI read fails: target a server that returns PermissionDenied. Expect apply to proceed silently.

Closes (or addresses item 3 of) #132.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added pre-flight checks to detect Talos version mismatches before applying configurations, providing warnings when configured versions are newer than running versions.
    • Enhanced error messages with detailed hints when configuration decoding fails.
  • Documentation

    • Added guidance on version compatibility requirements, clarifying that configured Talos versions must match the running version on target nodes.
  • Tests

    • Added unit tests for version mismatch detection and error annotation.

lexfrei added 2 commits May 5, 2026 14:51
Add github.com/cockroachdb/errors as a direct dependency and surface its
WithHint annotations after the main error message in the entrypoint
error printer.

Existing fmt.Errorf-wrapped errors continue to render unchanged. New
sites can now attach actionable hints that print on their own lines as
'hint: ...' without callers having to format them.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Pre-flight: read the running Talos version from the COSI Versions
resource on the target node and compare it to the contract derived from
templateOptions.talosVersion / --talos-version. When the configured
contract is strictly newer than the running version, print a warning
with a hint pointing at the maintenance image / contract mismatch.
Best-effort: any read or parse failure returns silently and never
blocks apply.

Backstop: when c.ApplyConfiguration returns the maintenance strict
decoder error 'unknown keys found during decoding: ...', attach the
same hint via errors.WithHint so users see actionable guidance instead
of a cryptic gRPC message.

Both paths cover the template-rendering and direct-patch apply flows.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 5, 2026

Warning

Rate limit exceeded

@lexfrei has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 18 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fe3e1c53-eaa3-46b9-8a49-a8a4c039b41c

📥 Commits

Reviewing files that changed from the base of the PR and between c2659c1 and 19245e4.

📒 Files selected for processing (2)
  • pkg/commands/apply.go
  • pkg/commands/preflight.go
📝 Walkthrough

Walkthrough

This PR adds Talos version preflight checking to the talm apply command. It introduces a pre-flight version compatibility validator that reads the running Talos version from target nodes, compares it against the configured version, and emits warnings when the configured version is newer. Error handling infrastructure is added to surface error hints from CockroachDB's error library, and version mismatches or apply failures are annotated with relevant hints.

Changes

Talos Version Preflight Validation

Layer / File(s) Summary
Dependency Infrastructure
go.mod
Adds cockroachdb/errors and related dependencies (errors, logtags, redact) plus getsentry/sentry-go and tooling packages (kr/pretty, kr/text, rogpeppe/go-internal) for error annotation and hint extraction.
Error Hint Printing
main.go
Execute error handler imports cockroachdb/errors and prints all error hints to os.Stderr after the main error output.
Preflight Version & Error Logic
pkg/commands/preflight.go
Implements preflightCheckTalosVersion which reads node runtime version via COSI, evaluates version mismatch against configured contract, and prints warnings with hints; adds annotateApplyConfigError to attach decoder-error hints to apply failures; defines two hint constants for version mismatches and unknown-field errors.
Command Integration
pkg/commands/apply.go
Integrates preflight checks into both template-rendering and direct-patch apply paths; calls preflightCheckTalosVersion per node before applying config and wraps apply errors with annotateApplyConfigError.
Tests & Documentation
pkg/commands/preflight_test.go, README.md
Tests cover version mismatch evaluation, error hint annotation, and full preflight flow with stubbed version readers; README documents version compatibility requirement and pre-flight warning/error-hint behavior.

Sequence Diagram

sequenceDiagram
    participant User as User/CLI
    participant App as talm apply
    participant Client as Talos Client
    participant Node as Target Node
    participant Config as Configuration Apply
    
    User->>App: talm apply
    loop For each node
        App->>Client: Connect (node-scoped context)
        Client->>Node: Read runtime.Version from COSI
        Node-->>Client: Return running Talos version
        App->>App: Evaluate version mismatch<br/>(configured vs running)
        alt Version mismatch detected
            App->>App: Print warning + hints to stderr
        end
        App->>Config: Apply configuration
        Config-->>Config: Apply result
        alt Apply error with unknown fields
            App->>App: Annotate error with decode hint
            App->>App: Extract and print all hints
        end
    end
    Config-->>User: Success or error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Version checks before we leap,
Contracts matched while nodes do sleep,
Hints extracted, warnings clear,
Apply configuration without fear!
Configuration harmony, at last! 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main changes: adding a pre-flight Talos version check and decode-error hints to the apply command, which are the core features described in the PR objectives.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/preflight-version-check

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a pre-flight check to compare the configured Talos version against the version running on the node, providing warnings and hints if there's a mismatch. It also enhances error reporting by using the github.com/cockroachdb/errors package to display actionable hints for common configuration issues, such as unknown keys during decoding. Feedback suggests adding a timeout to the pre-flight network call to ensure it remains non-blocking and truly best-effort.

Comment thread pkg/commands/preflight.go
"context"
"fmt"
"io"
"strings"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The time package is required to implement a timeout for the pre-flight version check.

Suggested change
"strings"
"strings"
"time"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in b6e0618time is now imported and used by the new preflightCOSIReadTimeout constant in cosiVersionReader.

Comment thread pkg/commands/preflight.go Outdated
Comment on lines +81 to +85
res, err := safe.StateGet[*runtime.Version](
ctx,
c.COSI,
resource.NewMetadata(runtime.NamespaceName, runtime.VersionType, "version", resource.VersionUndefined),
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The pre-flight check performs a network call to the Talos node via safe.StateGet. To ensure this check remains truly "best-effort" and does not block the apply process if a node is unresponsive or the network is slow, it should have a dedicated timeout. Adding a short timeout (e.g., 2 seconds) ensures that this informational check never significantly delays the actual configuration application.

Suggested change
res, err := safe.StateGet[*runtime.Version](
ctx,
c.COSI,
resource.NewMetadata(runtime.NamespaceName, runtime.VersionType, "version", resource.VersionUndefined),
)
ctx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
res, err := safe.StateGet[*runtime.Version](
ctx,
c.COSI,
resource.NewMetadata(runtime.NamespaceName, runtime.VersionType, "version", resource.VersionUndefined),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in b6e0618 — wrapped the safe.StateGet call inside cosiVersionReader with context.WithTimeout(ctx, preflightCOSIReadTimeout) where the constant is 2 * time.Second. Short enough to stay invisible on a healthy node, long enough to clear any expected roundtrip; on a hung node the reader returns ok=false and the existing silent-on-error contract takes over.

lexfrei added 2 commits May 5, 2026 17:49
…flight per-node

Pre-flight in pkg/commands/preflight.go had two gaps that exactly
matched the reproduction cases from #132.

1. Empty configuredVersion was short-circuited to silent return.
   Machinery treats an unset contract as TalosVersionCurrent (a nil
   *VersionContract that compares as strictly greater than every
   concrete version), so it still injects machine.install.grubUseUKICmdline
   even when templateOptions.talosVersion is unset. Drop the early
   return: evaluateVersionMismatch now leaves configuredContract nil
   when configuredVersion is empty and lets contract.Greater carry the
   semantics. The warning prints 'configured talosVersion=current'
   (machinery's *VersionContract.String() for the nil contract) so the
   user can see what was inferred.

2. The direct-patch apply path called preflightCheckTalosVersion with
   the multi-node context withApplyClient produces. COSI does not
   support multi-node proxying — see the existing precedent in
   pkg/commands/rotate_ca_handler.go:317 — so the read fails silently
   and no warning ever surfaces for any direct-patch apply against
   more than one node. Iterate GlobalArgs.Nodes and invoke preflight
   per node with client.WithNode, matching the per-node fan-out the
   template-rendering path already inherits from applyTemplatesPerNode.

The function now takes a versionReader so tests can drive the wired
behavior without a live COSI server. cosiVersionReader wraps the
production safe.StateGet[*runtime.Version] read at the apply call
sites. preflight_test.go gains TestPreflightCheckTalosVersion with
five cases (newer/empty-as-current/match/reader-error/unparseable)
and the empty-configured row in TestEvaluateVersionMismatch flips
from 'no warning' to 'warning'.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
The talosVersion / --talos-version setting was only described as an
output-format selector. It is also (and more critically) the contract
that decides which fields machinery injects into the generated config,
which must match what the maintenance Talos parser on the node knows.
A user who sets it from install.image (a different artifact) lands on
'unknown keys found during decoding: machine.install.grubUseUKICmdline'
and has no doc-side guidance — see #132 and the
reproduction in cozystack/cozystack#2442.

Add a short paragraph that names the failure mode, points at the
pre-flight warning the apply path now prints, and tells the user the
two ways out (reboot into a maintenance image matching the contract,
or lower the contract). Keep it next to the existing output-format
note where readers are already looking at this setting.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei marked this pull request as ready for review May 5, 2026 14:56
Address review feedback from gemini-code-assist on
pkg/commands/preflight.go:21,85: the pre-flight COSI read had no
timeout, so a slow or unresponsive node could turn an informational
best-effort check into a blocker for apply. Wrap the safe.StateGet
call with context.WithTimeout (2s) inside cosiVersionReader. The
timeout is short enough to stay invisible on a healthy node and long
enough to clear any expected roundtrip; on a hung node, the reader
returns ok=false and the existing silent-on-error contract takes over.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/commands/apply.go`:
- Around line 171-180: The code only iterates GlobalArgs.Nodes so when nodes are
resolved from the talosconfig context by wrapWithNodeContext those targets are
skipped; change the loop to iterate the actual resolved node list (e.g. obtain
nodes from wrapWithNodeContext or from the client/context after wrapping)
instead of GlobalArgs.Nodes so preflightCheckTalosVersion runs for every
resolved node and the earlier fmt.Printf shows the correct nodes; update the
call sites around cosiVersionReader, preflightCheckTalosVersion,
wrapWithNodeContext, GlobalArgs.Nodes and applyCmdFlags.talosVersion to use the
resolved nodes collection.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 41a55578-01a5-4de8-922c-15b8915303c9

📥 Commits

Reviewing files that changed from the base of the PR and between 2545b3b and c2659c1.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (6)
  • README.md
  • go.mod
  • main.go
  • pkg/commands/apply.go
  • pkg/commands/preflight.go
  • pkg/commands/preflight_test.go

Comment thread pkg/commands/apply.go Outdated
…itted

Address review feedback from coderabbitai on
pkg/commands/apply.go:178: the direct-patch closure iterated
GlobalArgs.Nodes only, but wrapWithNodeContext fills ctx via
client.WithNodes from the talosconfig context when --nodes is omitted
without mutating GlobalArgs.Nodes itself. The preflight loop was a
no-op in that path and the log line printed an empty node list.
Mirror wrapWithNodeContext's resolution into a local targetNodes
slice and use it for both the log line and the per-node preflight
loop. With --nodes set, behavior is unchanged.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
@lexfrei lexfrei self-assigned this May 5, 2026
@lexfrei lexfrei merged commit 04fbd6c into main May 5, 2026
5 checks passed
lexfrei added a commit that referenced this pull request May 6, 2026
The auth template-rendering apply path puts the target node under
the plural "nodes" metadata key so helpers.ForEachResource and
apid's machine-API backend resolver can read it. Reusing that ctx
for the COSI version preflight breaks the preflight: Talos's apid
director rejects every COSI method whose outgoing context carries
the plural key, regardless of slice length, and cosiVersionReader
swallows errors and returns ok=false on rejection. End user sees
no version-mismatch warning even when the running Talos predates
the configured talosVersion -- defeating the whole point of the
preflight added in PR #133.

Add cosiPreflightContext: clone the outgoing metadata, drop
"nodes", attach "node" with the same target, hand the rebuilt
context to the COSI caller. ApplyConfiguration keeps the original
ctx unchanged. The helper is a noop on the insecure (maintenance)
path that carries no node metadata at all and on multi-node ctx
where the single-target shortcut would be unsafe.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
lexfrei added a commit that referenced this pull request May 6, 2026
…riptions

Source comments and tests had bug-number references (#77,
PR #133) that violate the project's commit-message and code-comment
standards: comments must be self-explanatory to a reader without
access to the issue tracker. Replace each citation with a description
of the bug class itself -- 'duplicate primitive-array entries per
round-trip' instead of '#77', 'the version-mismatch warning that
preflightCheckTalosVersion exists to surface' instead of 'PR #133'.
The surrounding prose already says what the issue is, so the numbers
came out without losing context.

Pre-existing references in code this branch does not touch
(preflight_test.go's #132 reproduction comments, engine_test.go's
#66 fixture name) are left as-is -- out of scope here.

Assisted-By: Claude <noreply@anthropic.com>
Signed-off-by: Aleksei Sviridkin <f@lex.la>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants