Skip to content

Conversation

@tchap
Copy link
Contributor

@tchap tchap commented Jul 21, 2025

The command now removes any temporary resources on signal received,
i.e. the whole temp namespace generally, or just the pods in case
--run-namespace is specified.

Since there is no way to interrupt log streaming in kubectl currently,
I ended up just pulling kubectl logs in as an internal package and adding RunLogsContext.
I sent a PR to add the method into upstream as well, but that will be only available
as we rebase onto the next release.

Let me know if you have any better idea. I did that because some
of the methods on LogOptions are private, so I couldn't reimplement
the logic without basically rewriting the while thing anyway.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jul 21, 2025
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-59311, which is invalid:

  • expected the bug to target the "4.20.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The command now removes any temporary resources on signal received,
i.e. the whole temp namespace generally, or just the pods in case
--run-namespace is specified.

Since there is no way to interrupt log streaming in kubectl currently,
I ended up just pulling kubectl logs in as an internal package and adding RunLogsContext.
I sent a PR to add the method into upstream as well, but that will be only available
as we rebase onto the next release.

Let me know if you have any better idea. I did that because some
of the methods on LogOptions are private, so I couldn't reimplement
the logic without basically rewriting the while thing anyway.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap
Copy link
Contributor Author

tchap commented Jul 21, 2025

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jul 21, 2025
@openshift-ci-robot
Copy link

@tchap: This pull request references Jira Issue OCPBUGS-59311, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @zhouying7780

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jul 21, 2025
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 21, 2025
@tchap tchap force-pushed the must-gather-process-signals branch from 2338818 to 71dd73e Compare July 21, 2025 14:07
@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 21, 2025
@tchap
Copy link
Contributor Author

tchap commented Jul 21, 2025

Ok, let's wait until kubernetes/kubernetes#133067 is merged and available. We will revisit the PR then.

@tchap
Copy link
Contributor Author

tchap commented Jul 23, 2025

/hold

until the upstream PR and function becomes available.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 23, 2025
}

func inspectSecretInfo(info *resource.Info, o *InspectOptions) error {
func inspectSecretInfo(ctx context.Context, info *resource.Info, o *InspectOptions) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, instead of waiting for the upstream changes landed in downstream. We can iterate over subsequent complementary efforts. For example, first step would be wiring context to oc adm inspect command. Callers can simply pass context.Background() for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I will get to that when I have some spare time.

kcmdutil.CheckErr(o.Complete(args))
kcmdutil.CheckErr(o.Validate())
kcmdutil.CheckErr(o.Run())
kcmdutil.CheckErr(o.Run(context.TODO()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run usually does not get any context. It is better to define context.TODO inside Run


// gatherOperatorResourceData gathers all kubeapiserver.operator.openshift.io resources
func (o *InspectOptions) gatherOperatorResourceData(destDir string, ctx *resourceContext) error {
func (o *InspectOptions) gatherOperatorResourceData(ctx context.Context, destDir string, rCtx *resourceContext) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming of ctx to resourceContext would be an innocent, safe change can easily be handled in a PR.

@tchap tchap force-pushed the must-gather-process-signals branch from 71dd73e to cfbad27 Compare September 18, 2025 05:50
@coderabbitai
Copy link

coderabbitai bot commented Sep 18, 2025

Walkthrough

Refactors must-gather to be context- and signal-aware with a cleanup-hook mechanism, typed workqueue, and context propagation across all operations. Introduces an internal logs package implementing configurable, concurrent log streaming with a new LogsOptions API, command wiring, and extensive tests. Updates must-gather tests for new context-driven signatures and cleanup flow.

Changes

Cohort / File(s) Summary of Changes
Must-gather context refactor
pkg/cli/admin/mustgather/mustgather.go, pkg/cli/admin/mustgather/mustgather_test.go
Convert must-gather to context-driven execution with signal handling; add cleanup hooks; propagate ctx through image resolution, k8s API, pods, logs, copying, backup; switch to typed, ctx-aware workqueue; adjust resource lifecycle and error wrapping; update tests to use ctx and explicit cleanup.
Internal logs implementation and tests
pkg/cli/internal/logs/doc.go, pkg/cli/internal/logs/logs.go, pkg/cli/internal/logs/logs_test.go
Add internal logs package mirroring kubectl logs with LogsOptions, command, flags, validation, and RunLogsContext; support sequential/parallel streaming, prefixing, concurrency control, and error handling; provide DefaultConsumeRequest; include comprehensive unit tests and package doc.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120+ minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "WIP: OCPBUGS-59311: oc adm must-gather: Perform cleanup on signal received" accurately and succinctly summarizes the primary change — adding signal-triggered cleanup to the oc adm must-gather command — and includes the Jira ID for traceability, so reviewers can quickly grasp intent. The inclusion of the affected command name and the bug reference makes the title directly related to the changeset. The "WIP" prefix correctly signals work-in-progress but does not obscure the core purpose of the PR.
Description Check ✅ Passed The description is directly related to the changeset: it explains that temporary resources are cleaned on signal (with special-case behavior for --run-namespace), documents vendoring kubectl logs to add RunLogsContext for interruptible streaming, and notes the upstream PR and why the copy was necessary. It provides relevant rationale and the current hold/upstream dependency status, so reviewers can understand both intent and the reason for the vendored code. Overall, the description aligns with the code changes and is not off-topic.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 18, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign ingvagabund for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tchap tchap changed the title OCPBUGS-59311: oc adm must-gather: Perform cleanup on signal received WIP: OCPBUGS-59311: oc adm must-gather: Perform cleanup on signal received Sep 18, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 18, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/cli/admin/mustgather/mustgather.go (2)

863-881: Bug: path.Join with absolute component writes to /gather.logs

Using path.Join(destDir, "/gather.logs") discards destDir and attempts to write to /gather.logs. This will often fail and blocks data collection.

Apply this fix:

-        dest, err := os.OpenFile(path.Join(destDir, "/gather.logs"), os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0o755)
+        dest, err := os.OpenFile(path.Join(destDir, "gather.logs"), os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0o755)

93-106: Shell script: incorrect self-session detection prevents correct kill loop

The disk-usage checker uses [[ "$sid" -eq "${$}" ]], which is invalid and also compares session IDs to a PID. Compute current session id and compare properly.

- ps -o sess --no-headers | sort -u | while read sid; do
-   [[ "$sid" -eq "${$}" ]] && continue
-   pkill --signal SIGKILL --session "$sid"
+ self_sid="$(ps -o sess= -p $$)"
+ ps -o sess --no-headers | sort -u | while read -r sid; do
+   [[ "$sid" -eq "$self_sid" ]] && continue
+   pkill --signal SIGKILL --session "$sid"
🧹 Nitpick comments (4)
pkg/cli/internal/logs/doc.go (1)

1-6: Add license header and clarify temporary fork scope

Include the Apache 2.0 license header (consistent with logs.go) and note this copy is for internal use only and will be removed once kubectl v34+ is available.

+/*
+Copyright 2014 The Kubernetes Authors.
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+*/
pkg/cli/admin/mustgather/mustgather.go (3)

710-723: Queue shutdown timing: optional simplification

You can defer ShutDownWithDrain after starting workers (and omit the extra goroutine) to follow the common workqueue pattern; current approach works but is atypical.

- queue.ShutDownWithDrain()
- go func() {
-   <-ctx.Done()
-   queue.ShutDown()
- }()
+ defer queue.ShutDownWithDrain()

1057-1063: Use structured log writer for cleanup error

Prefer writing to o.LogOut instead of fmt.Printf for consistency and testability.

-       if err := o.Client.CoreV1().Namespaces().Delete(ctx, ns.Name, metav1.DeleteOptions{}); err != nil {
-           fmt.Printf("%v\n", err)
+       if err := o.Client.CoreV1().Namespaces().Delete(ctx, ns.Name, metav1.DeleteOptions{}); err != nil {
+           fmt.Fprintf(o.LogOut, "%v\n", err)
        } else {
            o.PrinterDeleted.PrintObj(ns, o.LogOut)
        }

231-246: Minor: flag name typo in error message

The flag is --image-stream (singular). Error text says --image-streams.

-       if len(o.ImageStreams) != 0 {
-           return fmt.Errorf("--image-streams %s", errStr)
-       }
+       if len(o.ImageStreams) != 0 {
+           return fmt.Errorf("--image-stream %s", errStr)
+       }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 0e4fabc and cfbad27.

📒 Files selected for processing (5)
  • pkg/cli/admin/mustgather/mustgather.go (29 hunks)
  • pkg/cli/admin/mustgather/mustgather_test.go (3 hunks)
  • pkg/cli/internal/logs/doc.go (1 hunks)
  • pkg/cli/internal/logs/logs.go (1 hunks)
  • pkg/cli/internal/logs/logs_test.go (1 hunks)
🔇 Additional comments (10)
pkg/cli/admin/mustgather/mustgather_test.go (3)

122-122: Tests updated for context-aware completeImages: LGTM


217-222: Tests updated for context-aware getNamespace: LGTM

Good switch to ctx; assertions read cleanly.


234-235: Explicit cleanup(ctx) call: LGTM

Using context.Background() here is appropriate to ensure cleanup proceeds even if the run ctx was cancelled.

pkg/cli/internal/logs/logs_test.go (2)

204-275: Concurrency limit tests cover MaxFollowConcurrency well

Good coverage for both success and failure paths when following multiple streams.


757-822: DefaultConsumeRequest error/partial read handling: LGTM

Solid cases for timeouts and newline edge cases.

pkg/cli/admin/mustgather/mustgather.go (2)

587-611: Signal-aware shutdown: LGTM

Nice use of signal.NotifyContext with a second-signal fast-exit. Message-once guard via atomic.Bool is correct.


613-618: Background cleanup on exit: LGTM

Running cleanup with context.Background() avoids cancellation-related skips.

pkg/cli/internal/logs/logs.go (3)

360-368: Signal-friendly RunLogs wrapper: LGTM


398-427: Parallel consumption with per-source prefixes: LGTM

Pipe fan-in with buffered, line-bounded writes is appropriate; CloseWithError correctly propagates first error when IgnoreLogErrors=false.


474-494: DefaultConsumeRequest reads: LGTM

Line-buffered approach prevents sub-line interleaving across concurrent streams.

Comment on lines +1297 to +1311
func (o *MustGatherOptions) cleanup(ctx context.Context) {
o.cleanupMux.Lock()
hooks := append([]func(context.Context){}, o.cleanupHooks...)
o.cleanupMux.Unlock()

var wg sync.WaitGroup
wg.Add(len(o.cleanupHooks))
for _, hook := range hooks {
go func() {
defer wg.Done()
hook(ctx)
}()
}
wg.Wait()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Race/deadlock risk: WaitGroup uses len(o.cleanupHooks) after unlock

cleanup() copies hooks under lock but uses len(o.cleanupHooks) for wg.Add after unlocking. Concurrent mutation would miscount and deadlock. Also capture hook in the goroutine arg to avoid loop-variable capture.

Apply this safer pattern:

- var wg sync.WaitGroup
- wg.Add(len(o.cleanupHooks))
- for _, hook := range hooks {
-   go func() {
-     defer wg.Done()
-     hook(ctx)
-   }()
- }
+ var wg sync.WaitGroup
+ for _, hook := range hooks {
+   wg.Add(1)
+   go func(h func(context.Context)) {
+     defer wg.Done()
+     h(ctx)
+   }(hook)
+ }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (o *MustGatherOptions) cleanup(ctx context.Context) {
o.cleanupMux.Lock()
hooks := append([]func(context.Context){}, o.cleanupHooks...)
o.cleanupMux.Unlock()
var wg sync.WaitGroup
wg.Add(len(o.cleanupHooks))
for _, hook := range hooks {
go func() {
defer wg.Done()
hook(ctx)
}()
}
wg.Wait()
}
func (o *MustGatherOptions) cleanup(ctx context.Context) {
o.cleanupMux.Lock()
hooks := append([]func(context.Context){}, o.cleanupHooks...)
o.cleanupMux.Unlock()
var wg sync.WaitGroup
for _, hook := range hooks {
wg.Add(1)
go func(h func(context.Context)) {
defer wg.Done()
h(ctx)
}(hook)
}
wg.Wait()
}
🤖 Prompt for AI Agents
In pkg/cli/admin/mustgather/mustgather.go around lines 1297 to 1311, the
WaitGroup is being initialized with len(o.cleanupHooks) after releasing the lock
which can race if o.cleanupHooks is mutated concurrently, and the goroutine
captures the loop variable unsafely. Fix by computing and using the length of
the local copied slice (hooks) for wg.Add before launching goroutines, and
capture each hook in the goroutine signature (e.g., go func(h
func(context.Context)) { ... }(hook)) so each goroutine invokes the correct
hook; keep the copy-under-lock and unlock before running hooks but base wg.Add
and the iteration on the local hooks slice.

Comment on lines +167 to +169
containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers){(.+)}`),
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Regex bug: unescaped braces cause compile-time panic and break prefixing

containerNameFromRefSpecRegexp must escape literal braces to match spec.containers{...}. As written, it’s invalid and will panic at init.

- containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers){(.+)}`),
+ containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers)\{(.+)\}`),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers){(.+)}`),
}
}
containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers)\{(.+)\}`),
}
}
🤖 Prompt for AI Agents
In pkg/cli/internal/logs/logs.go around lines 167 to 169, the regex uses
unescaped braces which are treated as quantifiers and will panic at init; update
the pattern to escape the literal braces so it matches "spec.containers{...}"
(e.g. replace `{`/`}` with `\{`/`\}` in the regexp literal) and keep using the
backtick raw string so the backslashes are preserved for the regexp engine.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 18, 2025

@tchap: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 cfbad27 link false /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants