-
Notifications
You must be signed in to change notification settings - Fork 417
WIP: OCPBUGS-59311: oc adm must-gather: Perform cleanup on signal received #2062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@tchap: This pull request references Jira Issue OCPBUGS-59311, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@tchap: This pull request references Jira Issue OCPBUGS-59311, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
2338818 to
71dd73e
Compare
|
Ok, let's wait until kubernetes/kubernetes#133067 is merged and available. We will revisit the PR then. |
|
/hold until the upstream PR and function becomes available. |
| } | ||
|
|
||
| func inspectSecretInfo(info *resource.Info, o *InspectOptions) error { | ||
| func inspectSecretInfo(ctx context.Context, info *resource.Info, o *InspectOptions) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, instead of waiting for the upstream changes landed in downstream. We can iterate over subsequent complementary efforts. For example, first step would be wiring context to oc adm inspect command. Callers can simply pass context.Background() for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I will get to that when I have some spare time.
pkg/cli/admin/inspect/inspect.go
Outdated
| kcmdutil.CheckErr(o.Complete(args)) | ||
| kcmdutil.CheckErr(o.Validate()) | ||
| kcmdutil.CheckErr(o.Run()) | ||
| kcmdutil.CheckErr(o.Run(context.TODO())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run usually does not get any context. It is better to define context.TODO inside Run
pkg/cli/admin/inspect/inspect.go
Outdated
|
|
||
| // gatherOperatorResourceData gathers all kubeapiserver.operator.openshift.io resources | ||
| func (o *InspectOptions) gatherOperatorResourceData(destDir string, ctx *resourceContext) error { | ||
| func (o *InspectOptions) gatherOperatorResourceData(ctx context.Context, destDir string, rCtx *resourceContext) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renaming of ctx to resourceContext would be an innocent, safe change can easily be handled in a PR.
71dd73e to
cfbad27
Compare
WalkthroughRefactors must-gather to be context- and signal-aware with a cleanup-hook mechanism, typed workqueue, and context propagation across all operations. Introduces an internal logs package implementing configurable, concurrent log streaming with a new LogsOptions API, command wiring, and extensive tests. Updates must-gather tests for new context-driven signatures and cleanup flow. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120+ minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/cli/admin/mustgather/mustgather.go (2)
863-881: Bug: path.Join with absolute component writes to /gather.logsUsing path.Join(destDir, "/gather.logs") discards destDir and attempts to write to /gather.logs. This will often fail and blocks data collection.
Apply this fix:
- dest, err := os.OpenFile(path.Join(destDir, "/gather.logs"), os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0o755) + dest, err := os.OpenFile(path.Join(destDir, "gather.logs"), os.O_RDWR|os.O_CREATE|os.O_TRUNC, 0o755)
93-106: Shell script: incorrect self-session detection prevents correct kill loopThe disk-usage checker uses [[ "$sid" -eq "${$}" ]], which is invalid and also compares session IDs to a PID. Compute current session id and compare properly.
- ps -o sess --no-headers | sort -u | while read sid; do - [[ "$sid" -eq "${$}" ]] && continue - pkill --signal SIGKILL --session "$sid" + self_sid="$(ps -o sess= -p $$)" + ps -o sess --no-headers | sort -u | while read -r sid; do + [[ "$sid" -eq "$self_sid" ]] && continue + pkill --signal SIGKILL --session "$sid"
🧹 Nitpick comments (4)
pkg/cli/internal/logs/doc.go (1)
1-6: Add license header and clarify temporary fork scopeInclude the Apache 2.0 license header (consistent with logs.go) and note this copy is for internal use only and will be removed once kubectl v34+ is available.
+/* +Copyright 2014 The Kubernetes Authors. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/pkg/cli/admin/mustgather/mustgather.go (3)
710-723: Queue shutdown timing: optional simplificationYou can defer ShutDownWithDrain after starting workers (and omit the extra goroutine) to follow the common workqueue pattern; current approach works but is atypical.
- queue.ShutDownWithDrain() - go func() { - <-ctx.Done() - queue.ShutDown() - }() + defer queue.ShutDownWithDrain()
1057-1063: Use structured log writer for cleanup errorPrefer writing to o.LogOut instead of fmt.Printf for consistency and testability.
- if err := o.Client.CoreV1().Namespaces().Delete(ctx, ns.Name, metav1.DeleteOptions{}); err != nil { - fmt.Printf("%v\n", err) + if err := o.Client.CoreV1().Namespaces().Delete(ctx, ns.Name, metav1.DeleteOptions{}); err != nil { + fmt.Fprintf(o.LogOut, "%v\n", err) } else { o.PrinterDeleted.PrintObj(ns, o.LogOut) }
231-246: Minor: flag name typo in error messageThe flag is --image-stream (singular). Error text says --image-streams.
- if len(o.ImageStreams) != 0 { - return fmt.Errorf("--image-streams %s", errStr) - } + if len(o.ImageStreams) != 0 { + return fmt.Errorf("--image-stream %s", errStr) + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (5)
pkg/cli/admin/mustgather/mustgather.go(29 hunks)pkg/cli/admin/mustgather/mustgather_test.go(3 hunks)pkg/cli/internal/logs/doc.go(1 hunks)pkg/cli/internal/logs/logs.go(1 hunks)pkg/cli/internal/logs/logs_test.go(1 hunks)
🔇 Additional comments (10)
pkg/cli/admin/mustgather/mustgather_test.go (3)
122-122: Tests updated for context-aware completeImages: LGTM
217-222: Tests updated for context-aware getNamespace: LGTMGood switch to ctx; assertions read cleanly.
234-235: Explicit cleanup(ctx) call: LGTMUsing context.Background() here is appropriate to ensure cleanup proceeds even if the run ctx was cancelled.
pkg/cli/internal/logs/logs_test.go (2)
204-275: Concurrency limit tests cover MaxFollowConcurrency wellGood coverage for both success and failure paths when following multiple streams.
757-822: DefaultConsumeRequest error/partial read handling: LGTMSolid cases for timeouts and newline edge cases.
pkg/cli/admin/mustgather/mustgather.go (2)
587-611: Signal-aware shutdown: LGTMNice use of signal.NotifyContext with a second-signal fast-exit. Message-once guard via atomic.Bool is correct.
613-618: Background cleanup on exit: LGTMRunning cleanup with context.Background() avoids cancellation-related skips.
pkg/cli/internal/logs/logs.go (3)
360-368: Signal-friendly RunLogs wrapper: LGTM
398-427: Parallel consumption with per-source prefixes: LGTMPipe fan-in with buffered, line-bounded writes is appropriate; CloseWithError correctly propagates first error when IgnoreLogErrors=false.
474-494: DefaultConsumeRequest reads: LGTMLine-buffered approach prevents sub-line interleaving across concurrent streams.
| func (o *MustGatherOptions) cleanup(ctx context.Context) { | ||
| o.cleanupMux.Lock() | ||
| hooks := append([]func(context.Context){}, o.cleanupHooks...) | ||
| o.cleanupMux.Unlock() | ||
|
|
||
| var wg sync.WaitGroup | ||
| wg.Add(len(o.cleanupHooks)) | ||
| for _, hook := range hooks { | ||
| go func() { | ||
| defer wg.Done() | ||
| hook(ctx) | ||
| }() | ||
| } | ||
| wg.Wait() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Race/deadlock risk: WaitGroup uses len(o.cleanupHooks) after unlock
cleanup() copies hooks under lock but uses len(o.cleanupHooks) for wg.Add after unlocking. Concurrent mutation would miscount and deadlock. Also capture hook in the goroutine arg to avoid loop-variable capture.
Apply this safer pattern:
- var wg sync.WaitGroup
- wg.Add(len(o.cleanupHooks))
- for _, hook := range hooks {
- go func() {
- defer wg.Done()
- hook(ctx)
- }()
- }
+ var wg sync.WaitGroup
+ for _, hook := range hooks {
+ wg.Add(1)
+ go func(h func(context.Context)) {
+ defer wg.Done()
+ h(ctx)
+ }(hook)
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (o *MustGatherOptions) cleanup(ctx context.Context) { | |
| o.cleanupMux.Lock() | |
| hooks := append([]func(context.Context){}, o.cleanupHooks...) | |
| o.cleanupMux.Unlock() | |
| var wg sync.WaitGroup | |
| wg.Add(len(o.cleanupHooks)) | |
| for _, hook := range hooks { | |
| go func() { | |
| defer wg.Done() | |
| hook(ctx) | |
| }() | |
| } | |
| wg.Wait() | |
| } | |
| func (o *MustGatherOptions) cleanup(ctx context.Context) { | |
| o.cleanupMux.Lock() | |
| hooks := append([]func(context.Context){}, o.cleanupHooks...) | |
| o.cleanupMux.Unlock() | |
| var wg sync.WaitGroup | |
| for _, hook := range hooks { | |
| wg.Add(1) | |
| go func(h func(context.Context)) { | |
| defer wg.Done() | |
| h(ctx) | |
| }(hook) | |
| } | |
| wg.Wait() | |
| } |
🤖 Prompt for AI Agents
In pkg/cli/admin/mustgather/mustgather.go around lines 1297 to 1311, the
WaitGroup is being initialized with len(o.cleanupHooks) after releasing the lock
which can race if o.cleanupHooks is mutated concurrently, and the goroutine
captures the loop variable unsafely. Fix by computing and using the length of
the local copied slice (hooks) for wg.Add before launching goroutines, and
capture each hook in the goroutine signature (e.g., go func(h
func(context.Context)) { ... }(hook)) so each goroutine invokes the correct
hook; keep the copy-under-lock and unlock before running hooks but base wg.Add
and the iteration on the local hooks slice.
| containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers){(.+)}`), | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex bug: unescaped braces cause compile-time panic and break prefixing
containerNameFromRefSpecRegexp must escape literal braces to match spec.containers{...}. As written, it’s invalid and will panic at init.
- containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers){(.+)}`),
+ containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers)\{(.+)\}`),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers){(.+)}`), | |
| } | |
| } | |
| containerNameFromRefSpecRegexp: regexp.MustCompile(`spec\.(?:initContainers|containers|ephemeralContainers)\{(.+)\}`), | |
| } | |
| } |
🤖 Prompt for AI Agents
In pkg/cli/internal/logs/logs.go around lines 167 to 169, the regex uses
unescaped braces which are treated as quantifiers and will panic at init; update
the pattern to escape the literal braces so it matches "spec.containers{...}"
(e.g. replace `{`/`}` with `\{`/`\}` in the regexp literal) and keep using the
backtick raw string so the backslashes are preserved for the regexp engine.
|
@tchap: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
The command now removes any temporary resources on signal received,
i.e. the whole temp namespace generally, or just the pods in case
--run-namespaceis specified.Since there is no way to interrupt log streaming in
kubectlcurrently,I ended up just pulling
kubectl logsin as an internal package and addingRunLogsContext.I sent a PR to add the method into upstream as well, but that will be only available
as we rebase onto the next release.
Let me know if you have any better idea. I did that because some
of the methods on
LogOptionsare private, so I couldn't reimplementthe logic without basically rewriting the while thing anyway.