Skip to content

feat(executor): add garbage collector Prometheus metrics#7550

Open
davidlin20dev wants to merge 1 commit into
flyteorg:mainfrom
davidlin20dev:feat/executor-gc-metrics
Open

feat(executor): add garbage collector Prometheus metrics#7550
davidlin20dev wants to merge 1 commit into
flyteorg:mainfrom
davidlin20dev:feat/executor-gc-metrics

Conversation

@davidlin20dev

Copy link
Copy Markdown
Contributor

Tracking issue

Related to #7455 (part of #7445; does not close the issue).

Why are the changes needed?

The garbage collector emits no metrics, so there is no visibility into whether terminal-TaskAction cleanup is happening, failing, or keeping up. A broken or slow GC silently lets TaskAction CRDs pile up in etcd until the cluster is already degraded.

What changes were proposed in this pull request?

Three Prometheus instruments under a dedicated executor:gc: sub-scope, created once in the constructor and updated each sweep:

  • objects_deleted (counter): per successful delete.
  • deletion_errors (counter): per failed delete.
  • sweep_duration (stopwatch): full-sweep duration, timed via defer.

The scope is injected through NewGarbageCollector(..., scope) from setup.go (executorScope.NewSubScope("gc")). No labels, to keep cardinality bounded.

How was this patch tested?

New unit tests in garbage_collector_metrics_test.go (controller-runtime fake client, no envtest):

  • TestGarbageCollectorDeletedMetric: a successful sweep moves objects_deleted 0 to 1, records a sweep, and leaves deletion_errors at 0.
  • TestGarbageCollectorDeletionErrorMetric: an injected delete failure moves deletion_errors 0 to 1, and objects_deleted stays 0.

Verified via the default registry, since endpoint exposure depends on #7453.

go test ./executor/pkg/controller/ -run TestGarbageCollector -count=1 -v

Labels

added

Setup process

N/A

Screenshots

N/A

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

N/A

Docs link

N/A

Emit objects_deleted, deletion_errors, and sweep_duration under a dedicated
executor:gc: sub-scope, created once in the constructor and updated during each
collect() sweep. Unit tests assert the counters move and the sweep is timed.

Part of flyteorg#7455.

Signed-off-by: davidlin20dev <davidlin20.dev@gmail.com>
@davidlin20dev davidlin20dev force-pushed the feat/executor-gc-metrics branch from 71636b1 to 26030c3 Compare June 18, 2026 00:46
@Sovietaced

Copy link
Copy Markdown
Member

Is it possible to rework these using OTEL instead of prometheus?

@davidlin20dev

Copy link
Copy Markdown
Contributor Author

Is it possible to rework these using OTEL instead of prometheus?

Hi @Sovietaced, happy to switch these to OTel.

I went with Prometheus because #7445 and #7455 (umbrella issue) are both centered around promutils, so that's
the direction I followed. Before I rework it, though, I just want to make sure I'm building the right thing.

@pingsutw, since you opened these issues: could you confirm whether OTel is the preferred
direction now, and whether the issues should be updated to reflect that? Happy to rework this
on the meter provider (mirroring #7483) once I know which way to go.

Thanks both, appreciate the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants