Add K8s-specific files to elastic-agent diagnostics bundle #9103

pchila · 2025-07-23T08:11:20Z

What does this PR do?

This PR adds kubernetes data to the elastic-agent diagnostics bundle, specifically:

pod/replicaset/deployment/daemonset/statefulset k8s manifests
helm chart manifest and user values
namespace leases dump
elastic-agent previous (if it restarted) and current container logs
cgroup memory stats
kubernetes events for the elastic-agent pod

The new diagnostics files are compressed in a .zip archive elastic-agent-k8s.zip within the elastic-agent diagnostics bundle.

Why is it important?

This should help with initial investigation time for elastic-agent issues running on k8s by collecting as much information about the affected agents as possible along with the usual diagnostics files.

Checklist

I have read and understood the pull request guidelines of this project.
My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in ./changelog/fragments using the changelog tool
I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

Build elastic-agent docker image for your CPU architecture from this PR

SNAPSHOT=true EXTERNAL=true PACKAGES=docker DOCKER_VARIANTS=basic  PLATFORMS="linux/amd64" mage -v clean package

Load the new docker image on your target k8s cluster. This step is slightly different depending on the k8s cluster:

k3d example:

➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ k3d image import docker.elastic.co/elastic-agent/elastic-agent:9.2.0-SNAPSHOT
INFO[0000] Importing image(s) into cluster 'k3s-default'
INFO[0000] Starting new tools node...
INFO[0000] Starting node 'k3d-k3s-default-tools'
INFO[0000] Saving 1 image(s) from runtime...
INFO[0010] Importing images into nodes...
INFO[0010] Importing images from tarball '/k3d/images/k3d-k3s-default-images-20250725093558.tar' into node 'k3d-k3s-default-server-0'...
INFO[0010] Importing images from tarball '/k3d/images/k3d-k3s-default-images-20250725093558.tar' into node 'k3d-k3s-default-agent-1'...
INFO[0010] Importing images from tarball '/k3d/images/k3d-k3s-default-images-20250725093558.tar' into node 'k3d-k3s-default-agent-0'...
INFO[0025] Removing the tarball(s) from image volume...
INFO[0026] Removing k3d-tools node...
INFO[0026] Successfully imported image(s)
INFO[0026] Successfully imported 1 image(s) into 1 cluster(s)

kind example

➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ kind load docker-image docker.elastic.co/elastic-agent/elastic-agent:9.2.0-SNAPSHOT
Image: "docker.elastic.co/elastic-agent/elastic-agent:9.2.0-SNAPSHOT" with ID "sha256:1914d96cd13b60a50948d913824679090605298d85e49987c1304c4856670cb9" not yet present on node "kind-control-plane", loading...

Install elastic-agent using the included helm chart (this is not strictly necessary but it's a convenient way to demo most of the enhancements):

helm install -n kube-system elastic-agent ./deploy/helm/elastic-agent --set agent.presets.perNode.role.create=true --set agent.presets.clusterWide.role.create=true

➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ helm install -n kube-system elastic-agent ./deploy/helm/elastic-agent --set agent.presets.perNode.role.create=true --set agent.presets.clusterWide.role.create=true
NAME: elastic-agent
LAST DEPLOYED: Fri Jul 25 09:29:17 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Release "elastic-agent" is installed at "kube-system" namespace

Installed agents:
  - clusterWide [deployment - standalone mode]
  - perNode [daemonset - standalone mode]

Installed kube-state-metrics at "kube-system" namespace.

Installed integrations:
  - kubernetes [built-in chart integration]

👀 Make sure you have installed the corresponding assets in Kibana for all the above integrations!

check that elastic-agent pods are correctly created and running

kubectl -n kube-system get pods

➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ kubectl -n kube-system get pods
NAME                                               READY   STATUS      RESTARTS   AGE
agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s   1/1     Running     0          48s
agent-pernode-elastic-agent-4hlgd                  1/1     Running     0          48s
agent-pernode-elastic-agent-8pqsb                  1/1     Running     0          48s
agent-pernode-elastic-agent-q8gld                  1/1     Running     0          48s
coredns-ccb96694c-xvc24                            1/1     Running     0          2m7s
helm-install-traefik-5lkd6                         0/1     Completed   2          2m7s
helm-install-traefik-crd-qv69z                     0/1     Completed   0          2m7s
kube-state-metrics-7f9d79b77d-7s5l5                1/1     Running     0          48s
local-path-provisioner-5cf85fd84d-mnw7q            1/1     Running     0          2m7s
metrics-server-5985cbc9d7-mffhf                    1/1     Running     0          2m7s
svclb-traefik-f9913214-drj2p                       2/2     Running     0          82s
svclb-traefik-f9913214-kvhw7                       2/2     Running     0          82s
svclb-traefik-f9913214-smxpm                       2/2     Running     0          82s
traefik-5d45fc8cc9-q4wxq                           1/1     Running     0          82s

get a shell in elastic-agent container and execute elastic-agent diagnostics command

kubectl -n kube-system exec agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s -c agent -- elastic-agent diagnostics -f /tmp/diag.zip

➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ kubectl -n kube-system exec agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s -c agent -- elastic-agent diagnostics -f /tmp/diag.zip
Created diagnostics archive "/tmp/diag.zip"
***** WARNING *****
Created archive may contain plain text credentials.
Ensure that files in archive are redacted before sharing.
*******************

copy diagnostics bundle to host

kubectl cp -n kube-system -c agent agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s:/tmp/diag.zip ./scratch/diag.zip

extract diagnostics bundle and then extract the k8s diagnostics archive

unzip -d diag ./diag.zip

➜  scratch unzip -d diag ./diag.zip
Archive:  ./diag.zip
  inflating: diag/version.txt
  inflating: diag/package.version
  inflating: diag/goroutine.pprof.gz
  ... more files here ...
  inflating: diag/elastic-agent-k8s.zip  # <--- new file added by this PR
  ... more files here ...
  creating: diag/logs/
  creating: diag/logs/data/
  inflating: diag/logs/data/elastic-agent-20250725.ndjson

unzip -d diag/elastic-agent-k8s ./diag/elastic-agent-k8s.zip

➜  scratch unzip -d diag/elastic-agent-k8s ./diag/elastic-agent-k8s.zip
Archive:  ./diag/elastic-agent-k8s.zip
  creating: diag/elastic-agent-k8s/cgroup/
  inflating: diag/elastic-agent-k8s/cgroup/memory.events
  inflating: diag/elastic-agent-k8s/cgroup/memory.high
  inflating: diag/elastic-agent-k8s/cgroup/memory.low
  inflating: diag/elastic-agent-k8s/cgroup/memory.max
  inflating: diag/elastic-agent-k8s/cgroup/memory.min
  inflating: diag/elastic-agent-k8s/cgroup/memory.stat
  creating: diag/elastic-agent-k8s/k8s/
  inflating: diag/elastic-agent-k8s/k8s/deployment-agent-clusterwide-elastic-agent.yaml
  inflating: diag/elastic-agent-k8s/k8s/elastic-agent-9.2.0-beta.tgz
  inflating: diag/elastic-agent-k8s/k8s/leases.yaml
  creating: diag/elastic-agent-k8s/k8s/logs/
  inflating: diag/elastic-agent-k8s/k8s/logs/agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s-agent-current.log
  inflating: diag/elastic-agent-k8s/k8s/pod-agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s.yaml
  inflating: diag/elastic-agent-k8s/k8s/replicaset-agent-clusterwide-elastic-agent-66cb9c54b7.yaml
  inflating: diag/elastic-agent-k8s/k8s/values.yaml

Related issues

Relates Include k8s information when creating diagnostics archive #8131

Questions to ask yourself

How are we going to support this in production?
How are we going to measure its adoption?
How are we going to debug this?
What are the metrics I should take care of?
...

mergify · 2025-07-23T08:12:08Z

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

…unt token

…ner logs collection for diagnostics

…iagnostics from the helm release

…dToken

elasticmachine · 2025-07-25T10:04:18Z

💚 Build Succeeded

Buildkite Build
Commit: 0f17d47

History

💛 Build #24389 was flaky 9f05aa5
💔 Build #24380 failed 4a19826
💔 Build #24346 failed ec91eef
💛 Build #24333 was flaky 8a98b25
💔 Build #24329 failed 1e66d4a
💔 Build #24327 failed 79f6945

cc @pkoutsovasilis @pchila

elasticmachine · 2025-07-25T10:26:21Z

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

…on by default

elastic-sonarqube · 2025-07-28T05:02:34Z

Quality Gate passed

Issues
3 New issues
0 Fixed issues
0 Accepted issues

Measures
0 Security Hotspots
75.3% Coverage on New Code
0.3% Duplication on New Code

See analysis details on SonarQube

internal/pkg/diagnostics/diagnostics_k8s.go

swiatekm · 2025-07-28T15:54:30Z

deploy/helm/elastic-agent/values.schema.json

@@ -1507,6 +1507,9 @@
                "clusterRole": {
                    "$ref": "#/definitions/AgentPresetClusterRole"
                },
+                "role": {


Are all of these Helm Chart changes necessary in this PR?

oh @swiatekm you are probably onto something here, tell me which changes you think can go away? 🙂

Well, I'm mostly asking for the Helm Chart changes to be its own PR, because this one is already quite big, and giving agent permissions by default is something we should more carefully review. Especially if it involves permissions to read Secrets in the namespace.

We can merge the diagnostics independently of the permissions, as they need to work gracefully even in their absence.

so if git isn't lying to me we have the following

file changes

NOTICE-fips.txt 873 +++++++++++++++++++++------------

NOTICE.txt 873 +++++++++++++++++++++------------

deploy/helm/elastic-agent/examples/eck/rendered/manifest.yaml 16 +

deploy/helm/elastic-agent/examples/fleet-managed-certificates/rendered/manifest.yaml 5 +

deploy/helm/elastic-agent/examples/fleet-managed/rendered/manifest.yaml 5 +

deploy/helm/elastic-agent/examples/kubernetes-custom-output/rendered/manifest.yaml 10 +

deploy/helm/elastic-agent/examples/kubernetes-default/rendered/manifest.yaml 10 +

.../helm/elastic-agent/examples/kubernetes-hints-autodiscover/rendered/manifest.yaml 10 +

deploy/helm/elastic-agent/examples/kubernetes-ksm-sharding/rendered/manifest.yaml 10 +

deploy/helm/elastic-agent/examples/kubernetes-onboarding/rendered/manifest.yaml 10 +

deploy/helm/elastic-agent/examples/kubernetes-only-logs/rendered/manifest.yaml 5 +

deploy/helm/elastic-agent/examples/multiple-integrations/rendered/manifest.yaml 10 +

deploy/helm/elastic-agent/examples/netflow-service/rendered/manifest.yaml 4 +

deploy/helm/elastic-agent/examples/nginx-custom-integration/rendered/manifest.yaml 4 +

deploy/helm/elastic-agent/examples/priority-class/rendered/manifest.yaml 10 +

deploy/helm/elastic-agent/examples/statefulset-preset/rendered/manifest.yaml 4 +

deploy/helm/elastic-agent/examples/system-custom-auth-paths/rendered/manifest.yaml 5 +

deploy/helm/elastic-agent/examples/user-cluster-role/rendered/manifest.yaml 4 +

deploy/helm/elastic-agent/examples/user-service-account/rendered/manifest.yaml 10 +

deploy/helm/elastic-agent/templates/agent/_helpers.tpl 2 +

deploy/helm/elastic-agent/templates/agent/cluster-role.yaml 1 +

deploy/helm/elastic-agent/templates/agent/eck/daemonset.yaml 7 +

deploy/helm/elastic-agent/templates/agent/eck/deployment.yaml 7 +

deploy/helm/elastic-agent/templates/agent/eck/statefulset.yaml 7 +

deploy/helm/elastic-agent/templates/agent/k8s/daemonset.yaml 4 +

deploy/helm/elastic-agent/templates/agent/k8s/deployment.yaml 4 +

deploy/helm/elastic-agent/templates/agent/k8s/statefulset.yaml 4 +

deploy/helm/elastic-agent/templates/agent/role-binding.yaml 38 ++

deploy/helm/elastic-agent/templates/agent/role.yaml 37 ++

deploy/helm/elastic-agent/values.schema.json 73 +++

go.mod 2 +-

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go 19 +-

internal/pkg/agent/application/actions/handlers/handler_action_diagnostics_test.go 4 +-

internal/pkg/agent/cmd/run.go 2 +-

internal/pkg/diagnostics/diagnostics.go 10 +-

internal/pkg/diagnostics/diagnostics_k8s.go 579 ++++++++++++++++++++++

internal/pkg/diagnostics/diagnostics_k8s_test.go 1060 ++++++++++++++++++++++++++++++++++++++++

internal/pkg/diagnostics/diagnostics_test.go 34 +-

internal/pkg/diagnostics/testdata/helm.release.v1.secret.data 1 +

internal/pkg/diagnostics/testdata/helm.release.v2.secret.data 1 +

pkg/control/v2/server/server.go 18 +-

testing/integration/k8s/common.go 94 ++++

testing/integration/k8s/kubernetes_agent_standalone_test.go 88 ++++

so with some calculations, 3032 changes are because of generated files and testing code.

Now if we do split them up the diagnostics and helm chart change, we will reduce this PR by ~500 changes which isn't that dramatic of a difference; that said, no prob just say the word and this PR is split up in two 🙂

internal/pkg/diagnostics/diagnostics_k8s.go

pchila assigned pkoutsovasilis and pchila Jul 23, 2025

pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jul 23, 2025

pchila force-pushed the k8s-diagnostics-spacetime branch from 2a07a04 to b88d866 Compare July 24, 2025 07:59

pchila and others added 24 commits July 25, 2025 10:12

add pod info

e004c99

Add pod description to k8s diagnostics

fbc6db1

move k8s diagnostics to dedicated file

b7a4dd2

feat: derive namespace and pod name from payload part in service acco…

dd763b6

…unt token

WIP - recursively dump owner refs

d63ab96

Add logger to diagnostic hooks and follow owner references

a56a1d3

feat: add k8s leases in the diagnostics

fffba91

Add permission to read pods/log to agent clusterrole to enable contai…

03ba8bd

…ner logs collection for diagnostics

collect agent container logs in diagnostics

ce267e1

Cleanup k8s diagnostic hook

5f03053

feat: add dumping helm chart archive and values

f07e094

chore: fmt

03cca3e

Collect cgroup info

ec4af9c

feat: add collecting of k8s events for agent pod

01298ab

feat: replace chart archive with rendered manifest while extracting d…

561e59d

…iagnostics from the helm release

fix: don't overwrite diagnosticsAccumulatedError

a9487d0

fix: call properly dumpK8sEvents in collectK8sDiagnosticsWithClientAn…

3026929

…dToken

fix: pass correct paths for helm-manifest.yaml and helm-values.yaml

dd886c6

feat(helm): support creation of roles for agent presets

9006415

add tests for dumpK8sManifests

d866d52

add tests for collectLogsFromPod

c64d71a

Fix GlobalHooks unit tests

320a874

Do not create diagnostics file entries for nil []byte

67a99f6

go mod tidy

aa2f208

pkoutsovasilis and others added 6 commits July 25, 2025 10:12

feat: add k8s diagnostics integration tests

a95d72c

chore: update notice files

505efae

add basic cgroup files capture

0ce86c6

add basic tests for error and success k8s diagnostics

d4816c8

fix: use k8s path constants from k8s diagnostics in integration tests

fb426ee

Remove linter errors in zip extraction test code

0f17d47

pchila force-pushed the k8s-diagnostics-spacetime branch from 9f05aa5 to 0f17d47 Compare July 25, 2025 08:12

pchila marked this pull request as ready for review July 25, 2025 10:26

pchila requested a review from a team as a code owner July 25, 2025 10:26

pchila requested review from michalpristas and straistaru July 25, 2025 10:26

pchila requested a review from cmacknz July 25, 2025 10:28

fix: allow extra to be defined for preset role and disable its creati…

30b7fb0

…on by default

swiatekm requested review from pkoutsovasilis and swiatekm July 28, 2025 15:48

swiatekm reviewed Jul 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add K8s-specific files to elastic-agent diagnostics bundle #9103

Add K8s-specific files to elastic-agent diagnostics bundle #9103

pchila commented Jul 23, 2025 •

edited by pkoutsovasilis

Loading

Uh oh!

mergify bot commented Jul 23, 2025

Uh oh!

elasticmachine commented Jul 25, 2025

Uh oh!

elasticmachine commented Jul 25, 2025

Uh oh!

elastic-sonarqube bot commented Jul 28, 2025

Uh oh!

Uh oh!

swiatekm Jul 28, 2025

Uh oh!

pkoutsovasilis Jul 29, 2025 •

edited

Loading

Uh oh!

swiatekm Jul 29, 2025

Uh oh!

swiatekm Jul 29, 2025

Uh oh!

pkoutsovasilis Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

file	changes
NOTICE-fips.txt	873 +++++++++++++++++++++------------
NOTICE.txt	873 +++++++++++++++++++++------------
deploy/helm/elastic-agent/examples/eck/rendered/manifest.yaml	16 +
deploy/helm/elastic-agent/examples/fleet-managed-certificates/rendered/manifest.yaml	5 +
deploy/helm/elastic-agent/examples/fleet-managed/rendered/manifest.yaml	5 +
deploy/helm/elastic-agent/examples/kubernetes-custom-output/rendered/manifest.yaml	10 +
deploy/helm/elastic-agent/examples/kubernetes-default/rendered/manifest.yaml	10 +
.../helm/elastic-agent/examples/kubernetes-hints-autodiscover/rendered/manifest.yaml	10 +
deploy/helm/elastic-agent/examples/kubernetes-ksm-sharding/rendered/manifest.yaml	10 +
deploy/helm/elastic-agent/examples/kubernetes-onboarding/rendered/manifest.yaml	10 +
deploy/helm/elastic-agent/examples/kubernetes-only-logs/rendered/manifest.yaml	5 +
deploy/helm/elastic-agent/examples/multiple-integrations/rendered/manifest.yaml	10 +
deploy/helm/elastic-agent/examples/netflow-service/rendered/manifest.yaml	4 +
deploy/helm/elastic-agent/examples/nginx-custom-integration/rendered/manifest.yaml	4 +
deploy/helm/elastic-agent/examples/priority-class/rendered/manifest.yaml	10 +
deploy/helm/elastic-agent/examples/statefulset-preset/rendered/manifest.yaml	4 +
deploy/helm/elastic-agent/examples/system-custom-auth-paths/rendered/manifest.yaml	5 +
deploy/helm/elastic-agent/examples/user-cluster-role/rendered/manifest.yaml	4 +
deploy/helm/elastic-agent/examples/user-service-account/rendered/manifest.yaml	10 +
deploy/helm/elastic-agent/templates/agent/_helpers.tpl	2 +
deploy/helm/elastic-agent/templates/agent/cluster-role.yaml	1 +
deploy/helm/elastic-agent/templates/agent/eck/daemonset.yaml	7 +
deploy/helm/elastic-agent/templates/agent/eck/deployment.yaml	7 +
deploy/helm/elastic-agent/templates/agent/eck/statefulset.yaml	7 +
deploy/helm/elastic-agent/templates/agent/k8s/daemonset.yaml	4 +
deploy/helm/elastic-agent/templates/agent/k8s/deployment.yaml	4 +
deploy/helm/elastic-agent/templates/agent/k8s/statefulset.yaml	4 +
deploy/helm/elastic-agent/templates/agent/role-binding.yaml	38 ++
deploy/helm/elastic-agent/templates/agent/role.yaml	37 ++
deploy/helm/elastic-agent/values.schema.json	73 +++
go.mod	2 +-
internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go	19 +-
internal/pkg/agent/application/actions/handlers/handler_action_diagnostics_test.go	4 +-
internal/pkg/agent/cmd/run.go	2 +-
internal/pkg/diagnostics/diagnostics.go	10 +-
internal/pkg/diagnostics/diagnostics_k8s.go	579 ++++++++++++++++++++++
internal/pkg/diagnostics/diagnostics_k8s_test.go	1060 ++++++++++++++++++++++++++++++++++++++++
internal/pkg/diagnostics/diagnostics_test.go	34 +-
internal/pkg/diagnostics/testdata/helm.release.v1.secret.data	1 +
internal/pkg/diagnostics/testdata/helm.release.v2.secret.data	1 +
pkg/control/v2/server/server.go	18 +-
testing/integration/k8s/common.go	94 ++++
testing/integration/k8s/kubernetes_agent_standalone_test.go	88 ++++

Add K8s-specific files to elastic-agent diagnostics bundle #9103

Are you sure you want to change the base?

Add K8s-specific files to elastic-agent diagnostics bundle #9103

Conversation

pchila commented Jul 23, 2025 • edited by pkoutsovasilis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

Disruptive User Impact

How to test this PR locally

Related issues

Questions to ask yourself

Uh oh!

mergify bot commented Jul 23, 2025

Uh oh!

elasticmachine commented Jul 25, 2025

💚 Build Succeeded

History

Uh oh!

elasticmachine commented Jul 25, 2025

Uh oh!

elastic-sonarqube bot commented Jul 28, 2025

Quality Gate passed

Uh oh!

Uh oh!

swiatekm Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

pkoutsovasilis Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swiatekm Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

swiatekm Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

pkoutsovasilis Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pchila commented Jul 23, 2025 •

edited by pkoutsovasilis

Loading

pkoutsovasilis Jul 29, 2025 •

edited

Loading