Skip to content

Add K8s-specific files to elastic-agent diagnostics bundle #9103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

pchila
Copy link
Member

@pchila pchila commented Jul 23, 2025

What does this PR do?

This PR adds kubernetes data to the elastic-agent diagnostics bundle, specifically:

  • pod/replicaset/deployment/daemonset/statefulset k8s manifests
  • helm chart manifest and user values
  • namespace leases dump
  • elastic-agent previous (if it restarted) and current container logs
  • cgroup memory stats
  • kubernetes events for the elastic-agent pod

The new diagnostics files are compressed in a .zip archive elastic-agent-k8s.zip within the elastic-agent diagnostics bundle.

Why is it important?

This should help with initial investigation time for elastic-agent issues running on k8s by collecting as much information about the affected agents as possible along with the usual diagnostics files.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

How to test this PR locally

  • Build elastic-agent docker image for your CPU architecture from this PR
    SNAPSHOT=true EXTERNAL=true PACKAGES=docker DOCKER_VARIANTS=basic  PLATFORMS="linux/amd64" mage -v clean package
  • Load the new docker image on your target k8s cluster. This step is slightly different depending on the k8s cluster:
    • k3d example:
      ➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ k3d image import docker.elastic.co/elastic-agent/elastic-agent:9.2.0-SNAPSHOT
      INFO[0000] Importing image(s) into cluster 'k3s-default'
      INFO[0000] Starting new tools node...
      INFO[0000] Starting node 'k3d-k3s-default-tools'
      INFO[0000] Saving 1 image(s) from runtime...
      INFO[0010] Importing images into nodes...
      INFO[0010] Importing images from tarball '/k3d/images/k3d-k3s-default-images-20250725093558.tar' into node 'k3d-k3s-default-server-0'...
      INFO[0010] Importing images from tarball '/k3d/images/k3d-k3s-default-images-20250725093558.tar' into node 'k3d-k3s-default-agent-1'...
      INFO[0010] Importing images from tarball '/k3d/images/k3d-k3s-default-images-20250725093558.tar' into node 'k3d-k3s-default-agent-0'...
      INFO[0025] Removing the tarball(s) from image volume...
      INFO[0026] Removing k3d-tools node...
      INFO[0026] Successfully imported image(s)
      INFO[0026] Successfully imported 1 image(s) into 1 cluster(s)
    • kind example
      ➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ kind load docker-image docker.elastic.co/elastic-agent/elastic-agent:9.2.0-SNAPSHOT
      Image: "docker.elastic.co/elastic-agent/elastic-agent:9.2.0-SNAPSHOT" with ID "sha256:1914d96cd13b60a50948d913824679090605298d85e49987c1304c4856670cb9" not yet present on node "kind-control-plane", loading...
  • Install elastic-agent using the included helm chart (this is not strictly necessary but it's a convenient way to demo most of the enhancements):
    helm install -n kube-system elastic-agent ./deploy/helm/elastic-agent --set agent.presets.perNode.role.create=true --set agent.presets.clusterWide.role.create=true
    ➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ helm install -n kube-system elastic-agent ./deploy/helm/elastic-agent --set agent.presets.perNode.role.create=true --set agent.presets.clusterWide.role.create=true
    NAME: elastic-agent
    LAST DEPLOYED: Fri Jul 25 09:29:17 2025
    NAMESPACE: kube-system
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    NOTES:
    Release "elastic-agent" is installed at "kube-system" namespace
    
    Installed agents:
      - clusterWide [deployment - standalone mode]
      - perNode [daemonset - standalone mode]
    
    Installed kube-state-metrics at "kube-system" namespace.
    
    Installed integrations:
      - kubernetes [built-in chart integration]
    
    👀 Make sure you have installed the corresponding assets in Kibana for all the above integrations!
  • check that elastic-agent pods are correctly created and running
    kubectl -n kube-system get pods
    ➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ kubectl -n kube-system get pods
    NAME                                               READY   STATUS      RESTARTS   AGE
    agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s   1/1     Running     0          48s
    agent-pernode-elastic-agent-4hlgd                  1/1     Running     0          48s
    agent-pernode-elastic-agent-8pqsb                  1/1     Running     0          48s
    agent-pernode-elastic-agent-q8gld                  1/1     Running     0          48s
    coredns-ccb96694c-xvc24                            1/1     Running     0          2m7s
    helm-install-traefik-5lkd6                         0/1     Completed   2          2m7s
    helm-install-traefik-crd-qv69z                     0/1     Completed   0          2m7s
    kube-state-metrics-7f9d79b77d-7s5l5                1/1     Running     0          48s
    local-path-provisioner-5cf85fd84d-mnw7q            1/1     Running     0          2m7s
    metrics-server-5985cbc9d7-mffhf                    1/1     Running     0          2m7s
    svclb-traefik-f9913214-drj2p                       2/2     Running     0          82s
    svclb-traefik-f9913214-kvhw7                       2/2     Running     0          82s
    svclb-traefik-f9913214-smxpm                       2/2     Running     0          82s
    traefik-5d45fc8cc9-q4wxq                           1/1     Running     0          82s
  • get a shell in elastic-agent container and execute elastic-agent diagnostics command
    kubectl -n kube-system exec agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s -c agent -- elastic-agent diagnostics -f /tmp/diag.zip
    ➜  elastic-agent git:(k8s-diagnostics-spacetime) ✗ kubectl -n kube-system exec agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s -c agent -- elastic-agent diagnostics -f /tmp/diag.zip
    Created diagnostics archive "/tmp/diag.zip"
    ***** WARNING *****
    Created archive may contain plain text credentials.
    Ensure that files in archive are redacted before sharing.
    *******************
  • copy diagnostics bundle to host
    kubectl cp -n kube-system -c agent agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s:/tmp/diag.zip ./scratch/diag.zip
  • extract diagnostics bundle and then extract the k8s diagnostics archive
    unzip -d diag ./diag.zip
    ➜  scratch unzip -d diag ./diag.zip
    Archive:  ./diag.zip
      inflating: diag/version.txt
      inflating: diag/package.version
      inflating: diag/goroutine.pprof.gz
      ... more files here ...
      inflating: diag/elastic-agent-k8s.zip  # <--- new file added by this PR
      ... more files here ...
      creating: diag/logs/
      creating: diag/logs/data/
      inflating: diag/logs/data/elastic-agent-20250725.ndjson
    unzip -d diag/elastic-agent-k8s ./diag/elastic-agent-k8s.zip
    ➜  scratch unzip -d diag/elastic-agent-k8s ./diag/elastic-agent-k8s.zip
    Archive:  ./diag/elastic-agent-k8s.zip
      creating: diag/elastic-agent-k8s/cgroup/
      inflating: diag/elastic-agent-k8s/cgroup/memory.events
      inflating: diag/elastic-agent-k8s/cgroup/memory.high
      inflating: diag/elastic-agent-k8s/cgroup/memory.low
      inflating: diag/elastic-agent-k8s/cgroup/memory.max
      inflating: diag/elastic-agent-k8s/cgroup/memory.min
      inflating: diag/elastic-agent-k8s/cgroup/memory.stat
      creating: diag/elastic-agent-k8s/k8s/
      inflating: diag/elastic-agent-k8s/k8s/deployment-agent-clusterwide-elastic-agent.yaml
      inflating: diag/elastic-agent-k8s/k8s/elastic-agent-9.2.0-beta.tgz
      inflating: diag/elastic-agent-k8s/k8s/leases.yaml
      creating: diag/elastic-agent-k8s/k8s/logs/
      inflating: diag/elastic-agent-k8s/k8s/logs/agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s-agent-current.log
      inflating: diag/elastic-agent-k8s/k8s/pod-agent-clusterwide-elastic-agent-66cb9c54b7-v6k2s.yaml
      inflating: diag/elastic-agent-k8s/k8s/replicaset-agent-clusterwide-elastic-agent-66cb9c54b7.yaml
      inflating: diag/elastic-agent-k8s/k8s/values.yaml
    

Related issues

Questions to ask yourself

  • How are we going to support this in production?
  • How are we going to measure its adoption?
  • How are we going to debug this?
  • What are the metrics I should take care of?
  • ...

@pchila pchila added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team labels Jul 23, 2025
Copy link
Contributor

mergify bot commented Jul 23, 2025

This pull request does not have a backport label. Could you fix it @pchila? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label that automatically backports to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@pchila pchila force-pushed the k8s-diagnostics-spacetime branch from 2a07a04 to b88d866 Compare July 24, 2025 07:59
@pchila pchila force-pushed the k8s-diagnostics-spacetime branch from 9f05aa5 to 0f17d47 Compare July 25, 2025 08:12
@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @pkoutsovasilis @pchila

@pchila pchila marked this pull request as ready for review July 25, 2025 10:26
@pchila pchila requested a review from a team as a code owner July 25, 2025 10:26
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@pchila pchila requested a review from cmacknz July 25, 2025 10:28
Copy link

@@ -1507,6 +1507,9 @@
"clusterRole": {
"$ref": "#/definitions/AgentPresetClusterRole"
},
"role": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these Helm Chart changes necessary in this PR?

Copy link
Contributor

@pkoutsovasilis pkoutsovasilis Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh @swiatekm you are probably onto something here, tell me which changes you think can go away? 🙂

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm mostly asking for the Helm Chart changes to be its own PR, because this one is already quite big, and giving agent permissions by default is something we should more carefully review. Especially if it involves permissions to read Secrets in the namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can merge the diagnostics independently of the permissions, as they need to work gracefully even in their absence.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if git isn't lying to me we have the following

file changes
NOTICE-fips.txt 873 +++++++++++++++++++++------------
NOTICE.txt 873 +++++++++++++++++++++------------
deploy/helm/elastic-agent/examples/eck/rendered/manifest.yaml 16 +
deploy/helm/elastic-agent/examples/fleet-managed-certificates/rendered/manifest.yaml 5 +
deploy/helm/elastic-agent/examples/fleet-managed/rendered/manifest.yaml 5 +
deploy/helm/elastic-agent/examples/kubernetes-custom-output/rendered/manifest.yaml 10 +
deploy/helm/elastic-agent/examples/kubernetes-default/rendered/manifest.yaml 10 +
.../helm/elastic-agent/examples/kubernetes-hints-autodiscover/rendered/manifest.yaml 10 +
deploy/helm/elastic-agent/examples/kubernetes-ksm-sharding/rendered/manifest.yaml 10 +
deploy/helm/elastic-agent/examples/kubernetes-onboarding/rendered/manifest.yaml 10 +
deploy/helm/elastic-agent/examples/kubernetes-only-logs/rendered/manifest.yaml 5 +
deploy/helm/elastic-agent/examples/multiple-integrations/rendered/manifest.yaml 10 +
deploy/helm/elastic-agent/examples/netflow-service/rendered/manifest.yaml 4 +
deploy/helm/elastic-agent/examples/nginx-custom-integration/rendered/manifest.yaml 4 +
deploy/helm/elastic-agent/examples/priority-class/rendered/manifest.yaml 10 +
deploy/helm/elastic-agent/examples/statefulset-preset/rendered/manifest.yaml 4 +
deploy/helm/elastic-agent/examples/system-custom-auth-paths/rendered/manifest.yaml 5 +
deploy/helm/elastic-agent/examples/user-cluster-role/rendered/manifest.yaml 4 +
deploy/helm/elastic-agent/examples/user-service-account/rendered/manifest.yaml 10 +
deploy/helm/elastic-agent/templates/agent/_helpers.tpl 2 +
deploy/helm/elastic-agent/templates/agent/cluster-role.yaml 1 +
deploy/helm/elastic-agent/templates/agent/eck/daemonset.yaml 7 +
deploy/helm/elastic-agent/templates/agent/eck/deployment.yaml 7 +
deploy/helm/elastic-agent/templates/agent/eck/statefulset.yaml 7 +
deploy/helm/elastic-agent/templates/agent/k8s/daemonset.yaml 4 +
deploy/helm/elastic-agent/templates/agent/k8s/deployment.yaml 4 +
deploy/helm/elastic-agent/templates/agent/k8s/statefulset.yaml 4 +
deploy/helm/elastic-agent/templates/agent/role-binding.yaml 38 ++
deploy/helm/elastic-agent/templates/agent/role.yaml 37 ++
deploy/helm/elastic-agent/values.schema.json 73 +++
go.mod 2 +-
internal/pkg/agent/application/actions/handlers/handler_action_diagnostics.go 19 +-
internal/pkg/agent/application/actions/handlers/handler_action_diagnostics_test.go 4 +-
internal/pkg/agent/cmd/run.go 2 +-
internal/pkg/diagnostics/diagnostics.go 10 +-
internal/pkg/diagnostics/diagnostics_k8s.go 579 ++++++++++++++++++++++
internal/pkg/diagnostics/diagnostics_k8s_test.go 1060 ++++++++++++++++++++++++++++++++++++++++
internal/pkg/diagnostics/diagnostics_test.go 34 +-
internal/pkg/diagnostics/testdata/helm.release.v1.secret.data 1 +
internal/pkg/diagnostics/testdata/helm.release.v2.secret.data 1 +
pkg/control/v2/server/server.go 18 +-
testing/integration/k8s/common.go 94 ++++
testing/integration/k8s/kubernetes_agent_standalone_test.go 88 ++++

so with some calculations, 3032 changes are because of generated files and testing code.

Now if we do split them up the diagnostics and helm chart change, we will reduce this PR by ~500 changes which isn't that dramatic of a difference; that said, no prob just say the word and this PR is split up in two 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants