Skip to content

Agent Substrate subchart: atenet dns-controller hardcodes atenet-router/ate-system, so actor DNS never programs and agents never become Ready #2104

Description

@jmunozro

Summary

When Agent Substrate is installed via the subchart path (substrate.enabled=true on kagent, OSS 0.9.10, substrate fork v0.0.6) into any namespace other than ate-system, the control plane installs and reports healthy, but no agent can ever run. SandboxAgent stays Ready=False ("ActorTemplate golden snapshot is not ready") and A2A calls fail with not reachable via atenet-router: actor status Resuming: context deadline exceeded.

Root cause: the atenet dns-controller (the dns subcommand of the atenet image, running as a sidecar in the *-dns Deployment) looks up the atenet-router Service by the hardcoded standalone name atenet-router in the hardcoded namespace ate-system. As a subchart the Service is release-prefixed (<release>-atenet-router) and lives in the release namespace, so the lookup fails forever and actor DNS is never programmed. This is the same family as #2092 (the Redis address), but unlike that one there is no values workaround.

Evidence (kagent 0.9.10, release kagent, namespace kagent)

kubectl logs deploy/kagent-dns -c dns-controller repeats every 10s:

ERROR  Error during DNS reconciliation
error: failed to get atenet-router service: services "atenet-router" is forbidden:
  User "system:serviceaccount:kagent:kagent-atenet-dns" cannot get resource "services"
  in API group "" in the namespace "ate-system"

Two things are wrong in that one line:

  1. Service name — it queries atenet-router, but the actual Service is kagent-atenet-router (the chart names it via substrate.fullname, which prefixes with the release name whenever Release.Name != Chart.Name).
  2. Namespace — it queries ate-system, but the release namespace is kagent. The kagent-atenet-dns ServiceAccount only has a Role in kagent (and kube-system), so it's also RBAC-forbidden in ate-system.

The dns-controller container has no env and no flags for the router service name/namespace (args are only dns --log-level=debug --interval=10s --corefile-path=…), and the substrate chart exposes no values for it (only atelet has extraArgs/extraEnv). So it cannot be corrected from Helm.

Downstream effects (all symptoms of the above)

  • Actor names (e.g. hello-substrate.kagent) never get DNS records → gRPC dials fail with name resolver error: produced zero addresses.
  • ate-controller logs: while creating golden actor: rpc error: code = Unavailable desc = name resolver error: produced zero addresses (every reconcile).
  • The worker pod (ateom) logs only ateom booting and never boots the actor.
  • SandboxAgentReady=False :: ActorTemplate golden snapshot is not ready.
  • Controller A2A → failed to send HTTP request: Post "http://hello-substrate.kagent:8080": substrate session actor … not reachable via atenet-router: actor status Resuming: context deadline exceeded.

Reproduce

helm upgrade --install kagent-crds oci://ghcr.io/kagent-dev/kagent/helm/kagent-crds --version 0.9.10 \
  -n kagent --create-namespace --set substrate.enabled=true
helm upgrade --install kagent oci://ghcr.io/kagent-dev/kagent/helm/kagent --version 0.9.10 \
  -n kagent --set substrate.enabled=true \
  --set substrate.redis.clusterAddress=kagent-valkey-cluster.kagent.svc:6379 \
  --set controller.substrate.enabled=true \
  --set controller.substrate.ateApiEndpoint=dns:///kagent-api.kagent.svc:443 \
  --set controller.substrate.ateApiInsecure=true \
  --set substrateWorkerPool.create=true --set substrateWorkerPool.name=kagent-default \
  --set substrateWorkerPool.ateomImage=ghcr.io/kagent-dev/substrate/ateom-gvisor:v0.0.6 \
  --set providers.openAI.apiKey=$OPENAI_API_KEY

kubectl logs -n kagent deploy/kagent-dns -c dns-controller   # → "atenet-router … in namespace ate-system" forbidden, forever

(The substrate.redis.clusterAddress override above is the #2092 workaround, needed just to get ate-api up.)

Suggested fix

In the atenet dns controller, resolve the atenet-router Service using the release namespace and the release-prefixed Service name (the same substrate.fullname the chart uses), or expose both as flags/env that the chart wires from .Release.Namespace + the rendered Service name. The kagent-atenet-dns Role/RoleBinding must also cover whatever namespace it ends up querying. More broadly, every substrate component that addresses a sibling by a fixed ate-system/unprefixed name has this latent bug on the subchart path (see also #2092 for Redis).

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions