Summary
When Agent Substrate is installed via the subchart path (substrate.enabled=true on kagent, OSS 0.9.10, substrate fork v0.0.6) into any namespace other than ate-system, the control plane installs and reports healthy, but no agent can ever run. SandboxAgent stays Ready=False ("ActorTemplate golden snapshot is not ready") and A2A calls fail with not reachable via atenet-router: actor status Resuming: context deadline exceeded.
Root cause: the atenet dns-controller (the dns subcommand of the atenet image, running as a sidecar in the *-dns Deployment) looks up the atenet-router Service by the hardcoded standalone name atenet-router in the hardcoded namespace ate-system. As a subchart the Service is release-prefixed (<release>-atenet-router) and lives in the release namespace, so the lookup fails forever and actor DNS is never programmed. This is the same family as #2092 (the Redis address), but unlike that one there is no values workaround.
Evidence (kagent 0.9.10, release kagent, namespace kagent)
kubectl logs deploy/kagent-dns -c dns-controller repeats every 10s:
ERROR Error during DNS reconciliation
error: failed to get atenet-router service: services "atenet-router" is forbidden:
User "system:serviceaccount:kagent:kagent-atenet-dns" cannot get resource "services"
in API group "" in the namespace "ate-system"
Two things are wrong in that one line:
- Service name — it queries
atenet-router, but the actual Service is kagent-atenet-router (the chart names it via substrate.fullname, which prefixes with the release name whenever Release.Name != Chart.Name).
- Namespace — it queries
ate-system, but the release namespace is kagent. The kagent-atenet-dns ServiceAccount only has a Role in kagent (and kube-system), so it's also RBAC-forbidden in ate-system.
The dns-controller container has no env and no flags for the router service name/namespace (args are only dns --log-level=debug --interval=10s --corefile-path=…), and the substrate chart exposes no values for it (only atelet has extraArgs/extraEnv). So it cannot be corrected from Helm.
Downstream effects (all symptoms of the above)
- Actor names (e.g.
hello-substrate.kagent) never get DNS records → gRPC dials fail with name resolver error: produced zero addresses.
ate-controller logs: while creating golden actor: rpc error: code = Unavailable desc = name resolver error: produced zero addresses (every reconcile).
- The worker pod (
ateom) logs only ateom booting and never boots the actor.
SandboxAgent → Ready=False :: ActorTemplate golden snapshot is not ready.
- Controller A2A →
failed to send HTTP request: Post "http://hello-substrate.kagent:8080": substrate session actor … not reachable via atenet-router: actor status Resuming: context deadline exceeded.
Reproduce
helm upgrade --install kagent-crds oci://ghcr.io/kagent-dev/kagent/helm/kagent-crds --version 0.9.10 \
-n kagent --create-namespace --set substrate.enabled=true
helm upgrade --install kagent oci://ghcr.io/kagent-dev/kagent/helm/kagent --version 0.9.10 \
-n kagent --set substrate.enabled=true \
--set substrate.redis.clusterAddress=kagent-valkey-cluster.kagent.svc:6379 \
--set controller.substrate.enabled=true \
--set controller.substrate.ateApiEndpoint=dns:///kagent-api.kagent.svc:443 \
--set controller.substrate.ateApiInsecure=true \
--set substrateWorkerPool.create=true --set substrateWorkerPool.name=kagent-default \
--set substrateWorkerPool.ateomImage=ghcr.io/kagent-dev/substrate/ateom-gvisor:v0.0.6 \
--set providers.openAI.apiKey=$OPENAI_API_KEY
kubectl logs -n kagent deploy/kagent-dns -c dns-controller # → "atenet-router … in namespace ate-system" forbidden, forever
(The substrate.redis.clusterAddress override above is the #2092 workaround, needed just to get ate-api up.)
Suggested fix
In the atenet dns controller, resolve the atenet-router Service using the release namespace and the release-prefixed Service name (the same substrate.fullname the chart uses), or expose both as flags/env that the chart wires from .Release.Namespace + the rendered Service name. The kagent-atenet-dns Role/RoleBinding must also cover whatever namespace it ends up querying. More broadly, every substrate component that addresses a sibling by a fixed ate-system/unprefixed name has this latent bug on the subchart path (see also #2092 for Redis).
Environment
Summary
When Agent Substrate is installed via the subchart path (
substrate.enabled=trueonkagent, OSS 0.9.10, substrate forkv0.0.6) into any namespace other thanate-system, the control plane installs and reports healthy, but no agent can ever run.SandboxAgentstaysReady=False("ActorTemplate golden snapshot is not ready") and A2A calls fail withnot reachable via atenet-router: actor status Resuming: context deadline exceeded.Root cause: the atenet
dns-controller(thednssubcommand of theatenetimage, running as a sidecar in the*-dnsDeployment) looks up the atenet-router Service by the hardcoded standalone nameatenet-routerin the hardcoded namespaceate-system. As a subchart the Service is release-prefixed (<release>-atenet-router) and lives in the release namespace, so the lookup fails forever and actor DNS is never programmed. This is the same family as #2092 (the Redis address), but unlike that one there is no values workaround.Evidence (kagent 0.9.10, release
kagent, namespacekagent)kubectl logs deploy/kagent-dns -c dns-controllerrepeats every 10s:Two things are wrong in that one line:
atenet-router, but the actual Service iskagent-atenet-router(the chart names it viasubstrate.fullname, which prefixes with the release name wheneverRelease.Name != Chart.Name).ate-system, but the release namespace iskagent. Thekagent-atenet-dnsServiceAccount only has a Role inkagent(andkube-system), so it's also RBAC-forbidden inate-system.The
dns-controllercontainer has no env and no flags for the router service name/namespace (args are onlydns --log-level=debug --interval=10s --corefile-path=…), and the substrate chart exposes no values for it (onlyatelethasextraArgs/extraEnv). So it cannot be corrected from Helm.Downstream effects (all symptoms of the above)
hello-substrate.kagent) never get DNS records → gRPC dials fail withname resolver error: produced zero addresses.ate-controllerlogs:while creating golden actor: rpc error: code = Unavailable desc = name resolver error: produced zero addresses(every reconcile).ateom) logs onlyateom bootingand never boots the actor.SandboxAgent→Ready=False :: ActorTemplate golden snapshot is not ready.failed to send HTTP request: Post "http://hello-substrate.kagent:8080": substrate session actor … not reachable via atenet-router: actor status Resuming: context deadline exceeded.Reproduce
(The
substrate.redis.clusterAddressoverride above is the #2092 workaround, needed just to get ate-api up.)Suggested fix
In the atenet
dnscontroller, resolve the atenet-router Service using the release namespace and the release-prefixed Service name (the samesubstrate.fullnamethe chart uses), or expose both as flags/env that the chart wires from.Release.Namespace+ the rendered Service name. Thekagent-atenet-dnsRole/RoleBinding must also cover whatever namespace it ends up querying. More broadly, every substrate component that addresses a sibling by a fixedate-system/unprefixed name has this latent bug on the subchart path (see also #2092 for Redis).Environment
v0.0.6(ghcr.io/kagent-dev/substrate/*:v0.0.6)substrate.enabled=true, releasekagent, namespacekagent(kind)