Passive healthchecking/retries don't alleviate race condition on Pod deletion #4997

coro · 2025-01-02T16:09:49Z

Description:
During moments of Pod churn (especially around Spot terminations in AWS EC2, for example), the EndpointSlice in k8s is updated to remove Pod IPs that have gone away as a result of Pod deletion. There is a race condition (see this comment) between this update and Envoy being updated via xDS to remove the pod IP from the possible backends of routes.

It is not currently possible to work around this race condition with passive healthchecking and retryOn policies via a BackendTrafficPolicy.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: evict-deleted-backends
  namespace: kube-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: my-gateway
  healthCheck:
    passive:
      baseEjectionTime: 10s
      interval: 2s
      maxEjectionPercent: 40
      consecutive5XxErrors: 1
  retry:
    numRetries: 2
    perRetry:
      backOff:
        baseInterval: 500ms
        maxInterval: 2000ms
      timeout: 1000ms
    retryOn:
      httpStatusCodes:
        - 500
      triggers:
        - connect-failure
        - retriable-status-codes

See attached this screenshot of a packet capture on our Envoy Pod:

The IPs in question are:

10.12.44.215: client (with a 5s TCP idle timeout configured)
10.12.46.156: Envoy Pod
10.12.106.76: backend target server, configured as a HTTPRoute on our Gateway

You can see the following flow in this screenshot:

Packets 920 & 928: A successful HTTP flow between client & Envoy
Packets 994 & 1064: The backend server Pods are deleted and sent TCP RSTs to close the connection between Envoy and the backend server Pods
Packet 1084: The client sends another HTTP request to Envoy
Packet 1085: Envoy attempts to create a new TCP connection to the backend (TCP SYN), despite that Pod already going down
Packets 1841 & 2931: Envoy retransmits the TCP SYN after 0.5s and then 2s
Packet 3262: client hits idle timeout of 5 seconds and closes the connection to Envoy with a TCP FIN

The HTTPRoute points at a Kubernetes Service for our backend server with 3 replicas, one in each AZ. Theoretically the client need not know which backend servers are available, and should trust Envoy to do the pooling and work around one backend server being down.

Workaround:
It is possible to work around this issue by adding a preStop lifecycle hook to the target backend server:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: application
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - sleep 25

This immediately puts the Pod into Terminating state which removes the IP from the EndpointSlice at the start of the preStop hook, but doesn't remove the underlying Pod for the duration of this sleep 25. This also gives EG enough time to remove the Pod IP from xDS before it goes away.

Expectation:

I would like for EG to fallback to an existing backend endpoint in the case where one of them is going down. In the packet capture we see Envoy retrying the same backend that has already closed its connections and has gone down. We thought that the passive healthchecking and retrying would enable this, but instead it seems the retry applies only to one TCP stream and the passive healthchecking does not solve our issue.

We would like to set config at the Gateway level to bypass this, rather than having to set preStop hooks on every backend server.

Environment:
EG: 1.2.4
Envoy: envoyproxy/envoy:distroless-v1.32.1
K8s: 1.31

Logs:
No logs were emitted during this occurrence.

cc @evilr00t @sam-burrell

The text was updated successfully, but these errors were encountered:

arkodg · 2025-01-02T20:24:52Z

@coro what's the rollout strategy for your backend ? it should be maxUnavailable: 0 for hitless upgrades

coro · 2025-01-06T14:46:57Z

Hey @arkodg , thanks for this. From our experimentation, maxUnavailable doesn't have any impact on this timing window. The Pods will still receive SIGTERM at the same time that they are removed from the EndpointSlice, but there is still a window where Envoy Gateway has not picked up that the IP of the Pod that's just been SIGTERMed shouldn't receive traffic. If a client sends a request via a HTTPRoute to Envoy during that window, Envoy will try to connect to the IP that is no longer serving traffic (and you can see it doing TCP retransmissions when this gets no response).

In our case, our clients have a 5 second timeout on their requests. We actually managed to use this to solve our issue:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
  namespace: kube-system
spec:
  timeout:
    tcp:
      connectTimeout: 4s
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: my-gateway

We didn't need a retryOn or a passive healthcheck, only a backend TCP timeout that was shorter than the clients'. In this scenario, we can see that Envoy tries to connect to the backend, retransmits a couple of times, then times out & tries a different backend Pod's IP, successfully returning to the client, which is exactly what we were hoping for.

coro added the triage label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Passive healthchecking/retries don't alleviate race condition on Pod deletion #4997

Passive healthchecking/retries don't alleviate race condition on Pod deletion #4997

coro commented Jan 2, 2025

arkodg commented Jan 2, 2025

coro commented Jan 6, 2025

Passive healthchecking/retries don't alleviate race condition on Pod deletion #4997

Passive healthchecking/retries don't alleviate race condition on Pod deletion #4997

Comments

coro commented Jan 2, 2025

arkodg commented Jan 2, 2025

coro commented Jan 6, 2025