Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passive healthchecking/retries don't alleviate race condition on Pod deletion #4997

Open
coro opened this issue Jan 2, 2025 · 2 comments
Open
Labels

Comments

@coro
Copy link
Contributor

coro commented Jan 2, 2025

Description:
During moments of Pod churn (especially around Spot terminations in AWS EC2, for example), the EndpointSlice in k8s is updated to remove Pod IPs that have gone away as a result of Pod deletion. There is a race condition (see this comment) between this update and Envoy being updated via xDS to remove the pod IP from the possible backends of routes.

It is not currently possible to work around this race condition with passive healthchecking and retryOn policies via a BackendTrafficPolicy.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: evict-deleted-backends
  namespace: kube-system
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: my-gateway
  healthCheck:
    passive:
      baseEjectionTime: 10s
      interval: 2s
      maxEjectionPercent: 40
      consecutive5XxErrors: 1
  retry:
    numRetries: 2
    perRetry:
      backOff:
        baseInterval: 500ms
        maxInterval: 2000ms
      timeout: 1000ms
    retryOn:
      httpStatusCodes:
        - 500
      triggers:
        - connect-failure
        - retriable-status-codes

See attached this screenshot of a packet capture on our Envoy Pod:
Screenshot 2025-01-02 at 15 28 04
The IPs in question are:

  • 10.12.44.215: client (with a 5s TCP idle timeout configured)
  • 10.12.46.156: Envoy Pod
  • 10.12.106.76: backend target server, configured as a HTTPRoute on our Gateway

You can see the following flow in this screenshot:

  • Packets 920 & 928: A successful HTTP flow between client & Envoy
  • Packets 994 & 1064: The backend server Pods are deleted and sent TCP RSTs to close the connection between Envoy and the backend server Pods
  • Packet 1084: The client sends another HTTP request to Envoy
  • Packet 1085: Envoy attempts to create a new TCP connection to the backend (TCP SYN), despite that Pod already going down
  • Packets 1841 & 2931: Envoy retransmits the TCP SYN after 0.5s and then 2s
  • Packet 3262: client hits idle timeout of 5 seconds and closes the connection to Envoy with a TCP FIN

The HTTPRoute points at a Kubernetes Service for our backend server with 3 replicas, one in each AZ. Theoretically the client need not know which backend servers are available, and should trust Envoy to do the pooling and work around one backend server being down.

Workaround:
It is possible to work around this issue by adding a preStop lifecycle hook to the target backend server:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
        - name: application
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - '-c'
                  - sleep 25

This immediately puts the Pod into Terminating state which removes the IP from the EndpointSlice at the start of the preStop hook, but doesn't remove the underlying Pod for the duration of this sleep 25. This also gives EG enough time to remove the Pod IP from xDS before it goes away.

Expectation:

I would like for EG to fallback to an existing backend endpoint in the case where one of them is going down. In the packet capture we see Envoy retrying the same backend that has already closed its connections and has gone down. We thought that the passive healthchecking and retrying would enable this, but instead it seems the retry applies only to one TCP stream and the passive healthchecking does not solve our issue.

We would like to set config at the Gateway level to bypass this, rather than having to set preStop hooks on every backend server.

Environment:
EG: 1.2.4
Envoy: envoyproxy/envoy:distroless-v1.32.1
K8s: 1.31

Logs:
No logs were emitted during this occurrence.

cc @evilr00t @sam-burrell

@coro coro added the triage label Jan 2, 2025
@arkodg
Copy link
Contributor

arkodg commented Jan 2, 2025

@coro what's the rollout strategy for your backend ? it should be maxUnavailable: 0 for hitless upgrades

@coro
Copy link
Contributor Author

coro commented Jan 6, 2025

Hey @arkodg , thanks for this. From our experimentation, maxUnavailable doesn't have any impact on this timing window. The Pods will still receive SIGTERM at the same time that they are removed from the EndpointSlice, but there is still a window where Envoy Gateway has not picked up that the IP of the Pod that's just been SIGTERMed shouldn't receive traffic. If a client sends a request via a HTTPRoute to Envoy during that window, Envoy will try to connect to the IP that is no longer serving traffic (and you can see it doing TCP retransmissions when this gets no response).

In our case, our clients have a 5 second timeout on their requests. We actually managed to use this to solve our issue:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: backend-timeout
  namespace: kube-system
spec:
  timeout:
    tcp:
      connectTimeout: 4s
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: Gateway
      name: my-gateway

We didn't need a retryOn or a passive healthcheck, only a backend TCP timeout that was shorter than the clients'. In this scenario, we can see that Envoy tries to connect to the backend, retransmits a couple of times, then times out & tries a different backend Pod's IP, successfully returning to the client, which is exactly what we were hoping for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants