You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
During moments of Pod churn (especially around Spot terminations in AWS EC2, for example), the EndpointSlice in k8s is updated to remove Pod IPs that have gone away as a result of Pod deletion. There is a race condition (see this comment) between this update and Envoy being updated via xDS to remove the pod IP from the possible backends of routes.
It is not currently possible to work around this race condition with passive healthchecking and retryOn policies via a BackendTrafficPolicy.
See attached this screenshot of a packet capture on our Envoy Pod:
The IPs in question are:
10.12.44.215: client (with a 5s TCP idle timeout configured)
10.12.46.156: Envoy Pod
10.12.106.76: backend target server, configured as a HTTPRoute on our Gateway
You can see the following flow in this screenshot:
Packets 920 & 928: A successful HTTP flow between client & Envoy
Packets 994 & 1064: The backend server Pods are deleted and sent TCP RSTs to close the connection between Envoy and the backend server Pods
Packet 1084: The client sends another HTTP request to Envoy
Packet 1085: Envoy attempts to create a new TCP connection to the backend (TCP SYN), despite that Pod already going down
Packets 1841 & 2931: Envoy retransmits the TCP SYN after 0.5s and then 2s
Packet 3262: client hits idle timeout of 5 seconds and closes the connection to Envoy with a TCP FIN
The HTTPRoute points at a Kubernetes Service for our backend server with 3 replicas, one in each AZ. Theoretically the client need not know which backend servers are available, and should trust Envoy to do the pooling and work around one backend server being down.
Workaround:
It is possible to work around this issue by adding a preStop lifecycle hook to the target backend server:
This immediately puts the Pod into Terminating state which removes the IP from the EndpointSlice at the start of the preStop hook, but doesn't remove the underlying Pod for the duration of this sleep 25. This also gives EG enough time to remove the Pod IP from xDS before it goes away.
Expectation:
I would like for EG to fallback to an existing backend endpoint in the case where one of them is going down. In the packet capture we see Envoy retrying the same backend that has already closed its connections and has gone down. We thought that the passive healthchecking and retrying would enable this, but instead it seems the retry applies only to one TCP stream and the passive healthchecking does not solve our issue.
We would like to set config at the Gateway level to bypass this, rather than having to set preStop hooks on every backend server.
Hey @arkodg , thanks for this. From our experimentation, maxUnavailable doesn't have any impact on this timing window. The Pods will still receive SIGTERM at the same time that they are removed from the EndpointSlice, but there is still a window where Envoy Gateway has not picked up that the IP of the Pod that's just been SIGTERMed shouldn't receive traffic. If a client sends a request via a HTTPRoute to Envoy during that window, Envoy will try to connect to the IP that is no longer serving traffic (and you can see it doing TCP retransmissions when this gets no response).
In our case, our clients have a 5 second timeout on their requests. We actually managed to use this to solve our issue:
We didn't need a retryOn or a passive healthcheck, only a backend TCP timeout that was shorter than the clients'. In this scenario, we can see that Envoy tries to connect to the backend, retransmits a couple of times, then times out & tries a different backend Pod's IP, successfully returning to the client, which is exactly what we were hoping for.
Description:
During moments of Pod churn (especially around Spot terminations in AWS EC2, for example), the EndpointSlice in k8s is updated to remove Pod IPs that have gone away as a result of Pod deletion. There is a race condition (see this comment) between this update and Envoy being updated via xDS to remove the pod IP from the possible backends of routes.
It is not currently possible to work around this race condition with passive healthchecking and
retryOn
policies via aBackendTrafficPolicy
.See attached this screenshot of a packet capture on our Envoy Pod:
The IPs in question are:
10.12.44.215
: client (with a 5s TCP idle timeout configured)10.12.46.156
: Envoy Pod10.12.106.76
: backend target server, configured as a HTTPRoute on our GatewayYou can see the following flow in this screenshot:
The HTTPRoute points at a Kubernetes Service for our backend server with 3 replicas, one in each AZ. Theoretically the client need not know which backend servers are available, and should trust Envoy to do the pooling and work around one backend server being down.
Workaround:
It is possible to work around this issue by adding a
preStop
lifecycle hook to the target backend server:This immediately puts the Pod into
Terminating
state which removes the IP from the EndpointSlice at the start of the preStop hook, but doesn't remove the underlying Pod for the duration of thissleep 25
. This also gives EG enough time to remove the Pod IP from xDS before it goes away.Expectation:
I would like for EG to fallback to an existing backend endpoint in the case where one of them is going down. In the packet capture we see Envoy retrying the same backend that has already closed its connections and has gone down. We thought that the passive healthchecking and retrying would enable this, but instead it seems the retry applies only to one TCP stream and the passive healthchecking does not solve our issue.
We would like to set config at the Gateway level to bypass this, rather than having to set preStop hooks on every backend server.
Environment:
EG: 1.2.4
Envoy:
envoyproxy/envoy:distroless-v1.32.1
K8s: 1.31
Logs:
No logs were emitted during this occurrence.
cc @evilr00t @sam-burrell
The text was updated successfully, but these errors were encountered: