You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
GKE uses Antrea and OVS in Windows nodes, a customer reported that the DNS resolution failed intermittently. From a GKE cluster with Windows nodes they created a Pod which did the following while true; do dnslooup www.google.com; done; and reported that the lookup failed sometimes, we noted that the DNS request went to port 1024 instead of 53.
Reproducible Scenario
GKE teams created the following reproducible scenario, while we reproduced it in a GKE cluster it can be reproduced in a vanilla kubernetes distro.
Execute the previously created file infinite_nslookup.bat. It will take around a minute for DNS requests to start failing. I can see the failed DNS requests going to port 1024.
Debugging
Debugging from a GKE networking expert, we captured a trace attached as ovt-trace.txt:
Before the issue is observed, there are 1274 open UDP connections. Once the issue is observed, there are 7849 connections. That's a very noticable increase in connections and likely confirms my theory: we are overflowing some table due to too many UDP connections. The number of TCP connections actually slightly decreases (233->188).
The flow dump reveals the smoking gun. Here we have a packet that is clearly showing as being sent to port 53, yet OVS remaps it to port 1024:
Notice how we initially have tp_dst=53, but then UDP(src=55691,dst=1024). It is trivial to find more instances by doing egrep =53.*=1024 ovs-trace.txt:
This code seems to be cycling through all ports from 1024 to 65535, and once it runs out of ports, the code to override the port with 1024 is triggered:
Also, it seems weird that the destination port is being overriden. In a NAT scenario, I would expect the source port to be overriden, but somehow we end up with destination port override.
Platform details
OS: Windows Server LTSC2019 or LTSC2022
OVS Version: 2.14.4.
Attached file: ovs-trace.txt
Background
GKE uses Antrea and OVS in Windows nodes, a customer reported that the DNS resolution failed intermittently. From a GKE cluster with Windows nodes they created a Pod which did the following
while true; do dnslooup www.google.com; done;
and reported that the lookup failed sometimes, we noted that the DNS request went to port 1024 instead of 53.Reproducible Scenario
GKE teams created the following reproducible scenario, while we reproduced it in a GKE cluster it can be reproduced in a vanilla kubernetes distro.
kubectl exec -it POD_NAME -- cmd
infinite_nslookup.bat
. It will take around a minute for DNS requests to start failing. I can see the failed DNS requests going to port 1024.Debugging
Debugging from a GKE networking expert, we captured a trace attached as
ovt-trace.txt
:Before the issue is observed, there are 1274 open UDP connections. Once the issue is observed, there are 7849 connections. That's a very noticable increase in connections and likely confirms my theory: we are overflowing some table due to too many UDP connections. The number of TCP connections actually slightly decreases (233->188).
The flow dump reveals the smoking gun. Here we have a packet that is clearly showing as being sent to port 53, yet OVS remaps it to port 1024:
Notice how we initially have tp_dst=53, but then UDP(src=55691,dst=1024). It is trivial to find more instances by doing
egrep =53.*=1024 ovs-trace.txt
:Source code debugging:
Ok, so I think I found the code path that sets the port to 1024. Here, MIN_NAT_EPHEMERAL_PORT is set to 1024:
https://github.com/openvswitch/ovs/blob/d92003120dd9213de9452dff3fca0b46af79c68a/datapath-windows/ovsext/Conntrack-nat.c#L288
And then here, the port is overriden with minPort, which is set to MIN_NAT_EPHEMERAL_PORT:
https://github.com/openvswitch/ovs/blob/d92003120dd9213de9452dff3fca0b46af79c68a/datapath-windows/ovsext/Conntrack-nat.c#L398
This code seems to be cycling through all ports from 1024 to 65535, and once it runs out of ports, the code to override the port with 1024 is triggered:
https://github.com/openvswitch/ovs/blob/d92003120dd9213de9452dff3fca0b46af79c68a/datapath-windows/ovsext/Conntrack-nat.c#L365
Also, it seems weird that the destination port is being overriden. In a NAT scenario, I would expect the source port to be overriden, but somehow we end up with destination port override.
Platform details
OS: Windows Server LTSC2019 or LTSC2022
OVS Version: 2.14.4.
Attached file: ovs-trace.txt
ovs-trace.txt
The text was updated successfully, but these errors were encountered: