Skip to content

Bug: Spike in TCP connections when using OCI HelmRepository in v1.7.x #2002

@alex5517

Description

@alex5517

What is the bug?

We recently upgraded our source-controller from v1.6.2 to v1.7.4 and noticed a spike in outbound network traffic, specifically TCP connections and TLS handshakes directed at our Azure Container Registry. This happens when using HelmRepository resources configured with type: oci and provider: azure, authenticated via Azure Workload Identity (with the AZURE_CLIENT_ID environment variable configured directly on the source-controller pod).

After reviewing the changelog and the code differences between these versions, I suspect this regression is related to the recent object-level workload identity authentication changes. As mentioned in PR #1790 it seems the new authentication implementations were omitted for HelmRepository in favor of OCIRepository.

How to reproduce it?

We set up a test environment to isolate the behavior. We created 20 namespaces, and inside each namespace, we deployed one OCI Azure HelmRepository and 7 HelmRelease resources referencing it. All resources were set to a 1-minute interval.

Here is the generic setup we used for the repository:

apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: chart-oci
  namespace: test-ns-1
spec:
  interval: 1m0s
  provider: azure
  type: oci
  url: oci://example.azurecr.io/helm

And the associated releases:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: test1
  namespace: test-ns-1
spec:
  chart:
    spec:
      chart: example/chart
      reconcileStrategy: ChartVersion
      sourceRef:
        kind: HelmRepository
        name: chart-oci
      version: '>1.0.0-0'

With this setup applied, we ran a 5-minute tcpdump to count the TLS client hello SNI headers, as well as checking the TCP connections via netstat.

Any additional context to share?

Here are the results of the tests we ran, comparing v1.7.4 with the older v1.6.2, as well as comparing HelmRepository against OCIRepository.

When running the test setup on v1.7.4 using HelmRepository, we saw a spike in TLS handshakes over a 5-minute period compared to our baseline:

➜ v1.7.4 tshark -r 5m_baseline_before-tests_traffic.pcap -Y "tls.handshake.type == 1" -T fields -e tls.handshake.extensions_server_name | uniq -c 
  97 example.azurecr.io # existing stuff running in the cluster

➜ v1.7.4 tshark -r 5m_helmrepo-tests_traffic.pcap -Y "tls.handshake.type == 1" -T fields -e tls.handshake.extensions_server_name | uniq -c 
4382 example.azurecr.io

Checking netstat -an | grep 1.2.3.4 during this time showed 807 connections in various states (mostly TIME_WAIT and ESTABLISHED) to the ACR IP.

We then tore down the test namespaces and recreated the exact same setup using OCIRepository instead of HelmRepository. The traffic dropped back close to the baseline:

➜ v1.7.4 tshark -r 5m_ocirepo-tests_traffic.pcap -Y "tls.handshake.type == 1" -T fields -e tls.handshake.extensions_server_name | uniq -c 
 184 example.azurecr.io

Netstat for the OCIRepository test only showed 39 connections.

Out of curiosity, we ran the same HelmRepository test on v1.6.2. While there was still an increase over the baseline, it was lower than v1.7.4, generating 2528 handshakes and 212 netstat connections over 5 minutes.

We are aware that using HelmRepository+HelmChart will inherently generate more connections than OCIRepository. However, we observed a jump from 212 to 807 concurrent TCP connections after the upgrade to v1.7.4, which creates an unexpected increase in network overhead for the exact same workload.

I understand that OCI support in HelmRepository is in maintenance mode and OCIRepository is the recommended approach (which our tests confirm resolves the issue). We are planning to migrate our manifests accordingly. However, I thought it would be best to report this behavior change in case the extent of the network overhead was unintended.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions