Skip to content

Conversation

@lixmal
Copy link
Collaborator

@lixmal lixmal commented Nov 21, 2025

Describe your changes

Under certain conditions (sleep/wake), macOS can get stuck on getaddrinfo syscalls, rendering the system DNs broken.

To avoid this, we use Go's resolver. Go's resolver has the drawback that it doesn't take split DNS or any other scutil --dns settings into account. So we can only use that when it is safe to use /etc/resolv.conf directly, e.g., control plane traffic (our custom nbnet.Dialer plus stdnet's resolve methods internally used by ICE).

This behavior can be overridden by using the NB_DNS_RESOLVER environment variable.

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • New Features
    • DNS resolver selection can now be controlled via the NB_DNS_RESOLVER environment variable, supporting "system" and "go" options.
    • macOS now defaults to an optimized DNS resolution mode for more reliable name lookups.
    • Non-root Linux and general client networking now use the configured resolver for more consistent DNS behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

Walkthrough

Introduces an environment- and platform-aware DNS resolver and wires it into dialing paths: adds NewResolver() and EnvResolver, sets net.Dialer.Resolver at creation sites, and stores/uses a resolver on the stdnet Net for DNS lookups; gRPC non-root dialer now uses the custom resolver.

Changes

Cohort / File(s) Summary
Resolver implementation
client/net/resolver.go
Adds EnvResolver = "NB_DNS_RESOLVER" and NewResolver() *net.Resolver which selects resolver based on env var and platform (PreferGo on Darwin or when env says "go", system otherwise).
Dialer construction
client/net/dialer.go
NewDialer() now initializes the embedded net.Dialer with Resolver: NewResolver().
Standard net wrapper
client/internal/stdnet/stdnet.go
Net gains a resolver *net.Resolver field; constructors call NewResolver(); DNS lookups use the instance resolver rather than net.DefaultResolver.
gRPC dialer integration
client/grpc/dialer_generic.go
Non-root Linux path now uses a net.Dialer with Resolver: nbnet.NewResolver() when returning DialContext, instead of a zero-value net.Dialer.

Sequence Diagram(s)

sequenceDiagram
    actor Caller
    participant DialerFactory as NewDialer / gRPC dialer
    participant Resolver as NewResolver()
    participant NetDialer as net.Dialer
    participant DNS as net.Resolver.LookupNetIP
    participant Remote as RemoteHost

    Caller->>DialerFactory: request DialContext(...)
    DialerFactory->>Resolver: NewResolver()
    Resolver-->>DialerFactory: *net.Resolver
    DialerFactory->>NetDialer: create with Resolver (assigned)
    Caller->>NetDialer: DialContext(ctx, network, addr)
    NetDialer->>DNS: Resolver.LookupNetIP(ctx, ipNet, host)
    DNS-->>NetDialer: resolved IPs
    NetDialer->>Remote: TCP connect to resolved IP
    Remote-->>Caller: connection established / error
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Focused cross-cutting change touching resolver creation and its uses.
  • Review attention recommended for:
    • client/net/resolver.go (env parsing, platform logic).
    • client/internal/stdnet/stdnet.go (correct usage of instance resolver for all lookup paths).
    • client/grpc/dialer_generic.go and client/net/dialer.go (ensure behavior parity and no regressions in non-root vs root dialing).

Possibly related PRs

Suggested reviewers

  • pappz

Poem

🐰 I nibble code and sniff the air,
A resolver found with thoughtful care.
Env or Darwin tells me which way to go,
Connections follow where the DNS bunnies know. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately summarizes the main change: using Go's resolver on macOS for control plane and ICE traffic to address DNS resolution issues.
Description check ✅ Passed The description covers the problem, solution, limitations, and environment variable override. It follows the template structure with appropriate checklist selections and CLA confirmation.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch go-dns-for-ice

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
client/net/resolver.go (1)

24-43: Consider handling invalid environment variable values explicitly.

The switch statement on lines 26-33 silently falls through for unrecognized values (e.g., NB_DNS_RESOLVER=foo), which will then proceed to the darwin check. While this may be intentional, it could lead to confusion during debugging. Consider either logging a warning for invalid values or documenting this fall-through behavior more explicitly in the function comment.

Example approach with logging:

 func NewResolver() *net.Resolver {
 	if resolver := os.Getenv(EnvResolver); resolver != "" {
 		switch strings.ToLower(resolver) {
 		case "system":
 			return net.DefaultResolver
 		case "go":
 			return &net.Resolver{
 				PreferGo: true,
 			}
+		default:
+			// Invalid value, fall through to platform defaults
+			log.Debugf("Invalid %s value: %s, using platform defaults", EnvResolver, resolver)
 		}
 	}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32146e5 and 1dcd92c.

📒 Files selected for processing (4)
  • client/grpc/dialer_generic.go (1 hunks)
  • client/internal/stdnet/stdnet.go (5 hunks)
  • client/net/dialer.go (1 hunks)
  • client/net/resolver.go (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
client/grpc/dialer_generic.go (2)
client/net/dialer.go (1)
  • Dialer (9-11)
client/net/resolver.go (1)
  • NewResolver (24-43)
client/internal/stdnet/stdnet.go (1)
client/net/resolver.go (1)
  • NewResolver (24-43)
client/net/dialer.go (1)
client/net/resolver.go (1)
  • NewResolver (24-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Signal / Unit (amd64)
  • GitHub Check: Windows
  • GitHub Check: Linux
  • GitHub Check: Darwin
  • GitHub Check: JS / Lint
  • GitHub Check: release_ui_darwin
  • GitHub Check: Android / Build
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: release
  • GitHub Check: Client / Unit
🔇 Additional comments (6)
client/net/resolver.go (1)

10-14: LGTM! Clear environment variable definition.

The constant is well-named and properly documented with the allowed values.

client/net/dialer.go (1)

16-18: LGTM! Resolver integration is correct.

The change properly initializes the dialer with the custom resolver, aligning with the PR objectives to use Go's resolver on darwin for control plane traffic.

client/internal/stdnet/stdnet.go (3)

21-21: LGTM! Resolver field and import added correctly.

The import alias and resolver field are properly integrated into the Net struct. The field is safe without mutex protection since it's only set during initialization and remains immutable afterwards.

Also applies to: 47-47


58-58: LGTM! Consistent resolver initialization.

Both constructors properly initialize the resolver using nbnet.NewResolver(), ensuring the custom DNS resolution behavior is applied to ICE traffic as intended by the PR.

Also applies to: 79-79


118-118: LGTM! Resolver usage is correct.

The change from net.DefaultResolver.LookupNetIP to n.resolver.LookupNetIP properly applies the custom resolver to DNS lookups. Since the resolver is initialized in both constructors, no nil check is needed here.

client/grpc/dialer_generic.go (1)

32-34: LGTM! Consistent resolver usage in non-root path.

The change ensures the non-root Linux dialing path also uses the custom resolver, maintaining consistency with the root path (line 39) which uses nbnet.NewDialer() that was updated to use the same resolver in client/net/dialer.go. Both paths now properly apply the darwin-specific resolver workaround for control plane traffic.

@sonarqubecloud
Copy link

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
client/net/resolver.go (2)

27-38: Consider a more visible log level for invalid configuration.

Invalid NB_DNS_RESOLVER values are logged at Debug level, which may be too quiet for operators troubleshooting DNS issues. Consider using log.Warnf so configuration mistakes are more visible.

Also, the function documentation mentions "GODEBUG" as an override mechanism but doesn't explain how to use it. Either document the GODEBUG usage pattern or remove the reference to avoid confusion.

Apply this diff to improve logging visibility:

-		log.Debugf("Invalid %s value: %s, using platform defaults", EnvResolver, resolver)
+		log.Warnf("Invalid %s value: %s, using platform defaults", EnvResolver, resolver)

26-47: Verify the fix addresses the macOS getaddrinfo hang issue and add unit tests.

This implementation assumes that switching to the pure Go resolver (PreferGo: true) will prevent the macOS getaddrinfo hang after sleep/wake. Please verify this approach actually resolves the issue through testing or provide references.

Additionally, consider adding unit tests to cover:

  • Environment variable handling (valid values, invalid values, empty)
  • Platform detection logic
  • Return value correctness for each code path
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1dcd92c and 4b4d00d.

📒 Files selected for processing (1)
  • client/net/resolver.go (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: Windows
  • GitHub Check: Darwin
  • GitHub Check: Linux
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: Android / Build
  • GitHub Check: release_ui_darwin
  • GitHub Check: release_ui
  • GitHub Check: release
🔇 Additional comments (1)
client/net/resolver.go (1)

12-16: LGTM! Clear constant definition.

The environment variable constant is well-documented with acceptable values clearly specified.

Comment on lines +18 to +25
// NewResolver creates a DNS resolver with appropriate settings based on platform and configuration.
// On Darwin (macOS), it defaults to the pure Go resolver to avoid getaddrinfo hangs after sleep/wake.
// This is particularly important for connections using this package's Dialer, which bypasses the NetBird
// overlay network for control plane traffic. Since these connections target external infrastructure
// (management, signal, relay servers), it is safe to ignore split DNS configurations that would
// normally be provided by the system resolver.
// On other platforms, it uses the system resolver (cgo).
// This behavior can be overridden using the NB_DNS_RESOLVER environment variable or GODEBUG.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Verify the split DNS assumption for control plane traffic.

The comment states it's "safe to ignore split DNS configurations" for control plane traffic because these connections "target external infrastructure." However, in enterprise or self-hosted scenarios, management/signal/relay servers may be internal resources that rely on split DNS. Users with private deployments behind split DNS could experience connectivity issues.

Consider documenting this limitation or adding a configuration option for environments where control plane servers are internal.



---

🏁 Script executed:

```shell
#!/bin/bash
# Examine the NewResolver function implementation
cat client/net/resolver.go | head -60

Length of output: 1472


🏁 Script executed:

#!/bin/bash
# Search for NewResolver usage across the codebase
rg -n "NewResolver" -A3 -B3 --type=go

Length of output: 15091


🏁 Script executed:

#!/bin/bash
# Search for split DNS references
rg -n "split.*dns|DNS.*split" -i --type=go

Length of output: 174


🏁 Script executed:

#!/bin/bash
# Search for control plane related DNS or resolver usage
rg -n "control.*plane|management.*server|signal.*server|relay.*server" -i --type=go -C2 | head -100

Length of output: 6138


Update documentation to guide users on split DNS scenarios in enterprise deployments.

The concern about split DNS is valid. While the code provides override via NB_DNS_RESOLVER=system environment variable, the documentation doesn't explain when or why users should use it. In self-hosted or enterprise deployments where management/signal/relay servers are internal and rely on split DNS, users will encounter connectivity issues without knowing how to resolve them.

Recommended improvements:

  • Document in the comment that NB_DNS_RESOLVER=system should be used for deployments with internal servers behind split DNS
  • Explain what the GODEBUG override does and provide an example
  • Add troubleshooting guidance for split DNS scenarios

This allows users to self-serve when they encounter the issue rather than discovering it only through trial and error.

🤖 Prompt for AI Agents
In client/net/resolver.go around lines 18 to 25, expand the existing comment to
document split DNS guidance: add a short note that in self-hosted or enterprise
deployments where management/signal/relay servers are internal and reachable
only via split DNS, users should set NB_DNS_RESOLVER=system to force the system
resolver; briefly explain the GODEBUG override (e.g., how to force the pure Go
resolver via GODEBUG=netdns=go or use system via netdns=cgo) and include a
concrete example of setting the env var, and append a one-paragraph
troubleshooting tip listing symptoms (failed resolution to internal hostnames,
connection timeouts after suspend/resume) and the recommended steps (check split
DNS, try NB_DNS_RESOLVER=system, then GODEBUG variations, and where to file
logs).

@lixmal lixmal closed this Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants