Skip to content

Commit 493bd39

Browse files
cstocktonChris Stocktonsamrose
authored
feat: tighten gotrue.service deps and startup behavior (#1783)
* systemd: tighten gotrue.service deps and startup behavior Add stronger ordering and dependency constraints to reduce startup race conditions and noisy flapping: - Wait for `cloud-init`, `supabase-admin-agent_salt`, `apparmor`, `systemd-sysctl`, and `ufw` to complete before starting. - Require `network-online.target` and `systemd-resolved` for stable DNS resolution; note Go's resolver can race with early boot DNS. - Ensure `postgresql.service` is online before starting auth to avoid misleading error noise during slow boots. - Lower `StartLimitIntervalSec` and `StartLimitBurst` to reduce repeated restarts in failure scenarios. - Switch service type to `exec` instead of `simple`. This removes the tiny window in which systemd is supervising the wrapper process instead of the Go binary. These changes aim to rule out capability changes, socket reuse races, and incomplete firewall/network config as causes of EADDRINUSE errors and unstable startup. * chore: add newline to end of file * feat: enable notify support and cleanup for v3 support. * chore: add testing suffix for local infra test * feat: add info commands for when health checks fail * fix: better approach for this test * fix: formatting I believe is why test failed * chore: add a little more log output and remove -r * chore: fmt * chore: strip deps to see if orioledb test passes * chore: restore gotrue service unit deps --------- Co-authored-by: Chris Stockton <[email protected]> Co-authored-by: Sam Rose <[email protected]>
1 parent 8a8fac4 commit 493bd39

File tree

2 files changed

+80
-7
lines changed

2 files changed

+80
-7
lines changed

ansible/files/gotrue.service.j2

Lines changed: 72 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,56 @@
11
[Unit]
22
Description=Gotrue
33

4+
# Avoid starting gotrue while cloud-init is running. It makes a lot of changes
5+
# and I would like to rule out side effects of it running concurrently along
6+
# side services.
7+
After=cloud-init.service
8+
Wants=cloud-init.target
9+
10+
# Given the fact that auth uses SO_REUSEADDR, I want to rule out capabilities
11+
# being modified between restarts early in boot. This plugs up the scenario that
12+
# EADDRINUSE errors originate from a previous gotrue process starting without
13+
# the SO_REUSEADDR flag (due to lacking capability at that point in boot proc)
14+
# so when the next gotrue starts it can't re-use a slow releasing socket.
15+
After=apparmor.service
16+
17+
# We want sysctl's to be applied
18+
After=systemd-sysctl.service
19+
20+
# UFW Is modified by cloud init, but started non-blocking, so configuration
21+
# could be in-flight while gotrue is starting. I want to ensure future rules
22+
# that are relied on for security posture are applied before gotrue runs.
23+
After=ufw.service
24+
25+
# We need networking & resolution, auth uses the Go DNS resolver (not libc)
26+
# so it's possible `localhost` resolution could be unstable early in startup. We
27+
# care about this because SO_REUSEADDR eligibility checks the tuple
28+
# (proto, family, addr, port) meaning the AF_INET (ipv4, ipv6) could affect the
29+
# binding resulting in a second way for EADDRINUSE errors to surface.
30+
#
31+
# Note: We should consider removing localhost usage given `localhost` resolution
32+
# can often be racey early in boot, can be difficult to debug and offers no real
33+
# advantage in our infra. At the very least avoiding DNS resolved binding would
34+
# be a good idea.
35+
Wants=network-online.target systemd-resolved.service
36+
After=network-online.target systemd-resolved.service
37+
38+
# Auth server can't start unless postgres is online, lets remove a lot of auth
39+
# server noise during slow starts by requiring it.
40+
Wants=postgresql.service
41+
After=postgresql.service
42+
43+
# Lower start limit ival and burst to prevent the noisy flapping
44+
StartLimitIntervalSec=10
45+
StartLimitBurst=5
46+
447
[Service]
5-
Type=simple
48+
Type=exec
649
WorkingDirectory=/opt/gotrue
7-
{% if qemu_mode is defined and qemu_mode %}
8-
ExecStart=/opt/gotrue/gotrue
9-
{% else %}
50+
51+
# Both v2 & v3 need a config-dir for reloading support.
1052
ExecStart=/opt/gotrue/gotrue --config-dir /etc/auth.d
11-
{% endif %}
53+
ExecReload=/bin/kill -10 $MAINPID
1254

1355
User=gotrue
1456
Restart=always
@@ -17,11 +59,36 @@ RestartSec=3
1759
MemoryAccounting=true
1860
MemoryMax=50%
1961

62+
# These are the historical location of env files. The /etc/auth.d dir will
63+
# override them when present.
2064
EnvironmentFile=-/etc/gotrue.generated.env
2165
EnvironmentFile=/etc/gotrue.env
2266
EnvironmentFile=-/etc/gotrue.overrides.env
2367

68+
# Both v2 & v3 support reloading via signals, on linux this is SIGUSR1.
69+
Environment=GOTRUE_RELOADING_SIGNAL_ENABLED=true
70+
Environment=GOTRUE_RELOADING_SIGNAL_NUMBER=10
71+
72+
# Both v2 & v3 disable the poller. While gotrue sets it to off by default we
73+
# defensively set it to false here.
74+
Environment=GOTRUE_RELOADING_POLLER_ENABLED=false
75+
76+
# Determines how much idle time must pass before triggering a reload. This
77+
# ensures only 1 reload operation occurs during a burst of config updates.
78+
Environment=GOTRUE_RELOADING_GRACE_PERIOD_INTERVAL=2s
79+
80+
{% if qemu_mode is defined and qemu_mode %}
81+
# v3 does not use filesystem notifications for config reloads.
82+
Environment=GOTRUE_RELOADING_NOTIFY_ENABLED=false
83+
{% else %}
84+
# v2 currently relies on notify support, so we will enable it until both v2 / v3
85+
# have migrated to strictly use signals across all projects. The default is true
86+
# in gotrue but we will set it defensively here.
87+
Environment=GOTRUE_RELOADING_NOTIFY_ENABLED=true
88+
{% endif %}
89+
2490
Slice=services.slice
2591

2692
[Install]
2793
WantedBy=multi-user.target
94+

testinfra/test_ami_nix.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -374,12 +374,18 @@ def is_healthy(ssh) -> bool:
374374
try:
375375
result = run_ssh_command(ssh, command)
376376
if not result["succeeded"]:
377-
logger.warning(f"{service} not ready")
377+
info_text = ""
378+
info_command = f"sudo journalctl -b -u {service} -n 20 --no-pager"
379+
info_result = run_ssh_command(ssh, info_command)
380+
if info_result["succeeded"]:
381+
info_text = "\n" + info_result["stdout"].strip()
382+
383+
logger.warning(f"{service} not ready{info_text}")
378384
return False
385+
379386
except Exception:
380387
logger.warning(f"Connection failed during {service} check")
381388
return False
382-
383389
return True
384390

385391
while True:

0 commit comments

Comments
 (0)