Skip to content

Commit f3bbe9c

Browse files
author
Chris Stockton
committed
systemd: tighten gotrue.service deps and startup behavior
Add stronger ordering and dependency constraints to reduce startup race conditions and noisy flapping: - Wait for `cloud-init`, `supabase-admin-agent_salt`, `apparmor`, `systemd-sysctl`, and `ufw` to complete before starting. - Require `network-online.target` and `systemd-resolved` for stable DNS resolution; note Go's resolver can race with early boot DNS. - Ensure `postgresql.service` is online before starting auth to avoid misleading error noise during slow boots. - Lower `StartLimitIntervalSec` and `StartLimitBurst` to reduce repeated restarts in failure scenarios. - Switch service type to `exec` instead of `simple`. This removes the tiny window in which systemd is supervising the wrapper process instead of the Go binary. These changes aim to rule out capability changes, socket reuse races, and incomplete firewall/network config as causes of EADDRINUSE errors and unstable startup.
1 parent fae5be1 commit f3bbe9c

File tree

1 file changed

+49
-2
lines changed

1 file changed

+49
-2
lines changed

ansible/files/gotrue.service.j2

Lines changed: 49 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,55 @@
11
[Unit]
22
Description=Gotrue
33

4+
# Avoid starting gotrue while cloud-init is running. It makes a lot of changes
5+
# and I would like to rule out side effects of it running concurrently along
6+
# side services.
7+
After=cloud-init.service
8+
Wants=cloud-init.target
9+
10+
# I'm not certain waiting for this service will allow it to apply its changes
11+
# but I think it's prudent to do our best to let salt apply formula.
12+
After=supabase-admin-agent_salt.service
13+
14+
# Given the fact that auth uses SO_REUSEADDR, I want to rule out capabilities
15+
# being modified between restarts early in boot. This plugs up the scenario that
16+
# EADDRINUSE errors originate from a previous gotrue process starting without
17+
# the SO_REUSEADDR flag (due to lacking capability at that point in boot proc)
18+
# so when the next gotrue starts it can't re-use a slow releasing socket.
19+
After=apparmor.service
20+
21+
# We want sysctl's to be applied
22+
After=systemd-sysctl.service
23+
24+
# UFW Is modified by cloud init, but started non-blocking, so configuration
25+
# could be in-flight while gotrue is starting. I want to ensure future rules
26+
# that are relied on for security posture are applied before gotrue runs.
27+
After=ufw.service
28+
29+
# We need networking & resolution, auth uses the Go DNS resolver (not libc)
30+
# so it's possible `localhost` resolution could be unstable early in startup. We
31+
# care about this because SO_REUSEADDR eligibility checks the tuple
32+
# (proto, family, addr, port) meaning the AF_INET (ipv4, ipv6) could affect the
33+
# binding resulting in a second way for EADDRINUSE errors to surface.
34+
#
35+
# Note: We should consider removing localhost usage given `localhost` resolution
36+
# can often be racey early in boot, can be difficult to debug and offers no real
37+
# advantage in our infra. At the very least avoiding DNS resolved binding would
38+
# be a good idea.
39+
Wants=network-online.target systemd-resolved.service
40+
After=network-online.target systemd-resolved.service
41+
42+
# Auth server can't start unless postgres is online, lets remove a lot of auth
43+
# server noise during slow starts by requiring it.
44+
Wants=postgresql.service
45+
After=postgresql.service
46+
47+
# Lower start limit ival and burst to prevent the noisy flapping
48+
StartLimitIntervalSec=10
49+
StartLimitBurst=5
50+
451
[Service]
5-
Type=simple
52+
Type=exec
653
WorkingDirectory=/opt/gotrue
754
{% if qemu_mode is defined and qemu_mode %}
855
ExecStart=/opt/gotrue/gotrue
@@ -24,4 +71,4 @@ EnvironmentFile=-/etc/gotrue.overrides.env
2471
Slice=services.slice
2572

2673
[Install]
27-
WantedBy=multi-user.target
74+
WantedBy=multi-user.target

0 commit comments

Comments
 (0)