Skip to content

Commit 8847978

Browse files
committed
Add a simple system startup guard
Since it's going to be possible in the official Nerves systems to try out firmware that needs logic to validate it, there needs to be a simple way for new users to use it. This is a really basic startup guard that waits for all OTP applications in the start script to be running and then validates the running firmware. Applications not starting result in a reboot after 15 minutes which will either revert or go through the process again. A warning message is printed every minute to hopefully clue people into what's happening since it's guaranteed that 15 minutes won't work for everyone.
1 parent 633ebdb commit 8847978

File tree

8 files changed

+426
-64
lines changed

8 files changed

+426
-64
lines changed

README.md

Lines changed: 56 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -176,38 +176,53 @@ encrypted firmware storage. See `Nerves.Runtime.FwupOps.prevent_revert/0`.
176176

177177
### Assisted firmware validation and automatic revert
178178

179-
Nerves firmware updates protect against update corruption and power loss
180-
midway into the update procedure. However, what happens if the firmware update
181-
contains bad code that hangs the device or breaks something important like
182-
networking? Some Nerves systems support tentative runs of new firmware and if
183-
something goes wrong, they'll revert back.
179+
Nerves firmware updates protect against update corruption and power loss midway
180+
into the update procedure. However, what happens if the firmware update contains
181+
bad code that hangs the device or breaks something important like networking?
182+
Some Nerves systems support tentative runs of new firmware and if something goes
183+
wrong, they'll revert back.
184184

185185
At a high level, this involves some additional code from the developer that
186-
knows what constitutes "working". This could be "is it possible to connect to
187-
the firmware update server within 5 minutes of boot?"
188-
189-
Here's the process:
190-
191-
1. New firmware is installed in the normal manner. The `Nerves.Runtime.KV`
192-
variable, `nerves_fw_validated` is set to 0. (The systems `fwup.conf` does
193-
this)
194-
2. The system reboots like normal.
195-
3. The device starts a five minute reboot timer (your code needs to do this if
196-
you want to catch hangs or super-slow boots)
197-
4. The application attempts to make a connection to the firmware update server.
198-
5. On a good connection, the application sets `nerves_fw_validated` to 1 by
199-
calling `Nerves.Runtime.validate_firmware/0` and cancels the reboot timer.
200-
6. On error, the reboot timer failing, or a hardware watchdog timeout, the
201-
system reboots. The bootloader reverts to the previous firmware.
202-
203-
Some Nerves systems support a KV variable called `nerves_fw_autovalidate`. The
204-
intention of this variable was to make that system support scenarios that
205-
require validate and ones that don't. If the system supports this variable then
206-
you should make sure that it is set to 0 (either via a custom fwup.conf or via
207-
the provisioning hooks for writing serial numbers to MicroSD cards). Support for
208-
the `nerves_fw_autovalidate` variable will likely go away in the future as steps
209-
are made to make automatic revert on bad firmware a default feature of Nerves
210-
rather than an add-on.
186+
knows what constitutes "working". `Nerves.Runtime` comes with a module,
187+
`Nerves.Runtime.StartupGuard`, that handles this by waiting for all OTP
188+
applications to start and then validates the new firmware.
189+
190+
To use `Nerves.Runtime.StartupGuard`, first check whether your Nerves system
191+
doesn't automatically validate firmware after it gets written successfully. This
192+
was previously done on all official systems for simplicity and we're in the
193+
process of changing that. It's easy to see. Update the firmware to your project.
194+
Run `Nerves.Runtime.firmware_validation_status/0`. If it's validated and you
195+
don't have the `Nerves.Runtime.StartupGuard` enabled, then it auto-validates.
196+
Otherwise, run `Nerves.Runtime.validate_firmware/0`. To enable
197+
`Nerves.Runtime.StartupGuard` to validate the firmware for you, add the
198+
following to your project's `target.exs` or `config.exs`:
199+
200+
```elixir
201+
config :nerves_runtime, startup_guard_enabled: true
202+
```
203+
204+
Add then add the following to your project's `rel/vm.args.eex`:
205+
206+
```text
207+
## Require an initialization handshake within 10 minutes
208+
-env HEART_INIT_TIMEOUT 600
209+
```
210+
211+
Of course, there's much room for improvement. For example, if your Nerves device
212+
connects to a firmware update server, the criteria for validating new firmware
213+
could be connecting to that server.
214+
215+
Recommendations for this process are:
216+
217+
1. Allow for enough time when in a bad state to do remote debug if that's
218+
possible. Rebooting immediately can limit diagnostic options when unexpected
219+
things happen remotely.
220+
2. Link the validation code to Nerves Heart. This can protect against failures
221+
and hangs that occur before the validation process starts.
222+
3. Keep the heart callback code as simple as possible since heart is very
223+
unforgiving to errors, exceptions, and slow code.
224+
225+
One way to start is to copy/paste `Nerves.Runtime.StartupGuard` and modify.
211226

212227
### U-Boot assisted automatic revert
213228

@@ -232,23 +247,6 @@ environment variable to `"1"` to indicate that boot counting should start.
232247
you call it to indicate that the firmware is ok, it will set `upgrade_available`
233248
back to `"0"` and reset `"bootcount"`.
234249

235-
### Best effort automatic revert
236-
237-
Unfortunately, the bootloader for platforms like the Raspberry Pi makes it
238-
difficult to implement the above mechanism. The following strategy cannot
239-
protect against kernel and early boot issues, but it can still provide value:
240-
241-
1. Upgrade firmware the normal way. Record that the next boot will be the first
242-
one in the application data partition.
243-
2. On the reboot, if this is the first one, record that the boot happened and
244-
revert the firmware with `reboot: false`. If this is not the first boot,
245-
carry on.
246-
3. When you're happy with the new firmware, revert the firmware again with
247-
`reboot: false`. I.e., revert the revert. It is critical that `revert` is
248-
only called once.
249-
250-
To make this handle hangs, you'll want to enable a hardware watchdog.
251-
252250
## Serial numbers
253251

254252
Finding the serial number of a device is both hardware specific and influenced
@@ -299,19 +297,18 @@ Task | Description
299297

300298
## Application environment
301299

302-
This section documents officially supported application environment keys.
300+
This section documents officially supported application environment keys that
301+
can be added to your `config.exs`, `target.exs`, or the like.
303302

304-
Most users shouldn't need to modify the application environment for
305-
`nerves_runtime` except for unit testing. See the next section for testing.
306-
307-
Key | Default | Description
308-
--------------- | ----------------------------------- | ------------
309-
`:boardid_path` | `"/usr/bin/boardid"` | Path to the `boardid` binary for determining the device's serial number
310-
`:devpath` | `/dev/rootdisk0` | The block device that firmware is stored on. `/dev/rootdisk0` is a symlink on Nerves to the real location, so this really shouldn't need to be changed.
311-
`:fwup_env` | `%{}` | Additional environment variables to pass to `fwup`
312-
`:fwup_path` | `"fwup"` | Path to the `fwup` binary for querying or modifying firmware status
313-
`:kv_backend` | `Nerves.Runtime.KVBackend.UBootEnv` | The backing store for firmware slot and other low level key-value pairs. This is almost always a U-Boot environment block for Nerves
314-
`:ops_fw_path` | `"/usr/share/fwup/ops.fw"` | Path to the `ops.fw` file for passing to `fwup` for firmware status tasks
303+
Key | Default | Description
304+
------------------------- | ----------------------------------- | ------------
305+
`:boardid_path` | `"/usr/bin/boardid"` | Path to the `boardid` binary for determining the device's serial number (useful for unit tests)
306+
`:devpath` | `/dev/rootdisk0` | The block device that firmware is stored on. `/dev/rootdisk0` is a symlink on Nerves to the real location, so this really shouldn't need to be changed. (useful for unit tests)
307+
`:fwup_env` | `%{}` | Additional environment variables to pass to `fwup`. (useful for unit tests)
308+
`:fwup_path` | `"fwup"` | Path to the `fwup` binary for querying or modifying firmware status. (useful for unit tests)
309+
`:kv_backend` | `Nerves.Runtime.KVBackend.UBootEnv` | The backing store for firmware slot and other low level key-value pairs. This is almost always a U-Boot environment block for Nerves. (useful for unit tests)
310+
`:ops_fw_path` | `"/usr/share/fwup/ops.fw"` | Path to the `ops.fw` file for passing to `fwup` for firmware status tasks. (useful for unit tests)
311+
`:startup_guard_enabled` | `false` | Check that all OTP applications start up and then validate the firmware if needed. Reboot after 15 minutes if start up isn't successful.
315312

316313
## Using nerves_runtime in tests
317314

@@ -357,4 +354,3 @@ All original source code in this project is licensed under Apache-2.0.
357354

358355
Additionally, this project follows the [REUSE recommendations](https://reuse.software)
359356
and labels so that licensing and copyright are clear at the file level.
360-

lib/nerves_runtime.ex

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -264,4 +264,17 @@ defmodule Nerves.Runtime do
264264
# it's come up so far.
265265
Function.identity(@mix_target)
266266
end
267+
268+
@doc false
269+
@spec get_expected_started_apps() :: {:ok, [atom()]} | :error
270+
def get_expected_started_apps() do
271+
{:ok, [[boot]]} = :init.get_argument(:boot)
272+
contents = File.read!("#{boot}.boot")
273+
{:script, _name, instructions} = :erlang.binary_to_term(contents)
274+
275+
apps = for {:apply, {:application, :start_boot, [app | _]}} <- instructions, do: app
276+
{:ok, apps}
277+
rescue
278+
_ -> :error
279+
end
267280
end

lib/nerves_runtime/application.ex

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ defmodule Nerves.Runtime.Application do
1212

1313
alias Nerves.Runtime.FwupOps
1414
alias Nerves.Runtime.KV
15+
alias Nerves.Runtime.StartupGuard
1516

1617
require Logger
1718

@@ -20,7 +21,11 @@ defmodule Nerves.Runtime.Application do
2021
load_services()
2122

2223
options = Application.get_all_env(:nerves_runtime)
23-
children = [{FwupOps, options}, {KV, options} | target_children()]
24+
25+
startup_guard_children =
26+
if options[:startup_guard_enabled], do: [{StartupGuard, options}], else: []
27+
28+
children = [{FwupOps, options}, {KV, options}] ++ startup_guard_children ++ target_children()
2429

2530
opts = [strategy: :one_for_one, name: Nerves.Runtime.Supervisor]
2631
Supervisor.start_link(children, opts)
Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,191 @@
1+
# SPDX-FileCopyrightText: 2025 Frank Hunleth
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
defmodule Nerves.Runtime.StartupGuard do
5+
@moduledoc """
6+
Monitor system startup and validate firmware
7+
8+
This module provides an easy option for validating firmware for simple use
9+
cases. Whether new firmware even needs to be validated on first boot is
10+
determined by the Nerves system that you're using. When it doubt, an easy way
11+
to know is if you have to run `Nerves.Runtime.validate_firmware/0` every time
12+
you upload new firmware, then your Nerves system requires validation. While
13+
you may eventually want to check that networking or other things work before
14+
validating, using this module should suffice in the mean time.
15+
16+
## Setup
17+
18+
Add the following to your project's `target.exs` or `config.exs`:
19+
20+
```elixir
21+
config :nerves_runtime, startup_guard_enabled: true
22+
```
23+
24+
Add the following to your project's `rel/vm.args.eex`:
25+
26+
```text
27+
## Require an initialization handshake within 10 minutes
28+
-env HEART_INIT_TIMEOUT 600
29+
```
30+
31+
The discussion below explains more about the heart initialization handshake
32+
timer.
33+
34+
## Discussion
35+
36+
Here's the high level summary:
37+
38+
1. New firmware is unvalidated on first boot. If it's not validated, the
39+
next reboot runs the previous firmware again.
40+
2. This module considers firmware good if the OTP release starts all
41+
applications successfully. If this doesn't happen in 15 minutes, the
42+
system reboots.
43+
3. After application startup confirmation, the running firmware is
44+
validated if this is the first boot by calling
45+
`Nerves.Runtime.validate_firmware/0`.
46+
4. `StartupGuard` stops running.
47+
48+
This sounds good, but broken firmware can also hang or not call the code that
49+
gives up after 15 minutes.
50+
51+
Protecting against hung code eventually leads to making use of a hardware
52+
watchdog. Most Nerves systems use these and integrate it with the Erlang
53+
heart feature. The hardware watchdog is still a last resort, so other systems
54+
can certainly try to gracefully reboot before the hardware watchdog kicks in.
55+
56+
This module registers with Erlang's heart. The
57+
`Nerves.Runtime.Heart.init_complete/0` call is a Nerves extension to heart to
58+
cancel a timer on setting the Erlang heart callback. This addresses hangs
59+
before setting the callback or just something skipping the code entirely.
60+
61+
Keep in mind that the heart callback is totally unforgiving to errors and
62+
function calls taking too long. Making it too complicated can backfire and
63+
cause inadvertent reboots. Rebooting too quickly on errors can impact your
64+
ability debug partial failures. If using this code as a template, try to
65+
keep your code in `Task` or change this to a `GenServer` or anything else
66+
that can be supervised.
67+
"""
68+
use Task, restart: :transient
69+
70+
alias Nerves.Runtime.Heart
71+
72+
require Logger
73+
74+
@retry_delay :timer.seconds(10)
75+
@give_up_minutes 15
76+
@start_warning_minutes 2
77+
78+
@doc false
79+
@spec start_link(keyword()) :: {:ok, pid()}
80+
def start_link(opts) do
81+
Task.start_link(__MODULE__, :run, [opts])
82+
end
83+
84+
@doc false
85+
@spec run(keyword()) :: :ok
86+
def run(opts) do
87+
retry_delay = Keyword.get(opts, :retry_delay, @retry_delay)
88+
89+
# Register with heart to bullet proof against hangs or other weirdness happening
90+
# in this code.
91+
:ok = :heart.set_callback(__MODULE__, :heart_check)
92+
Heart.init_complete()
93+
94+
# Wait for all of the applications specified in the release to start.
95+
{:ok, expected_apps} =
96+
repeat_while(&Nerves.Runtime.get_expected_started_apps/0, :error, 10, retry_delay)
97+
98+
repeat_until(fn -> all_applications_started?(expected_apps) end, 10, retry_delay)
99+
100+
# Try getting the firmware validation status. If :unknown, hope.
101+
status = repeat_while(&Nerves.Runtime.firmware_validation_status/0, :unknown, 10, retry_delay)
102+
103+
# Validate or not.
104+
if status == :unvalidated do
105+
Logger.info("Firmware not validated. Validating now...")
106+
:ok = Nerves.Runtime.validate_firmware()
107+
Logger.info("Firmware validated successfully")
108+
else
109+
Logger.info("Firmware valid and all applications started successfully")
110+
end
111+
112+
# Stop the heart callback since all is good now
113+
:heart.clear_callback()
114+
end
115+
116+
defp repeat_until(_fun, 0, _retry_delay) do
117+
raise RuntimeError, "Exceeded maximum retries"
118+
end
119+
120+
defp repeat_until(fun, retries, retry_delay) do
121+
if !fun.() do
122+
Process.sleep(retry_delay)
123+
repeat_until(fun, retries - 1, retry_delay)
124+
end
125+
end
126+
127+
defp repeat_while(_fun, _unwanted_result, 0, _retry_delay) do
128+
raise RuntimeError, "Exceeded maximum retries"
129+
end
130+
131+
defp repeat_while(fun, unwanted_result, retries, retry_delay) do
132+
result = fun.()
133+
134+
if result == unwanted_result do
135+
Process.sleep(retry_delay)
136+
repeat_while(fun, unwanted_result, retries - 1, retry_delay)
137+
else
138+
result
139+
end
140+
end
141+
142+
@doc false
143+
@spec heart_check() :: :ok | :error
144+
def heart_check() do
145+
uptime_minutes = get_uptime_minutes()
146+
147+
do_heart_check(uptime_minutes)
148+
end
149+
150+
@doc false
151+
@spec do_heart_check(non_neg_integer()) :: :ok | :error
152+
def do_heart_check(uptime_minutes) do
153+
cond do
154+
uptime_minutes >= @give_up_minutes ->
155+
Logger.error("Took too long to validate firmware. Rebooting.")
156+
:error
157+
158+
uptime_minutes < @start_warning_minutes ->
159+
:ok
160+
161+
uptime_minutes != Process.get(:last_warning_minutes) ->
162+
Logger.warning(
163+
"Firmware not validated. Check logs. Rebooting in #{@give_up_minutes - uptime_minutes} minutes if unfixed."
164+
)
165+
166+
Process.put(:last_warning_minutes, uptime_minutes)
167+
:ok
168+
169+
true ->
170+
:ok
171+
end
172+
end
173+
174+
defp get_uptime_minutes() do
175+
{total, _last_call} = :erlang.statistics(:wall_clock)
176+
div(total, 60_000)
177+
end
178+
179+
defp all_applications_started?(expected_apps) do
180+
actual_apps = for {app, _, _} <- Application.started_applications(), do: app
181+
182+
unstarted_apps = expected_apps -- actual_apps
183+
184+
if unstarted_apps != [] do
185+
Logger.warning("Waiting on the following applications to start: #{inspect(unstarted_apps)}")
186+
false
187+
else
188+
true
189+
end
190+
end
191+
end

0 commit comments

Comments
 (0)