Allow supervisor to recover after crash#519
Conversation
|
Looking into the test failures |
|
All checks but 1 are passing, and the failure was due to an error from Docker: I tried to see if there was a way to re-run the job, but couldn't find anything. Am I missing something obvious? 😅 |
Co-authored-by: Rosa Gutierrez <rosa.ge@gmail.com>
|
Hey @darinwilson, sorry for the delay and radio-silence on this one. I continued thinking about this, and I'm not convinced this is the right way to handle the problem. I think whatever process manages the supervisor should be in charge of restarting it, just in a similar way as the supervisor is in charge of restarting the workers when they crash. This means that when the Puma plugin is used, the whole thing shouldn't crash when Solid Queue stops. I'm going to close this PR, and will keep the original issue open. |
|
@rosa OK - no worries. Thanks for the feedback! |
This is a first pass at a fix for #512: allowing Solid Queue to recover if the database goes offline (or if it fails for any other reason).
In the case of the database going away, there are a few possible scenarios that can cause the supervisor to fail, but the most common is:
after_shutdowninRegistrable) and failsProcess#deregisterre-raises any exceptions that come up during deregistration, so worker crashesProcessso when it fails, it callsderegisterjust like the worker didderegisterso Solid Queue terminates completelyAfter a restart, the maintenance tasks performed by the supervisor do a good job of cleaning up the loose ends left behind, so it seemed like the cleanest approach was just to let the supervisor crash, then spin up a new instance. This is handled by a new
Launcherclass that wrapsSupervisor#startin a retry block with exponential backoff.This also adds a new config parameter
max_restart_attemptsthat allows the user to limit the number of restart attempts. Ifnil, it will retry forever, and if0, it won't try at all. (I made0the default since that's the current behavior.)I tested with Postgres and MySQL, but didn't really know how to test SQLite or if it even made sense to. Again, this is just a first attempt - happy to try a different approach if this doesn't seem quite right.
Thanks!