Skip to content

Commit 3efed5f

Browse files
alexmvtimabbott
authored andcommitted
queue_processors: Shut down background missedmessage_emails thread.
Python's behaviour on `sys.exit` is to wait for all non-daemon threads to exit. In the context of the missedmessage_emails worker, if any work is pending, a non-daemon Timer thread exists, which is waiting for 5 seconds. As soon as that thread is serviced, it sets up another 5-second Timer, a process which repeats until all ScheduledMessageNotificationEmail records have been handled. This likely takes two minutes, but may theoretically take up to a week until the thread exits, and thus sys.exit can complete. Supervisor only gives the process 30 seconds to shut down, so something else must prevent this endless Timer. When `stop` is called, take the lock so we can mutate the timer. However, since `stop` may have been called from a signal handler, our thread may _already_ have the lock. As Python provides no way to know if our thread is the one which has the lock, make the lock a re-entrant one, allowing us to always try to take it. With the lock in hand, cancel any outstanding timers. A race exists where the timer may not be able to be canceled because it has finished, maybe_send_batched_emails has been called, and is itself blocked on the lock. Handle this case by timing out the thread join in `stop()`, and signal the running thread to exit by unsetting the timer event, which will be detected once it claims the lock.
1 parent acbeeac commit 3efed5f

File tree

1 file changed

+32
-4
lines changed

1 file changed

+32
-4
lines changed

zerver/worker/queue_processors.py

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616
from collections import deque
1717
from email.message import EmailMessage
1818
from functools import wraps
19-
from threading import Lock, Timer
19+
from threading import RLock, Timer
2020
from types import FrameType
2121
from typing import (
2222
Any,
@@ -582,8 +582,10 @@ class MissedMessageWorker(QueueProcessingWorker):
582582
# This lock protects access to all of the data structures declared
583583
# above. A lock is required because maybe_send_batched_emails, as
584584
# the argument to Timer, runs in a separate thread from the rest
585-
# of the consumer.
586-
lock = Lock()
585+
# of the consumer. This is a _re-entrant_ lock because we may
586+
# need to take the lock when we already have it during shutdown
587+
# (see the stop method).
588+
lock = RLock()
587589

588590
# Because the background `maybe_send_batched_email` thread can
589591
# hold the lock for an indeterminate amount of time, the `consume`
@@ -640,7 +642,10 @@ def maybe_send_batched_emails(self) -> None:
640642
# self.timer_event just triggered execution of this
641643
# function in a thread, so now that we hold the lock, we
642644
# clear the timer_event attribute to record that no Timer
643-
# is active.
645+
# is active. If it is already None, stop() is shutting us
646+
# down.
647+
if self.timer_event is None:
648+
return
644649
self.timer_event = None
645650

646651
current_time = timezone_now()
@@ -696,6 +701,29 @@ def maybe_send_batched_emails(self) -> None:
696701
if ScheduledMessageNotificationEmail.objects.exists():
697702
self.ensure_timer()
698703

704+
def stop(self) -> None:
705+
# This may be called from a signal handler when we _already_
706+
# have the lock. Python doesn't give us a way to check if our
707+
# thread has the lock, so we instead use a re-entrant lock to
708+
# always take it.
709+
with self.lock:
710+
# With the lock,we can safely inspect the timer_event and
711+
# cancel it if it is still pending.
712+
if self.timer_event is not None:
713+
# We cancel and then join the timer with a timeout to
714+
# prevent deadlock, where we took the lock, the timer
715+
# then ran out and started maybe_send_batched_emails,
716+
# and then it started waiting for the lock. The timer
717+
# isn't running anymore so can't be canceled, and the
718+
# thread is blocked on the lock, so will never join().
719+
self.timer_event.cancel()
720+
self.timer_event.join(timeout=1)
721+
# In case we did hit this deadlock, we signal to
722+
# maybe_send_batched_emails that it should abort by,
723+
# before releasing the lock, unsetting the timer.
724+
self.timer_event = None
725+
super().stop()
726+
699727

700728
@assign_queue("email_senders")
701729
class EmailSendingWorker(LoopQueueProcessingWorker):

0 commit comments

Comments
 (0)