-
-
Notifications
You must be signed in to change notification settings - Fork 348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global deadlock detector #1085
Comments
Minor tricky bit: our |
#456 raises some important questions about how we define "blocked" for purposes of introspective stuff like |
I have a branch to experiment with splitting
|
Some more interesting issues that occurred to me, around the interaction between deadlock detection and
A simple solution would be to have a flag you pass when creating a Alternatively, I think we could use a trick: when you call This sounds a bit elaborate, but there's actually some theoretical benefit too: I don't think you can have a correct system that relies on toggling the DDBNP from False -> True from outside Trio. There's an inherent race condition: if you're relying on setting it to True to block the deadlock detector from firing, then how do you know that the deadlock detector won't fire just before you set it to True? But if people have to go into Trio to get a deadlock-detector-blocking-ticket, then that automatically gives you a synchronization point. Of course, if someone has a ticket with DDBNP = False, they can still use that to enter Trio and get a ticket with DDBNP = True, and that has exactly the same issues as toggling a single ticket in-place. We don't really need to solve this right now – first we need to get the new entry ticket API in place, and then add deadlock detection, and then we can worry about whether toggling this is important. |
OK better idea! Let's add (Or maybe call it Then:
So maybe we don't even need to replace |
or maybe Anyway, the concept seems sound enough. However, I'd like to have the opposite option, i.e. the ability to mark a task as non-deadlock-breaking even if it is (or will shortly be) runnable. Use case: a protocol's keepalive task which runs periodically and sends off an "I am alive" packet. As I understand it, without this option the existence of this task would be sufficient to inhibit deadlock detection. |
That's an interesting idea. I haven't thought too much about cases like a keepalive task. For now I've mostly focused on figuring out how to make sure this never gives false positives, since a false positive is "we just blew up your working program" while a false negative is "your broken program could potentially have gotten better diagnostics of how it was broken, but didn't". And I think the biggest value will be for small programs, like simple scripts and for unit tests; I guess large programs will generally have something somewhere that keeps running, even if some subsystem is deadlocked. But... a small-but-annoying design problem I've been thinking about is, how should we prevent the wakeup-pipe task from blocking the deadlock detector. It's definitely solvable, since this is a special internal task with privileged access to Trio's internals. But right now it just blocks in Another example where this would be useful would be python-trio/hip#125, where if you're using an HTTP library that silently uses background tasks to listen for pings on HTTP/2 and HTTP/3 connections, then having ever used HTTP/2 or HTTP/3 in your program could produce a situation where the deadlock detector never fires. (Depending on how patient the server is.) Not very nice! I guess the implementation could be pretty trivial too... async def wait_task_rescheduled(..., inhibit_deadlock_detector=...):
task = current_task()
...
inhibit_deadlock_detector = (inhibit_deadlock_detector and task.can_inhibit_deadlock_detector)
... |
After #1588, I think actual deadlock detection could be as simple as: # in run loop
idle_primed = False
if runner.waiting_for_idle:
...
elif runner.clock_autojump_threshold < timeout:
...
elif runner.external_wakeup_count == 0 and not runner.guest_mode:
timeout = 0
idle_primed = True
idle_prime_type = IdlePrimedTypes.DEADLOCK
...
if idle_primed and idle:
if idle_primed_type is IdlePrimedTypes.DEADLOCK:
# ... crash with deadlock message ... And @contextmanager
def not_deadlocked():
task = trio.current_task()
if not task.ignored_by_deadlock_detector:
GLOBAL_RUN_STATE.runner.external_wakeup_count += 1
try:
yield
finally:
GLOBAL_RUN_STATE.runner.external_wakeup_count -= 1
else:
yield IO primitives do (Maybe it's also convenient allow a We would also need to update the autojump clock so that if the autojump is triggered but there is no next deadline, then it runs a version of the deadlock-detected code. Though probably with a custom message. A minor open question: what to do with This is all looking pretty straightforward. I think the main remaining blockers are getting code to dump the task tree when a deadlock happens (#927), and some way to kill everything with a custom exception (maybe the |
On further thought, going through the new if timeout == inf and not runner.external_wakeup_count and not runner.guest_mode:
# deadlock detected This also requires moving the Ugh this will be so cool and it's so easy, I can taste it. But we need to sort out those blockers first. |
We have an issue for fancier deadlock detection, and API support to make it more useful (#182). This is about a simpler issue: detecting when the entire program has deadlocked, i.e. no tasks are runnable or will ever be runnable again. This is not nearly as fancy, but it would catch lots of real-world deadlock cases (e.g. in tests), and is potentially wayyy simpler. In particular, I believe a Trio program has deadlocked if:
IOManager
wait_all_tasks_blocked
(Did I miss anything?)
However, there is one practical problem: the
EntryQueue
task is always blocked in theIOManager
, waiting for someone to callrun_sync_soon
.Practical example of why this is important: from the Trio scheduler's point of view,
run_sync_in_worker_thread
puts a task to sleep, and then later a call toreschedule(...)
magically appears throughrun_sync_soon
. So... it's entirely normal to be in a state where the whole program looks deadlocked except for the possibility of getting arun_sync_soon
, and the program actually isn't deadlocked. But, of course, 99% of the time, there is absolutely and definitely norun_sync_soon
call coming. There's just no way for Trio to know that.So I guess to make this viable, we would need some way to recognize the 99% of cases where there is no chance of a
run_sync_soon
. I think that means, we need to refactorTrioToken
so that it uses an acquire/release pattern: you acquire the token only if you plan to callrun_sync_soon
, and then when you're done with it you explicitly close it.This will break the other usage of
TrioToken
, which is that you can compare them withis
to check if two calls totrio.run
are in fact the same. Maybe this is not even that useful? If it is though then we should split it off into a separate class, so that the only reason to acquire therun_sync_soon
-object is because you're going to callrun_sync_soon
.Given that, I think we could implement this by extending the code at the top of the event loop like:
This is probably super-cheap too, because we only do the extra checks when there are no runnable tasks or deadlines. No runnable tasks means we're either about to go to sleep for a while, so taking some extra time here is "free", or else that we're about to detect I/O, but if there's outstanding I/O then you should probably have a deadline set...
The text was updated successfully, but these errors were encountered: