-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
using 100% trying to acquire lock #15
Comments
my reading of finally without an except is that it will never raise the exception because it gets thrown away. Thats obviously not the case. Either way I suspect changing the code to
Might solve the lock issue. But still to understand how the state gets a key error |
Ah ok. the cause of the keyerror is because the original exception is happening in another subscriber called first. But end request handler is still called regardless. So the fix code could be
|
I've changed my mind. Everything says that python guarantees finally is called. so I can't think of any way the del _state[id] exception can be the cause of the lock not being released. Can you @zopyx ? |
Sorry, no idea...ask Dieter |
The |
In principle, the lock management should be correct: the lock is acquired (and released in a A note in Note: If a subscriber to 'IPubSuccess' raises an exception,
then 'IPubFailure' may be notified in addtion to 'IPubSuccess'. i.e. it is possible that there are 2 |
@d-maurer the keyerror is not the problem. I've fixed that in the referenced pr. The problem is that I'm getting 100% CPU from a thread stuck in line 80 trying to acquire a lock and I can't see how your code is wrong to cause a lock to not be released that results in this kind of deadlock. |
I am just writing this to confirm that code like this would be essentially wrong, and |
Dylan Jay wrote at 2023-8-23 06:38 -0700:
...
The problem is that I'm getting 100% CPU from a thread stuck in line 80 trying to acquire a lock and I can't see how your code is wrong to cause a lock to not be released that results in this kind of deadlock.
I explained the only thing I see which could block the lock release
for some time: misbehaving destructors activated during item deletion.
I do not think the probability is high because the request (the main
critical object) is not destroyed at this point (but keep alive
until the request is closed).
Thus, I too, do not have a convincing idea.
|
@d-maurer I'm not sure I understand how a destructor might hold a lock? My reasoning is the only thing that could cause the lock to not be released is an exception in
Since that has no try: finally:. but no evidence this happened. Also that means the loop would have been exited and there is no way for it to get stuck on line 80... By destructors during item deletion do you mean that you think this could happen in when the state is deleted?
however finally should handle the release of the lock so would not cause the deadlock. Anywhere else a destructor would not result in a lock not being released so wouldnt be a problem. So I'm still stumped. |
@d-maurer actually I just did a test. lock.acquire doesn't use hardly any cpu so it can't be the cause. it might be just that when I did py-spy dump its what I keep seeing as the line it's stuck on because it's the slowest. Instead it could be just running that loop fast with no sleep. but the only way that could be is if somehow config.period get reset to 0 halfway through the life of the zope process??? I just can't see how that could happen. |
Dylan Jay wrote at 2023-8-24 03:08 -0700:
@d-maurer I'm not sure I understand how a destructor might hold a lock?
`account_request` is holding the lock while it executes
`del state[id]`. The `del` may invoke arbitrary destructors
and potentially one of them takes a lot of time.
My reasoning is the only thing that could cause the lock to not be released is an exception in
https://github.com/collective/haufe.requestmonitoring/blob/3e1af58eeeb014a3f94aa3fc37893f68cdbf4dd2/haufe/requestmonitoring/monitor.py#L81
`state` is a `dict`; `state.copy()` creates a shallow copy.
The only possible exception should be a `MemoryError`.
…--
Dieter
|
@d-maurer none of this explains what I'm seeing so still stumped. No matter how many times I run py-spy on all of the running instances that switch to 100% cpu I get the same result. Stuck on the lock.acquire on line 80. No other active threads most of the time. and acquire should use no cpu if it's blocked and it has no timeout so it's not running in a tight loop.
So none of it makes sense. But the PR at least fixes the problem of masking of other errors that happen in other subscribers. Once I get it into production we will see if helps with the 100% cpu issue or not, despite that in theory it should not. |
Dylan Jay wrote at 2023-8-30 22:41 -0700:
...
But the PR at least fixes the problem of masking of other errors that happen in other subscribers.
I have not looked at your PR. But, if it not already does, it could
close the small hole where an exception in `dict.copy`
(only `MemoryError` should be possible) prevents the lock release.
Enclose the `copy` in a `try: ... finally:` with the lock release in
the `finally` clause.
I do not expect that this is the cause of your problem but
nevertheless, it would be an improvement.
I remember to have had analysis problems similar to yours
(I observed a thread apparently continuously at a harmless place)
in the past. In my case, the garbage collector was responsible:
some third party component had created hundreds of thousand
temporary objects causing frequent garbage collector activations.
Running in "C", the garbage collector is not seen by stack
analysis targeting Python code.
During garbage collector runs, 100 % CPU usage is not surprising.
Of course, such runs should end (but if you have a large number
of objects, a complete GC turn can take some time).
|
@d-maurer yes of course I wrapped the copy. The 100% cpu goes on forever if not restarted. Interesting possible cause but don't think that can be it. |
@d-maurer after deployment the 100% cpu problem still happens.
except now the line it's hanging on is
but this is just as mysterious since I sleep shouldn't be using any cpu. Unless somehow py-spy is lumping GIL/async stuff into the sleep call and something else is causing the cpu usuage? or perhaps the other explanation is it's a super tight loop and period is very small but still big enough that the thread always seems to end up on that line. |
Dylan Jay wrote at 2023-9-3 20:03 -0700:
@d-maurer after deployment the 100% cpu problem still happens.
That was to be expected: there are really very few conditions for a
`dict.copy` failure.
...
lumping GIL/async stuff into the sleep call and something else is causing the cpu usuage?
That depends how "py-spy" determines its data.
The "py-spy" output suggests that it is based on "samples"
(--> "Samples 10491"), i.e. snapshots.
It is natural that snapshots of `Monitor.run` will typically
see it in the `sleep` call (where is is almost all the time).
You have not yet told us (I think) on what platform you
are observing the problem.
Should you see it on Linux, then threads are mapped to processes
and you should be able to find out which process is continously running.
Hopefully, you can tell "py-spy" to include the process/thread relation
into its output (or add appropriate diagnostic output) to find out which
thread is responsible. In this way, you might be able to determine
whether the monitor is responsible or not.
|
@d-maurer I included the thread dump but in an edit so you wouldn't have seen it in the notification.
I think thats what you mean. if not, not sure what kind of process dump you mean. top just says python. I did look at the sentry thread to see if it could be related. it's idle on this line - ttps://github.com/python/cpython/blob/3.9/Lib/threading.py#L312 The platform is linux running inside docker.
True, but I wouldn't except to see it at 100% CPU in top and 100% in py-spy... |
Dylan Jay wrote at 2023-9-4 00:28 -0700:
@d-maurer I included the thread dump but in an edit so you wouldn't have seen it in the notification.
...
Thread 0x7F39770AB700 (active): "Dummy-1"
run (haufe/requestmonitoring/monitor.py:78)
It apparently tells us that `monitor.run` is active.
As you wrote, it should not be active in a `sleep` (at least not
for a longer period).
You you verify that the sleep time (i.e. the monitor's `period`)
still is reasonable.
For a client, I had created a set of views which allows inline debugging
of a running Zope process -- similar to what the so called
"monitor server" of `ZServer` allowed (but safer as I used it in a
production environment). They allow (under special setup conditions)
to interactively [re]import modules and call functions/views.
Something like this might help you to analyse the state.
...
I think thats what you mean. if not, not sure what kind of process dump you mean. top just says python.
Not precisely: Processes are identified by process ids (i.e. small
integers). Many Linux tools work with those process ids (e.g. "kill",
"ps", "top", "strace").
...
True, but I wouldn't except to see it at 100% CPU in top and 100% in py-spy...
Once you have the process id of the 100 % CPU eater, you can use
`strace -p` to learn which system services it calls
(not sure how easy this will be with `docker`).
You can also try to kill this process with various signals
(not sure this will work: when I remember right, Python handles
all signals in the main thread).
|
@d-maurer I understand processes (or think I did) but single process is all the threads. hense why I used py-spy. This is what two of the waitress instances look like from the os level.
strace gives us a lot of
Does that tell is anything? |
@d-maurer if I look at one of the normal instances I get something similar but much slower output.
so perhaps it does look like somehow that sleep delay is somehow getting reset down to some really small number like 0 and then we get a tight loop? |
Dylan Jay wrote at 2023-9-4 20:07 -0700:
@d-maurer I understand processes (or think I did) but single process is all the threads. hense why I used py-spy.
You are right. Apparently, a thread is no longer (since Linux 2.4)
identified by a pid, but now has an identifying `tid` (obtained via the system
call `gettid` and shown as `SPID` for `ps -Tf` and as `LWP` for `ps -Lf`).
You likely could call `gettid` via `ctypes` to find out the
tid of the monitor thread and one of the `ps` thread options to
determine whether the monitor thread is really the one eating all
the CPU time.
You can use `strace` (with the `-p` option) to trace
a thread. This way, you can determine which system calls the
CPU eating thread performs. Maybe, this provides sufficient hints.
... but py-spy should give the same result I believe? or I guess it's sampling so maybe not...
I have used "py-spy" only once (to analyse an apparent memory leak).
But we both agree that what "py-spy" tells us (a "sleep" using 100 % CPU)
should not be possible. Thus, either "py-spy" is mistaken
or something really strange happens.
This is what two of the waitress instances look like from the os level.
...
This looks like `top` output. You might be able to add a "tid" output column.
strace gives us a lot of
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999997})
This looks like a system call from the IO thread
(waiting on 2 input and one output channel with a 1 s timeout
returning because output is possible).
You can add a time column to the `strace` command to better understand
what goes on.
You would want to `strace` the thread eating all the CPU time,
not necessarily the IO thread.
|
Dylan Jay wrote at 2023-9-4 20:14 -0700:
@d-maurer if I look at one of the normal instances I get something similar but much slower output.
```
strace: Process 3051017 attached
select(17, [13 16], [], [13 16], {tv_sec=0, tv_usec=187813}) = 0 (Timeout)
Those `select`s are terminated by a timeout, not by possible output.
In your previous `strace` output, the `select` reported possible
output but was not followed by a corresponding `write/send` operation.
If the `select` is called again, it will see the same situation
and immediately return -- resulting in 100 % CPU usage.
Your description above ("much slower") indicates that the `select`s
follow rapidly in the 100 % CPU usage case.
You can add a time column to the `strace` output and quantify this.
I think you have found the 100 % CPU usage cause:
`select` reports "output possible" but the output is not performed.
You now must identify which thread/component is responsible
(I suspect the IO thread).
I think the monitor thread not responsible: while the `sleep`
might be implemented via `select`, it is unlikely that it wants to
perform output.
|
@d-maurer thank you. now I have something to go on. btw https://unixhealthcheck.com/blog?id=465 says how to get the TID and strace -fp TID seems to work. just got to wait for it to trip into 100% cpu mode to try it. but not sure how match the TID back to the py-spy output to see what python code is responsible. still not sure how if py-spy says all the other threads are idle we could having this select loop. Or perhaps its still in the monitor thread but the sleep is switching back to the asyc stuff and that would be doing the select stuff. |
Dylan Jay wrote at 2023-9-5 00:13 -0700:
... still not sure how if py-spy says all the other threads are idle we could having this select loop.
I am already convinced that you can forget "py-spy" for this case:
its sampling rate is not high enough and therefore, its output is not
reliable.
Or perhaps its still in the monitor thread but the sleep is switching back to the asyc stuff and that would be doing the select stuff.
Why should `sleep` do that?
It would mean that `waitress` has replaced the standard `sleep`
by something very strange (the "async world" has its own `sleep`,
there is no real need to replace the standard `sleep`).
Even if `sleep` is connected to the "async stuff", if the `select`
reports possible IO, the IO must be handled before the next identical `select`
-- otherwise, we get an active infinite loop.
|
Dylan Jay wrote at 2023-9-4 20:07 -0700:
...
strace gives us a lot of
...
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999997})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728355736}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718362989}) = 0
I have looked at the `sleep` implementation (in Python 3.10
--> "Modules/timemodule.c:pysleep").
It implements `sleep` via
err = select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &timeout);
This means: the `select` we see above does not come from `time.sleep`.
|
@d-maurer I got an strace with just the threadid that is high cpu and relative timestamps but I don't think it tells us much
I couldn't get py-spy to tell us the threadid due to docker issues so that way didn't tell us python code might be causing this. I did find another sampling profiler I could get to work but it gives a completely different answer.
but this seems equally unlikely since a wait should also be no cpu. One suspicious thing might be the sentry-sdk code as I think we might not have got this issue before we installed sentry... but I'd have to go back and check. So I still wonder if somehow something happing in c code is running while these threads have released the GIL and these sampling profilers somehow associate that activity with the waiting thread. I don't know how possible that is, or another way to get a more accurate picture of whats python code is the cause. . |
I wonder if getting a stack trace using gdb would give any clues about what's happening outside of the Python stack. https://wiki.python.org/moin/DebuggingWithGdb |
Dylan Jay wrote at 2023-9-10 21:28 -0700:
@d-maurer I got an strace with just the threadid that is high cpu and relative timestamps but I don't think it tells us much
```
[pid 3339237] 0.000094 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706467631}) = 0
[pid 3339237] 0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345697557}) = 0
[pid 3339237] 0.000076 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237] 0.000095 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706725008}) = 0
[pid 3339237] 0.000085 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345955002}) = 0
[pid 3339237] 0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
It tells us that the problem is in an IO thread:
`select` reports ready output
but not output is tried (no "write/send") before the next "select".
The IO thread is likely controlled by `waitress` (`ZEO`, too, has
IO threads but they usually talk only to a single output channel,
the `select` above shows 4).
Potentially, an exception has intervened preventing the output
operation. I would try to learn how `waitress` reports
loop callback exceptions and how to ensure that you would see such reports.
...
One suspicious thing might be the sentry-sdk code as I think we might not have got this issue before we installed sentry... but I'd have to go back and check.
You are looking for an IO thread which talks over several output
channels. Not sure whether `sentry-sdk` uses such a thread.
If something knows running in an "async" environment, then
the IO thread may not itself be responsible:
the "something" might have put a buggy output callback into
the loop. But the IO threads exception reporting should
nevertheless give hints.
...
So I still wonder if somehow something happing in c code is running while these threads have released the GIL and these sampling profilers somehow associate that activity with the waiting thread.
This might be possible. But, it should be quite rare that
C code is interested in IO on 2 input and 4 output channels.
I would expect typical C code (e.g. for database connections)
to handle a single connection.
Potentially interesting: the input "fileno"s referenced in the `select`
are small (likely opened very early) while the output "fileno"s
are quite large (likely opened significantly later).
Another potentially interesting analysis approach:
if you have set up your system appropriately, you
can attach a running process with `gdb` and analyze its state
at the C level interactively.
Once the 100 % situation occurs,
I would attach `gdb` to the Zope process,
put a breakpoint on "select" and analyze the (C level) traceback
of the thread stopped by the breakpoint.
At least, this should show whether or not the problem comes from C code.
There is a `.gdbinit` (somewhere) which defines `gdb` commands
allowing to analyze the Python state (e.g. the callback structure)
from C level (in the case that the `select` comes from Python).
I see aother approach:
You can guess which IO thread might be responsible (my guess is the
one controlled by `waitress`) and verify the guess by introducing
diagnostic output into this thread (e.g. logging the `gettid` value
(called via `ctype`)). You can then compare this "TID" to the "PID" of
the 100 % CPU thread.
My guess above comes from the "fileno"s referenced in the "select".
Using the "tid"s might provide another guess approach:
the "tid" sequence reveals the order in which the threads have been
created.
The "main thread" will always have the "process id" as "tid" (the
primary info shown by e.g. "ps").
I do not know how Linux assigns the "tid"s but I suspect
it will use the first "available" number (tasks and processes share the
name domain: small integers). Thus, on a system started recently,
a task started earlier will likely get a smaller "tid".
This will allow you to guess the order in which the tasks have been
started. And this order will be quite stable (at least for the threads
started while Zope is starting up).
If you know the start index of the problematic thread, you
can analyse it outside your production environment (or `strace` it
in your production enviroment before it enters the 100 % CPU usage
to lean what it does under "normal" conditions).
|
@d-maurer @d-maurer I was using austin wrong. You can move through the threads and the one with the internal threadid that corresponded with strace was this thread
EDIT: note this thread wasn't stuck. it was switching between reading and polling etc. if you match it to py-spy its the one that said "Thread 0x7F39883D4740 (active+gil): "MainThread". You can see the internal thread id here (by running top -H inside the running container)
|
@d-maurer @davisagli not sure why but I couldn't get gdb to load the python symbols
instead I tried pdb-attach but it seemed to crash the process so now I have to wait some days before it go back to switch into this 100% mode again :(
|
Dylan Jay wrote at 2023-9-10 22:41 -0700:
@d-maurer @d-maurer I was using austin wrong. You can move through the threads and the one with the internal threadid that corresponded with strace was this thread
...
00" 5'05" 0.0% 100.0% run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322)
03" 5'05" 0.9% 99.9% loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245)
17" 36" 5.7% 11.7% poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:159)
02" 02" 0.7% 0.7% readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)
Good!
This shows: the main thread (i.e. `waitress`) is suffering the active loop.
We now must find out why.
The `strace` shows us: `select` reports possible `write` but
no write is performed before the next `select`.
`waitress.wasyncore.write` is responsible for the `write` operation:
```python
def write(obj):
try:
obj.handle_write_event()
except _reraised_exceptions:
raise
except:
obj.handle_error()
```
I assume that the `handle_write_event` was prevented by an exception
from performing the `write` and a bug in `handle_error` caused
a cleanup failure (removing the object from `socket_map`).
I suggest to change the code above to get diagnostic information
about any exception in `handle_write_event` (which exception,
object class, object).
|
@d-maurer a little more info. This shows time spent on calls over a period of time, so it's not just static in time.
Before I go changing the waitress code in prod (since I don't know the cause so can't reproduce this), I'll look at the code and logs too see if anything could be causing what you describe. |
Dylan Jay wrote at 2023-9-11 01:10 -0700:
...
Before I go changing the waitress code in prod (since I don't know the cause so can't reproduce this), I'll look at the code and logs too see if anything could be causing what you describe.
One idea: potentially, you are seeing a DOS attack where the attacker
closes the socket in a non standard way (e.g. only partially)
and `waitress` gets in an inconsistent state in that case.
We do not see a `write/send`, i.e. the inconsistency would get effective
earlier (preventing the OS level output operation).
When you change the code, try also to get at the traceback.
|
@d-maurer looks like you are right https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L98 is visited a lot which means it's in a loop where its trying to write but can't because the socket internally marked as not in connected state. def handle_write(self):
# Precondition: there's data in the out buffer to be sent, or
# there's a pending will_close request
if not self.connected:
# we dont want to close the channel twice
return EDIT: seems it could be due to an empty read here https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L176. I can't see how it gets out of the loop and actually closes the connection if it hits that line. |
I'm 90% sure thats the cause so I've created issue at Pylons/waitress#418 but not sure on the best solution. |
Dylan Jay wrote at 2023-9-11 03:50 -0700:
I'm 90% sure thats the cause so I've created issue at Pylons/waitress#418 but not sure on the best solution.
The error is where `self.connected` was set tu `False`.
There, it should have been ensured that the corresponding "fileno"
is removed von `socket_map` and that it will not be put there again
(as long as `self.connected` remains `False`).
Something exceptional must have brought `waitress` into this state
(otherwise, we would have lots of 100 % CPU usage reports).
I assume that some bad client has used the system call `shutdown`
to close only part of the socket connection and that `waitress`does not
anticipate something like that.
|
We are seeing instances randomly go to 100%
With a bit of py-spy we managed to track it down to
haufe.requestmonitoring/haufe/requestmonitoring/monitor.py
Line 80 in 3e1af58
Most likely related we are also seeing this exception which is likely to be related.
But I can't yet work out why the
finally
wouldn't work in releasing the the lock. or perhaps exception is symptom rather than a causeThe text was updated successfully, but these errors were encountered: