Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

using 100% trying to acquire lock #15

Open
djay opened this issue Aug 22, 2023 · 37 comments · May be fixed by #16
Open

using 100% trying to acquire lock #15

djay opened this issue Aug 22, 2023 · 37 comments · May be fixed by #16

Comments

@djay
Copy link
Member

djay commented Aug 22, 2023

We are seeing instances randomly go to 100%

With a bit of py-spy we managed to track it down to

Most likely related we are also seeing this exception which is likely to be related.

ValueError: too many values to unpack (expected 2)
  File "ZPublisher/WSGIPublisher.py", line 165, in transaction_pubevents
    notify(pubevents.PubStart(request))
  File "/app/lib/python3.9/site-packages/zope/event/__init__.py", line 32, in notify
    subscriber(event)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/event.py", line 27, in dispatch
    component_subscribers(event, None)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/_api.py", line 134, in subscribers
    return sitemanager.subscribers(objects, interface)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/registry.py", line 448, in subscribers
    return self.adapters.subscribers(objects, provided)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/adapter.py", line 899, in subscribers
    subscription(*objects)
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/events.py", line 32, in subscriber_mark_as_api_request
    mark_as_api_request(
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/events.py", line 18, in mark_as_api_request
    service_id = lookup_service_id(method, accept)
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/negotiation.py", line 26, in lookup_service_id
    media_types = parse_accept_header(accept)
  File "app/eggs/plone.rest-2.0.0a5-py3.9.egg/plone/rest/negotiation.py", line 16, in parse_accept_header
    type_, subtype = media_type.split("/")
KeyError: 3738
  File "waitress/channel.py", line 428, in service
    task.service()
  File "waitress/task.py", line 168, in service
    self.execute()
  File "waitress/task.py", line 434, in execute
    app_iter = self.channel.server.application(environ, start_response)
  File "ZPublisher/httpexceptions.py", line 30, in __call__
    return self.application(environ, start_response)
  File "ZPublisher/WSGIPublisher.py", line 375, in publish_module
    with transaction_pubevents(request, response):
  File "contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "ZPublisher/WSGIPublisher.py", line 221, in transaction_pubevents
    notify(pubevents.PubFailure(request, exc_info, retry))
  File "/app/lib/python3.9/site-packages/zope/event/__init__.py", line 32, in notify
    subscriber(event)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/event.py", line 27, in dispatch
    component_subscribers(event, None)
  File "/app/eggs/zope.component-5.0.1-py3.9.egg/zope/component/_api.py", line 134, in subscribers
    return sitemanager.subscribers(objects, interface)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/registry.py", line 448, in subscribers
    return self.adapters.subscribers(objects, provided)
  File "/app/eggs/zope.interface-5.4.0-py3.9-linux-x86_64.egg/zope/interface/adapter.py", line 899, in subscribers
    subscription(*objects)
  File "haufe/requestmonitoring/monitor.py", line 182, in handle_request_end
    account_request(event.request, True)
  File "haufe/requestmonitoring/monitor.py", line 63, in account_request
    del _state[id]

But I can't yet work out why the finally wouldn't work in releasing the the lock. or perhaps exception is symptom rather than a cause

@djay
Copy link
Member Author

djay commented Aug 22, 2023

my reading of finally without an except is that it will never raise the exception because it gets thrown away. Thats obviously not the case. Either way I suspect changing the code to

    try:
        if end:
            del _state[id]
        else:
            _state[id] = Request(id, info, request, ticket.time, get_ident())
    except:
       _lock.release()
       raise
    finally:
        _lock.release()

Might solve the lock issue. But still to understand how the state gets a key error

@djay
Copy link
Member Author

djay commented Aug 22, 2023

Ah ok. the cause of the keyerror is because the original exception is happening in another subscriber called first. But end request handler is still called regardless. So the fix code could be

    try:
        if end:
            if id in _state:
                del _state[id]
        else:
            _state[id] = Request(id, info, request, ticket.time, get_ident())
    except:
       _lock.release()
       raise
    finally:
        _lock.release()

@djay
Copy link
Member Author

djay commented Aug 23, 2023

I've changed my mind. Everything says that python guarantees finally is called. so I can't think of any way the del _state[id] exception can be the cause of the lock not being released. Can you @zopyx ?
So the only other explanation is the _state.copy() had an exception so I've wrapped that. The problem is I can find no evidence for this in our logs. e.g. there are no errors "monitor thread died with exception".
Will see if this PR fixes things or not.

@zopyx
Copy link
Contributor

zopyx commented Aug 23, 2023

Sorry, no idea...ask Dieter

@d-maurer
Copy link

...
except:
_lock.release()
raise
finally:
_lock.release()

The release in the except block is wrong: the finally will try again to release the lock and fail (because it is not in an acquired state).

@d-maurer
Copy link

In principle, the lock management should be correct: the lock is acquired (and released in a finally block) in account_request (called by worker threads) and in _Monitor.run (called by the monitor thread). It protects simple dict operations (shallow copy, adding an item, removing an item). The only potentially critical operation is the item deletion; it may call arbitrary destructor code, potentially misbehaving one (causing a long (potentially infinite) delay before the release).

A note in ZPublisher.interfaces.IPubFailure reveals a potential cause for the KeyError:

    Note: If a subscriber to 'IPubSuccess' raises an exception,
    then 'IPubFailure' may be notified in addtion to 'IPubSuccess'.

i.e. it is possible that there are 2 IPubEnd events for the same request. This may cause the observed KeyError.

@djay
Copy link
Member Author

djay commented Aug 23, 2023

@d-maurer the keyerror is not the problem. I've fixed that in the referenced pr.

The problem is that I'm getting 100% CPU from a thread stuck in line 80 trying to acquire a lock and I can't see how your code is wrong to cause a lock to not be released that results in this kind of deadlock.

@jsbueno
Copy link
Member

jsbueno commented Aug 23, 2023

my reading of finally without an except is that it will never raise the exception because it gets thrown away. Thats obviously not the case. Either way I suspect changing the code to

    try:
        if end:
            del _state[id]
        else:
            _state[id] = Request(id, info, request, ticket.time, get_ident())
    except:
       _lock.release()
       raise
    finally:
        _lock.release()

Might solve the lock issue. But still to understand how the state gets a key error
Hi - sorry for pinging in without all the context - it looks like you gave up this approach, so ok.

I am just writing this to confirm that code like this would be essentially wrong, and .release() would be called twice in a row when any exception occurred. The finally clause will always run.

@d-maurer
Copy link

d-maurer commented Aug 23, 2023 via email

@djay
Copy link
Member Author

djay commented Aug 24, 2023

@d-maurer I'm not sure I understand how a destructor might hold a lock?

My reasoning is the only thing that could cause the lock to not be released is an exception in

Since that has no try: finally:. but no evidence this happened.

Also that means the loop would have been exited and there is no way for it to get stuck on line 80...

By destructors during item deletion do you mean that you think this could happen in when the state is deleted?

however finally should handle the release of the lock so would not cause the deadlock.

Anywhere else a destructor would not result in a lock not being released so wouldnt be a problem.

So I'm still stumped.

@djay
Copy link
Member Author

djay commented Aug 24, 2023

@d-maurer actually I just did a test. lock.acquire doesn't use hardly any cpu so it can't be the cause. it might be just that when I did py-spy dump its what I keep seeing as the line it's stuck on because it's the slowest.

Instead it could be just running that loop fast with no sleep. but the only way that could be is if somehow config.period get reset to 0 halfway through the life of the zope process???

https://github.com/collective/haufe.requestmonitoring/blob/3e1af58eeeb014a3f94aa3fc37893f68cdbf4dd2/haufe/requestmonitoring/monitor.py#L79C1-L79C1

I just can't see how that could happen.

@d-maurer
Copy link

d-maurer commented Aug 24, 2023 via email

@djay
Copy link
Member Author

djay commented Aug 31, 2023

@d-maurer none of this explains what I'm seeing so still stumped.

No matter how many times I run py-spy on all of the running instances that switch to 100% cpu I get the same result. Stuck on the lock.acquire on line 80. No other active threads most of the time. and acquire should use no cpu if it's blocked and it has no timeout so it's not running in a tight loop.

Thread 0x7EFF19C52740 (active+gil): "MainThread"
    readable (waitress/server.py:266)
    poll (waitress/wasyncore.py:158)
    loop (waitress/wasyncore.py:245)
    run (waitress/server.py:322)
    serve (waitress/__init__.py:19)
    serve_paste (plone/recipe/zope2instance/ctl.py:917)
    serve (plone/recipe/zope2instance/ctl.py:942)
    serve (Zope2/Startup/serve.py:203)
    run (Zope2/Startup/serve.py:217)
    main (Zope2/Startup/serve.py:251)
    <module> (Zope2/Startup/serve.py:255)
    <module> (interpreter:323)
Thread 0x7EFF088E9700 (active): "Dummy-1"
    run (haufe/requestmonitoring/monitor.py:80)
100.00% 100.00%   13.00s    13.00s   run (haufe/requestmonitoring/monitor.py)
 40.00%  40.00%    5.16s     5.16s   readable (waitress/server.py)

So none of it makes sense.

But the PR at least fixes the problem of masking of other errors that happen in other subscribers. Once I get it into production we will see if helps with the 100% cpu issue or not, despite that in theory it should not.

@d-maurer
Copy link

d-maurer commented Aug 31, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 2, 2023

@d-maurer yes of course I wrapped the copy.

The 100% cpu goes on forever if not restarted. Interesting possible cause but don't think that can be it.

@djay
Copy link
Member Author

djay commented Sep 4, 2023

@d-maurer after deployment the 100% cpu problem still happens.

Collecting samples from '/app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini' (python v3.9.5)amples 10391
Total Samples 10491
GIL: 42.00%, Active: 142.00%, Threads: 5

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                 
100.00% 100.00%   104.9s    104.9s   run (haufe/requestmonitoring/monitor.py:78)
 42.00%  42.00%   39.64s    39.64s   readable (waitress/server.py:266)
  0.00%   0.00%   0.610s    0.610s   readable (waitress/server.py:274)
  0.00%  42.00%   0.400s    42.92s   loop (waitress/wasyncore.py:245)
  0.00%   0.00%   0.350s    0.350s   writable (waitress/channel.py:89)
  0.00%   0.00%   0.330s    0.460s   handle_write_event (waitress/wasyncore.py:517)
  0.00%   0.00%   0.320s     1.10s   write (waitress/wasyncore.py:113)
  0.00%   0.00%   0.170s    0.170s   execute (relstorage/adapters/sql/query.py:221)
  0.00%   0.00%   0.160s    0.160s   readable (waitress/channel.py:155)
  0.00%   0.00%   0.130s    0.130s   handle_write (waitress/channel.py:95)
  0.00%   0.00%   0.120s    0.120s   handle_write_event (waitress/wasyncore.py:514)
  0.00%   0.00%   0.110s    0.110s   rea

Thread 0x7F39883D4740 (active+gil): "MainThread"
    readable (waitress/server.py:266)
    poll (waitress/wasyncore.py:158)
    loop (waitress/wasyncore.py:245)
    run (waitress/server.py:322)
    serve (waitress/__init__.py:19)
    serve_paste (plone/recipe/zope2instance/ctl.py:917)
    serve (plone/recipe/zope2instance/ctl.py:942)
    serve (Zope2/Startup/serve.py:203)
    run (Zope2/Startup/serve.py:217)
    main (Zope2/Startup/serve.py:251)
    <module> (Zope2/Startup/serve.py:255)
    <module> (interpreter:321)
Thread 0x7F39770AB700 (active): "Dummy-1"
    run (haufe/requestmonitoring/monitor.py:78)
Thread 0x7F39768AA700 (idle): "waitress-0"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39760A9700 (idle): "waitress-1"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39758A8700 (idle): "raven-sentry.BackgroundWorker"
    wait (threading.py:312)
    get (sentry_sdk/_queue.py:240)
    _target (sentry_sdk/worker.py:127)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)

except now the line it's hanging on is sleep(self.period).
maybe it was before and I was just looking at the wrong version of the code

            while 1:
                sleep(self.period)
                with _lock:
                    pending = _state.copy()
                if not pending:
                    continue
                monitorTime = time()

but this is just as mysterious since I sleep shouldn't be using any cpu. Unless somehow py-spy is lumping GIL/async stuff into the sleep call and something else is causing the cpu usuage?

or perhaps the other explanation is it's a super tight loop and period is very small but still big enough that the thread always seems to end up on that line.

@d-maurer
Copy link

d-maurer commented Sep 4, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 4, 2023

@d-maurer I included the thread dump but in an edit so you wouldn't have seen it in the notification.


Thread 0x7F39883D4740 (active+gil): "MainThread"
    readable (waitress/server.py:266)
    poll (waitress/wasyncore.py:158)
    loop (waitress/wasyncore.py:245)
    run (waitress/server.py:322)
    serve (waitress/__init__.py:19)
    serve_paste (plone/recipe/zope2instance/ctl.py:917)
    serve (plone/recipe/zope2instance/ctl.py:942)
    serve (Zope2/Startup/serve.py:203)
    run (Zope2/Startup/serve.py:217)
    main (Zope2/Startup/serve.py:251)
    <module> (Zope2/Startup/serve.py:255)
    <module> (interpreter:321)
Thread 0x7F39770AB700 (active): "Dummy-1"
    run (haufe/requestmonitoring/monitor.py:78)
Thread 0x7F39768AA700 (idle): "waitress-0"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39760A9700 (idle): "waitress-1"
    wait (threading.py:312)
    handler_thread (waitress/task.py:72)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)
Thread 0x7F39758A8700 (idle): "raven-sentry.BackgroundWorker"
    wait (threading.py:312)
    get (sentry_sdk/_queue.py:240)
    _target (sentry_sdk/worker.py:127)
    run (threading.py:892)
    run (sentry_sdk/integrations/threading.py:70)
    _bootstrap_inner (threading.py:954)
    _bootstrap (threading.py:912)

I think thats what you mean. if not, not sure what kind of process dump you mean. top just says python.

I did look at the sentry thread to see if it could be related. it's idle on this line - ttps://github.com/python/cpython/blob/3.9/Lib/threading.py#L312

The platform is linux running inside docker.

It is natural that snapshots of Monitor.run will typically
see it in the sleep call (where is is almost all the time).

True, but I wouldn't except to see it at 100% CPU in top and 100% in py-spy...

@d-maurer
Copy link

d-maurer commented Sep 4, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 5, 2023

@d-maurer I understand processes (or think I did) but single process is all the threads. hense why I used py-spy.
Zope used to dump the threads when you send it sigusr1 but that is zserver not waitress and waitress doesn't have the same feature. but py-spy should give the same result I believe? or I guess it's sampling so maybe not...

This is what two of the waitress instances look like from the os level.

 167393 ubuntu    20   0  940548 375200   8928 R 100.3   2.3   1334:40 python3.9                                                                                             
2274489 ubuntu    20   0  739124 375896  10416 R 100.0   2.3   1149:26 python3.9  

strace gives us a lot of

select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999997})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728355736}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718362989}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728501947}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718486917}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728620701}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718605166}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728737692}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718721432}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728853086}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718837226}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=728969206}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=718953381}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=729085570}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=719069606}) = 0
select(115, [13 16], [114], [13 16 114], {tv_sec=1, tv_usec=0}) = 1 (out [114], left {tv_sec=0, tv_usec=999998})
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883161, tv_nsec=729201195}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392565, tv_nsec=719206793}) = 0

Does that tell is anything?

@djay
Copy link
Member Author

djay commented Sep 5, 2023

@d-maurer if I look at one of the normal instances I get something similar but much slower output.

strace: Process 3051017 attached
select(17, [13 16], [], [13 16], {tv_sec=0, tv_usec=187813}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883492, tv_nsec=923946154}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392896, tv_nsec=914070438}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883493, tv_nsec=925595671}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392897, tv_nsec=915645056}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883494, tv_nsec=926326532}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392898, tv_nsec=916442025}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 1 (in [13], left {tv_sec=0, tv_usec=540079})
accept4(13, {sa_family=AF_INET, sin_port=htons(39856), sin_addr=inet_addr("10.42.105.85")}, [16], SOCK_CLOEXEC) = 8
getsockname(8, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("10.42.113.157")}, [128->16]) = 0
setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=387151742}) = 0
getsockopt(8, SOL_SOCKET, SO_SNDBUF, [46080], [4]) = 0
ioctl(8, FIONBIO, [1])                  = 0
getpeername(8, {sa_family=AF_INET, sin_port=htons(39856), sin_addr=inet_addr("10.42.105.85")}, [16]) = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=387702332}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=377760856}) = 0
select(17, [8 13 16], [], [8 13 16], {tv_sec=1, tv_usec=0}) = 1 (in [8], left {tv_sec=0, tv_usec=999997})
recvfrom(8, "GET /++api++/@subfooter HTTP/1.1"..., 8192, 0, NULL, NULL) = 601
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=388329229}) = 0
futex(0x7f5c601b11a0, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=388931506}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=378963078}) = 0
futex(0x9a5dec, FUTEX_WAKE_PRIVATE, 1)  = 1
futex(0x9a5df0, FUTEX_WAKE_PRIVATE, 1)  = 1
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 1 (in [16], left {tv_sec=0, tv_usec=730065})
read(16, "x", 8192)                     = 1
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=660095688}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=650290069}) = 0
select(17, [13 16], [8], [8 13 16], {tv_sec=1, tv_usec=0}) = 1 (out [8], left {tv_sec=0, tv_usec=999996})
close(8)                                = 0
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883495, tv_nsec=660811840}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392899, tv_nsec=650884792}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883496, tv_nsec=662340141}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392900, tv_nsec=652385383}) = 0
select(17, [13 16], [], [13 16], {tv_sec=1, tv_usec=0}) = 0 (Timeout)
clock_gettime(CLOCK_REALTIME, {tv_sec=1693883497, tv_nsec=663832251}) = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=392901, tv_nsec=653892531}) = 0

so perhaps it does look like somehow that sleep delay is somehow getting reset down to some really small number like 0 and then we get a tight loop?

@d-maurer
Copy link

d-maurer commented Sep 5, 2023 via email

@d-maurer
Copy link

d-maurer commented Sep 5, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 5, 2023

@d-maurer thank you. now I have something to go on.

btw https://unixhealthcheck.com/blog?id=465 says how to get the TID and strace -fp TID seems to work. just got to wait for it to trip into 100% cpu mode to try it. but not sure how match the TID back to the py-spy output to see what python code is responsible. still not sure how if py-spy says all the other threads are idle we could having this select loop. Or perhaps its still in the monitor thread but the sleep is switching back to the asyc stuff and that would be doing the select stuff.

@d-maurer
Copy link

d-maurer commented Sep 5, 2023 via email

@d-maurer
Copy link

d-maurer commented Sep 6, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 11, 2023

@d-maurer I got an strace with just the threadid that is high cpu and relative timestamps but I don't think it tells us much

[pid 3339237]      0.000094 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706467631}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345697557}) = 0
[pid 3339237]      0.000076 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000095 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706725008}) = 0
[pid 3339237]      0.000085 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=345955002}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000093 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=706980762}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=346210463}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000093 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=707235936}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=346465653}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})
[pid 3339237]      0.000094 clock_gettime(CLOCK_REALTIME, {tv_sec=1694343800, tv_nsec=707492042}) = 0
[pid 3339237]      0.000084 clock_gettime(CLOCK_MONOTONIC, {tv_sec=854466, tv_nsec=346721763}) = 0
[pid 3339237]      0.000077 select(145, [13 16], [137 142 143 144], [13 16 137 142 143 144], {tv_sec=1, tv_usec=0}) = 4 (out [137 142 143 144], left {tv_sec=0, tv_usec=999997})

I couldn't get py-spy to tell us the threadid due to docker issues so that way didn't tell us python code might be causing this.

I did find another sampling profiler I could get to work but it gives a completely different answer.


  Austin  TUI   Wall Time Profile                                                                                             CPU  99% ▇▆▇▇▇▇█▇   MEM 360M ████████    1/7
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3339237     PID:TID 3339237:31          
               Samples 62252    ⏲️   12"         Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"     12"      0.0%   99.8%  └─ _bootstrap (/usr/lib/python3.9/threading.py:912)                                                                                         │
  00"     12"      0.0%   99.8%     └─ _bootstrap_inner (/usr/lib/python3.9/threading.py:954)                                                                                │
  00"     12"      0.0%   99.8%        └─ run (/app/lib/python3.9/site-packages/sentry_sdk/integrations/threading.py:70)                                                     │
  00"     12"      0.0%   99.8%           └─ run (/app/eggs/plone.cachepurging-3.0.0a1-py3.9.egg/plone/cachepurging/purger.py:170)                                           │
  00"     12"      0.0%   99.8%              └─ get (/usr/lib/python3.9/queue.py:171)                                                                                        │
  12"     12"     99.8%   99.8%                 └─ wait (/usr/lib/python3.9/threading.py:312)                                                                                │
                                                                                                                                                                             │

but this seems equally unlikely since a wait should also be no cpu.
It is another worker thread I wasn't aware of in
https://github.com/plone/plone.cachepurging/blob/3.0.0a1/plone/cachepurging/purger.py#L170.

One suspicious thing might be the sentry-sdk code as I think we might not have got this issue before we installed sentry... but I'd have to go back and check.
But looking at the code I don't see what it could be doing to cause this.

https://github.com/getsentry/sentry-python/blob/a6e1cbe1e8daf0ff5532ad4dd7eea17b086bfe9d/sentry_sdk/integrations/threading.py#L70

So I still wonder if somehow something happing in c code is running while these threads have released the GIL and these sampling profilers somehow associate that activity with the waiting thread. I don't know how possible that is, or another way to get a more accurate picture of whats python code is the cause.
Previously we looked at the logs to see if there was any correlation between the requests before the cpu went up and couldn't see anything that stands out...

.

@davisagli
Copy link
Member

I wonder if getting a stack trace using gdb would give any clues about what's happening outside of the Python stack. https://wiki.python.org/moin/DebuggingWithGdb

@d-maurer
Copy link

d-maurer commented Sep 11, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 11, 2023

@d-maurer @d-maurer I was using austin wrong. You can move through the threads and the one with the internal threadid that corresponded with strace was this thread

  Austin  TUI   Wall Time Profile                                                                                             CPU  98% ▇█▇▇▇▇█▇   MEM 366M ████████    7/7
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3339237     PID:TID 3339237:12          
               Samples 2374164  ⏲️   5'05"       Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"    5'05"     0.0%  100.0%  <module> (/app/parts/instance/bin/interpreter:321)                                                                                          │
  00"    5'05"     0.0%  100.0%  <module> (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:255)                                                                          │
  00"    5'05"     0.0%  100.0%  main (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:251)                                                                              │
  00"    5'05"     0.0%  100.0%  run (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:217)                                                                               │
  00"    5'05"     0.0%  100.0%  serve (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:203)                                                                             │
  00"    5'05"     0.0%  100.0%  serve (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:942)                                         │
  00"    5'05"     0.0%  100.0%  serve_paste (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:917)                                   │
  00"    5'05"     0.0%  100.0%  serve (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/__init__.py:19)                                                                          │
  00"    5'05"     0.0%  100.0%  run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322)                                                                             │
  03"    5'05"     0.9%   99.9%  loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245)                                                                         │
  17"     36"      5.7%   11.7%  poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:159)                                                                         │
  02"     02"      0.7%    0.7%  readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)                                                                        │                                                                                                                            

EDIT: note this thread wasn't stuck. it was switching between reading and polling etc. if you match it to py-spy its the one that said "Thread 0x7F39883D4740 (active+gil): "MainThread".

You can see the internal thread id here (by running top -H inside the running container)

top - 15:39:32 up 10 days, 15:57,  0 users,  load average: 1.22, 1.69, 1.88
Threads:  10 total,   2 running,   8 sleeping,   0 stopped,   0 zombie
%Cpu(s): 23.5 us,  3.8 sy,  0.0 ni, 71.5 id,  0.0 wa,  0.0 hi,  1.0 si,  0.2 st
MiB Mem :  15986.9 total,    415.6 free,  13702.8 used,   1868.5 buff/cache
MiB Swap:  16384.0 total,  15806.7 free,    577.2 used.   1949.4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                               
     12 plone     20   0 1127736 346516  14464 R  99.0   2.1   1785:12 python3.9                                                                                             
     27 plone     20   0 1127736 346516  14464 S   1.7   2.1  14:17.64 python3.9                                                                                             
      1 plone     20   0    2616      0      0 S   0.0   0.0   0:00.03 sh                                                                                                    
     26 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:03.91 python3.9                                                                                             
     28 plone     20   0 1127736 346516  14464 S   0.0   2.1  14:22.74 python3.9                                                                                             
     29 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:58.88 python3.9                                                                                             
     30 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:00.22 python3.9                                                                                             
     31 plone     20   0 1127736 346516  14464 S   0.0   2.1   0:00.22 python3.9                                                                                             
     32 plone     20   0    4116   3044   2668 S   0.0   0.0   0:00.02 bash                                                                                                  
     42 plone     20   0    6124   3132   2632 R   0.0   0.0   0:00.00 top                                                                                                   

@djay
Copy link
Member Author

djay commented Sep 11, 2023

@d-maurer @davisagli not sure why but I couldn't get gdb to load the python symbols

$ sudo docker run --rm --pid=host -it --cap-add=SYS_PTRACE -p 3000:3000 python:3.9.0 bash -c 'apt-get update && apt-get install gdb && gdb python  3339237'
...
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
Attaching to program: /usr/local/bin/python, process 3339237
[New LWP 3339497]
[New LWP 3339498]
[New LWP 3339499]
[New LWP 3342706]
[New LWP 611493]
[New LWP 611494]

warning: Unable to find dynamic linker breakpoint function.
GDB will be unable to debug shared library initializers
and track explicitly loaded dynamic code.

warning: Target and debugger are in different PID namespaces; thread lists and other data are likely unreliable.  Connect to gdbserver inside the container.
0x000000000059938c in ?? ()
(gdb) bt
#0  0x000000000059938c in ?? ()
#1  0x00000000022a4310 in ?? ()
#2  0x00007f80d71df050 in ?? ()
#3  0x00000000022a4358 in ?? ()
#4  0x00000000047226f8 in ?? ()
#5  0x00007f80d70b07d0 in ?? ()
#6  0x00007f80d70c6980 in ?? ()
#7  0x00007f80d70c7710 in ?? ()
#8  0x00007f80d71df07c in ?? ()
#9  0xac9d42930de1dc62 in ?? ()
#10 0x00007f80d7e74500 in ?? ()
#11 0x00007f80d71df1a8 in ?? ()
#12 0xffffffff00000000 in ?? ()
#13 0x0000000004722748 in ?? ()
#14 0x0000000000000250 in ?? ()
#15 0x0000000000000008 in ?? ()
#16 0x0000003800000023 in ?? ()
#17 0x00000000ffffffff in ?? ()
#18 0x0000000000000000 in ?? ()
(gdb) py-bt
Undefined command: "py-bt".  Try "help".

instead I tried pdb-attach but it seemed to crash the process so now I have to wait some days before it go back to switch into this 100% mode again :(

sudo docker run --rm --pid=host -it --cap-add=SYS_PTRACE -p 30000:30000 python:3.9.0 bash -c 'pip install pdb-attach && python -m pdb_attach  3339237 30000'

@d-maurer
Copy link

d-maurer commented Sep 11, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 11, 2023

@d-maurer a little more info. This shows time spent on calls over a period of time, so it's not just static in time.

 Austin  TUI   Wall Time Profile                                                                                             CPU  99% ▇█▇█▇▇▇▇   MEM 263M ████████    5/5
   _________   Command /app/bin/python3.9 /app/parts/instance/bin/interpreter /app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py /app/parts/instance/etc/wsgi.ini
   ⎝__⎠ ⎝__⎠   Python 3.9.0     PID 3351466     PID:TID 3351466:11          
               Samples 1451365  ⏲️   3'16"       Threshold 0%   
  OWN    TOTAL    %OWN   %TOTAL  FUNCTION                                                                                                                                     
  00"    3'16"     0.0%  100.0%  └─ <module> (/app/parts/instance/bin/interpreter:326)                                                                                       ▒
  00"    3'16"     0.0%  100.0%     └─ <module> (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:255)                                                                    ▒
  00"    3'16"     0.0%  100.0%        └─ main (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:251)                                                                     ▒
  00"    3'16"     0.0%  100.0%           └─ run (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:217)                                                                   │
  00"    3'16"     0.0%  100.0%              └─ serve (/app/eggs/Zope-5.6-py3.9.egg/Zope2/Startup/serve.py:203)                                                              │
  00"    3'16"     0.0%  100.0%                 └─ serve (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:942)                       │
  00"    3'16"     0.0%  100.0%                    └─ serve_paste (/app/eggs/plone.recipe.zope2instance-6.11.0-py3.9.egg/plone/recipe/zope2instance/ctl.py:917)              │
  00"    3'16"     0.0%  100.0%                       └─ serve (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/__init__.py:19)                                                  │
  00"    3'16"     0.0%  100.0%                          └─ run (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:322)                                                  │
  05"    3'16"     2.5%   99.9%                             ├─ loop (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:245)                                           │
  36"     44"     18.3%   22.4%                             │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:158)                                        │
  05"     05"      2.4%    2.4%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/server.py:290)                                    │
  01"     02"      0.4%    0.9%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:117)                                    │
  01"     01"      0.4%    0.5%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  ├─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:95)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:514)                    │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:515)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  ├─ poll (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:150)                                  │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.0%    0.0%                             │  │  │     └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.1%                             │  │  ├─ write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:113)                                    │
  00"     00"      0.1%    0.1%                             │  │  │  ├─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:517)                    │
  00"     00"      0.0%    0.0%                             │  │  │  │  └─ handle_write (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:98)                          │
  00"     00"      0.0%    0.0%                             │  │  │  └─ handle_write_event (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/wasyncore.py:509)                    │
  00"     00"      0.1%    0.1%                             │  │  ├─ readable (/app/eggs/waitress-2.1.2-py3.9.egg/waitress/channel.py:154)   

Before I go changing the waitress code in prod (since I don't know the cause so can't reproduce this), I'll look at the code and logs too see if anything could be causing what you describe.

@d-maurer
Copy link

d-maurer commented Sep 11, 2023 via email

@djay
Copy link
Member Author

djay commented Sep 11, 2023

@d-maurer looks like you are right

https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L98

is visited a lot which means it's in a loop where its trying to write but can't because the socket internally marked as not in connected state.

    def handle_write(self):
        # Precondition: there's data in the out buffer to be sent, or
        # there's a pending will_close request

        if not self.connected:
            # we dont want to close the channel twice

            return

EDIT: seems it could be due to an empty read here https://github.com/Pylons/waitress/blob/main/src/waitress/channel.py#L176. I can't see how it gets out of the loop and actually closes the connection if it hits that line.

@djay
Copy link
Member Author

djay commented Sep 11, 2023

I'm 90% sure thats the cause so I've created issue at Pylons/waitress#418 but not sure on the best solution.

@d-maurer
Copy link

d-maurer commented Sep 11, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants