Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nudge function leaking FDs, leading to all kernels stopping due to "Too many open files" #1506

Open
ibdafna opened this issue Mar 21, 2025 · 2 comments
Labels

Comments

@ibdafna
Copy link

ibdafna commented Mar 21, 2025

I'm still working on a reliable repro, but I thought I'd raise this sooner rather than later. Under some conditions, rogue kernels can get jupyter server into a weird state where it exhausts all available file descriptors. The culprit here seems to stem from the nudge function, which starts accumulating file descriptors without cleaning them up.

lsof -p $(pgrep -f jupyter-lab) | awk '{print $5}' | sort | uniq -c | sort -nr
   1031 a_inode
    432 IPv4
    197 REG
      4 unix
      3 CHR
      2 FIFO
      2 DIR
      1 TYPE
      1 sock
      1 IPv6
[
{
"id": "11421a39-5726-49f0-9b24-7ee6c1a02190",
"name": "python3",
"last_activity": "2025-03-21T20:42:09.082165Z",
"execution_state": "idle",
"connections": 26715
},
{
"id": "a54d1764-8b84-4ea0-907a-27dc64289436",
"name": "python310",
"last_activity": "2025-03-21T20:42:07.464654Z",
"execution_state": "idle",
"connections": 26656
},
{
"id": "3a53b14e-e44b-4f0f-ad44-6e5e801d8923",
"name": "python310",
"last_activity": "2025-03-21T20:45:46.949561Z",
"execution_state": "idle",
"connections": 26622
},
{
"id": "982322be-1070-4612-bf59-07d5f4702ade",
"name": "scala",
"last_activity": "2025-03-20T23:59:47.957756Z",
"execution_state": "idle",
"connections": 26665
},
{
"id": "8ba2719f-ec00-4ffe-a868-14ec6339e2d6",
"name": "spark33-scala",
"last_activity": "2025-03-20T23:59:40.736254Z",
"execution_state": "idle",
"connections": 29112
}
]

The rogue kernel here is the Scala kernel, which starts this process, ultimately affecting the Python kernels.

This is the intermediate state:

2025-03-21 04:34:06.580115500  [W 2025-03-21 04:34:06.580 ServerApp] WebSocket ping timeout after 90000 ms.
2025-03-21 04:34:07.540819500  [W 2025-03-21 04:34:07.540 ServerApp] Nudge: attempt 550 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:08.866015500  [W 2025-03-21 04:34:08.865 ServerApp] Replacing stale connection: 8ba2719f-ec00-4ffe-a868-14ec6339e2d6:baf135be-1cf9-4ba7-b2d0-8164caa4a2a9
2025-03-21 04:34:08.867461500  [I 2025-03-21 04:34:08.867 ServerApp] Adapting from protocol version 5.4 (kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6) to 5.3 (client).
2025-03-21 04:34:08.869069500  [I 2025-03-21 04:34:08.869 ServerApp] Connecting to kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6.
2025-03-21 04:34:09.155150500  [W 2025-03-21 04:34:09.155 ServerApp] Nudge: attempt 370 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:11.047986500  [W 2025-03-21 04:34:11.047 ServerApp] Nudge: attempt 740 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:11.308202500  [W 2025-03-21 04:34:11.308 ServerApp] Nudge: attempt 190 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:12.555067500  [W 2025-03-21 04:34:12.554 ServerApp] Nudge: attempt 560 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:13.380044500  [W 2025-03-21 04:34:13.379 ServerApp] Nudge: attempt 10 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:14.166065500  [W 2025-03-21 04:34:14.165 ServerApp] Nudge: attempt 380 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:16.062188500  [W 2025-03-21 04:34:16.062 ServerApp] Nudge: attempt 750 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:16.321077500  [W 2025-03-21 04:34:16.320 ServerApp] Nudge: attempt 200 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:17.568264500  [W 2025-03-21 04:34:17.568 ServerApp] Nudge: attempt 570 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:18.390559500  [W 2025-03-21 04:34:18.390 ServerApp] Nudge: attempt 20 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:19.176689500  [W 2025-03-21 04:34:19.176 ServerApp] Nudge: attempt 390 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:21.076157500  [W 2025-03-21 04:34:21.076 ServerApp] Nudge: attempt 760 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:21.333889500  [W 2025-03-21 04:34:21.333 ServerApp] Nudge: attempt 210 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:22.582934500  [W 2025-03-21 04:34:22.582 ServerApp] Nudge: attempt 580 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:23.401344500  [W 2025-03-21 04:34:23.401 ServerApp] Nudge: attempt 30 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:24.187958500  [W 2025-03-21 04:34:24.187 ServerApp] Nudge: attempt 400 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:26.090278500  [W 2025-03-21 04:34:26.090 ServerApp] Nudge: attempt 770 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:26.344238500  [W 2025-03-21 04:34:26.344 ServerApp] Nudge: attempt 220 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:27.598137500  [W 2025-03-21 04:34:27.598 ServerApp] Nudge: attempt 590 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:28.411274500  [W 2025-03-21 04:34:28.411 ServerApp] Nudge: attempt 40 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6
2025-03-21 04:34:29.199767500  [W 2025-03-21 04:34:29.199 ServerApp] Nudge: attempt 410 on kernel 8ba2719f-ec00-4ffe-a868-14ec6339e2d6

Finally, we end up here:

2025-03-21 22:20:12.583433500  [E 2025-03-21 22:20:12.582 ServerApp] Uncaught exception GET /api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8 (2607:fb10:7011:1::c61)
2025-03-21 22:20:12.583439500      HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8', version='HTTP/1.1', remote_ip='2607:fb10:7011:1::c61')
2025-03-21 22:20:12.583439500      Traceback (most recent call last):
2025-03-21 22:20:12.583439500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.583440500          await open_result
2025-03-21 22:20:12.583440500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.583440500          await self.connection.connect()
2025-03-21 22:20:12.583441500                ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583441500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.583441500          self.create_stream()
2025-03-21 22:20:12.583442500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.583442500          self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.583442500                                            ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583443500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.583443500          socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.583443500                   ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.583444500          sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.583444500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.583445500          sock = self.context.socket(socket_type)
2025-03-21 22:20:12.583445500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583445500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.583446500          socket_class(  # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.583446500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.583446500          super().__init__(
2025-03-21 22:20:12.583447500        File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.583447500      zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.710088500  [W 2025-03-21 22:20:12.710 ServerApp] Replacing stale connection: 11421a39-5726-49f0-9b24-7ee6c1a02190:c7ab78fd-148f-4c1e-aafc-cbd646628767
2025-03-21 22:20:12.712058500  [I 2025-03-21 22:20:12.712 ServerApp] Connecting to kernel 11421a39-5726-49f0-9b24-7ee6c1a02190.
2025-03-21 22:20:12.716564500  [E 2025-03-21 22:20:12.716 ServerApp] Uncaught exception GET /api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767 (172.24.9.135)
2025-03-21 22:20:12.716565500      HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767', version='HTTP/1.1', remote_ip='172.24.9.135')
2025-03-21 22:20:12.716566500      Traceback (most recent call last):
2025-03-21 22:20:12.716566500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.716567500          await open_result
2025-03-21 22:20:12.716567500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.716567500          await self.connection.connect()
2025-03-21 22:20:12.716567500                ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716568500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.716568500          self.create_stream()
2025-03-21 22:20:12.716568500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.716569500          self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.716569500                                            ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716569500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.716570500          socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.716570500                   ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716570500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.716571500          sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.716571500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716571500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.716572500          sock = self.context.socket(socket_type)
2025-03-21 22:20:12.716572500                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716572500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.716572500          socket_class(  # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.716573500        File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.716573500          super().__init__(
2025-03-21 22:20:12.716573500        File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.716574500      zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.821062500  [W 2025-03-21 22:20:12.821 ServerApp] Replacing stale connection: 982322be-1070-4612-bf59-07d5f4702ade:c34b488e-ca59-42a0-a61a-e505e79d9e03
2025-03-21 22:20:12.822449500  [I 2025-03-21 22:20:12.822 ServerApp] Adapting from protocol version 5.4 (kernel 982322be-1070-4612-bf59-07d5f4702ade) to 5.3 (client).

I'm still working on a repro, but I thought I'd share this earlier than later so anyone who does have an idea how to reproduce or has seen this before, can chime in.

Thanks!

@ibdafna ibdafna added the bug label Mar 21, 2025
@minrk
Copy link
Contributor

minrk commented Mar 22, 2025

Thanks, if it's not happening all the time, it's probably a failure to clean up when there's a particular error or failure to respond. Hopefully not too hard to track down.

@Zsailer
Copy link
Member

Zsailer commented Apr 2, 2025

In my experience, the nudging logic in Jupyter Server is really sensitive and leads to memory leaks often. I think the kernel APIs in Jupyter Server need a major revamp.

I did some work on this last year, with the hope of eventually replacing some of the Jupyter Server with a simpler, more robust architecture: https://github.com/Zsailer/nextgen-kernels-api

The goals of this work:

  • tracks kernel lifecycle state and execution state server-side.
  • uses a single kernel client (thus, single set of ZMQ channels) to communicate with the kernel. No need to open ZMQ sockets outside of this client.
  • uses a completely native asyncio approach to poll messages from the kernel, dropping the tornado IOLoop and ZMQStream logic.
  • simplifies the websocket connection logic
    • removes all nudging logic in the websocket handler, since the kernel manager owns this now.
    • the WS handle registers itself as a listener on the kernel client
    • the websocket can connect, even if the kernel is busy. (I think) this eliminates the necessity for "pending"

Unfortunately, it's been difficult to find people to review + validate this approach outside our own team, so these things didn't really make it back into server (yet). This approach would eliminate the leaks you're seeing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants