You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm still working on a reliable repro, but I thought I'd raise this sooner rather than later. Under some conditions, rogue kernels can get jupyter server into a weird state where it exhausts all available file descriptors. The culprit here seems to stem from the nudge function, which starts accumulating file descriptors without cleaning them up.
2025-03-21 22:20:12.583433500 [E 2025-03-21 22:20:12.582 ServerApp] Uncaught exception GET /api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8 (2607:fb10:7011:1::c61)
2025-03-21 22:20:12.583439500 HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/a54d1764-8b84-4ea0-907a-27dc64289436/channels?session_id=61927800-9c95-416b-b48f-b339ed5030c8', version='HTTP/1.1', remote_ip='2607:fb10:7011:1::c61')
2025-03-21 22:20:12.583439500 Traceback (most recent call last):
2025-03-21 22:20:12.583439500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.583440500 await open_result
2025-03-21 22:20:12.583440500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.583440500 await self.connection.connect()
2025-03-21 22:20:12.583441500 ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583441500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.583441500 self.create_stream()
2025-03-21 22:20:12.583442500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.583442500 self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.583442500 ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583443500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.583443500 socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.583443500 ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.583444500 sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.583444500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583444500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.583445500 sock = self.context.socket(socket_type)
2025-03-21 22:20:12.583445500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.583445500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.583446500 socket_class( # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.583446500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.583446500 super().__init__(
2025-03-21 22:20:12.583447500 File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.583447500 zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.710088500 [W 2025-03-21 22:20:12.710 ServerApp] Replacing stale connection: 11421a39-5726-49f0-9b24-7ee6c1a02190:c7ab78fd-148f-4c1e-aafc-cbd646628767
2025-03-21 22:20:12.712058500 [I 2025-03-21 22:20:12.712 ServerApp] Connecting to kernel 11421a39-5726-49f0-9b24-7ee6c1a02190.
2025-03-21 22:20:12.716564500 [E 2025-03-21 22:20:12.716 ServerApp] Uncaught exception GET /api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767 (172.24.9.135)
2025-03-21 22:20:12.716565500 HTTPServerRequest(protocol='https', host='idafna-workbench-dev.workbench.prod.netflix.net:8888', method='GET', uri='/api/kernels/11421a39-5726-49f0-9b24-7ee6c1a02190/channels?session_id=c7ab78fd-148f-4c1e-aafc-cbd646628767', version='HTTP/1.1', remote_ip='172.24.9.135')
2025-03-21 22:20:12.716566500 Traceback (most recent call last):
2025-03-21 22:20:12.716566500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/tornado/websocket.py", line 940, in _accept_connection
2025-03-21 22:20:12.716567500 await open_result
2025-03-21 22:20:12.716567500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/websocket.py", line 75, in open
2025-03-21 22:20:12.716567500 await self.connection.connect()
2025-03-21 22:20:12.716567500 ^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716568500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 364, in connect
2025-03-21 22:20:12.716568500 self.create_stream()
2025-03-21 22:20:12.716568500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_server/services/kernels/connection/channels.py", line 155, in create_stream
2025-03-21 22:20:12.716569500 self.channels[channel] = stream = meth(identity=identity)
2025-03-21 22:20:12.716569500 ^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716569500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/ioloop/manager.py", line 25, in wrapped
2025-03-21 22:20:12.716570500 socket = f(self, *args, **kwargs)
2025-03-21 22:20:12.716570500 ^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716570500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 664, in connect_iopub
2025-03-21 22:20:12.716571500 sock = self._create_connected_socket("iopub", identity=identity)
2025-03-21 22:20:12.716571500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716571500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/jupyter_client/connect.py", line 654, in _create_connected_socket
2025-03-21 22:20:12.716572500 sock = self.context.socket(socket_type)
2025-03-21 22:20:12.716572500 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-03-21 22:20:12.716572500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/context.py", line 354, in socket
2025-03-21 22:20:12.716572500 socket_class( # set PYTHONTRACEMALLOC=2 to get the calling frame
2025-03-21 22:20:12.716573500 File "/apps/bdi-internal-venv-jupyter/lib/python3.11/site-packages/zmq/sugar/socket.py", line 159, in __init__
2025-03-21 22:20:12.716573500 super().__init__(
2025-03-21 22:20:12.716573500 File "_zmq.py", line 690, in zmq.backend.cython._zmq.Socket.__init__
2025-03-21 22:20:12.716574500 zmq.error.ZMQError: Too many open files
2025-03-21 22:20:12.821062500 [W 2025-03-21 22:20:12.821 ServerApp] Replacing stale connection: 982322be-1070-4612-bf59-07d5f4702ade:c34b488e-ca59-42a0-a61a-e505e79d9e03
2025-03-21 22:20:12.822449500 [I 2025-03-21 22:20:12.822 ServerApp] Adapting from protocol version 5.4 (kernel 982322be-1070-4612-bf59-07d5f4702ade) to 5.3 (client).
I'm still working on a repro, but I thought I'd share this earlier than later so anyone who does have an idea how to reproduce or has seen this before, can chime in.
Thanks!
The text was updated successfully, but these errors were encountered:
Thanks, if it's not happening all the time, it's probably a failure to clean up when there's a particular error or failure to respond. Hopefully not too hard to track down.
In my experience, the nudging logic in Jupyter Server is really sensitive and leads to memory leaks often. I think the kernel APIs in Jupyter Server need a major revamp.
I did some work on this last year, with the hope of eventually replacing some of the Jupyter Server with a simpler, more robust architecture: https://github.com/Zsailer/nextgen-kernels-api
The goals of this work:
tracks kernel lifecycle state and execution state server-side.
uses a single kernel client (thus, single set of ZMQ channels) to communicate with the kernel. No need to open ZMQ sockets outside of this client.
uses a completely native asyncio approach to poll messages from the kernel, dropping the tornado IOLoop and ZMQStream logic.
simplifies the websocket connection logic
removes all nudging logic in the websocket handler, since the kernel manager owns this now.
the WS handle registers itself as a listener on the kernel client
the websocket can connect, even if the kernel is busy. (I think) this eliminates the necessity for "pending"
Unfortunately, it's been difficult to find people to review + validate this approach outside our own team, so these things didn't really make it back into server (yet). This approach would eliminate the leaks you're seeing.
I'm still working on a reliable repro, but I thought I'd raise this sooner rather than later. Under some conditions, rogue kernels can get jupyter server into a weird state where it exhausts all available file descriptors. The culprit here seems to stem from the nudge function, which starts accumulating file descriptors without cleaning them up.
The rogue kernel here is the Scala kernel, which starts this process, ultimately affecting the Python kernels.
This is the intermediate state:
Finally, we end up here:
I'm still working on a repro, but I thought I'd share this earlier than later so anyone who does have an idea how to reproduce or has seen this before, can chime in.
Thanks!
The text was updated successfully, but these errors were encountered: