You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
an analyses got stuck waiting for the controller to spin up a worker, but the worker-controller hit an error and got stuck in a state where the number of replicas = 0
oasis-worker-controller-574bbc4c94-8n9xv
Defaulted container "main" out of: main, init-tcp-wait-by-secret (init)
2025-01-22 12:08:16,451 INFO: Deployment XXX-YYY-2-v2: New
2025-01-22 12:08:16,452 INFO: Current list of worker deployments:
2025-01-22 12:08:16,452 INFO: - worker-XXX-YYY-2-v2 (replicas: 12)
2025-01-22 12:08:16,592 INFO: Connected to ws: oasis-websocket:8001
2025-01-22 12:08:16,726 INFO: Get oasis model id from API for model XXX-YYY-2-v2
2025-01-22 13:38:20,291 INFO: Start cleanup of: {'worker-XXX-YYY-2-v2'}
2025-01-22 13:38:20,292 INFO: Scale worker-XXX-YYY-2-v2 to 0 replicas
2025-01-22 13:38:20,343 INFO: Deployment XXX-YYY-2-v2: updated replicas: 0
2025-01-22 13:57:08,412 ERROR: Task exception was never retrieved
future: <Task finished name='Task-4' coro=<DeploymentWatcher.watch() done, defined at /app/worker-controller/cluster_client.py:77> exception=ServerDisconnectedError('Server disconnected')>
Traceback (most recent call last):
File "/app/worker-controller/cluster_client.py", line 89, in watch
async for event in w.stream(apps_v1.list_namespaced_deployment, namespace=self.namespace,
File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/watch/watch.py", line 131, in __anext__
return await self.next()
File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/watch/watch.py", line 143, in next
self.resp = await self.func()
File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/client/api_client.py", line 182, in __call_api
response_data = await self.request(
File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
return (await self.request("GET", url,
File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
r = await self.pool_manager.request(**args)
File "/usr/lib/python3.10/site-packages/aiohttp/client.py", line 605, in _request
await resp.start(conn)
File "/usr/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 976, in start
message, payload = await protocol.read() # type: ignore[union-attr]
File "/usr/lib/python3.10/site-packages/aiohttp/streams.py", line 640, in read
await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected
When on fixed workers
an analyses got stuck waiting for the controller to spin up a worker, but the
worker-controller
hit an error and got stuck in a state where the number ofreplicas = 0
worker-controller-logs.txt
The text was updated successfully, but these errors were encountered: