Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker controller can get hit a stuck state on error #1156

Open
sambles opened this issue Jan 20, 2025 · 1 comment
Open

Worker controller can get hit a stuck state on error #1156

sambles opened this issue Jan 20, 2025 · 1 comment
Assignees

Comments

@sambles
Copy link
Contributor

sambles commented Jan 20, 2025

When on fixed workers

{
  "scaling_strategy": "FIXED_WORKERS",
  "worker_count_fixed": 12,
  "worker_count_max": 12,
  "worker_count_min": 0,
  "chunks_per_worker": 10
}

an analyses got stuck waiting for the controller to spin up a worker, but the worker-controller hit an error and got stuck in a state where the number of replicas = 0

worker-controller-logs.txt

@sambles
Copy link
Contributor Author

sambles commented Jan 22, 2025

oasis-worker-controller-574bbc4c94-8n9xv
Defaulted container "main" out of: main, init-tcp-wait-by-secret (init)
2025-01-22 12:08:16,451 INFO: Deployment XXX-YYY-2-v2: New
2025-01-22 12:08:16,452 INFO: Current list of worker deployments:
2025-01-22 12:08:16,452 INFO: - worker-XXX-YYY-2-v2 (replicas: 12)
2025-01-22 12:08:16,592 INFO: Connected to ws: oasis-websocket:8001
2025-01-22 12:08:16,726 INFO: Get oasis model id from API for model XXX-YYY-2-v2
2025-01-22 13:38:20,291 INFO: Start cleanup of: {'worker-XXX-YYY-2-v2'}
2025-01-22 13:38:20,292 INFO: Scale worker-XXX-YYY-2-v2 to 0 replicas
2025-01-22 13:38:20,343 INFO: Deployment XXX-YYY-2-v2: updated replicas: 0
2025-01-22 13:57:08,412 ERROR: Task exception was never retrieved
future: <Task finished name='Task-4' coro=<DeploymentWatcher.watch() done, defined at /app/worker-controller/cluster_client.py:77> exception=ServerDisconnectedError('Server disconnected')>
Traceback (most recent call last):
  File "/app/worker-controller/cluster_client.py", line 89, in watch
    async for event in w.stream(apps_v1.list_namespaced_deployment, namespace=self.namespace,
  File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/watch/watch.py", line 131, in __anext__
    return await self.next()
  File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/watch/watch.py", line 143, in next
    self.resp = await self.func()
  File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/client/api_client.py", line 182, in __call_api
    response_data = await self.request(
  File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
    return (await self.request("GET", url,
  File "/usr/lib/python3.10/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
    r = await self.pool_manager.request(**args)
  File "/usr/lib/python3.10/site-packages/aiohttp/client.py", line 605, in _request
    await resp.start(conn)
  File "/usr/lib/python3.10/site-packages/aiohttp/client_reqrep.py", line 976, in start
    message, payload = await protocol.read()  # type: ignore[union-attr]
  File "/usr/lib/python3.10/site-packages/aiohttp/streams.py", line 640, in read
    await self._waiter
aiohttp.client_exceptions.ServerDisconnectedError: Server disconnected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

1 participant