You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I was running a task, I found that a type of job can stably trigger the resource leakage problem after the ray job ends. I tested the code part to the most simplified version and provided it below. The Ray Cluster used in the test is configured as a 128c896g worker node, minWorkerNum is 0, maxWorkerNum is 1, and no environment variables are configured. After the job runs for one minute, you need to manually trigger the ray job stop to trigger the exception.
This task will have the following abnormal phenomena, I think it is worth your in-depth investigation to see what bug is triggered.
During the running process, the Resource Status of Overview always shows that the CPU resources are fully occupied, that is, there are 128 tasks running in parallel, but only a dozen tasks can be seen in the running state on the Job Detail page
After the running is completed, the logical resource leakage can be almost stably triggered, and the Resource Status of Overview always shows that the resources are occupied, resulting in the worker node cannot be scaled down
After the running is completed, the ray task leakage can be almost stably triggered, and there will be many pending tasks in the Demands of Resource Status of Overview, which will cause the node to scale up and scale down continuously.
These three problems will occasionally be triggered separately in some other jobs, but in this given code, they can be almost stably triggered at the same time. Please take a look at where the problem occurs. Thank you!
Versions / Dependencies
Ray v2.40.0
Kuberay v1.2.2
Reproduction script
import ray
import time
import random
@ray.remote
def son():
time.sleep(10)
@ray.remote
def father():
futures = []
for i in range(9000):
futures.append(son.options(memory=random.randint(1, 1024**3)).remote())
res_list = []
while len(futures) > 0:
ready_futures, futures = ray.wait(futures, num_returns=1)
res_list.extend(ray.get(ready_futures))
if __name__ == '__main__':
ray.init()
ray.get(father.remote())
Submit job: ray job submit --address=http://localhost:8265 --working-dir=. -- python3 test_resource_leak.py
And execute ray job stop 02000000 --address=http://localhost:8265 after the ray job runs for one minute.
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered:
Moonquakes
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 22, 2025
jjyao
added
P0
Issues that should be fixed in short order
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jan 22, 2025
What happened + What you expected to happen
When I was running a task, I found that a type of job can stably trigger the resource leakage problem after the ray job ends. I tested the code part to the most simplified version and provided it below. The Ray Cluster used in the test is configured as a 128c896g worker node, minWorkerNum is 0, maxWorkerNum is 1, and no environment variables are configured. After the job runs for one minute, you need to manually trigger the
ray job stop
to trigger the exception.This task will have the following abnormal phenomena, I think it is worth your in-depth investigation to see what bug is triggered.
These three problems will occasionally be triggered separately in some other jobs, but in this given code, they can be almost stably triggered at the same time. Please take a look at where the problem occurs. Thank you!
Versions / Dependencies
Ray v2.40.0
Kuberay v1.2.2
Reproduction script
Submit job:
ray job submit --address=http://localhost:8265 --working-dir=. -- python3 test_resource_leak.py
And execute
ray job stop 02000000 --address=http://localhost:8265
after the ray job runs for one minute.Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: