Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Head Node leave zombie processes after job is finished #50031

Open
wangxin201492 opened this issue Jan 23, 2025 · 0 comments
Open

Ray Head Node leave zombie processes after job is finished #50031

wangxin201492 opened this issue Jan 23, 2025 · 0 comments

Comments

@wangxin201492
Copy link

I run a RayCluster(Ray 2.39.0) using KubeRay(1.2.2), and submit many job to it. I discover that there many zombie process left after the job is finished.

The zombie processes cause some psutil methods runs vary slow

It will leave 2 zombie processes when I submit one job. For more detail, when I submit a job, the JobSupervisor will start up at head node to hold the job, JobSupervisor(pid=152424) will run 2 subprocesses:

  1. /bin/bash -c python numpy-cpu-job-actor.py, pid is 152834
  2. /bin/bash -c while kill -s 0 152424; do sleep 1; done; kill -9 -152824, pid is 152836

When the job is finished, 152424 & 152834 is exited, but leave 152836 and its subprocess zombie: 1)[sh] <defunct>; 2) [sleep] <defunct>

Code of numpy-cpu-job-actor.py is

import ray
import numpy as np
import datetime

t0 = datetime.datetime.now()
formatted_time = t0.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Starting at ", formatted_time)

ray.init()
t1 = datetime.datetime.now()
formatted_time = t1.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Ray initialized at ", formatted_time)

@ray.remote
def cpu_intensive_task():
    result = 0
    tt1 = datetime.datetime.now()
    print("Start at ", tt1.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3])
    for _ in range(int(5e6)):
        result += np.random.rand()
    tt2 = datetime.datetime.now()
    formatted_time1 = tt2.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
    print("Finished at %s, cost %.2f second." % (formatted_time1, (tt2-tt1).total_seconds()))
    return result


t2 = datetime.datetime.now()
formatted_time = t2.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Placement group ready at ", formatted_time)


t3 = datetime.datetime.now()
formatted_time = t3.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Actor scheduled at ", formatted_time)

result_ids = [cpu_intensive_task.options().remote() for _ in range(2)]
t4 = datetime.datetime.now()
formatted_time = t4.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
print("Actor task scheduled at ", formatted_time)

try:
    results = ray.get(result_ids)
    t5 = datetime.datetime.now()
    formatted_time = t5.strftime("%Y-%m-%d %H:%M:%S,%f")[:-3]
    print("Finished at ", formatted_time)
    print("Result: ")
    print(results)
    print("t0 - t1 - t2 - t3 - t4 - t5: %.2f - %.2f - %.2f - %.2f - %.2f" % ((t1-t0).total_seconds(), (t2-t1).total_seconds(), (t3-t2).total_seconds(), (t4-t3).total_seconds(), (t5-t4).total_seconds()))
except KeyboardInterrupt:
    print("terminatted.")
finally:
    ray.shutdown()
    print("Ray shutdown at ", formatted_time)

And the submit command is ray job submit --working-dir . -- python numpy-cpu-job.py

I’m wondering if I did something wrong that caused this, of if this is a community bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant