Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

Open
Moonquakes opened this issue Jan 22, 2025 · 2 comments
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@Moonquakes
Copy link

Moonquakes commented Jan 22, 2025

What happened + What you expected to happen

When I was running a task, I found that a type of job can stably trigger the resource leakage problem after the ray job ends. I tested the code part to the most simplified version and provided it below. The Ray Cluster used in the test is configured as a 128c896g worker node, minWorkerNum is 0, maxWorkerNum is 1, and no environment variables are configured. After the job runs for one minute, you need to manually trigger the ray job stop to trigger the exception.

This task will have the following abnormal phenomena, I think it is worth your in-depth investigation to see what bug is triggered.

  1. During the running process, the Resource Status of Overview always shows that the CPU resources are fully occupied, that is, there are 128 tasks running in parallel, but only a dozen tasks can be seen in the running state on the Job Detail page
  2. After the running is completed, the logical resource leakage can be almost stably triggered, and the Resource Status of Overview always shows that the resources are occupied, resulting in the worker node cannot be scaled down
  3. After the running is completed, the ray task leakage can be almost stably triggered, and there will be many pending tasks in the Demands of Resource Status of Overview, which will cause the node to scale up and scale down continuously.

Image

These three problems will occasionally be triggered separately in some other jobs, but in this given code, they can be almost stably triggered at the same time. Please take a look at where the problem occurs. Thank you!

Versions / Dependencies

Ray v2.40.0
Kuberay v1.2.2

Reproduction script

import ray
import time
import random

@ray.remote
def son():
  time.sleep(10)

@ray.remote
def father():
  futures = []
  for i in range(9000):
    futures.append(son.options(memory=random.randint(1, 1024**3)).remote())
  
  res_list = []
  while len(futures) > 0:
    ready_futures, futures = ray.wait(futures, num_returns=1)
    res_list.extend(ray.get(ready_futures))

if __name__ == '__main__':
    ray.init()
    ray.get(father.remote())

Submit job: ray job submit --address=http://localhost:8265 --working-dir=. -- python3 test_resource_leak.py
And execute ray job stop 02000000 --address=http://localhost:8265 after the ray job runs for one minute.

Issue Severity

High: It blocks me from completing my task.

@Moonquakes Moonquakes added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2025
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Jan 22, 2025
@jjyao jjyao added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2025
@jjyao
Copy link
Collaborator

jjyao commented Jan 23, 2025

I'm able to repro. Thanks for reporting.

@Moonquakes
Copy link
Author

@jjyao Thanks for your quick confirmation and looking forward to having this fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

3 participants