[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

Moonquakes · 2025-01-22T05:40:27Z

What happened + What you expected to happen

When I was running a task, I found that a type of job can stably trigger the resource leakage problem after the ray job ends. I tested the code part to the most simplified version and provided it below. The Ray Cluster used in the test is configured as a 128c896g worker node, minWorkerNum is 0, maxWorkerNum is 1, and no environment variables are configured. After the job runs for one minute, you need to manually trigger the ray job stop to trigger the exception.

This task will have the following abnormal phenomena, I think it is worth your in-depth investigation to see what bug is triggered.

During the running process, the Resource Status of Overview always shows that the CPU resources are fully occupied, that is, there are 128 tasks running in parallel, but only a dozen tasks can be seen in the running state on the Job Detail page
After the running is completed, the logical resource leakage can be almost stably triggered, and the Resource Status of Overview always shows that the resources are occupied, resulting in the worker node cannot be scaled down
After the running is completed, the ray task leakage can be almost stably triggered, and there will be many pending tasks in the Demands of Resource Status of Overview, which will cause the node to scale up and scale down continuously.

These three problems will occasionally be triggered separately in some other jobs, but in this given code, they can be almost stably triggered at the same time. Please take a look at where the problem occurs. Thank you!

Versions / Dependencies

Ray v2.40.0
Kuberay v1.2.2

Reproduction script

import ray
import time
import random

@ray.remote
def son():
  time.sleep(10)

@ray.remote
def father():
  futures = []
  for i in range(9000):
    futures.append(son.options(memory=random.randint(1, 1024**3)).remote())
  
  res_list = []
  while len(futures) > 0:
    ready_futures, futures = ray.wait(futures, num_returns=1)
    res_list.extend(ray.get(ready_futures))

if __name__ == '__main__':
    ray.init()
    ray.get(father.remote())

Submit job: ray job submit --address=http://localhost:8265 --working-dir=. -- python3 test_resource_leak.py
And execute ray job stop 02000000 --address=http://localhost:8265 after the ray job runs for one minute.

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

jjyao · 2025-01-23T19:08:11Z

I'm able to repro. Thanks for reporting.

Moonquakes · 2025-01-24T00:58:32Z

@jjyao Thanks for your quick confirmation and looking forward to having this fixed!

Moonquakes added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2025

jcotant1 added the core Issues that should be addressed in Ray Core label Jan 22, 2025

jjyao added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

Moonquakes commented Jan 22, 2025 •

edited

Loading

jjyao commented Jan 23, 2025

Moonquakes commented Jan 24, 2025

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

[Ray Core] After the ray job is finished, it will stably trigger resource leakage #49999

Comments

Moonquakes commented Jan 22, 2025 • edited Loading

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

jjyao commented Jan 23, 2025

Moonquakes commented Jan 24, 2025

Moonquakes commented Jan 22, 2025 •

edited

Loading