Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: add current_libraries to the worker's data structure #4046

Open
JinZhou5042 opened this issue Jan 28, 2025 · 3 comments
Open

vine: add current_libraries to the worker's data structure #4046

JinZhou5042 opened this issue Jan 28, 2025 · 3 comments

Comments

@JinZhou5042
Copy link
Member

JinZhou5042 commented Jan 28, 2025

There many places that may need to get the running libraries on a worker and perform operations on them, the current way to do it is to traverse all running tasks to select libraries.

For example, in kill_empty_libraries_on_worker, we kill unused libraries on a worker to reclaim resources:

static void kill_empty_libraries_on_worker(struct vine_manager *q, struct vine_worker_info *w, struct vine_task *t)
{
	uint64_t task_id;
	struct vine_task *task;
	ITABLE_ITERATE(w->current_tasks, task_id, task)
	{
		if (task->provides_library && task->function_slots_inuse == 0) {
			vine_cancel_by_task_id(q, task->task_id);
		}
	}
}

In check_worker_have_enough_resources, we substract the inuse resources from libraries that are not running any functions at all:

uint64_t task_id;
struct vine_task *ti;
ITABLE_ITERATE(w->current_tasks, task_id, ti)
{
	if (ti->provides_library && ti->function_slots_inuse == 0) {
		worker_net_resources->disk.inuse -= ti->current_resource_box->disk;
		worker_net_resources->cores.inuse -= ti->current_resource_box->cores;
		worker_net_resources->memory.inuse -= ti->current_resource_box->memory;
		worker_net_resources->gpus.inuse -= ti->current_resource_box->gpus;
	}
}

On function scheduling, we can terminate early by identifying if there are any free slots on any library on the worker.

Does it make sense to add a current_libraries to the worker's data structure, so that we don't spend time on traversing non-library related tasks? As it would be as simple as calling itable_insert(w->current_libraries, t->task_id, t); on committing and itable_remove(w->current_libraries, t->task_id); on reaping, but would bring a lot of convenience to those operations.

@btovar
Copy link
Member

btovar commented Jan 29, 2025

It makes sense to me. Let's evaluate with a pr as changes to the code should be small.

@dthain
Copy link
Member

dthain commented Jan 29, 2025

Let's keep in mind the expected orders of magnitude in each data structure:

  • The manager may have millions of tasks overall in q->tasks.
  • The manager may have millions of tasks in q->ready list
  • The manager may have thousands of running tasks in q->running-table
  • The manager may have hundreds of workers in w->worker_table
  • Each worker may have a handful of ready/running tasks in w->current_tasks

Because of the sheer number of tasks in q->tasks, there is a lot gained by segregating the tasks by state into q->ready_list and q->running_table, even though that adds complexity.

But if there are only a handful of items at any given time in w->current_tasks, I'm not sure that we gain a lot by dividing it further into several data structures.

Is there some other consideration?

@JinZhou5042
Copy link
Member Author

@dthain One benefit I could see is that when there are hundreds of running tasks, on task scheduling, send_one_task will try to consider a depth of tasks (100) until one is runnable, select_worker_by_files will typically traverse all workers to find the best one, and check_worker_have_enough_resources will traverse every task to substract resources used by empty libraries. That way, in the worst case, we end up with traversing 100*10000 tasks which might be expensive. But if we are able to directly access the running libraries on each worker, the number of traversing would be reduced by 99%.

Also, it provides with us a way to keep track of all the running libraries among all workers, by traversing each worker and get the running libraries on that worker, that saves time in that workers without libraries can be passed directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants