TaskVine: Fix library deployment method #3560

tphung3 · 2023-10-26T18:34:52Z

Proposed changes

Before this PR the manager tries to send a library to all known workers, which takes a long time and breaks the logic of the internal wait loop. Now the manager sends the library as necessary only.

Post-change actions

Put an 'x' in the boxes that describe post-change actions that you have done.
The more 'x' ticked, the faster your changes are accepted by maintainers.

make test Run local tests prior to pushing.
make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
make lint Run lint on source code prior to pushing.
Manual Update Did you update the manual to reflect your changes, if appropriate? This action should be done after your changes are approved but not merged.
Type Labels Select github labels for the type of this change: bug, enhancement, etc.
Product Labels Select github labels for the product affected: TaskVine, Makeflow, etc.
PR RTM Mark your PR as ready to merge.

dthain · 2023-10-26T18:50:46Z

Could you summarize the principle of operation here?
Previously, the idea was to make library deployment the lowest priority activity, and then to only deploy one library per worker.
I'm sure one-library-per-worker is no longer what we want.
But not sure if we want to increase the priority of deploying libraries.

tphung3 · 2023-10-26T18:54:07Z

The idea here is to piggyback the library with a function call to send to the worker. Instead of sending all libraries to all known workers at once, we send one function call to one worker as usual, and if that worker doesn't already have the library running, then we send the library (lazily).

dthain · 2023-10-26T19:42:09Z

Per our conversation, please do some big runs with this new configuration (large number of tasks and workers) and plot the results to make sure it has the desired effect...

dthain · 2023-10-27T13:16:54Z

Another fundamental problem that I see here is that we are not checking for the compatibility of the library task with the worker, nor are we checking to see if the worker has the appropriate resources available for the library. For example, perhaps the library needs a GPU.

We need a scheme by which library tasks are scheduled to workers in a "mostly normal" way so that we are checking all of the properties that are important to the task.

tphung3 · 2023-10-27T20:01:44Z

A quick experiment shows that the proposed way delivers minimal improvements, even though it is technically more sound than the previous way of delivering libraries to workers. This might be because the scenario when a manager spends a lot of time to send libraries to a large number (100s) of workers isn't real in practice: instead the manager sends to small chunks of workers (in 20-30) as they arrive. Some numbers are provided below to back up this claim. Note that the experiment setup is exactly the same, except the way to deliver libraries:

With 20,000 function calls, each running for 5 seconds, and 100 16-core workers, each can run 16 concurrent function tasks:

old: 105.88 seconds
new: 104.21 seconds

Same set of function calls, and 400 4-core workers, each can run 4 concurrent function tasks:

old: 178.83 seconds
new: 175.46 seconds

On the second thought, if a user has rounds of experiments within one workflow, which installs one library then installs another library after 500 workers are connected, then the old way of sending libraries will clearly block the main wait loop. Maybe another experiment can confirm this.

Tagging relevant people @dthain @btovar.

dthain · 2023-10-30T13:04:54Z

taskvine/src/manager/vine_manager.c

+					vine_task_copy(hash_table_lookup(q->libraries, t->needs_library)));
+			debug(D_VINE, "Sending library %s to worker %s\n", t->needs_library, w->workerid);
+		} else {
+			debug(D_VINE, "Cannot send library %s to worker %s\n", t->needs_library, w->workerid);


This debug could be misleading b/c vine_manager_send_library_to_worker may return zero if the library simply does not fit the worker, or is incompatible. In short, it's not really a failure, it's just the manager realizing (late in the game) that it doesn't make sense to schedule this task here.

dthain · 2023-10-30T13:06:37Z

taskvine/src/manager/vine_manager.c

@@ -2723,6 +2728,11 @@ static void commit_task_to_worker(struct vine_manager *q, struct vine_worker_inf
 	if (result != VINE_SUCCESS) {
 		debug(D_VINE, "Failed to send task %d to worker %s (%s).", t->task_id, w->hostname, w->addrport);
 		handle_failure(q, w, t, result);
+	} else {
+		if (t->needs_library) {
+			struct vine_task *library_task = hash_table_lookup(w->libraries, t->needs_library);


Use vine_manager_find_library_on_worker to find one of the running libraries on the worker with an available slot, and then store it in t->library_task so that you can correctly un-do the binding when the task ends.

dthain

I think that this approach can work, but it is a mistake to remove the t->library_task pointer. The reason for this pointer is to associate a function call task with one specific library task on that host during scheduling. (This is because a host could run more than one library task at once.). Then the t->library_task pointer is used to perform the needed reference counting on the matched library.

Second, I'm not sure that is necessary to introduce the w->libraries table since the library task is already represented in w->current_tasks. And every table of aliases becomes a new complication to keep track of. You can use the existing vine_manager_find_library_on_worker to find a suitable library, and then keep that binding in t->library_task

dthain · 2023-11-02T16:51:52Z

Per our recent discussions, please fix up this PR as follows:

1 - Libraries should consume resources in the same way as normal tasks, using the same routines. (Use explicit resources if given, otherwise fill the worker.)
2 - It is allowed to have multiple libraries per worker, so function call tasks should continue to track what library they are connected to via t->library_task
3 - Libraries should be deployed when needed by function calls, and then removed when the need no longer exists.

I think this can be done with relatively minor changes. And then let's focus on getting this working with Parsl applications at large scale, and evaluate the performance. (And probably there will be some bugs to fix along the way.)

dthain

Ok, I think you have convinced me that this is a workable idea. Since it is core to the scheduling loop, please make a few changes to clarify for the next reader.

taskvine/src/manager/vine_manager.c

taskvine/src/manager/vine_schedule.c

dthain · 2023-11-06T18:10:25Z

Looks like the OSX build just timed out and preahps will succeed with a simple retry.

tphung3 · 2023-11-06T21:00:41Z

It's the second time the OSX build fails. Can this be related to the recent OSX build changes @dthain?

dthain · 2023-11-07T16:44:38Z

Please rebase on master to get a workaround for the OSX build.

dthain · 2023-12-05T21:37:30Z

RTM?

tphung3 · 2023-12-05T21:44:48Z

Yes it is

tphung3 added 2 commits October 26, 2023 14:32

better library dispatch

009f50b

lint

13d9925

tphung3 added TaskVine bug For modifications that fix a flaw in the code. labels Oct 26, 2023

delete table

937cf1a

dthain reviewed Oct 30, 2023

View reviewed changes

dthain requested changes Oct 30, 2023

View reviewed changes

tphung3 added 4 commits November 3, 2023 10:51

send lib when needed

39654e4

remove lib as needed

0c04d7b

lint

221f703

fix valgrind

7963873

tphung3 requested a review from dthain November 3, 2023 20:23

dthain requested changes Nov 6, 2023

View reviewed changes

fix

7c5d141

Merge branch 'master' into fix_lib_deploy

ad3fcef

dthain approved these changes Dec 5, 2023

View reviewed changes

dthain merged commit f0248ca into cooperative-computing-lab:master Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskVine: Fix library deployment method #3560

TaskVine: Fix library deployment method #3560

tphung3 commented Oct 26, 2023 •

edited

Loading

dthain commented Oct 26, 2023

tphung3 commented Oct 26, 2023 •

edited

Loading

dthain commented Oct 26, 2023

dthain commented Oct 27, 2023

tphung3 commented Oct 27, 2023

dthain Oct 30, 2023

dthain Oct 30, 2023

dthain left a comment

dthain commented Nov 2, 2023

dthain left a comment

dthain commented Nov 6, 2023

tphung3 commented Nov 6, 2023

dthain commented Nov 7, 2023

dthain commented Dec 5, 2023

tphung3 commented Dec 5, 2023

TaskVine: Fix library deployment method #3560

TaskVine: Fix library deployment method #3560

Conversation

tphung3 commented Oct 26, 2023 • edited Loading

Proposed changes

Post-change actions

dthain commented Oct 26, 2023

tphung3 commented Oct 26, 2023 • edited Loading

dthain commented Oct 26, 2023

dthain commented Oct 27, 2023

tphung3 commented Oct 27, 2023

dthain Oct 30, 2023

Choose a reason for hiding this comment

dthain Oct 30, 2023

Choose a reason for hiding this comment

dthain left a comment

Choose a reason for hiding this comment

dthain commented Nov 2, 2023

dthain left a comment

Choose a reason for hiding this comment

dthain commented Nov 6, 2023

tphung3 commented Nov 6, 2023

dthain commented Nov 7, 2023

dthain commented Dec 5, 2023

tphung3 commented Dec 5, 2023

tphung3 commented Oct 26, 2023 •

edited

Loading

tphung3 commented Oct 26, 2023 •

edited

Loading