vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop #4038

JinZhou5042 · 2025-01-26T18:54:07Z

I've occasionally run into the case where most of the tasks were scheduled and retrieved normally, while the last reduction task gets stuck in the waiting list.

dthain · 2025-01-27T15:40:39Z

What else do you know about the final task? Does the log show that it executed and was evicted, or never ran at all?

JinZhou5042 · 2025-01-27T15:44:40Z

It seems that the final task went through the resource allocation process but was finally rejected somehow, I ran the same application for ~10 times but 8 of them didn't have such issue.

JinZhou5042 · 2025-01-30T14:56:23Z

A few tasks are getting blocked in the waiting queue

I am using tail -f to monitor the debug file, it is weird that task 5604 is repeating the state circle READY -> RUNNING -> WAITING_RETRIEVAL -> READY over and over again

2025/01/30 09:47:31.76 vine_manager[569695] vine: Task 5604 state change: READY (1) to RUNNING (2)
2025/01/30 09:47:31.76 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): alive
2025/01/30 09:47:31.76 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): resources
2025/01/30 09:47:31.76 vine_manager[569695] vine: cores 20
2025/01/30 09:47:31.76 vine_manager[569695] vine: memory 20480
2025/01/30 09:47:31.76 vine_manager[569695] vine: disk 307201
2025/01/30 09:47:31.76 vine_manager[569695] vine: gpus 0
2025/01/30 09:47:31.76 vine_manager[569695] vine: workers 1
2025/01/30 09:47:31.76 vine_manager[569695] vine: tag 8225
2025/01/30 09:47:31.76 vine_manager[569695] vine: end
2025/01/30 09:47:31.76 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): complete 40 0 0 0 0 0 0 8225
2025/01/30 09:47:31.76 vine_manager[569695] vine: Task 8225 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:47:31.76 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): unlink file-meta-9e6a98e2629a469117822e01c539c01a
2025/01/30 09:47:31.76 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): kill 8225
2025/01/30 09:47:31.76 vine_manager[569695] vine: Task 8225 state change: WAITING_RETRIEVAL (3) to READY (1)
2025/01/30 09:47:31.77 vine_manager[569695] vine: d12chas054.crc.nd.edu (10.32.81.233:52918) done in 0.00s total tasks 50 average 8.90s
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): info tasks_running 0
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): alive
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): resources
2025/01/30 09:47:31.77 vine_manager[569695] vine: cores 20
2025/01/30 09:47:31.77 vine_manager[569695] vine: memory 20480
2025/01/30 09:47:31.77 vine_manager[569695] vine: disk 307201
2025/01/30 09:47:31.77 vine_manager[569695] vine: gpus 0
2025/01/30 09:47:31.77 vine_manager[569695] vine: workers 1
2025/01/30 09:47:31.77 vine_manager[569695] vine: tag 8225
2025/01/30 09:47:31.77 vine_manager[569695] vine: end
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): complete 40 0 0 0 0 0 0 5604
2025/01/30 09:47:31.77 vine_manager[569695] vine: Task 5604 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:47:31.77 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): unlink file-meta-3af82e4487bb3730086b0ccff0fa6064
2025/01/30 09:47:31.77 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): kill 5604
2025/01/30 09:47:31.77 vine_manager[569695] vine: Task 5604 state change: WAITING_RETRIEVAL (3) to READY (1)

It is the same for task 8225, they have been repeating the circle for quite a while

2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 8225 state change: READY (1) to RUNNING (2)
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): alive
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): resources
2025/01/30 09:50:47.37 vine_manager[569695] vine: cores 20
2025/01/30 09:50:47.37 vine_manager[569695] vine: memory 20480
2025/01/30 09:50:47.37 vine_manager[569695] vine: disk 307201
2025/01/30 09:50:47.37 vine_manager[569695] vine: gpus 0
2025/01/30 09:50:47.37 vine_manager[569695] vine: workers 1
2025/01/30 09:50:47.37 vine_manager[569695] vine: tag 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: end
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): complete 40 0 0 0 0 0 0 5604
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 5604 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas079.crc.nd.edu (10.32.82.27:38006): unlink file-meta-3af82e4487bb3730086b0ccff0fa6064
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas079.crc.nd.edu (10.32.82.27:38006): kill 5604
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 5604 state change: WAITING_RETRIEVAL (3) to READY (1)
2025/01/30 09:50:47.37 vine_manager[569695] vine: d12chas079.crc.nd.edu (10.32.82.27:38006) done in 0.00s total tasks 691 average 5.15s
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): info tasks_running 0
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas083.crc.nd.edu (10.32.82.35:52014): alive
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas083.crc.nd.edu (10.32.82.35:52014): resources
2025/01/30 09:50:47.37 vine_manager[569695] vine: cores 20
2025/01/30 09:50:47.37 vine_manager[569695] vine: memory 20480
2025/01/30 09:50:47.37 vine_manager[569695] vine: disk 307201
2025/01/30 09:50:47.37 vine_manager[569695] vine: gpus 0
2025/01/30 09:50:47.37 vine_manager[569695] vine: workers 1
2025/01/30 09:50:47.37 vine_manager[569695] vine: tag 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: end
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas083.crc.nd.edu (10.32.82.35:52014): complete 40 0 0 0 0 0 0 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 8225 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas083.crc.nd.edu (10.32.82.35:52014): unlink file-meta-9e6a98e2629a469117822e01c539c01a
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas083.crc.nd.edu (10.32.82.35:52014): kill 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 8225 state change: WAITING_RETRIEVAL (3) to READY (1)

JinZhou5042 · 2025-01-30T14:57:12Z

Is #4000 possibly relavant?

btovar · 2025-01-30T15:05:28Z

#4000 should be fixed with #4002 ?

JinZhou5042 · 2025-01-30T15:42:10Z

There were a handful of tasks going through this loop, most of them eventually went out of it, but several were permanantly stuck.

dthain · 2025-01-30T16:04:23Z

There are valid reasons for going back to READy.
Look at the ‘complete’ message coming back from the worker. What does it indicate?

JinZhou5042 · 2025-01-30T16:23:15Z

The common pattern for the complete messages is complete 40 0 0 0 0 0 0 xxx where the last item is a changeable number, but the others (40 0 0 0 0 0 0) are seemingly to be fixed.

btovar · 2025-01-30T16:25:57Z

The last number is the task id. The first number is the reason why the task did not work. Since everything else is 0, my guess is that the tasks are being forsaken.

btovar · 2025-01-30T16:27:36Z

Right, 40 is RESULT_FORSAKEN, which may mean that an input file could not be transmitted to the worker?

JinZhou5042 · 2025-01-30T16:41:31Z

You are right, I am bringing back debug files from workers, this is what it says:

2025/01/30 11:28:21.86 vine_worker[1478313] vine: cache: error when creating buffer-rnd-rjjksxpzaasgklw via mini task: Could not establish connection with worker at: 10.32.81.233:1024
2025/01/30 11:28:21.86 vine_worker[1478299] vine: cache: transfer process for buffer-rnd-rjjksxpzaasgklw (pid 1478313) exited normally with exit code 1
2025/01/30 11:28:21.86 vine_worker[1478299] vine: cache: transfer process for buffer-rnd-rjjksxpzaasgklw failed
2025/01/30 11:28:21.86 vine_worker[1478299] vine: cache: command failed to complete for buffer-rnd-rjjksxpzaasgklw
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending Cache invalid transfer id: 712CAA62-0567-4854-BF45-54EA6AD66821
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: cache-invalid buffer-rnd-rjjksxpzaasgklw 96 712CAA62-0567-4854-BF45-54EA6AD66821
2025/01/30 11:28:21.86 vine_worker[1478299] debug: trashing file /var/condor/execute/dir_1477813/worker-241627-1478299/transfer/buffer-rnd-rjjksxpzaasgklw.error
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 220 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 220 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 131 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 131 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 173 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 173 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 134 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 134 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 145 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 145 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 219 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 219 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 209 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 209 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 162 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 162 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 163 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 163 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 139 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 139 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 205 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 205 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 140 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 140 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 161 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 161 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 181 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 181 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 201 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 201 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 158 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 158 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 159 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 159 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 180 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 180 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 165 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 165 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 138 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 138 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 131
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 134
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 138
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 139
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 140
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 145
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 158
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 159
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 161
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 162
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 163
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 165
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 173
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 180
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 181
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 201
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 205
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 209
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 219
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 220

Though my problem becomes to a straggler stuck on a worker for a long time (which was reported in #4007)

That task was eventually forsaken, but it really lingered with the worker for a while

dthain · 2025-01-30T17:10:54Z

Jin, are looking at the big picture when you should be digging into individual facts.

The task never returns. Why? Look at the log.
The manager says it sends it to the worker, but the workers sends it back. Why? Look at the log.
The worker says that it can't transfer the input file, and that's why it sends it back. Why? Look at the log.

The next step is not to look at the big visualization, but to track down the next detail. Why is the transfer failing?

JinZhou5042 · 2025-01-30T17:16:44Z

Copy that, looking at the logs...

JinZhou5042 · 2025-01-30T20:02:02Z

As it typically severes when I increase the number of replicas per temp file, I was investigating if the temp file replication interacts poorly with peer transfers. And one problem seems to be that the manager is too optimistical about file replication and is agreesively dispatching waiting tasks even if their input files have not been physically existing on the target worker.

On temporary file replication, the manager calls vine_file_replica_table_replicate to determine whether any file that has not yet reached its required replica count can be replicated. This decision is made with respect to a series of constraints, such as m->file_source_max_transfers and m->worker_source_max_transfers.

Then, when it finds a file replicable, it calls vine_manager_put_url_now, which first sends a puturl_now message to the destination worker with the source IP address and the file cache name, then puts the replica in the file_replica_table, which updates the destination worker’s current_files data structure as if the file already exists in its cache directory. Consequently, tasks needing the file can be dispatched since it appears in w->current_files, even though the manager has not yet received a cache-update message from that worker about that input file.

However, at this stage, there are various reasons that a replication request can be failed and therefore is no guarantee that the file will be successfully replicated within a reasonable time frame, for instance, if the worker is busy with numerous transfers, several of them just time out.

In such cases, tasks needing that input file are eventually forsaken, and the worker will be blocked for a while.

dthain · 2025-01-30T20:07:39Z

Ok, it looks like you are getting closer.

Note that a replica has two states to address exactly this problem:

cctools/taskvine/src/manager/vine_file_replica.h

Line 13 in 04724ee

typedef enum {

A replica should be entered in the PENDING state when the puturl is sent, and then updated to READY when cache-update is received.

Is it possible that the replica is getting entered with the wrong state?

Or perhaps the state is not being checked when necessary?

JinZhou5042 · 2025-01-30T20:11:11Z

I opened a tentative pr in #4048, I think simply delaying the replica registering process from put_url to cache-update would solve it?

JinZhou5042 · 2025-01-30T20:12:13Z

Or do we need to register it when sending put_url, but should consider the replicating state at task scheduling?

dthain · 2025-01-30T20:12:13Z

No, that's not right. If you fail to register the replica, then the manager will keep sending it over and over. The manager needs to know that the replica attempt was sent put it hasn't materialized yet. That's what PENDING vs READY is for.

Something else is wrong wrt to the checking of the replica state.

JinZhou5042 · 2025-01-30T20:12:52Z

Ah, I see what I am missing there.

dthain · 2025-01-30T20:14:59Z

Hmm, is this a bug?

After sending the file, we set the state to PENDING (if it will be created by the worker) or READY if we are sending the literal data and not expecting a cache-update.

Does VINE_MINI_TASK belong in the first case instead of the second case?

dthain · 2025-01-30T20:20:58Z

I was referring to this line of code:
https://github.com/cooperative-computing-lab/cctools/blob/master/taskvine/src/manager/vine_manager_put.c#L451

But now that I think about it, it's correct. Mini-tasks only get executed on demand, and also produce data that should not be moved between hosts. That's not the problem.

JinZhou5042 · 2025-01-30T20:28:54Z

How about this?

In this line of code, we check if the state of the input file is pending, task is rejected if any of the inputs is pending.

But we need to check the replica state as well?

Like

struct vine_file_replica *replica = vine_file_replica_table_lookup(w, f->cached_name);
if (f->state != VINE_FILE_STATE_CREATED) {
    all_available = 0;
}

JinZhou5042 · 2025-01-30T20:34:48Z

NVM, that function was not considering any workers... check_worker_against_task seems to be relavent, but it seems to me that sending tasks to a worker without inputs guarantee is a feature rather than a bug?

dthain · 2025-01-30T20:55:24Z

Hmm, I'm free now for a bit come on over and we can look at it together.

dthain · 2025-02-05T14:50:11Z

Is this solved with #4050?

JinZhou5042 · 2025-02-05T20:57:22Z

I believe so

JinZhou5042 · 2025-02-26T18:28:43Z

solved with #4076

JinZhou5042 changed the title ~~vine: occasionally the last reduction task is retained~~ vine: occasionally the last reduction task is retained forever Jan 27, 2025

JinZhou5042 added bug For modifications that fix a flaw in the code. TaskVine labels Jan 30, 2025

JinZhou5042 changed the title ~~vine: occasionally the last reduction task is retained forever~~ vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop Jan 30, 2025

JinZhou5042 added the critical label Jan 30, 2025

JinZhou5042 mentioned this issue Jan 30, 2025

vine: delay registering file_replica_table #4048

Closed

7 tasks

JinZhou5042 mentioned this issue Jan 30, 2025

Vine: Worker transfer limit must be larger than manager limit. #4050

Merged

7 tasks

JinZhou5042 self-assigned this Feb 11, 2025

JinZhou5042 mentioned this issue Feb 11, 2025

vine: put files to worker may fail #4061

Closed

JinZhou5042 closed this as completed Feb 26, 2025

JinZhou5042 mentioned this issue Feb 26, 2025

vine: properly close link in the worker transfer server process #4076

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop #4038

vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop #4038

JinZhou5042 commented Jan 26, 2025

dthain commented Jan 27, 2025

JinZhou5042 commented Jan 27, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

btovar commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

btovar commented Jan 30, 2025

btovar commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025 •

edited

Loading

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025 •

edited

Loading

dthain commented Jan 30, 2025

dthain commented Feb 5, 2025

JinZhou5042 commented Feb 5, 2025

JinZhou5042 commented Feb 26, 2025

vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop #4038

vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop #4038

Comments

JinZhou5042 commented Jan 26, 2025

dthain commented Jan 27, 2025

JinZhou5042 commented Jan 27, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

btovar commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

btovar commented Jan 30, 2025

btovar commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025 • edited Loading

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

dthain commented Jan 30, 2025

dthain commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025

JinZhou5042 commented Jan 30, 2025 • edited Loading

dthain commented Jan 30, 2025

dthain commented Feb 5, 2025

JinZhou5042 commented Feb 5, 2025

JinZhou5042 commented Feb 26, 2025

JinZhou5042 commented Jan 30, 2025 •

edited

Loading

JinZhou5042 commented Jan 30, 2025 •

edited

Loading