Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop #4038

Closed
JinZhou5042 opened this issue Jan 26, 2025 · 27 comments
Closed
Assignees
Labels
bug For modifications that fix a flaw in the code. critical TaskVine

Comments

@JinZhou5042
Copy link
Member

I've occasionally run into the case where most of the tasks were scheduled and retrieved normally, while the last reduction task gets stuck in the waiting list.

Image
@JinZhou5042 JinZhou5042 changed the title vine: occasionally the last reduction task is retained vine: occasionally the last reduction task is retained forever Jan 27, 2025
@dthain
Copy link
Member

dthain commented Jan 27, 2025

What else do you know about the final task? Does the log show that it executed and was evicted, or never ran at all?

@JinZhou5042
Copy link
Member Author

It seems that the final task went through the resource allocation process but was finally rejected somehow, I ran the same application for ~10 times but 8 of them didn't have such issue.

@JinZhou5042
Copy link
Member Author

A few tasks are getting blocked in the waiting queue

Image

I am using tail -f to monitor the debug file, it is weird that task 5604 is repeating the state circle READY -> RUNNING -> WAITING_RETRIEVAL -> READY over and over again

2025/01/30 09:47:31.76 vine_manager[569695] vine: Task 5604 state change: READY (1) to RUNNING (2)
2025/01/30 09:47:31.76 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): alive
2025/01/30 09:47:31.76 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): resources
2025/01/30 09:47:31.76 vine_manager[569695] vine: cores 20
2025/01/30 09:47:31.76 vine_manager[569695] vine: memory 20480
2025/01/30 09:47:31.76 vine_manager[569695] vine: disk 307201
2025/01/30 09:47:31.76 vine_manager[569695] vine: gpus 0
2025/01/30 09:47:31.76 vine_manager[569695] vine: workers 1
2025/01/30 09:47:31.76 vine_manager[569695] vine: tag 8225
2025/01/30 09:47:31.76 vine_manager[569695] vine: end
2025/01/30 09:47:31.76 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): complete 40 0 0 0 0 0 0 8225
2025/01/30 09:47:31.76 vine_manager[569695] vine: Task 8225 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:47:31.76 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): unlink file-meta-9e6a98e2629a469117822e01c539c01a
2025/01/30 09:47:31.76 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): kill 8225
2025/01/30 09:47:31.76 vine_manager[569695] vine: Task 8225 state change: WAITING_RETRIEVAL (3) to READY (1)
2025/01/30 09:47:31.77 vine_manager[569695] vine: d12chas054.crc.nd.edu (10.32.81.233:52918) done in 0.00s total tasks 50 average 8.90s
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): info tasks_running 0
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): alive
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): resources
2025/01/30 09:47:31.77 vine_manager[569695] vine: cores 20
2025/01/30 09:47:31.77 vine_manager[569695] vine: memory 20480
2025/01/30 09:47:31.77 vine_manager[569695] vine: disk 307201
2025/01/30 09:47:31.77 vine_manager[569695] vine: gpus 0
2025/01/30 09:47:31.77 vine_manager[569695] vine: workers 1
2025/01/30 09:47:31.77 vine_manager[569695] vine: tag 8225
2025/01/30 09:47:31.77 vine_manager[569695] vine: end
2025/01/30 09:47:31.77 vine_manager[569695] vine: rx from d12chas054.crc.nd.edu (10.32.81.233:52918): complete 40 0 0 0 0 0 0 5604
2025/01/30 09:47:31.77 vine_manager[569695] vine: Task 5604 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:47:31.77 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): unlink file-meta-3af82e4487bb3730086b0ccff0fa6064
2025/01/30 09:47:31.77 vine_manager[569695] vine: tx to d12chas054.crc.nd.edu (10.32.81.233:52918): kill 5604
2025/01/30 09:47:31.77 vine_manager[569695] vine: Task 5604 state change: WAITING_RETRIEVAL (3) to READY (1)

It is the same for task 8225, they have been repeating the circle for quite a while

2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 8225 state change: READY (1) to RUNNING (2)
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): alive
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): resources
2025/01/30 09:50:47.37 vine_manager[569695] vine: cores 20
2025/01/30 09:50:47.37 vine_manager[569695] vine: memory 20480
2025/01/30 09:50:47.37 vine_manager[569695] vine: disk 307201
2025/01/30 09:50:47.37 vine_manager[569695] vine: gpus 0
2025/01/30 09:50:47.37 vine_manager[569695] vine: workers 1
2025/01/30 09:50:47.37 vine_manager[569695] vine: tag 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: end
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): complete 40 0 0 0 0 0 0 5604
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 5604 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas079.crc.nd.edu (10.32.82.27:38006): unlink file-meta-3af82e4487bb3730086b0ccff0fa6064
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas079.crc.nd.edu (10.32.82.27:38006): kill 5604
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 5604 state change: WAITING_RETRIEVAL (3) to READY (1)
2025/01/30 09:50:47.37 vine_manager[569695] vine: d12chas079.crc.nd.edu (10.32.82.27:38006) done in 0.00s total tasks 691 average 5.15s
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas079.crc.nd.edu (10.32.82.27:38006): info tasks_running 0
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas083.crc.nd.edu (10.32.82.35:52014): alive
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas083.crc.nd.edu (10.32.82.35:52014): resources
2025/01/30 09:50:47.37 vine_manager[569695] vine: cores 20
2025/01/30 09:50:47.37 vine_manager[569695] vine: memory 20480
2025/01/30 09:50:47.37 vine_manager[569695] vine: disk 307201
2025/01/30 09:50:47.37 vine_manager[569695] vine: gpus 0
2025/01/30 09:50:47.37 vine_manager[569695] vine: workers 1
2025/01/30 09:50:47.37 vine_manager[569695] vine: tag 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: end
2025/01/30 09:50:47.37 vine_manager[569695] vine: rx from d12chas083.crc.nd.edu (10.32.82.35:52014): complete 40 0 0 0 0 0 0 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 8225 state change: RUNNING (2) to WAITING_RETRIEVAL (3)
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas083.crc.nd.edu (10.32.82.35:52014): unlink file-meta-9e6a98e2629a469117822e01c539c01a
2025/01/30 09:50:47.37 vine_manager[569695] vine: tx to d12chas083.crc.nd.edu (10.32.82.35:52014): kill 8225
2025/01/30 09:50:47.37 vine_manager[569695] vine: Task 8225 state change: WAITING_RETRIEVAL (3) to READY (1)

@JinZhou5042
Copy link
Member Author

Is #4000 possibly relavant?

@JinZhou5042 JinZhou5042 added bug For modifications that fix a flaw in the code. TaskVine labels Jan 30, 2025
@JinZhou5042 JinZhou5042 changed the title vine: occasionally the last reduction task is retained forever vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop Jan 30, 2025
@btovar
Copy link
Member

btovar commented Jan 30, 2025

#4000 should be fixed with #4002 ?

@JinZhou5042
Copy link
Member Author

There were a handful of tasks going through this loop, most of them eventually went out of it, but several were permanantly stuck.

@dthain
Copy link
Member

dthain commented Jan 30, 2025

There are valid reasons for going back to READy.
Look at the ‘complete’ message coming back from the worker. What does it indicate?

@JinZhou5042
Copy link
Member Author

The common pattern for the complete messages is complete 40 0 0 0 0 0 0 xxx where the last item is a changeable number, but the others (40 0 0 0 0 0 0) are seemingly to be fixed.

@btovar
Copy link
Member

btovar commented Jan 30, 2025

The last number is the task id. The first number is the reason why the task did not work. Since everything else is 0, my guess is that the tasks are being forsaken.

@btovar
Copy link
Member

btovar commented Jan 30, 2025

Right, 40 is RESULT_FORSAKEN, which may mean that an input file could not be transmitted to the worker?

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Jan 30, 2025

You are right, I am bringing back debug files from workers, this is what it says:

2025/01/30 11:28:21.86 vine_worker[1478313] vine: cache: error when creating buffer-rnd-rjjksxpzaasgklw via mini task: Could not establish connection with worker at: 10.32.81.233:1024
2025/01/30 11:28:21.86 vine_worker[1478299] vine: cache: transfer process for buffer-rnd-rjjksxpzaasgklw (pid 1478313) exited normally with exit code 1
2025/01/30 11:28:21.86 vine_worker[1478299] vine: cache: transfer process for buffer-rnd-rjjksxpzaasgklw failed
2025/01/30 11:28:21.86 vine_worker[1478299] vine: cache: command failed to complete for buffer-rnd-rjjksxpzaasgklw
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending Cache invalid transfer id: 712CAA62-0567-4854-BF45-54EA6AD66821
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: cache-invalid buffer-rnd-rjjksxpzaasgklw 96 712CAA62-0567-4854-BF45-54EA6AD66821
2025/01/30 11:28:21.86 vine_worker[1478299] debug: trashing file /var/condor/execute/dir_1477813/worker-241627-1478299/transfer/buffer-rnd-rjjksxpzaasgklw.error
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 220 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 220 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 131 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 131 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 173 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 173 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 134 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 134 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 145 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 145 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 219 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 219 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.86 vine_worker[1478299] vine: task 209 requires failed input cache transfers
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Waiting task 209 has been forsaken.
2025/01/30 11:28:21.86 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.86 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.86 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 162 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 162 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 163 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 163 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 139 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 139 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 205 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 205 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 140 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 140 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 161 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 161 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 181 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 181 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 201 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 201 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 158 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 158 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 159 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 159 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 180 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 180 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 165 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 165 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: task 138 requires failed input cache transfers
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Waiting task 138 has been forsaken.
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: alive
2025/01/30 11:28:21.87 vine_worker[1478299] vine: Sending resource description to manager:
2025/01/30 11:28:21.87 vine_worker[1478299] vine:    cores      0 inuse     20 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:   memory      0 inuse  20480 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     disk     49 inuse 307201 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:     gpus      0 inuse      0 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine:  workers      0 inuse      1 total
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 131
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 134
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 138
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 139
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 140
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 145
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 158
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 159
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 161
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 162
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 163
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 165
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 173
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 180
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 181
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 201
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 205
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 209
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 219
2025/01/30 11:28:21.87 vine_worker[1478299] vine: tx: complete 40 0 0 0 0 0 0 220

Though my problem becomes to a straggler stuck on a worker for a long time (which was reported in #4007)

Image

That task was eventually forsaken, but it really lingered with the worker for a while

Image

@dthain
Copy link
Member

dthain commented Jan 30, 2025

Jin, are looking at the big picture when you should be digging into individual facts.

  • The task never returns. Why? Look at the log.
  • The manager says it sends it to the worker, but the workers sends it back. Why? Look at the log.
  • The worker says that it can't transfer the input file, and that's why it sends it back. Why? Look at the log.

The next step is not to look at the big visualization, but to track down the next detail. Why is the transfer failing?

@JinZhou5042
Copy link
Member Author

Copy that, looking at the logs...

@JinZhou5042
Copy link
Member Author

As it typically severes when I increase the number of replicas per temp file, I was investigating if the temp file replication interacts poorly with peer transfers. And one problem seems to be that the manager is too optimistical about file replication and is agreesively dispatching waiting tasks even if their input files have not been physically existing on the target worker.

On temporary file replication, the manager calls vine_file_replica_table_replicate to determine whether any file that has not yet reached its required replica count can be replicated. This decision is made with respect to a series of constraints, such as m->file_source_max_transfers and m->worker_source_max_transfers.

Then, when it finds a file replicable, it calls vine_manager_put_url_now, which first sends a puturl_now message to the destination worker with the source IP address and the file cache name, then puts the replica in the file_replica_table, which updates the destination worker’s current_files data structure as if the file already exists in its cache directory. Consequently, tasks needing the file can be dispatched since it appears in w->current_files, even though the manager has not yet received a cache-update message from that worker about that input file.

However, at this stage, there are various reasons that a replication request can be failed and therefore is no guarantee that the file will be successfully replicated within a reasonable time frame, for instance, if the worker is busy with numerous transfers, several of them just time out.

In such cases, tasks needing that input file are eventually forsaken, and the worker will be blocked for a while.

@dthain
Copy link
Member

dthain commented Jan 30, 2025

Ok, it looks like you are getting closer.

Note that a replica has two states to address exactly this problem:

A replica should be entered in the PENDING state when the puturl is sent, and then updated to READY when cache-update is received.

Is it possible that the replica is getting entered with the wrong state?

Or perhaps the state is not being checked when necessary?

@JinZhou5042
Copy link
Member Author

I opened a tentative pr in #4048, I think simply delaying the replica registering process from put_url to cache-update would solve it?

@JinZhou5042
Copy link
Member Author

Or do we need to register it when sending put_url, but should consider the replicating state at task scheduling?

@dthain
Copy link
Member

dthain commented Jan 30, 2025

No, that's not right. If you fail to register the replica, then the manager will keep sending it over and over. The manager needs to know that the replica attempt was sent put it hasn't materialized yet. That's what PENDING vs READY is for.

Something else is wrong wrt to the checking of the replica state.

@JinZhou5042
Copy link
Member Author

Ah, I see what I am missing there.

@dthain
Copy link
Member

dthain commented Jan 30, 2025

Hmm, is this a bug?

After sending the file, we set the state to PENDING (if it will be created by the worker) or READY if we are sending the literal data and not expecting a cache-update.

Does VINE_MINI_TASK belong in the first case instead of the second case?

@dthain
Copy link
Member

dthain commented Jan 30, 2025

I was referring to this line of code:
https://github.com/cooperative-computing-lab/cctools/blob/master/taskvine/src/manager/vine_manager_put.c#L451

But now that I think about it, it's correct. Mini-tasks only get executed on demand, and also produce data that should not be moved between hosts. That's not the problem.

@JinZhou5042
Copy link
Member Author

How about this?

In this line of code, we check if the state of the input file is pending, task is rejected if any of the inputs is pending.

But we need to check the replica state as well?

Like

struct vine_file_replica *replica = vine_file_replica_table_lookup(w, f->cached_name);
if (f->state != VINE_FILE_STATE_CREATED) {
    all_available = 0;
}

@JinZhou5042
Copy link
Member Author

JinZhou5042 commented Jan 30, 2025

NVM, that function was not considering any workers... check_worker_against_task seems to be relavent, but it seems to me that sending tasks to a worker without inputs guarantee is a feature rather than a bug?

@dthain
Copy link
Member

dthain commented Jan 30, 2025

Hmm, I'm free now for a bit come on over and we can look at it together.

@dthain
Copy link
Member

dthain commented Feb 5, 2025

Is this solved with #4050?

@JinZhou5042
Copy link
Member Author

I believe so

@JinZhou5042
Copy link
Member Author

solved with #4076

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug For modifications that fix a flaw in the code. critical TaskVine
Projects
None yet
Development

No branches or pull requests

3 participants