-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vine: tasks stuck in READY → RUNNING → WAITING_RETRIEVAL → READY loop #4038
Comments
What else do you know about the final task? Does the log show that it executed and was evicted, or never ran at all? |
It seems that the final task went through the resource allocation process but was finally rejected somehow, I ran the same application for ~10 times but 8 of them didn't have such issue. |
A few tasks are getting blocked in the waiting queue I am using
It is the same for task 8225, they have been repeating the circle for quite a while
|
Is #4000 possibly relavant? |
There were a handful of tasks going through this loop, most of them eventually went out of it, but several were permanantly stuck. |
There are valid reasons for going back to READy. |
The common pattern for the |
The last number is the task id. The first number is the reason why the task did not work. Since everything else is 0, my guess is that the tasks are being forsaken. |
Right, 40 is RESULT_FORSAKEN, which may mean that an input file could not be transmitted to the worker? |
You are right, I am bringing back debug files from workers, this is what it says:
Though my problem becomes to a straggler stuck on a worker for a long time (which was reported in #4007) That task was eventually forsaken, but it really lingered with the worker for a while |
Jin, are looking at the big picture when you should be digging into individual facts.
The next step is not to look at the big visualization, but to track down the next detail. Why is the transfer failing? |
Copy that, looking at the logs... |
As it typically severes when I increase the number of replicas per temp file, I was investigating if the temp file replication interacts poorly with peer transfers. And one problem seems to be that the manager is too optimistical about file replication and is agreesively dispatching waiting tasks even if their input files have not been physically existing on the target worker. On temporary file replication, the manager calls Then, when it finds a file replicable, it calls However, at this stage, there are various reasons that a replication request can be failed and therefore is no guarantee that the file will be successfully replicated within a reasonable time frame, for instance, if the worker is busy with numerous transfers, several of them just time out. In such cases, tasks needing that input file are eventually forsaken, and the worker will be blocked for a while. |
Ok, it looks like you are getting closer. Note that a replica has two states to address exactly this problem:
A replica should be entered in the Is it possible that the replica is getting entered with the wrong state? Or perhaps the state is not being checked when necessary? |
I opened a tentative pr in #4048, I think simply delaying the replica registering process from |
Or do we need to register it when sending |
No, that's not right. If you fail to register the replica, then the manager will keep sending it over and over. The manager needs to know that the replica attempt was sent put it hasn't materialized yet. That's what Something else is wrong wrt to the checking of the replica state. |
Ah, I see what I am missing there. |
Hmm, is this a bug? After sending the file, we set the state to Does |
I was referring to this line of code: But now that I think about it, it's correct. Mini-tasks only get executed on demand, and also produce data that should not be moved between hosts. That's not the problem. |
How about this? In this line of code, we check if the state of the input file is pending, task is rejected if any of the inputs is pending. But we need to check the replica state as well? Like
|
NVM, that function was not considering any workers... |
Hmm, I'm free now for a bit come on over and we can look at it together. |
Is this solved with #4050? |
I believe so |
solved with #4076 |
I've occasionally run into the case where most of the tasks were scheduled and retrieved normally, while the last reduction task gets stuck in the waiting list.
The text was updated successfully, but these errors were encountered: