-
Notifications
You must be signed in to change notification settings - Fork 748
Do not delete K8s jobs when ttlSecondsAfterFinished is set #6597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ben Sherman <[email protected]>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
|
Happy to test, how do I get the executable to do so? should I build from the PR branch? |
|
Hmmm I'm not sure I've built nextflow correctly but I'm still getting this issue: I think I have the right commit checked out too |
|
One thing, is nextflow removing successful / failed jobs from the monitoring loop as soon as it sees they have finished/errored and handled those states from a workflow perspective? this seems to me where the performance improvements would come from |
|
@BioWilko since the k8s executor is in a core plugin, you'll need to build with From your error trace, it looks like the task monitor encountered an exception while polling and is trying to cleanup all the jobs: nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskPollingMonitor.groovy Lines 624 to 636 in d440040
Unfortunately we can't see the original exception that interrupted the polling, but that is easily fixed 😄 I will push a few more changes for you to try |
Signed-off-by: Ben Sherman <[email protected]>
|
The latest changes should allow Nextflow to recover when trying to delete a K8s job that was already deleted, should make testing easier |
|
All seems to be working on our infra :) Thanks for looking at this, big QOL improvements for me! |
Close #6452
This PR changes the K8s task handler to not delete jobs when
ttlSecondsAfterFinishedis set. When this option is set, the K8s cluster will delete these jobs on its own.This change should solve two issues:
It is possible for the K8s cluster to delete the job before Nextflow tries to delete it, which causes an error as shown in the linked issue
The current approach to deleting jobs can block the task polling monitor, since the monitor must wait for the job to be deleted before proceeding to the next task in the running queue. This can delay the submission of pending jobs when the running queue is full
cc @BioWilko , it would be good if you could test this change in your environment