Skip to content

Conversation

@bentsherman
Copy link
Member

Close #6452

This PR changes the K8s task handler to not delete jobs when ttlSecondsAfterFinished is set. When this option is set, the K8s cluster will delete these jobs on its own.

This change should solve two issues:

  • It is possible for the K8s cluster to delete the job before Nextflow tries to delete it, which causes an error as shown in the linked issue

  • The current approach to deleting jobs can block the task polling monitor, since the monitor must wait for the job to be deleted before proceeding to the next task in the running queue. This can delay the submission of pending jobs when the running queue is full

cc @BioWilko , it would be good if you could test this change in your environment

@netlify
Copy link

netlify bot commented Nov 21, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit a4ba6cf
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/6920933c8bc71b0008941941

@BioWilko
Copy link
Contributor

Happy to test, how do I get the executable to do so? should I build from the PR branch?

@BioWilko
Copy link
Contributor

BioWilko commented Nov 21, 2025

Hmmm I'm not sure I've built nextflow correctly but I'm still getting this issue:

Caused by: nextflow.k8s.client.K8sResponseException: Request DELETE /apis/batch/v1/namespaces/ns-loman-labz/jobs/nf-6e75f8df6a75defebb479938727963c1-7e495 returne
d an error code=404

  {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {
          
      },
      "status": "Failure",
      "message": "jobs.batch \"nf-6e75f8df6a75defebb479938727963c1-7e495\" not found",
      "reason": "NotFound",
      "details": {
          "name": "nf-6e75f8df6a75defebb479938727963c1-7e495",
          "group": "batch",
          "kind": "jobs"
      },
      "code": 404
  }

        at nextflow.k8s.client.K8sClient.makeRequestCall(K8sClient.groovy:669)
        at nextflow.k8s.client.K8sClient.access$0(K8sClient.groovy)
        at nextflow.k8s.client.K8sClient$_makeRequest_lambda5.doCall(K8sClient.groovy:635)
        at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:236)
        at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
        at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
        at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
        at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
        at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:115)
        at nextflow.k8s.client.K8sClient.apply(K8sClient.groovy:782)
        at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy:635)
        at nextflow.k8s.client.K8sClient.delete(K8sClient.groovy:600)
        at nextflow.k8s.client.K8sClient.delete(K8sClient.groovy)
        at nextflow.k8s.client.K8sClient.jobDelete(K8sClient.groovy:260)
        at nextflow.k8s.K8sTaskHandler.killTask(K8sTaskHandler.groovy:500)
        at nextflow.processor.TaskHandler.kill(TaskHandler.groovy:106)
        at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:633)
        at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:498)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:569)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        ... 11 common frames omitted

I think I have the right commit checked out too

jovyan:/shared/team/sam/nextflow$ git rev-parse HEAD
a807cf049225cc53dbb111b08f9232e7ea1d8b6a

@BioWilko
Copy link
Contributor

BioWilko commented Nov 21, 2025

One thing, is nextflow removing successful / failed jobs from the monitoring loop as soon as it sees they have finished/errored and handled those states from a workflow perspective? this seems to me where the performance improvements would come from

@bentsherman
Copy link
Member Author

@BioWilko since the k8s executor is in a core plugin, you'll need to build with make and use launch.sh in lieu of nextflow, as described here. Building with make pack won't work because it doesn't bake the k8s plugin into the binary

From your error trace, it looks like the task monitor encountered an exception while polling and is trying to cleanup all the jobs:

// -- iterate over the task and check the status
for( int i=0; i<queue.size(); i++ ) {
final handler = queue.get(i)
try {
checkTaskStatus(handler)
}
catch (Throwable error) {
// At this point NF assumes job is not running, but there could be errors at monitoring that could leave a job running (#5516).
// In this case, NF needs to ensure the job is killed.
handler.kill()
handleException(handler, error)
}
}

Unfortunately we can't see the original exception that interrupted the polling, but that is easily fixed 😄 I will push a few more changes for you to try

@bentsherman
Copy link
Member Author

The latest changes should allow Nextflow to recover when trying to delete a K8s job that was already deleted, should make testing easier

@BioWilko
Copy link
Contributor

All seems to be working on our infra :) Thanks for looking at this, big QOL improvements for me!

@pditommaso pditommaso merged commit 51042db into master Nov 25, 2025
24 checks passed
@pditommaso pditommaso deleted the k8s-improve-job-cleanup-ttl branch November 25, 2025 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using TTL more extensively in k8s executor

4 participants