Do not delete K8s jobs when ttlSecondsAfterFinished is set #6597

bentsherman · 2025-11-21T05:14:24Z

This PR changes the K8s task handler to not delete jobs when ttlSecondsAfterFinished is set. When this option is set, the K8s cluster will delete these jobs on its own.

This change should solve two issues:

It is possible for the K8s cluster to delete the job before Nextflow tries to delete it, which causes an error as shown in the linked issue
The current approach to deleting jobs can block the task polling monitor, since the monitor must wait for the job to be deleted before proceeding to the next task in the running queue. This can delay the submission of pending jobs when the running queue is full

cc @BioWilko , it would be good if you could test this change in your environment

Signed-off-by: Ben Sherman <[email protected]>

netlify · 2025-11-21T05:14:29Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`a4ba6cf`
🔍 Latest deploy log	https://app.netlify.com/projects/nextflow-docs-staging/deploys/6920933c8bc71b0008941941

BioWilko · 2025-11-21T10:18:13Z

Happy to test, how do I get the executable to do so? should I build from the PR branch?

BioWilko · 2025-11-21T11:20:05Z

Hmmm I'm not sure I've built nextflow correctly but I'm still getting this issue:

Caused by: nextflow.k8s.client.K8sResponseException: Request DELETE /apis/batch/v1/namespaces/ns-loman-labz/jobs/nf-6e75f8df6a75defebb479938727963c1-7e495 returne
d an error code=404

  {
      "kind": "Status",
      "apiVersion": "v1",
      "metadata": {
          
      },
      "status": "Failure",
      "message": "jobs.batch \"nf-6e75f8df6a75defebb479938727963c1-7e495\" not found",
      "reason": "NotFound",
      "details": {
          "name": "nf-6e75f8df6a75defebb479938727963c1-7e495",
          "group": "batch",
          "kind": "jobs"
      },
      "code": 404
  }

        at nextflow.k8s.client.K8sClient.makeRequestCall(K8sClient.groovy:669)
        at nextflow.k8s.client.K8sClient.access$0(K8sClient.groovy)
        at nextflow.k8s.client.K8sClient$_makeRequest_lambda5.doCall(K8sClient.groovy:635)
        at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:236)
        at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
        at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
        at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
        at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
        at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:115)
        at nextflow.k8s.client.K8sClient.apply(K8sClient.groovy:782)
        at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy:635)
        at nextflow.k8s.client.K8sClient.delete(K8sClient.groovy:600)
        at nextflow.k8s.client.K8sClient.delete(K8sClient.groovy)
        at nextflow.k8s.client.K8sClient.jobDelete(K8sClient.groovy:260)
        at nextflow.k8s.K8sTaskHandler.killTask(K8sTaskHandler.groovy:500)
        at nextflow.processor.TaskHandler.kill(TaskHandler.groovy:106)
        at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:633)
        at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:498)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:569)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:343)
        ... 11 common frames omitted

I think I have the right commit checked out too

jovyan:/shared/team/sam/nextflow$ git rev-parse HEAD
a807cf049225cc53dbb111b08f9232e7ea1d8b6a

BioWilko · 2025-11-21T11:38:43Z

One thing, is nextflow removing successful / failed jobs from the monitoring loop as soon as it sees they have finished/errored and handled those states from a workflow perspective? this seems to me where the performance improvements would come from

bentsherman · 2025-11-21T16:21:45Z

@BioWilko since the k8s executor is in a core plugin, you'll need to build with make and use launch.sh in lieu of nextflow, as described here. Building with make pack won't work because it doesn't bake the k8s plugin into the binary

From your error trace, it looks like the task monitor encountered an exception while polling and is trying to cleanup all the jobs:

nextflow/modules/nextflow/src/main/groovy/nextflow/processor/TaskPollingMonitor.groovy

Lines 624 to 636 in d440040

    
           // -- iterate over the task and check the status 
        
           for( int i=0; i<queue.size(); i++ ) { 
        
               final handler = queue.get(i) 
        
               try { 
        
                   checkTaskStatus(handler) 
        
               } 
        
               catch (Throwable error) { 
        
                   // At this point NF assumes job is not running, but there could be errors at monitoring that could leave a job running (#5516). 
        
                   // In this case, NF needs to ensure the job is killed. 
        
                   handler.kill() 
        
                   handleException(handler, error) 
        
               } 
        
           }

Unfortunately we can't see the original exception that interrupted the polling, but that is easily fixed 😄 I will push a few more changes for you to try

Signed-off-by: Ben Sherman <[email protected]>

bentsherman · 2025-11-21T19:50:10Z

The latest changes should allow Nextflow to recover when trying to delete a K8s job that was already deleted, should make testing easier

BioWilko · 2025-11-24T09:42:53Z

All seems to be working on our infra :) Thanks for looking at this, big QOL improvements for me!

Do not delete K8s jobs when ttlSecondsAfterFinished is set

a807cf0

Signed-off-by: Ben Sherman <[email protected]>

bentsherman requested a review from jorgee November 21, 2025 05:14

bentsherman added the executor/k8s label Nov 21, 2025

pditommaso approved these changes Nov 21, 2025

View reviewed changes

Handle exception when cleaning up k8s jobs

a4ba6cf

Signed-off-by: Ben Sherman <[email protected]>

pditommaso merged commit 51042db into master Nov 25, 2025
24 checks passed

pditommaso deleted the k8s-improve-job-cleanup-ttl branch November 25, 2025 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not delete K8s jobs when ttlSecondsAfterFinished is set #6597

Do not delete K8s jobs when ttlSecondsAfterFinished is set #6597

Uh oh!

bentsherman commented Nov 21, 2025

Uh oh!

netlify bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

BioWilko commented Nov 21, 2025

Uh oh!

BioWilko commented Nov 21, 2025 •

edited

Loading

Uh oh!

BioWilko commented Nov 21, 2025 •

edited

Loading

Uh oh!

bentsherman commented Nov 21, 2025

Uh oh!

bentsherman commented Nov 21, 2025

Uh oh!

BioWilko commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Do not delete K8s jobs when ttlSecondsAfterFinished is set #6597

Do not delete K8s jobs when ttlSecondsAfterFinished is set #6597

Uh oh!

Conversation

bentsherman commented Nov 21, 2025

Uh oh!

netlify bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for nextflow-docs-staging canceled.

Uh oh!

BioWilko commented Nov 21, 2025

Uh oh!

BioWilko commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BioWilko commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bentsherman commented Nov 21, 2025

Uh oh!

bentsherman commented Nov 21, 2025

Uh oh!

BioWilko commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

netlify bot commented Nov 21, 2025 •

edited

Loading

BioWilko commented Nov 21, 2025 •

edited

Loading

BioWilko commented Nov 21, 2025 •

edited

Loading