@@ -341,13 +341,80 @@ reached before the container successfully exits, it triggers an `Error` status.
341
341
timeout begins upon entry into the PostRun phase, allowing the containers the specified period to
342
342
execute before the workflow enters an `Error` status.
343
343
344
+ # ### Long-Running Processes
345
+
346
+ Some containers may run applications that are intended to run indefinitely (for example, an HTTP
347
+ server listening for requests). The copy-offload user container is one such example. These
348
+ long-running containers need a mechanism for a caller to stop them; otherwise, the `postRunTimeoutSeconds`
349
+ timeout will be reached, resulting in a workflow error.
350
+
351
+ During the PostRun phase, if containers are still running, the NNF software will attempt to send an
352
+ HTTPS POST request to the `/shutdown` endpoint on each user container. This process will continue
353
+ until the container receives the request and gracefully exits, or until the timeout is reached.
354
+
355
+ The software inside the container **must** be able to handle this request. The copy-offload user
356
+ container includes this functionality. If this functionality is not present, the container will need
357
+ to be terminated by some other means outside the container (for example, by the compute node
358
+ application when it is time to stop). The request is defined in the next section.
359
+
360
+ Alternatively, if `postRunTimeoutSeconds` is set to 0, the container exit codes will not be checked.
361
+ The software will ignore the result of the containers and proceed to the Teardown phase, where the
362
+ containers will be destroyed. This can be useful for long-running processes where the exit code is
363
+ not important.
364
+
365
+ # ### Shutdown Request via `/shutdown` endpoint
366
+
367
+ The request is sent using HTTPS (TLS required; client verifies server using a CA certificate from
368
+ the Kubernetes `nnf-dm-usercontainer-server-tls` secret).
369
+
370
+ The token is taken from the workflow-specific token generated by the NNF software, if specified. See
371
+ the `requires=user-container-auth` argument in [Command
372
+ Arguments](../user-interactions/readme.md#command-arguments). Using this keyword in your directive
373
+ instructs the NNF software to create a workflow-specific token that is used here. If the `requires`
374
+ argument is not used, then no token will be generated, and no authorization will be sent in the
375
+ request.
376
+
377
+ Headers :
378
+
379
+ | Header | Required | Example Value | Description |
380
+ |------------------|----------|---------------------|---------------------------------------------------------------------|
381
+ | Content-Type | Yes | application/json | Indicates the request body is JSON |
382
+ | Authorization | Optional | Bearer TOKEN... | Bearer token for authentication (if token is requested by workflow) |
383
+ | X-Auth-Type | Optional | XOAUTH2 | Indicates XOAUTH2 token type (if token is requested by workflow) |
384
+
385
+ Request body :
386
+
387
+ ` ` ` json
388
+ {
389
+ "message": "shutdown"
390
+ }
391
+ ` ` `
392
+
393
+ The following is an example request that is sent to the copy-offload user containers using TLS :
394
+
395
+ ` ` ` http
396
+ POST /shutdown HTTP/1.1
397
+ Host: nnf-node1:8080
398
+ Content-Type: application/json
399
+ Authorization: Bearer eyJhbG...
400
+ X-Auth-Type: XOAUTH2
401
+ Content-Length: 23
402
+
403
+ {"message": "shutdown"}
404
+ ` ` `
405
+
406
+ # ### Recap
407
+
344
408
To recap the PostRun behavior :
345
409
346
410
- If the container exits successfully, transition to `Completed` status.
347
411
- If the container exits unsuccessfully after `retryLimit` number of retries, transition to the
348
- ` Error` status.
412
+ ` Error` status.
349
413
- If the container is running and has not exited after `postRunTimeoutSeconds` seconds, terminate
350
- the container and transition to the `Error` status.
414
+ the container and transition to the `Error` status.
415
+ - If the container is running, a POST Request will be sent to the `/shutdown` endpoint on each
416
+ container to attempt a graceful shutdown.
417
+ - If `postRunTimeoutSeconds` is set to zero, the container result will not be checked.
351
418
352
419
# ## Failure Retries
353
420
0 commit comments