Skip to content

Commit 5859fea

Browse files
authored
User Containers: Add long running process and shutdown information (#277)
Signed-off-by: Blake Devcich <[email protected]>
1 parent 03692e9 commit 5859fea

File tree

1 file changed

+69
-2
lines changed

1 file changed

+69
-2
lines changed

docs/guides/user-containers/readme.md

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -341,13 +341,80 @@ reached before the container successfully exits, it triggers an `Error` status.
341341
timeout begins upon entry into the PostRun phase, allowing the containers the specified period to
342342
execute before the workflow enters an `Error` status.
343343

344+
#### Long-Running Processes
345+
346+
Some containers may run applications that are intended to run indefinitely (for example, an HTTP
347+
server listening for requests). The copy-offload user container is one such example. These
348+
long-running containers need a mechanism for a caller to stop them; otherwise, the `postRunTimeoutSeconds`
349+
timeout will be reached, resulting in a workflow error.
350+
351+
During the PostRun phase, if containers are still running, the NNF software will attempt to send an
352+
HTTPS POST request to the `/shutdown` endpoint on each user container. This process will continue
353+
until the container receives the request and gracefully exits, or until the timeout is reached.
354+
355+
The software inside the container **must** be able to handle this request. The copy-offload user
356+
container includes this functionality. If this functionality is not present, the container will need
357+
to be terminated by some other means outside the container (for example, by the compute node
358+
application when it is time to stop). The request is defined in the next section.
359+
360+
Alternatively, if `postRunTimeoutSeconds` is set to 0, the container exit codes will not be checked.
361+
The software will ignore the result of the containers and proceed to the Teardown phase, where the
362+
containers will be destroyed. This can be useful for long-running processes where the exit code is
363+
not important.
364+
365+
#### Shutdown Request via `/shutdown` endpoint
366+
367+
The request is sent using HTTPS (TLS required; client verifies server using a CA certificate from
368+
the Kubernetes `nnf-dm-usercontainer-server-tls` secret).
369+
370+
The token is taken from the workflow-specific token generated by the NNF software, if specified. See
371+
the `requires=user-container-auth` argument in [Command
372+
Arguments](../user-interactions/readme.md#command-arguments). Using this keyword in your directive
373+
instructs the NNF software to create a workflow-specific token that is used here. If the `requires`
374+
argument is not used, then no token will be generated, and no authorization will be sent in the
375+
request.
376+
377+
Headers:
378+
379+
| Header | Required | Example Value | Description |
380+
|------------------|----------|---------------------|---------------------------------------------------------------------|
381+
| Content-Type | Yes | application/json | Indicates the request body is JSON |
382+
| Authorization | Optional | Bearer TOKEN... | Bearer token for authentication (if token is requested by workflow) |
383+
| X-Auth-Type | Optional | XOAUTH2 | Indicates XOAUTH2 token type (if token is requested by workflow) |
384+
385+
Request body:
386+
387+
```json
388+
{
389+
"message": "shutdown"
390+
}
391+
```
392+
393+
The following is an example request that is sent to the copy-offload user containers using TLS:
394+
395+
```http
396+
POST /shutdown HTTP/1.1
397+
Host: nnf-node1:8080
398+
Content-Type: application/json
399+
Authorization: Bearer eyJhbG...
400+
X-Auth-Type: XOAUTH2
401+
Content-Length: 23
402+
403+
{"message": "shutdown"}
404+
```
405+
406+
#### Recap
407+
344408
To recap the PostRun behavior:
345409

346410
- If the container exits successfully, transition to `Completed` status.
347411
- If the container exits unsuccessfully after `retryLimit` number of retries, transition to the
348-
`Error` status.
412+
`Error` status.
349413
- If the container is running and has not exited after `postRunTimeoutSeconds` seconds, terminate
350-
the container and transition to the `Error` status.
414+
the container and transition to the `Error` status.
415+
- If the container is running, a POST Request will be sent to the `/shutdown` endpoint on each
416+
container to attempt a graceful shutdown.
417+
- If `postRunTimeoutSeconds` is set to zero, the container result will not be checked.
351418

352419
### Failure Retries
353420

0 commit comments

Comments
 (0)