Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 69 additions & 2 deletions docs/guides/user-containers/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -341,13 +341,80 @@ reached before the container successfully exits, it triggers an `Error` status.
timeout begins upon entry into the PostRun phase, allowing the containers the specified period to
execute before the workflow enters an `Error` status.

#### Long-Running Processes

Some containers may run applications that are intended to run indefinitely (for example, an HTTP
server listening for requests). The copy-offload user container is one such example. These
long-running containers need a mechanism for a caller to stop them; otherwise, the `postRunTimeoutSeconds`
timeout will be reached, resulting in a workflow error.

During the PostRun phase, if containers are still running, the NNF software will attempt to send an
HTTPS POST request to the `/shutdown` endpoint on each user container. This process will continue
until the container receives the request and gracefully exits, or until the timeout is reached.

The software inside the container **must** be able to handle this request. The copy-offload user
container includes this functionality. If this functionality is not present, the container will need
to be terminated by some other means outside the container (for example, by the compute node
application when it is time to stop). The request is defined in the next section.

Alternatively, if `postRunTimeoutSeconds` is set to 0, the container exit codes will not be checked.
The software will ignore the result of the containers and proceed to the Teardown phase, where the
containers will be destroyed. This can be useful for long-running processes where the exit code is
not important.

#### Shutdown Request via `/shutdown` endpoint

The request is sent using HTTPS (TLS required; client verifies server using a CA certificate from
the Kubernetes `nnf-dm-usercontainer-server-tls` secret).

The token is taken from the workflow-specific token generated by the NNF software, if specified. See
the `requires=user-container-auth` argument in [Command
Arguments](../user-interactions/readme.md#command-arguments). Using this keyword in your directive
instructs the NNF software to create a workflow-specific token that is used here. If the `requires`
argument is not used, then no token will be generated, and no authorization will be sent in the
request.

Headers:

| Header | Required | Example Value | Description |
|------------------|----------|---------------------|---------------------------------------------------------------------|
| Content-Type | Yes | application/json | Indicates the request body is JSON |
| Authorization | Optional | Bearer TOKEN... | Bearer token for authentication (if token is requested by workflow) |
| X-Auth-Type | Optional | XOAUTH2 | Indicates XOAUTH2 token type (if token is requested by workflow) |

Request body:

```json
{
"message": "shutdown"
}
```

The following is an example request that is sent to the copy-offload user containers using TLS:

```http
POST /shutdown HTTP/1.1
Host: nnf-node1:8080
Content-Type: application/json
Authorization: Bearer eyJhbG...
X-Auth-Type: XOAUTH2
Content-Length: 23

{"message": "shutdown"}
```

#### Recap

To recap the PostRun behavior:

- If the container exits successfully, transition to `Completed` status.
- If the container exits unsuccessfully after `retryLimit` number of retries, transition to the
`Error` status.
`Error` status.
- If the container is running and has not exited after `postRunTimeoutSeconds` seconds, terminate
the container and transition to the `Error` status.
the container and transition to the `Error` status.
- If the container is running, a POST Request will be sent to the `/shutdown` endpoint on each
container to attempt a graceful shutdown.
- If `postRunTimeoutSeconds` is set to zero, the container result will not be checked.

### Failure Retries

Expand Down